10,000 Matching Annotations
  1. Jun 2025
    1. eLife Assessment

      The study introduces new tools for measuring the intracellular calcium concentration close to transmitter release sites, which may be relevant for synaptic vesicle fusion and replenishment. This approach yields important new information about the spatial and temporal profile of calcium concentrations near the site of entry at the plasma membrane. This experimental work is complemented by a coherent, open-source, computational model that successfully describes changes in calcium domains. Some of the conclusions are strongly supported by the data, but a few gaps in the data presented mean that the evidence for other conclusions is incomplete.

    2. Reviewer #1 (Public review):

      This paper describes technically-impressive measurements of calcium signals near synaptic ribbons in goldfish bipolar cells. The data presented provides high spatial and temporal resolution information about calcium concentrations along the ribbon at various distances from the site of entry at the plasma membrane. This is important information. Important gaps in the data presented mean that the evidence for the main conclusions is currently inadequate.

      Strengths

      • The technical aspects of the measurements are impressive. The authors use calcium indicators bound to the ribbon and high speed line scans to resolve changes with a spatial resolution of ~250 nm and temporal resolution of less than 10 ms. These spatial and temporal scales are much closer to those relevant for vesicle release than previous measurements.

      • The use of calcium indicators with very different affinities and of different intracellular calcium buffers helps provide confirmation of key results.

      Weaknesses

      • Multiple key points of the paper lack a statistical test or summary data from populations of cells. For example, the text states that the proximal and distal calcium kinetics in Figure 2A differ. This is not clear from the inset to Figure 2A - where the traces look like scaled versions of each other. Values for time to half-maximal peak fluorescence are given for one example cell but no statistics or summary are provided. Figure 8 shows examples from one cell with no summary data. This issue comes up in other places as well.

      • The rise time measurements in Figure 2 are very different for low and high affinity indicators, but no explanation is given for this difference. Similarly, the measurements of peak calcium concentration in Figure 4 are very different with the two indicators. That might suggest that the high affinity indicator is strongly saturated, which raises concerns about whether that is impacting the kinetic measurements.

    3. Reviewer #2 (Public review):

      Summary:

      The study introduces new tools for measuring intracellular Ca2+ concentration gradients around retinal rod bipolar cell (rbc) synaptic ribbons. This is done by comparing the Ca2+ profiles measured with mobile Ca2+ indicator dyes versus ribbon-tethered (immobile) Ca2+ indicator dyes. The Ca2+ imaging results provide a straightforward demonstration of Ca2+ gradients around the ribbon and validate their experimental strategy. This experimental work is complemented by a coherent, open-source, computational model that successfully describes changes in Ca2+ domains as a function of Ca2+ buffering. In addition, the authors try to demonstrate that there is heterogeneity among synaptic ribbons within an individual rbc terminal.

      Strengths:

      The study introduces a new set of tools for estimating Ca2+ concentration gradients at ribbon AZs, and the experimental results are accompanied by an open-source, computational model that nicely describes Ca2+ buffering at the rbc synaptic ribbon. In addition, the dissociated retinal preparation remains a valuable approach for studying ribbon synapses. Lastly, excellent EM.

      Comments on revisions:

      Specific minor comments:

      (1) Rewrite the final sentence of the Abstract. It is difficult to understand.

      (2) Add a definition in the Introduction (and revisit in the Discussion) that delineates between micro- and nano-domain. A practical approach would be to round up and round down. If you round up from 0.6 um, then it is microdomain which means ~ 1 um or higher. Likewise, round down from 0.3 um to nanodomain? If you are using confocal, or even STED, the resolution for Ca imaging will be in the 100 to 300 nm range. The point of your study is that your new immobile Ca2-ribbon indicator may actually be operating on a tens of nm scale: nanophysiology. The Results are clearly written in a way that acknowledges this point but maybe make such a "definition" comment in the intro/discussion in order to: 1) demonstrate the power of the new Ca2+ indicator to resolve signals at the base of the ribbon (effectively nano), and 2) (Discussion) to acknowledge that some are achieving nanoscopic resolution (50 to 100nm?) with light microscopy (as you ref'd Neef et al., 2018 Nat Comm).

      (3) Suggested reference: Grabner et al. 2022 (Sci Adv, Supp video 13, and Fig S5). Here rod Cav channels are shown to be expressed on both sides the ribbon, at its base, and they are within nanometers from other AZ proteins. This agrees with the conclusions from your imaging work.

      (4) In the Discussion, add a little more context to what is known about synaptic transmission in the outer and inner retina.. First, state that the postsynaptic receptors (for example: mGluR6-OnBCs vs KARs-Off-BCs, vs. AMPAR-HCs), and possibly the synaptic cleft (ground squirrel), are known to have a significant impact on signaling in the outer retina. In the inner retina, there are many more unknowns. For example, when I think of the pioneering Palmer JPhysio study, which you sight, I think of NMDAR vs AMPAR, and uncertainty in what type postsynaptic cell was patched (GC or AC....). Once you have informed the reader that the postsynapse is known to have a significant impact on signaling, then promote your experimental work that addresses presynaptic processes: "...the new tool and results allow us to explore release heterogeneity, ribbon by ribbon in dissociated preps, which we eventually plan to use at ribbon synapses within slices......to better understand how the presynapse shapes signaling......".

    4. Reviewer #3 (Public review):

      Summary:

      In this study, the authors have developed a new Ca indicator conjugated to the peptide, which likely recognizes synaptic ribbons and have measured microdomain Ca near synaptic ribbons at retinal bipolar cells. This interesting approach allows one to measure Ca close to transmitter release sites, which may be relevant for synaptic vesicle fusion and replenishment. Though microdomain Ca at the active zone of ribbon synapses has been measured by Hudspeth and Moser, the new study uses the peptide recognizing synaptic ribbons, potentially measuring the Ca concentration relatively proximal to the release sites.

      Strengths:

      The study is, in principle, technically well done, and the peptide approach is technically interesting, which allows one to image Ca near the particular protein complexes. The approach is potentially applicable to other types of imaging.

      Weaknesses:

      Peptides may not be entirely specific, and genetic approach tagging particular active zone proteins with fluorescent Ca indicator proteins may well be more specific. Although the authors are aware of this and the peptide approach is generally used for ribbon synapses, the authors should be aware of this, when interpreting the results.

    1. eLife Assessment

      The authors take a synthetic approach by introducing synaptic ribbon proteins into HEK cells to analyze how these assemblies cluster calcium channels at the active zone. Using a synapse-naive heterologous expression system and overexpression-based strategy is valuable, as it establishes a promising model for studying molecular interactions at the active zone. The study is built on a solid combination of super-resolution microscopy and electrophysiology, though it currently falls short of replicating the full functional properties of native ribbon synapses and instead resembles a multiprotein complex that partially mimics ribbon-type active zones.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors attempt to reconstitute some active zone properties by introducing synaptic ribbon proteins into HEK cells. This "ground-up" approach can be valuable for assessing the necessity of specfic proteins in synaptic function. Here, the authors co-transfect a membrane-targeted bassoon, RBP2, calcium channel subunits and Ribeye to generate what they call "synthetic ribbons". The resultant structures show an ability to cluster calcium channels (Figure 4B) and a modest ability to concentrate calcium entry locations (figure 7J). At the light level, the ribeye aggregates look spherical and localize to the membrane through its interaction with the membrane-targeted bassoon and at the EM level the structures resemble those observed when Ribeye is overexpressed alone. It is a nice proof-of-principle in establishing a useful experimental system for studying calcium channel localization and with expression of other proteins perhaps a means to understanding structure and function of the ribbon. The paper does establish that previously described protein-interactions can be reconstituted in a heterologous system to and that the addition of Ribeye can increase the size of calcium channel patches via indirect interactions.

      Strengths:

      (1) The authors establish a new experimental system for the study of calcium channel localization to active zones.<br /> (2) The clustering of calcium channels to bassoon via RBP2 is a nice confirmation of a previously-described interaction between bassoon and calcium channels in a cell-based system<br /> (3) The "ground-up" approach is an attractive one and theoretically allows one learn a lot about the essential interactions for building a ribbon structure.<br /> (4) The finding that introducing Ribeye can enhance the size of calcium channel patches is a novel finding that is interesting.

      Weaknesses:

      (1) The addition of EM is welcome, but the structures seem to resemble those created by overexpression of Ribeye alone, albeit at the membrane. It is unclear to me whether the interaction with Bsn or indirect interactions with other proteins has any effect on these structures. Also, while the abstract mentions that the size and shape are similar to ribbons, the EM seems to show that the size and shape are quite variable.<br /> (2) The clustering of channels is accomplished by taking advantage of previously described interactions between RBP2, Ca channels and bassoon. While it is nice to see that it can be reconstituted in a naive cell, the interactions were previously described. The localization of Ribeye to bassoon takes advantage of a previously described interaction between the two and the membrane localization of the complexes required introduction of a membrane-anchoring motif. These factors limit the novelty of the findings.<br /> (3) The difference in Ca imaging between SyRibbons and other locations is subtle. While there are reasonable explanations for why this could be the case, it may limit the utility of this system for studying Ca-channel-ribbon dynamics moving forward.

    3. Reviewer #2 (Public review):

      Summary:

      The authors show that co-expression of bassoon, RIBEYE, Cav1.3-alpha1, Cav-beta3, Cav-alpha2delta1, and RBP2 in a heterologus system (HEK293 cells) is sufficient to generate a protein complex resembling a presyanptic ribbon-type active zone both in morphology and in function (in clustering voltage-gated Ca channels and creating sites for localized Ca2+ entry). If the 3 separate Cav gene products are taken as a single protein (i.e. a Ca channel), the conclusion is that the core of a ribbon synapse comprises 4 proteins: bassoon holds the RIBEYE-containing ribbon to the plasma membrane, and RPB2 binds to bassoon and Ca channels, tethering the Ca channels to the presynaptic active zone.

      Strengths:

      (1) Good use of a heterologous system with generally appropriate controls provides convincing evidence that a presynaptic ribbon-type active zone (without the ability to support exocytosis), with the ability to support localized Ca2+ entry (a key feature of ribbon-type pre-synapses) can be assembled from a few proteins.<br /> (2) In the revised manuscript, the authors do a good job of addressing the limitations of their cultured cell-system.

      Weaknesses:

      (1) Relies on over-expression, which almost certainly diminishes the experimentally-measured parameters (e.g. pre-synapse clustering, localization of Ca2+ entry).<br /> (2) Are HEK cells the best model? HEK cells secrete substances and have a studied-endocytitic pathway, but they do not create neurosecretory vesicles. Initially, I asked why didn't the authors did not try to reconstitute a ribbon synapse in a cell that makes neurosecretory vesicles like a PC12 cell, and the authors addressed this question in their revision.<br /> (3) Related to 1 and 2: the Ca channel localization observed is significant but not so striking given the presence of Cav protein and measurements of Ca2+ influx distributed across the membrane. Presumably, this is the result of overexpression and an absence of pathways for pre-synaptic targeting of Ca channels. But, still, it was surprising that Ca channel localization was so diffuse. I suppose that the authors tried to reduce the effect of over-expression by using an inducible Cav1.3? Even so, the accessory subunits were constitutively over-expressed.

    4. Reviewer #3 (Public review):

      Summary:

      Ribbon synapses are complex molecular assemblies responsible for synaptic vesicle trafficking in sensory cells of the eye and the inner ear. The Ca2+-dependent exocytosis occurs at the active zone (AZ), however, the molecular mechanisms orchestrating the structure and function of the AZs of ribbon synapses are not well understood. To advance in the understanding of those mechanisms, the authors present a novel and interesting experimental strategy pursuing the reconstitution of a minimal active zone of a ribbon synapse within a synapse-naïve cell line: HEK293 cells. The authors have used stably transfected HEK293 cells that express voltage-gated Ca2+ channels subunits (constitutive -CaV beta3 and CaV alpha2 beta1- and inducible CaV1.3 alpha1). They have expressed in those cells several proteins of the ribbon synapse active zone: (1) RIBEYE, (2) a modified version of Bassoon that binds to the plasma membrane through artificial palmitoylation (Palm-Bassoon) and (3) RIM-binding protein 2 (RBP2) to induce the formation of a minimal active zone that they called SyRibbons. The formation of such structures is convincing, however, the evidence of such structures having a functional impact (for example enhancing Ca2+-currents), as the authors claim, is weak. In conclusion, the novel approach shows that expression of a multiprotein complex partially reproduces properties, especially structural properties, of ribbon-type active zones in a heterologous system. Although the approach opens interesting possibilities for further experiments, the evidence supporting the functional properties of the so called "synthetic ribbon synapses" is incomplete.

      Strengths of the study:

      (1) The study is carefully carried out using a remarkable combination of (1) superresolution, correlative light microscopy and cryo-electron tomography, to analyze the formation and subcellular distribution of molecular assemblies and (2) functional assessment of voltage-gated Ca2+ channels using patch-clamp recording of Ca2+-currents and fluorometry to correlate Ca2+ influx with the molecular assemblies formed by AZ proteins. The results are of high quality and are in general accompanied of required control experiments.<br /> (2) The method opens new opportunities to further investigate the minimal and basic properties of AZ proteins that are difficult to study using in vivo systems. The cells that operate through ribbon synapses (e.g. photoreceptors and hair cells) are particularly difficult to manipulate, so setting up and validating the use of a heterologous system more suitable for molecular manipulations is highly valuable.<br /> (3) The structures formed by RIBEYE and Palm-Bassoon in HEK293 cells identified by STED nanoscopy and cryo-electron microscopy share relevant similarities similar to the AZs of ribbon synapses found in rat inner hair cells.

      Weaknesses of the study:

      (1) The evidence of the functional properties of the "synthetic ribbon-type active zones" has been only assessed by its effect on the modulation of Ca2+-channel function, and that effect is rather weak. The authors provide reasonable explanations regarding such a weak effect but, however, it is difficult to conclude that indeed the "synthetic ribbon-type active zones" are bona fide functional multiprotein complexes.

    5. Author response:

      The following is the authors’ response to the original reviews

      Life Assessment

      The authors use a synthetic approach to introduce synaptic ribbon proteins into HEK cells and analyze the ability of the resulting assemblies to cluster calcium channels at the active zone. The use of this ground-up approach is valuable as it establishes a system to study molecular interactions at the active zone. The work relies on a solid combination of super-resolution microscopy and electrophysiology, but would benefit from: (i) additional ultrastructural analysis to establish ribbon formation (in the absence of which the claim of these being synthetic ribbons might not be supported; (ii) data quantification (to confirm colocalization of different proteins); (iii) stronger validation of impact on Ca2+ function; (iv) in depth discussion of problems derived from the use of an over-expression approach.

      We thank the editors and the reviewers for the constructive comments and appreciation of our work. Please find a detailed point-to-point response below. In response to the critique received, we have now (i) included an ultrastructural analysis of the SyRibbons using correlative light microscopy and cryo-electron tomography, (ii) performed quantifications to confirm the colocalisation of the various proteins, (iii) discussed and carefully rephrased our interpretation of the role of the ribbon in modulating Ca<sup>2+</sup> channel function and (iv) discussed concerns regarding the use of an overexpression system. 

      Public Reviews:

      Reviewer #1 (Public Review):

      We would like to thank the reviewer for the comments and advice to further improve our manuscript. We have completely overhauled the manuscript taking the suggestions of the reviewer into account.

      (1) Are these truly "synthetic ribbons". The ribbon synapse is traditionally defined by its morphology at the EM level. To what extent these structures recapitulate ribbons is not shown. It has been previously shown that Ribeye forms aggregates on its own. Do these structures look any more ribbonlike than ribeye aggregates in the absence of its binding partners?

      We thank reviewer 1 for their constructive feedback and critique of the work. 

      We agree that traditionally, ribbon synapses have always been defined by the distinct morphology observed at the EM level. However, since the discovery of the core-components of ribbons (RIBEYE and Piccolino) confocal and super-resolution imaging of immunofluorescently labelled ribbons have gained importance for analysing ribbon synapses. A correspondence of RIBEYE immunofluorescent structures at the active zone to electron microscopy observations of ribbons has been established in numerous studies (Wong et al, 2014; Michanski et al, 2019, 2023; Maxeiner et al, 2016; Jean et al, 2018) even though direct correlative approaches have yet to be performed to our knowledge. We have now analysed SyRibbons using cryo-correlative electron-light microscopy. We observe that GFPpositive RIBEYE spots corresponded well with electron-dense structures, as is characteristic for synaptic ribbons (Robertis & Franchi, 1956; Smith & Sjöstrand, 1961; Matthews & Fuchs, 2010). We could also observe SyRibbons within 100 nm of the plasma membrane (see Fig. 3). We have now added this qualitative ultrastructural analysis of SyRibbons in the main manuscript (lines 272 - 294, Fig. 3 and Supplementary Fig. 3).

      (2) No new biology is discovered here. The clustering of channels is accomplished by taking advantage of previously described interactions between RBP2, Ca channels and bassoon. The localization of Ribeye to bassoon takes advantage of a previously described interaction between the two. Even the membrane localization of the complexes required the introduction of a membraneanchoring motif.

      We respectfully disagree with the overall assessment. Our study emphasizes the synthetic establishment of protein assemblies that mimic key aspects of ribbon-type active zone, defining minimum molecular requirements. Numerous previous studies have described the role of the synaptic ribbon in organising the spatial arrangement of Ca<sup>2+</sup> channels, regulating their abundance and possibly also modulating their physiological properties (Maxeiner et al, 2016; Frank et al, 2010; Jean et al, 2018; Wong et al, 2014; Grabner & Moser, 2021; Lv et al, 2016). We would like to highlight that there remain major gaps between existing in vitro and in vivo data; for instance, no evidence for direct or indirect interactions between Ca<sup>2+</sup> channels and RIBEYE have been demonstrated so far. While we do indeed take advantage of previously known interactions between RIBEYE and Bassoon (tom Dieck et al, 2005); between Bassoon, RBP2 and P/Q-type Ca<sup>2+</sup> channels (Davydova et al, 2014); and between RBP2 and Ltype Ca<sup>2+</sup> channels (Hibino et al, 2002), our study tries to bridge these gaps by establishing the indirect link between the synaptic ribbon (RIBEYE) and L-type CaV1.3 Ca<sup>2+</sup> channels using a bottom-up approach, which has previously just been speculative. Our data shows how even in a synapse-naive heterologous expression system, ribbon synapse components assemble Ca<sup>2+</sup> channel clusters and even show a partial localisation of Ca<sup>2+</sup> signal. Moreover, we argue that the established reconstitution approach provides other interesting insights such as laying ground-up evidence supporting the anchoring of the synaptic ribbon by Bassoon. Finally, we expect that the established system will serve future studies aimed at deciphering the role of putative CaV1.3 or CaV1.4 interacting proteins in regulating Ca<sup>2+</sup> channels of ribbon synapses by providing a more realistic Ca<sup>2+</sup> channel assembly that has been available in heterologous expression systems used so far. In response to the reviewers comment we have augmented the discussion accordingly.  

      (3) The only thing ribbon-specific about these "syn-ribbons" is the expression of ribeye and ribeye does not seem to participate in the localization of other proteins in these complexes. Bsn, Cav1.3 and RBP2 can be found in other neurons.

      The synaptic ribbon made of RIBEYE is the key molecular difference in the molecular AZ ultrastructure of ribbon synapses in the eye and the ear. We hypothesize the ribbon to act as a superscaffold that enables AZ with large Ca<sup>2+</sup> channel assemblies and readily releasable pools. In further support of this hypothesis, the present study on synthetic ribbons shows that CaV1.3 Ca<sup>2+</sup> channel clusters are larger in the presence of SyRibbons compared to SyRibbon-less CaV1.3 Ca<sup>2+</sup> channel clusters in tetratransfected HEK cells (Ca<sup>2+</sup> channels, RBP, membrane-anchored Bassoon, and RIBEYE, Fig. 6). In response to the reviewers comment we now added an analysis of triple-transfected HEK cells (Ca<sup>2+</sup> channels, RBP, membrane-anchored Bassoon), in which CaV1.3 Ca<sup>2+</sup> channel clusters again are significantly smaller than at the SyRibbons and indistinguishable from SyRibbon-less CaV1.3 Ca<sup>2+</sup> channel clusters (Fig. 6E, F).

      (4) As the authors point out, RBP2 is not necessary for some Ca channel clustering in hair cells, yet seems to be essential for clustering to bassoon here.

      Here we would like to clarify that RBP2 is indeed important in inner hair cells for promoting a larger complement of CaV1.3 and RBP2 KO mice show smaller CaV1.3 channel clusters and reduced whole cell and single-AZ Ca<sup>2+</sup> influx amplitudes (Krinner et al, 2017). However, a key point of difference we emphasize on is that even though CaV1.3 clusters appeared smaller, they did not appear broken or fragmented as they do upon genetic perturbation of Bassoon (Frank et al, 2010), RIBEYE (Jean et al, 2018) or Piccolino (Michanski et al, 2023). This highlights how there may be a hierarchy in the spatial assembly of CaV1.3 channels at the inner hair cell ribbon synapse (also described in the discussion section “insights into presynaptic Ca<sup>2+</sup> channel clustering and function”) with proteins like RBP2 regulating abundance of CaV1.3 channels at the synapse and organising them into smaller clusters – what we have termed as “nanoclustering”; while Bassoon and RIBEYE may serve as super-scaffolds further organizing these CaV1.3 nanoclusters into “microclusters”. Observations of fragmented Ca<sup>2+</sup> channel clusters and broader spread of Ca<sup>2+</sup> signal seen upon Ca<sup>2+</sup> imaging in RIBEYE and Bassoon mutants (Jean et al, 2018; Frank et al, 2010; Neef et al, 2018), and the absence of such a phenotype in RBP2 mutants (Krinner et al, 2017) may be explained by such a differential role of these proteins in organising Ca<sup>2+</sup> channel spatial assembly. The data of the present study on reconstituted ribbon containing AZs are in line with these observations in inner hair cells: RBP2 appears important to tether Ca<sup>2+</sup> channels to Bassoon and these AZ-like assemblies are organised to their full extent by the presence of RIBEYE. As mentioned in the response to point 3 of the reviewer, we have now further strengthened this point by adding the analysis of SyRibbon-less CaV1.3 Ca<sup>2+</sup> channel clusters in tripletransfected HEK cells (Ca<sup>2+</sup> channels, RBP, membrane-anchored Bassoon, Fig. 6E, F). Moreover, we have revised the discussion accordingly. 

      (5) The difference in Ca imaging between SyRibbons and other locations is extremely subtle.

      We agree with the reviewer on the modest increase in Ca<sup>2+</sup> signal amplitude seen in the presence of  SyRibbons and provide the following reasoning for this observation: 

      (i) It is plausible that due to the overexpression approach, Ca<sup>2+</sup> channels (along with RBP2 and PalmBassoon) still show considerably high expression throughout the membrane even in regions where SyRibbons are not localised. Indeed, this is evident in the images shown in the lower panel in Fig. 6B, where Ca<sup>2+</sup> channel immunofluorescence is distributed across the plasma membrane with larger clusters formed underneath SyRibbons (for an opposing scenario, please see the cell in Fig. 6B upper panel with very localised CaV1.3 distribution underneath SyRibbons). This would of course diminish the difference in the Ca<sup>2+</sup> signals between membrane regions with and without SyRibbons. We note that while the contrast is greater for native synapses, extrasynaptic Ca<sup>2+</sup> channels have been described in numerous studies alone for hair cells (Roberts et al, 1990; Brandt, 2005; Zampini et al, 2010; Wong et al, 2014).

      (ii) Nevertheless, we do not expect a remarkably big difference in Ca<sup>2+</sup> influx due to the presence of SyRibbons in the first place. Ribbon-less AZs in inner hair cells of RIBEYE KO mice showed normal Ca<sup>2+</sup> current amplitudes at the whole-cell and the single-AZ level (Jean et al, 2018). However, it was the spatial spread of the Ca2+ signal at the single-AZ level which appeared to be broader and more diffuse in these mutants in the absence of the ribbon, in contrast to the more confined Ca2+ hotspots seen in the wild-type controls. 

      So, in agreement with these published observations – it appears that presence of SyRibbons helps in spatially confining the Ca<sup>2+</sup> signal by super scaffolding nanoclusters into microclusters (see also our response to points 3 and 4 of the reviewer): this is evident from seeing some spatial confinement of Ca<sup>2+</sup> signals near SyRibbons on top of the diffuse Ca<sup>2+</sup> signal across the rest of the membrane as a result of overexpression in HEK cells. 

      We have now carefully rephrased our interpretation throughout the manuscript and added further explanation in the discussion section.   

      (6) The effect of the expression of palm-Bsn, RBP2 and the combination of the two on Ca-current is ambiguous. It appears that while the combination is larger than the control, it probably isn't significantly different from either of the other two alone (Fig 5). Moreover, expression of Ribeye + the other two showed no effect on Ca current (Figure 7). Also, why is the IV curve right shifted in Figure 7 vs Figure 5?

      We agree with the reviewer that co-expression of palm-Bassoon and RBP2 seems to augment Ca<sup>2+</sup> currents, while the additional expression of RIBEYE results in no change when compared to wild-type controls. We currently do not have an explanation for this observation and would refrain from making any claims without concrete evidence. As the reviewer also correctly pointed out, while the expression of the combination of palm-Bassoon and RBP2 raises Ca<sup>2+</sup> currents, current amplitudes are not significantly different when compared to the individual expression of the two proteins (P > 0.05, Kruskal-Wallis test). In light of this, we have now carefully rephrased our MS. Moreover, we would like to thank reviewer 1 for pointing out the right shift in the IV curve which was due to an error in the values plotted on the x-axis. This has been corrected in the updated version of the manuscript. 

      (7) While some of the IHC is quantified, some of it is simply shown as single images. EV2, EV3 and Figure 4a in particular (4b looks convincing enough on its own, but could also benefit from a larger sample size and quantification)

      We have now added quantifications for the colocalisations of the various transfection combinations depicted in the above-mentioned figures collectively in Supplementary Figure 7 and added the corresponding results and methods accordingly. 

      Reviewer #2 (Public Review):

      We would like to thank the reviewer for the comments and advice to further improve our manuscript.

      (1) Relies on over-expression, which almost certainly diminishes the experimentally-measured parameters (e.g. pre-synapse clustering, localization of Ca2+ entry).

      We acknowledge this limitation highlighted by the reviewer arising from the use of an overexpression system and have carefully rephrased our interpretation and discussed possible caveats in the discussion section. 

      (2) Are HEK cells the best model? HEK cells secrete substances and have a studied-endocytitic pathway, but they do not create neurosecretory vesicles. Why didn't the authors try to reconstitute a ribbon synapse in a cell that makes neurosecretory vesicles like a PC12 cell?

      This is a valid point for discussion that we also had here extensively. We indeed did consider pheochromocytoma cells (PC12 cells) for reconstitution of ribbon-type AZs and also performed initial experiments with these in the initial stages of the project. PC12 cells offer the advantage of providing synaptic-like microvesicles and also endogenously express several components of the presynaptic machinery such as Bassoon, RIM2, ELKS etc (Inoue et al, 2006) such that overexpression of exogenous AZ proteins would have to be limited to RIBEYE only. 

      However, a major drawback of PC12 cells as a model is the complex molecular background of these cells. We have also briefly described this in the discussion section (line 615 – 619). Naïve, undifferentiated PC12 cells show highly heterogeneous expression of various CaV channel types (Janigro et al, 1989); however, CaV1.3, the predominant type in ribbon synapses of the ear, does not seem to express in these cells (Liu et al, 1996). Furthermore, our attempts at performing immunostainings against CaV1.3 and at overexpressing CaV1.3 in PC12 cells did not prove successful and we decided on refraining from pursuing this further (data not shown). 

      On the contrary, HEK293 cells being “synapse-naïve” provide the advantage of serving as a “blank canvas” for performing such reconstitutions, e.g. they lack voltage-gated Ca<sup>2+</sup> channels and multidomain proteins of the active zone. Moreover, an important practical aspect for our choice was the availability of the HEK293 cell line with stable (and inducible) expression of the CaV1.3 Ca<sup>2+</sup> channel complex. Finally, as described in lines 613 – 614 of the discussion section, even though HEK293 cells lack SVs and the molecular machinery required for their release, our work paves way for future studies which could employ delivery of SV machinery via co-expression (Park et al, 2021) which could then be analyzed by the correlative light and electron microscopy workflow we worked out and added during revision. 

      (3) Related to 1 and 2: the Ca channel localization observed is significant but not so striking given the presence of Cav protein and measurements of Ca2+ influx distributed across the membrane. Presumably, this is the result of overexpression and an absence of pathways for pre-synaptic targeting of Ca channels. But, still, it was surprising that Ca channel localization was so diffuse. I suppose that the authors tried to reduce the effect of over-expression by using an inducible Cav1.3? Even so, the accessory subunits were constitutively over-expressed.

      We agree with the reviewer on the modest increase in Ca<sup>2+</sup> signal amplitude seen in the presence of SyRibbons. Yes, we employed inducible expression of the CaV1.3a subunit and tried to reduce the effect of overexpression by testing different induction times. However, we did not observe any major differences in expression and observed large variability in CaV1.3 expression across cells irrespective of induction duration. At all time points, there were cells with diffuse CaV1.3 localisation also in regions without SyRibbons which likely reduced the contrast of the Ca<sup>2+</sup> signal we observe. We provide the following reasoning for this observation: 

      (i) It is plausible that due to the overexpression approach, Ca<sup>2+</sup> channels (along with RBP2 and PalmBassoon) still show considerable expression along the membrane also in regions where SyRibbons are not localised. Indeed, this is evident in the images shown in the lower panel in Fig. 6B where Ca<sup>2+</sup> channel immunofluorescence is distributed across the plasma membrane with larger clusters formed underneath SyRibbons. This would of course diminish the difference in the Ca<sup>2+</sup> signals between membrane regions with and without SyRibbons. We note that while the contrast is greater for native synapses, extrasynaptic Ca<sup>2+</sup> channels have been described in numerous studies alone for hair cells (Roberts et al, 1990; Brandt, 2005; Zampini et al, 2010; Wong et al, 2014).

      (ii) Nevertheless, we do not expect a striking difference in Ca<sup>2+</sup> influx amplitude due to the presence of SyRibbons in the first place. Ribbon-less AZs in inner hair cells of RIBEYE KO mice showed normal Ca<sup>2+</sup> current amplitudes at the whole-cell and the single-AZ level (Jean et al, 2018). Instead, it was the spatial spread of the Ca<sup>2+</sup> signal at the single-AZ level which appeared to be broader and more diffuse in these mutants in the absence of the ribbon, in contrast to the more confined Ca<sup>2+</sup> hotspots seen in the wildtype controls. 

      So, in agreement with these published observations – it appears that presence of SyRibbons helps in spatially confining the Ca<sup>2+</sup> signal by super scaffolding nanoclusters into microclusters: this is evident from seeing some spatial confinement of Ca<sup>2+</sup> signals near SyRibbons on top of the diffuse Ca<sup>2+</sup> signal across the rest of the membrane as a result of overexpression in HEK cells. 

      We have now carefully rephrased our interpretation throughout the manuscript and added further explanation in the discussion section.   

      Reviewer #3 (Public Review):

      We would like to thank the reviewer for the comments and advice to further improve our manuscript.

      (1) The results obtained in a heterologous system (HEK293 cells) need to be interpreted with caution. They will importantly speed the generation of models and hypothesis that will, however, require in vivo validation.

      We acknowledge this limitation highlighted by Reviewer 3 arising from the use of an overexpression system and have carefully rephrased our interpretation and discussed possible caveats in the discussion section. We employed inducible expression of the CaV1.3a subunit and tried to reduce the effect of overexpression by testing different induction times. However, we did not observe any major differences in expression and observed large variability in CaV1.3 expression across cells irrespective of induction duration. At all time points, there were cells with diffuse CaV1.3 localisation, even in regions without SyRibbons and this could reduce the contrast of the Ca<sup>2+</sup> signal we observe. We provide the following reasoning for this observation: 

      (i) It is plausible that due to the overexpression approach, Ca<sup>2+</sup> channels (along with RBP2 and PalmBassoon) still show considerable expression along the membrane also in regions where SyRibbons are not localised. Indeed, this is evident in the images shown in the lower panel in Fig. 6B where Ca<sup>2+</sup> channel immunofluorescence is distributed across the plasma membrane with larger clusters formed underneath SyRibbons. This would of course diminish the difference in the Ca<sup>2+</sup> signals between membrane regions with and without SyRibbons. We note that while the contrast is greater for native synapses, extrasynaptic Ca<sup>2+</sup> channels have been described in numerous studies alone for hair cells (Roberts et al, 1990; Brandt, 2005; Zampini et al, 2010; Wong et al, 2014).

      (ii) Nevertheless, we do not expect a striking difference in Ca<sup>2+</sup> influx amplitude due to the presence of SyRibbons in the first place. Ribbon-less AZs in inner hair cells of RIBEYE KO mice showed normal Ca<sup>2+</sup> current amplitudes at the whole-cell and the single-AZ level (Jean et al, 2018). Instead, it was the spatial spread of the Ca<sup>2+</sup> signal at the single-AZ level which appeared to be broader and more diffuse in these mutants in the absence of the ribbon, in contrast to the more confined Ca<sup>2+</sup> hotspots seen in the wildtype controls. 

      So, in agreement with these published observations – it appears that presence of SyRibbons helps in spatially confining the Ca<sup>2+</sup> signal by super scaffolding nanoclusters into microclusters: this is evident from seeing some spatial confinement of Ca<sup>2+</sup> signals near SyRibbons on top of the diffuse Ca<sup>2+</sup> signal across the rest of the membrane as a result of overexpression in HEK cells. 

      (2) The authors analyzed the distribution of RIBEYE clusters in different membrane compartments and correctly conclude that RIBEYE clusters are not trapped in any of those compartments, but it is soluble instead. The authors, however, did not carry out a similar analysis for Palm-Bassoon. It is therefore unknown if Palm-Bassoon binds to other membrane compartments besides the plasma membrane. That could occur because in non-neuronal cells GAP43 has been described to be in internal membrane compartments. This should be investigated to document the existence of ectopic internal Synribbons beyond the plasma membrane because it might have implications for interpreting functional data in case Ca2+-channels become part of those internal Synribbons.

      In response to this valid concern, we have now included the suggested experiment in Supplementary Figure 1. We investigated the subcellular localisation of Palm-Bassoon and did not find Palm-Bassoon puncta to colocalise with ER, Golgi, or lysosomal markers, suggesting against a possible binding with membrane compartments inside the cell. We have added the following sentence in the results section, line 145 : “Palm-Bassoon does not appear to localize in the ER, Golgi apparatus or lysosomes (Supplementary Fig 1 D, E and F).”

      (3) The co-expression of RBP2 and Palm-Bassoon induces a rather minor but significant increase in Ca2+-currents (Figure 5). Such an increase does not occur upon expression of (1) Palm-Bassoon alone, (2) RBP2 alone or (3) RIBEYE alone (Figure 5). Intriguingly, the concomitant expression of PalmBassoon, RBP2 and RIBEYE does not translate into an increase of Ca2+-currents either (Figure 7).

      We agree with the reviewer that co-expression of palm-Bassoon and RBP2 seems to augment Ca<sup>2+</sup> currents, while the additional expression of RIBEYE results in no change when compared to wild-type controls. We currently do not have an explanation for this observation and would refrain from making any claims without concrete evidence. We also highlight that, while the expression of the combination of palm-Bassoon and RBP2 raises Ca<sup>2+</sup> currents, current amplitudes are not significantly different when compared to the individual expression of the two proteins (P > 0.05, Kruskal-Wallis test). In light of this, we have now carefully rephrased our MS. 

      (4) The authors claim that Ca2+-imaging reveals increased CA2+-signal intensity at synthetic ribbontype AZs. That claim is a subject of concern because the increase is rather small and it does not correlate with an increase in Ca2+-currents.

      Thanks for the comment: please see our response to your first comment and the lines 585 – 610 in the discussion section.

      Recommendations for the authors:  

      Reviewer #2 (Recommendations For The Authors):

      (1) The authors should have a better discussion of problems derived from over-expression.

      Done. Please see above. 

      (2) Ideally, the authors would repeat the study using a secretory cell line, but this is of course not possible. The idea could be brought forth, though.

      As described above in our response to the public review of reviewer 2, we have discussed this idea in the discussion section (refer to lines 615 – 619), emphasizing on both the advantages and the limitations of using a secretory cell line (e.g. PC12 cells) instead of HEK293 cells as a model for performing such reconstitutions. 

      Reviewer #3 (Recommendations For The Authors):

      (1) There are several figures in which colocalization between different proteins is studied only displaying images but without any quantitative data. This should be corrected by providing such a quantitative analysis.

      We have now added quantifications for the colocalisations of the various transfection combinations depicted in the above-mentioned figures collectively in Supplementary Figure 7 and added the corresponding results and methods accordingly. 

      (2) The little increase in Ca2+-currents and Ca2+-influx associated to the clustering of Ca2+-channels to Synribbons is a concern. The authors should discuss if such a minor increase (found only when Palm-Bassoon and RBP2 ae co-expressed) would have or not physiological consequences in an actual synapse. They might discuss the comparison of those results and compare with results obtained in genetically modified mice in which Ca2+-currents are affected upon the removal of AZs proteins. On the other hand, they should explain why Ca2+-currents do not increase when the Synribbons are formed by RIBEYE, Palm-Bassoon and RBP2.

      Done. Please see above. 

      (3) The description of the patch-clamp experiments should be enriched by including representative currents. Did the authors measure tail currents?

      We would like to thank the reviewer for the valuable suggestion and have now added representative currents to the figures (see Supplementary Figure 5B). We agree with the reviewer on the importance of further characterizing the Ca<sup>2+</sup> currents in the presence and absence of SyRibbons by analysis of tail currents for counting the number of Ca<sup>2+</sup> channels by non-stationary fluctuation analysis but consider this to be out of scope of the current study and an objective for future studies. 

      (4) The current displayed in Figure 7 E should be explained better.

      Previous studies have shown that Ca<sup>2+</sup>-binding proteins (CaBPs) compete with Calmodulin to reduce Ca<sup>2+</sup>-dependent inactivation (CDI) and promote sustained Ca<sup>2+</sup> influx in Inner Hair Cells (Cui et al, 2007; Picher et al, 2017). In the absence of CaBPs, CaV1.3-mediated Ca<sup>2+</sup> currents show more rapid CDI as in the case here upon heterologous expression in HEK cells ((Koschak et al, 2001), see also Picher et al 2017 where co-expression of CaBP2 with CaV1.3 inhibits CDI in HEK293 cells). The inactivation kinetics of CaV1.3 are also regulated by the subunit composition (Cui et al, 2007) along with the modulation via interaction partners and given the reconstitution here we do not find the currents very surprising. 

      (5) Is the difference in Ca2+-influx still significantly higher upon the removal of the maximum value measured in positive Syribbons spots (Figure 7, panel K)?

      Yes, on removing the maximum value, the P value increases from 0.01 to 0.03 but remains statistically significant. 

      (6) In summary, although the approach pioneered by the authors is exciting and provides relevant results, there is a major concern regarding the interpretation of the modulation of Ca2+ channels.

      We have now carefully rephrased our interpretation on the modulation of Ca<sup>2+</sup> channels.  

      References

      Brandt A (2005) Few CaV1.3 Channels Regulate the Exocytosis of a Synaptic Vesicle at the Hair Cell Ribbon Synapse. Journal of Neuroscience 25: 11577–11585

      Cui G, Meyer AC, Calin-Jageman I, Neef J, Haeseleer F, Moser T & Lee A (2007) Ca2+-binding proteins tune Ca2+-feedback to Cav1. 3 channels in mouse auditory hair cells. The Journal of Physiology 585: 791–803

      Davydova D, Marini C, King C, Klueva J, Bischof F, Romorini S, Montenegro-Venegas C, Heine M, Schneider R, Schröder MS, et al (2014) Bassoon specifically controls presynaptic P/Q-type Ca(2+) channels via RIM-binding protein. Neuron 82: 181–194

      tom Dieck S, Altrock WD, Kessels MM, Qualmann B, Regus H, Brauner D, Fejtová A, Bracko O, Gundelfinger ED & Brandstätter JH (2005) Molecular dissection of the photoreceptor ribbon synapse: physical interaction of Bassoon and RIBEYE is essential for the assembly of the ribbon complex. J Cell Biol 168: 825–836

      Frank T, Rutherford MA, Strenzke N, Neef A, Pangršič T, Khimich D, Fejtova A, Gundelfinger ED, Liberman MC, Harke B, et al (2010) Bassoon and the synaptic ribbon organize Ca2+ channels and vesicles to add release sites and promote refilling. Neuron 68: 724–738

      Grabner CP & Moser T (2021) The mammalian rod synaptic ribbon is essential for Cav channel facilitation and ultrafast synaptic vesicle fusion. eLife 10: e63844

      Hibino H, Pironkova R, Onwumere O, Vologodskaia M, Hudspeth AJ & Lesage F (2002) RIM - binding proteins (RBPs) couple Rab3 - interacting molecules (RIMs) to voltage - gated Ca2+ channels. Neuron 34: 411–423

      Inoue E, Deguchi-Tawarada M, Takao-Rikitsu E, Inoue M, Kitajima I, Ohtsuka T & Takai Y (2006) ELKS, a protein structurally related to the active zone protein CAST, is involved in Ca2+-dependent exocytosis from PC12 cells. Genes to Cells 11: 659–672

      Janigro D, Maccaferri G & Meldolesi J (1989) Calcium channels in undifferentiated PC12 rat pheochromocytoma cells. FEBS Letters 255: 398–400

      Jean P, Morena DL de la, Michanski S, Tobón LMJ, Chakrabarti R, Picher MM, Neef J, Jung S, Gültas M, Maxeiner S, et al (2018) The synaptic ribbon is critical for sound encoding at high rates and with temporal precision. Elife 7: e29275

      Koschak A, Reimer D, Huber I, Grabner M, Glossmann H, Engel J & Striessnig J (2001) alpha 1D (Cav1.3) subunits can form l-type Ca2+ channels activating at negative voltages. J Biol Chem 276: 22100–22106

      Krinner S, Butola T, Jung S, Wichmann C & Moser T (2017) RIM-Binding Protein 2 Promotes a Large Number of CaV1.3 Ca2+-Channels and Contributes to Fast Synaptic Vesicle Replenishment at Hair Cell Active Zones. Front Cell Neurosci 11: 334

      Liu H, Felix R, Gurnett CA, De Waard M, Witcher DR & Campbell KP (1996) Expression and Subunit Interaction of Voltage-Dependent Ca2+ Channels in PC12 Cells. J Neurosci 16: 7557–7565

      Lv C, Stewart WJ, Akanyeti O, Frederick C, Zhu J, Santos-Sacchi J, Sheets L, Liao JC & Zenisek D (2016) Synaptic Ribbons Require Ribeye for Electron Density, Proper Synaptic Localization, and Recruitment of Calcium Channels. Cell Reports 15: 2784–2795

      Matthews G & Fuchs P (2010) The diverse roles of ribbon synapses in sensory neurotransmission. Nat Rev Neurosci 11: 812–822

      Maxeiner S, Luo F, Tan A, Schmitz F & Südhof TC (2016) How to make a synaptic ribbon: RIBEYE deletion abolishes ribbons in retinal synapses and disrupts neurotransmitter release. The EMBO Journal 35: 1098–1114

      Michanski S, Kapoor R, Steyer AM, Möbius W, Früholz I, Ackermann F, Gültas M, Garner CC, Hamra FK, Neef J, et al (2023) Piccolino is required for ribbon architecture at cochlear inner hair cell synapses and for hearing. EMBO Rep 24: e56702

      Michanski S, Smaluch K, Steyer AM, Chakrabarti R, Setz C, Oestreicher D, Fischer C, Möbius W, Moser T, Vogl C, et al (2019) Mapping developmental maturation of inner hair cell ribbon synapses in the apical mouse cochlea. PNAS 116: 6415–6424

      Neef J, Urban NT, Ohn T-L, Frank T, Jean P, Hell SW, Willig KI & Moser T (2018) Quantitative optical nanophysiology of Ca2+ signaling at inner hair cell active zones. Nat Commun 9: 290

      Park D, Wu Y, Lee S-E, Kim G, Jeong S, Milovanovic D, Camilli PD & Chang S (2021) Cooperative function of synaptophysin and synapsin in the generation of synaptic vesicle-like clusters in non-neuronal cells. Nat Commun 12

      Picher MM, Gehrt A, Meese S, Ivanovic A, Predoehl F, Jung S, Schrauwen I, Dragonetti AG, Colombo R, Camp GV, et al (2017) Ca2+-binding protein 2 inhibits Ca2+-channel inactivation in mouse inner hair cells. PNAS 114: E1717–E1726

      Robertis ED & Franchi CM (1956) Electron Microscope Observations on Synaptic Vesicles in Synapses of the Retinal Rods and Cones. J Biophys Biochem Cytol 2: 307–318

      Roberts WM, Jacobs RA & Hudspeth AJ (1990) Colocalization of ion channels involved in frequency selectivity and synaptic transmission at presynaptic active zones of hair cells. J Neurosci 10: 3664–3684

      Smith CA & Sjöstrand FS (1961) A synaptic structure in the hair cells of the guinea pig cochlea. Journal of Ultrastructure Research 5: 184–192

      Wong AB, Rutherford MA, Gabrielaitis M, Pangršič T, Göttfert F, Frank T, Michanski S, Hell S, Wolf F, Wichmann C, et al (2014) Developmental refinement of hair cell synapses tightens the coupling of Ca2+ influx to exocytosis. EMBO J 33: 247–264

      Zampini V, Johnson SL, Franz C, Lawrence ND, Münkner S, Engel J, Knipper M, Magistretti J, Masetto S & Marcotti W (2010) Elementary properties of CaV1.3 Ca(2+) channels expressed in mouse cochlear inner hair cells. J Physiol 588: 187–199

    1. eLife Assessment

      The reported cryo-EM imaging of a pentameric ligand-gated ion channel in liposomes as opposed to nanodiscs has both broad implications and contributes valuable methodological advances to the structural investigation of membrane receptors. The comparison of structures assigned to distinct functional states in liposomes versus nanodiscs is convincing and will aid membrane protein structural biologists in selection of functionally relevant membrane reconstitution environments.

    2. Reviewer #1 (Public review):

      Summary:

      The authors, Dalal, et. al., determined cryo-EM structures of open, closed, and desensitized states of the pentameric ligand-gated ion channel ELIC reconstituted in liposomes, and compared them to structures determined in varying nanodisc diameters. They argue that the liposomal reconstitution method is more representative of functional ELIC channels, as they were able to test and recapitulate channel kinetics through stopped-flow thallium flux liposomal assay. The authors and others have described channel interactions with membrane scaffold proteins (MSP), initially thought to be in a size-dependent manner. However, the authors reported their cryo-EM ELIC structure interacts with the large nanodisc spNW25, contrary to their original hypotheses. This suggests that the channels interactions with MSPs might alter its structure, possibly influencing the functional states of the channel. Thus, the authors describe reconstitution in liposomes are more representative of the native structure and can recapitulate all channel states.

      Strengths:

      Cryo-EM structural determination from proteoliposomes is promising methodology within the ion channel field due to their large surface area and lack of MSP or other membrane memetics that could alter channel structure. The authors succeeded in comparing structures determined in liposomes to those in a wide range of nanodisc diameters. This comparison gives rise to important discussions for other membrane protein structural studies when deciding the best method for individual circumstances.

      Weaknesses:

      As the overarching goal of the study was to determine structural differences of ELIC in detergent nanodiscs and liposomes. The authors stated they determined open, closed, and desensitized states of ELIC reconstituted in liposomes and suggest the desensitization gate is at the 9' region of the pore. However, limited functional data was provided when determining the functional states of the channel with most of the evidence deriving from structures, which only provides snapshots of channels.

    3. Reviewer #2 (Public review):

      Summary

      The report by Dalas and colleagues introduces a significant novelty in the field of pentameric ligand-gated ion channels (pLGICs). Within this family of receptors, numerous structures are available, but a widely recognised problem remains in assigning structures to functional states observed in biological membranes. Here, the authors obtain both structural and functional information of a pLGIC in a liposome environment. The model receptor ELIC is captured in the resting, desensitised and open states. Structures in large nanodiscs, possibly biased by receptor-scaffold protein interactions, are also reported. Altogether these results set the stage for the adoption of liposomes as a proxy for the biological membranes, for cryoEM studies of pLGICs and membrane proteins in general.

      Strengths

      The structural data is comprehensive, with structures in liposomes in the 3 main states (and for each, both inward-facing and outward-facing), and an agonist-bound structure in the large spNW25 nanodisc (and a retreatment of previous data obtained in a smaller disc). It adds up to a series of work from the same team that constitutes a much-needed exploration of various types of environment for the transmembrane domain of pLGICs. The structural analysis is thorough.<br /> The tone of the report is particularly pleasant, in the sense that the authors' claims are not inflated. For instance, a sentence such as "By performing structural and functional characterization under the same reconstitution conditions, we increase our confidence in the functional annotation of these structures." is exemplary.

      Weakness

      All the details necessary to reproduce the work are present in the Methods. Nevertheless, the biochemistry might have been shown and discussed in greater details. While I do believe that liposomes will be in most cases better than, say, nanodiscs, the process that leads from the protein in its membrane down to the liposome will play a big role in preserving the native structure.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors, Dalal, et. al., determined cryo-EM structures of open, closed, and desensitized states of the pentameric ligand-gated ion channel ELIC reconstituted in liposomes, and compared them to structures determined in varying nanodisc diameters. They argue that the liposomal reconstitution method is more representative of functional ELIC channels, as they were able to test and recapitulate channel kinetics through stopped-flow thallium flux liposomal assay. The authors and others have described channel interactions with membrane scaffold proteins (MSP), initially thought to be in a size-dependent manner. However, the authors reported that their cryo-EM ELIC structure interacts with the large nanodisc spNW25, contrary to their original hypotheses. This suggests that the channel's interactions with MSPs might alter its structure, possibly not accurately representing/reflecting functional states of the channel.

      Strengths:

      Cryo-EM structural determination from proteoliposomes is a promising methodology within the ion channel field due to their large surface area and lack of MSP or other membrane mimetics that could alter channel structure. Comparing liposomal ELIC to structures in various-sized nanodiscs gives rise to important discussions for other membrane protein structural studies when deciding the best method for individual circumstances.

      Weaknesses:

      The overarching goal of the study was to determine structural differences of ELIC in detergent nanodiscs and liposomes. Including comparisons of the results to the native bacterial lipid environment would provide a more encompassing discussion of how the determined liposome structures might or might not relate to the native receptor in its native environment. The authors stated they determined open, closed, and desensitized states of ELIC reconstituted in liposomes and suggest the desensitization gate is at the 9' region of the pore. However, no functional studies were performed to validate this statement.

      The goal of this study was to determine structures of ELIC in the same lipid environment in which its function is characterized. However, it is also worth noting that phosphatidylethanolamine and phosphatidylglyerol, two lipids used for the liposome formation, are necessary for ELIC function (PMID 36385237) and principal lipid components of gram-negative bacterial membranes in which ELIC is expressed.

      The desensitized structure of ELIC in liposomes shows a pore diameter at the hydrophobic L240 (9’) residue of 3.3 Å, which is anticipated to pose a large energetic barrier to the passage of ions due to the hydrophobic effect. We have included a graphical representation of pore diameters from the HOLE analysis for all liposome structures in Supplementary Figure 6B. While we have not tested the role of L240 in desensitization with functional experiments, it was shown by Gonzalez-Gutierrez and colleagues (PMID 22474383) that the L240A mutation apparently eliminates desensitization in ELIC. This finding is consistent with L240 (9’) being the desensitization gate of ELIC. We have referenced this study when discussing the desensitization gate in the Results.

      Reviewer #2 (Public review):

      Summary

      The report by Dalas and colleagues introduces a significant novelty in the field of pentameric ligand-gated ion channels (pLGICs). Within this family of receptors, numerous structures are available, but a widely recognised problem remains in assigning structures to functional states observed in biological membranes. Here, the authors obtain both structural and functional information of a pLGIC in a liposome environment. The model receptor ELIC is captured in the resting, desensitized, and open states. Structures in large nanodiscs, possibly biased by receptor-scaffold protein interactions, are also reported. Altogether, these results set the stage for the adoption of liposomes as a proxy for the biological membranes, for cryoEM studies of pLGICs and membrane proteins in general.

      Strengths

      The structural data is comprehensive, with structures in liposomes in the 3 main states (and for each, both inward-facing and outward-facing), and an agonist-bound structure in the large spNW25 nanodisc (and a retreatment of previous data obtained in a smaller disc). It adds up to a series of work from the same team that constitutes a much-needed exploration of various types of environment for the transmembrane domain of pLGICs. The structural analysis is thorough.

      The tone of the report is particularly pleasant, in the sense that the authors' claims are not inflated. For instance, a sentence such as "By performing structural and functional characterization under the same reconstitution conditions, we increase our confidence in the functional annotation of these structures." is exemplary.

      Weaknesses

      Core parts of the method are not described and/or discussed in enough detail. While I do believe that liposomes will be, in most cases, better than, say, nanodiscs, the process that leads from the protein in its membrane down to the liposome will play a big role in preserving the native structure, and should be an integral part of the report. Therefore, I strongly felt that biochemistry should be better described and discussed. The results section starts with "Optimal reconstitution of ELIC in liposomes [...] was achieved by dialysis". There is no information on why dialysis is optimal, what it was compared to, the distribution of liposome sizes using different preparation techniques, etc... Reading the title, I would have expected a couple of paragraphs and figure panels on liposome reconstitution. Similarly, potential biochemical challenges are not discussed. The methods section mentions that the sample was "dialyzed [...] over 5-7 days". In such a time window, most of the members of this protein family would aggregate, and it is therefore a protocol that can not be directly generalised. This has to be mentioned explicitly, and a discussion on why this can't be done in two days, what else the authors tested (biobeads? ... ?) would strengthen the manuscript.

      To a lesser extent, the relative lack of both technical details and of a broad discussion also pertains to the cryoEM and thallium flux results. Regarding the cryoEM part, the authors focus their analysis on reconstructions from outward-facing particles on the basis of their better resolutions, yet there was little discussion about it. Is it common for liposome-based structures? Are inward-facing reconstructions worse because of the increased background due to electrons going through two membranes? Are there often impurities inside the liposomes (we see some in the figures)? The influence of the membrane mimetics on conformation could be discussed by referring to other families of proteins where it has been explored (for instance, ABC transporters, but I'm sure there are many other examples). If there are studies in other families of channels in liposomes that were inspirational, those could be mentioned. Regarding thallium flux assays, one argument is that they give access to kinetics and set the stage for time-resolved cryoEM, but if I did not miss it, no comparison of kinetics with other techniques, such as electrophysiology, nor references to eventual pioneer time-resolved studies are provided.

      Altogether, in my view, an updated version would benefit from insisting on every aspect of the methodological development. I may well be wrong, but I see this paper more like a milestone on sample prep for cryoEM imaging than being about the details of the ELIC conformations.

      Additions have been made to the Results and Discussion sections elaborating on the following points: 1) reconstitution of ELIC in liposomes using dialysis, the advantage of this over other methods such as biobeads, and whether the dialysis protocol can be shortened for other less stable proteins; 2) the issue of separating outward- and inward-facing channels; 3) referencing the effect of nanodiscs on ABC transporters, structures of membrane proteins in liposomes, and pioneering time-resolved cryo-EM studies; and 4) comparison of the kinetics of ELIC gating kinetics with electrophysiology measurements. With regards to the first point, it should be noted that all necessary details are provided in the Methods to reproduce the experiments including the reconstitution and stopped-flow thallium flux assay. It is also important to note that the same preparation for making proteoliposomes was used for assessing function using the stopped-flow thallium flux assay and for determining the structure by cryo-EM. This is now stated in the Results.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Major revisions:

      (1) The authors suggest that the desensitization gate is located at the 9' region within the pore. However, as stated by the authors, the 2' residues function as the desensitization gate in related channels. In a few of their HOLE analyzed structures (e.g. Figure 2B and 4B), there seems to be a constriction also at 2', but this finding is not discussed in the context of desensitization. Further functional testing of mutated 9' and/or 2' gates would bolster the argument for the location of the desensitization gate.

      As stated above, we have included HOLE plots of pore radius in Supplementary Fig. 6B and referenced the study showing that the L240A mutation (9’) in ELIC (PMID 22474383) appears to eliminate desensitization. This result along with the narrow pore diameter at 9’ in the desensitized structure suggests that 9’ is likely a desensitization gate in ELIC. In contrast, mutation of Q233 (2’) to a cysteine in a previous study produced a channel that still desensitizes (PMID 25960405). Since Q233 is a hydrophilic residue in contrast to L240, Q233 probably does not pose the same energetic barrier to ion translocation as L240 based on the structure.

      (2) In discussing functional states of ELIC and ELIC5 in different reconstitution methods, the authors reference constriction sites determined by HOLE analysis software. These constriction sites were key evidence for the authors to determine functional state, however, it is difficult to discern pore sizes based on the figures. Pore diameters and clear color designation (ie, green vs orange) with the figures would greatly aid their discussions.

      HOLE plots are displayed in Supplementary Fig. 6B and pore diameters are not provided in the text.

      (3) The authors had an intriguing finding that ELIC dimers are found in spNW25 scaffolds. Is there any functional evidence to suggest they could be functioning as dimers?

      There is no evidence that the function of ELIC or other pLGICs is altered by the formation of dimers of pentamers. Therefore, while this result is intriguing and likely facilitated by concentrating multiple ELIC pentamers within the nanodisc, it is not clear if these interactions have any functional importance. We have stated this in the Results.

      (4) Thallium flux assay to validate channel function within proteoliposomes. Proteoliposomes are known to be generally very leaky membranes, would be good to have controls without ELIC added to determine baseline changes in fluorescence.

      We have established from multiple previous studies that liposomes composed of 2:1:1 POPC:POPE:POPG (PMID 36385237 and 31724949) do not show significant thallium flux as measured by the stopped-flow assay (PMID 29058195) in the absence of ELIC activity. Furthermore, in the present study, the data in Fig. 1A of WT ELIC shows a low thallium flux rate 60 seconds after exposure to agonist when the ion channel has mostly desensitized. Therefore, this data serves also as a control indicating that the high thallium flux rates in response to agonist (at earlier delay times) are not due to leak, but rather due to ELIC channel activity.

      Minor revisions:

      (1) Abstract and introduction. 'Liganded' should be ligand

      We removed this word and changed it to “agonist-bound” for consistency throughout the manuscript.

      (2) Inconsistent formatting of FSC graphs in Supplemental Figure 4

      The difference is a consequence of the different formatting between cryoSPARC and Relion FSC graphs.

      Reviewer #2 (Recommendations for the authors):

      Minor writing remarks:

      The present report builds on previous work from the same team, and to my eye it would be a plus if this were conveyed more explicitly. I see it as a strength to explore various developments in several papers that complement each other. E.g in the introduction when citing reference 12 (Dalal 2024), later in introducing ref 15 (Petroff 2022), I wish I was reminded of the main findings and how they fit with the new results.

      We have expanded on the Results and Discussion detailing key findings from these studies that are relevant to the current study.

      Suggestions for analysis:

      Data treatment. Maybe I missed it, but I wondered if C1 vs C5 treatment of the liposome data showed any interesting differences? When I think about the biological membrane, I picture it as a very crowded place with lots of neighbouring proteins. I would not be surprised if, similarly to what they do in discs, the receptor would tend to stick to, or bump into, anything present also in liposomes (a neighboring liposome, some undefined density inside the liposome).

      We attempted to perform C1 heterogeneous refinement jobs in cryoSPARC and C1 3D classification in Relion5. For the WT datasets, these did not produce 3D reconstructions that were of sufficient quality for further refinement. For ELIC5 with agonist, the C1 reconstructions were not different than the C5 reconstructions. Furthermore, there was no evidence of dimers of pentamers from the 2D or 3D treatments, unlike what was observed in the spNW25 nanodiscs. This is likely because the density of ELIC pentamers in the liposomes was too low to capture these transient interactions. We have included this information in the Methods.

      In data treatment, we sometimes find only what we're looking for. I wondered if the authors tried to find, for instance, the open and D conformations in the resting dataset during classifications.

      This is an interesting question since some population of ELIC channels could visit a desensitized conformation in the absence of agonist and this would not be detected in our flux assay. After extensive heterogeneous refinement jobs in cryoSPARC and 3D classification jobs in Relion5, we did not detect any unexpected structures such as open/desensitized conformations in the apo dataset.

      In the analysis of the M4 motions, is there info to be gained by looking at how it interacts with the rest of the TMD? For instance, I wondered if the buried surface area between M4 and the rest was changed. Also one could imagine to look at that M4 separately in outward-facing and inward-facing conformations (because the tension due to the bilayer will not be the same in the outer layer in both orientations - intuitively, I'd expect different levels of M4 motions)

      We have expanded our analysis of the structures as recommended. We determined the buried surface area between M4 and the rest of the channel in the liganded WT and ELIC5 structures in liposomes and nanodiscs, as well as the area between the TMD interfaces for these structures. There appears to be a pattern where liposome structures show less buried surface area between M4 and the rest of the channel, and less area at the TMD interfaces. Overall, this suggests that the liposome structures of ELIC in the open-channel or desensitized conformations are more loosely packed in the TMD compared to the nanodisc structures.

      We have also further discussed the issue of separating outward- and inward-facing conformations in the Results. The problem with classifying outward- and inward-facing orientations is that top/down or tilted views of the particles cannot be easily distinguished as coming from channels in one orientation or the other, unless there are conformational differences between outward- and inward-facing channels that would allow for their separation during 3D heterogeneous refinement or 3D classification. Furthermore, since the inward-facing reconstructions are of much lower resolution than the outward-facing reconstructions, we suspect that these particles are more heterogeneous possibly containing junk, multiple conformations, or particles that are both inward- and outward-facing. On the other hand, the outward-facing structures are of good quality, and therefore we are more confident that these come from a more homogeneous set of particles that are likely outward-facing (Note that most particles are outward facing based on side views of the 2D class averages). That said, when examining the conformation of M4 in outward- and inward-facing structures, we do not see any significant differences with the caveat that the inward-facing structures are of poor quality and that inward- and outward-facing particles may not have been well-separated.

    1. eLife Assessment

      This study makes the valuable claim that people track, specifically, the elasticity of control (that is, the degree to which outcome depends on how many resources - such as money - are invested), and that control elasticity is impaired in certain types of psychopathology. A novel task is introduced that provides solid evidence that this learning process occurs and that human behavior is sensitive to changes in the elasticity of control. Evidence that elasticity inference is distinct from more general learning mechanisms and is related to psychopathology remains incomplete.

    2. Reviewer #1 (Public review):

      Summary:

      The authors investigated the elasticity of controllability by developing a task that manipulates the probability of achieving a goal with a baseline investment (which they refer to as inelastic controllability) and the probability that additional investment would increase the probability of achieving a goal (which they refer to as elastic controllability). They found that a computational model representing the controllability and elasticity of the environment accounted better for the data than a model representing only the controllability. They also found that prior biases about the controllability and elasticity of the environment was associated with a composite psychopathology score. The authors conclude that elasticity inference and bias guide resource allocation.

      Strengths:

      This research takes a novel theoretical and methodological approach to understanding how people estimate the level of control they have over their environment and how they adjust their actions accordingly. The task is innovative and both it and the findings are well-described (with excellent visuals). They also offer thorough validation for the particular model they develop. The research has the potential to theoretically inform understanding of control across domains, which is a topic of great importance.

      Weaknesses:

      In its revised form, the manuscript addresses most of my previous concerns. The main remaining weakness pertains to the analyses aimed at addressing my suggesting of Bayesian updating as an alternative to the model proposed by the authors. My suggestion was to assume that people perform a form of function approximation to relate resource expenditure to success probability. The authors performed a version of this where people were weighing evidence for a few canonical functions (flat, step, linear), and found that this model underperformed theirs. However, this Bayesian model is quite constrained in its ability to estimate the function relating resources. A more robust test would be to assume a more flexible form of updating that is able to capture a wide range of distributions (e.g., using basis functions, gaussian processes, or nonparametric estimators); see, e.g., work by Griffiths on human function learning). The benefit of testing this type of model is that it would make contact with a known form of inference that individuals engage in across various settings and therefore could offer a more parsimonious and generalizable account of function learning, whereby learning of resource elasticity is a special case. I defer to the authors as to whether they'd like to pursue this direction, but if not I think it's still important that they acknowledge that they are unable to rule out a more general process like this as an alternative to their model. This pertains also to inferences about individual differences, which currently hinge on their preferred model being the most parsimonious.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, the authors test whether controllability beliefs and associated actions/resource allocation are modulated by things like time, effort, and monetary costs (what they call "elastic" as opposed to "inelastic" controllability). Using a novel behavioral task and computational modeling, they find that participants do indeed modulate their resources depending on whether they are in an "elastic," "inelastic," or "low controllability" environment. The authors also find evidence that psychopathology is related to specific biases in controllability.

      Strengths:

      This research investigates how people might value different factors that contribute to controllability in a creative and thorough way. The authors use computational modeling to try to dissociate "elasticity" from "overall controllability," and find some differential associations with psychopathology. This was a convincing justification for using modeling above and beyond behavioral output and yielded interesting results. Notably, the authors conclude that these findings suggest that biased elasticity could distort agency beliefs via maladaptive resource allocation. Overall, this paper reveals important findings about how people consider components of controllability.

      Weaknesses:

      The authors have gone to great lengths to revise the manuscript to clarify their definitions of "elastic" and "inelastic" and bolster evidence for their computational model, resulting in an overall strong manuscript that is valuable for elucidating controllability dynamics and preferences. One minor weakness is that the justification for the analysis technique for the relationships between the model parameters and the psychopathology measures remains lacking given the fact that simple correlational analyses did not reveal any significant associations.

    4. Reviewer #3 (Public review):

      A bias in how people infer the amount of control they have over their environment is widely believed to be a key component of several mental illnesses including depression, anxiety, and addiction. Accordingly, this bias has been a major focus in computational models of those disorders. However, all of these models treat control as a unidimensional property, roughly, how strongly outcomes depend on action. This paper proposes---correctly, I think---that the intuitive notion of "control" captures multiple dimensions in the relationship between action and outcome. In particular, the authors identify one key dimension: the degree to which outcome depends on how much *effort* we exert, calling this dimension the "elasticity of control". They additionally argue that this dimension (rather than the more holistic notion of controllability) may be specifically impaired in certain types of psychopathology. This idea has the potential to change how we think about several major mental disorders in a substantial way and can additionally help us better understand how healthy people navigate challenging decision-making problems. More concisely, it is a very good idea.

      Unfortunately, my view is that neither the theoretical nor empirical aspects of the paper really deliver on that promise. In particular, most (perhaps all) of the interesting claims in the paper have weak empirical support.

      Starting with theory, the authors do not provide a strong formal characterization of the proposed notion of elasticity. There are existing, highly general models of controllability (e.g., Huys & Dayan, 2009; Ligneul, 2021) and the elasticity idea could naturally be embedded within one of these frameworks. The authors gesture at this in the introduction; however, this formalization is not reflected in the implemented model, which is highly task-specific. Moreover, the authors present elasticity as if it is somehow "outside of" the more general notion of controllability. However, effort and investment are just specific dimensions of action; and resources like money, strength, and skill (the "highly trained birke") are just specific dimensions of state. Accordingly, the notion of elasticity is necessarily implicitly captured by the standard model. Personally, I am compelled by the idea that effort and resource (and therefore elasticity) are particularly important dimensions, ones that people are uniquely tuned to. However, by framing elasticity as a property that is different in kind from controllability (rather than just a dimension of controllability), the authors only make it more difficult to integrate this exciting idea into generalizable models.

      Turning to experiment, the authors make two key claims: (1) people infer the elasticity of control, and (2) individual differences in how people make this inference are importantly related to psychopathology.

      Starting with claim 1, there are three subclaims here; implicitly, the authors make all three. (1A) People's behavior is sensitive to differences in elasticity, (1B) people actually represent/track something like elasticity, and (1C) people do so naturally as they go about their daily lives. The results clearly support 1A. However, 1B and 1C are not strongly supported.

      (1B) The experiment cannot support the claim that people represent or track elasticity because effort is the only dimension over which participants can engage in any meaningful decision-making. The other dimension, selecting which destination to visit, simply amounts to selecting the location where you were just told the treasure lies. Thus, any adaptive behavior will necessarily come out in a sensitivity to how outcomes depend on effort.

      Notes on rebuttal: The argument that vehicle/destination choice is not trivial because people occasionally didn't choose the instructed location is not compelling to me-if anything, the exclusion rate is unusually low for online studies. The finding that people learn more from non-random outcomes is helpful, but this could easily be cast as standard model-based learning very much like what one measures with the Daw two-step task (nothing specific to control here). Their final argument is the strongest, that to explain behavior the model must assume "a priori that increased effort could enhance control." However, more literally, the necessary assumption is that each attempt increases the probability of success-e.g. you're more likely to get a heads in two flips than one. I suppose you can call that "elasticity inference", but I would call it basic probabilistic reasoning.

      For 1C, the claim that people infer elasticity outside of the experimental task cannot be supported because the authors explicitly tell people about the two notions of control as part of the training phase: "To reinforce participants' understanding of how elasticity and controllability were manifested in each planet, [participants] were informed of the planet type they had visited after every 15 trips." (line 384).

      Notes on rebuttal: The authors try to retreat, saying "our research question was whether people can distinguish between elastic and inelastic controllability." I struggle to reconcile this with the claim in the abstract "These findings establish the elasticity of control as a distinct cognitive construct guiding adaptive behavior". That claim is the interesting one, and the one I am evaluating the evidence in light of.

      Finally, I turn to claim 2, that individual differences in how people infer elasticity are importantly related to psychopathology. There is much to say about the decision to treat psychopathology as a unidimensional construct (the authors claim otherwise, but see Fig 6C). However, I will keep it concrete and simply note that CCA (by design) obscures the relationship between any two variables. Thus, as suggestive as Figure 6B is, we cannot conclude that there is a strong relationship between Sense of Agency (SOA) and the elasticity bias---this result is consistent with any possible relationship (even a negative one). As it turns out, Figure S3 shows that there is effectively no relationship (r=0.03).

      Notes on rebuttal: The authors argue for CCA by appeal to the need to "account for the substantial variance that is typically shared among different forms of psychopathology". I agree. A simple correlation would indeed be fairly weak evidence. Strong evidence would show a significant correlation after *controlling for* other factors (e.g. a regression predicting elasticity bias from all subscales simultaneously). CCA effectively does the opposite, asking whether-with the help of all the parameters and all the surveys-one can find any correlation between the two sets of variables. The results are certainly suggestive, but they provide very little statistical evidence that the elasticity parameter is meaningfully related to any particular dimension of psychopathology.

      There is also a feature of the task that limits our ability to draw strong conclusions about individual differences about elasticity inference. In the original submission, the authors stated that the study was designed to be "especially sensitive to overestimation of elasticity". A straightforward consequence of this is that the resulting *empirical* estimate of estimation bias (i.e., the gamma_elasticity parameter) is itself biased. This immediately undermines any claim that references the directionality of the elasticity bias (e.g. in the abstract). Concretely, an undirected deficit such as slower learning of elasticity would appear as a directed overestimation bias.

      When we further consider that elasticity inference is the only meaningful learning/decision-making problem in the task (argued above), the situation becomes much worse. Many general deficits in learning or decision-making would be captured by the elasticity bias parameter. Thus, a conservative interpretation of the results is simply that psychopathology is associated with impaired learning and decision-making.

      Notes on rebuttal: I am very concerned to see that the authors removed the discussion of this limitation in response to my first review. I quote the original explanation here:

      - In interpreting the present findings, it needs to be noted that we designed our task to be especially sensitive to overestimation of elasticity. We did so by giving participants free 3 tickets at their initial visits to each planet, which meant that upon success with 3 tickets, people who overestimate elasticity were more likely to continue purchasing extra tickets unnecessarily. Following the same logic, had we first had participants experience 1 ticket trips, this could have increased the sensitivity of our task to underestimation of elasticity in elastic environments. Such underestimation could potentially relate to a distinct psychopathological profile that more heavily loads on depressive symptoms. Thus, by altering the initial exposure, future studies could disambiguate the dissociable contributions of overestimating versus underestimating elasticity to different forms of psychopathology.

      The logic of this paragraph makes perfect sense to me. If you assume low elasticity, you will infer that you could catch the train with just one ticket. However, when elasticity is in fact high, you would find that you don't catch the train, leading you to quickly infer high elasticity-eliminating the bias. In contrast, if you assume high elasticity, you will continue purchasing three tickets and will never have the opportunity to learn that you could be purchasing only one-the bias remains.

      The authors attempt to argue that this isn't happening using parameter recovery. However, they only report the *correlation* in the parameter, whereas the critical measure is the *bias*. Furthermore, in parameter recovery, the data-generating and data-fitting models are identical-this will yield the best possible recovery results. Although finding no bias in this setting would support the claims, it cannot outweigh the logical argument for the bias that they originally laid out. Finally, parameter recovery should be performed across the full range of plausible parameter values; using fitted parameters (a detail I could only determine by reading the code) yields biased results because the fitted parameters are themselves subject to the bias (if present). That is, if true low elasticity is inferred as high elasticity, then you will not have any examples of low elasticity in the fitted parameters and will not detect the inability to recover them.

      Minor comments:

      Below are things to keep in mind.

      The statistical structure of the task is inconsistent with the framing. In the framing, participants can make either one or two second boarding attempts (jumps) by purchasing extra tickets. The additional attempt(s) will thus succeed with probability p for one ticket and 2p - p^2 for two tickets; the p^2 captures the fact that you only take the second attempt if you fail on the first. A consequence of this is buying more tickets has diminishing returns. In contrast, in the task, participants always jumped twice after purchasing two tickets, and the probability of success with two tickets was exactly double that with one ticket. Thus, if participants are applying an intuitive causal model to the task, they will appear to "underestimate" the elasticity of control. I don't think this seriously jeopardizes the key results, but any follow-up work should ensure that the task's structure is consistent with the intuitive causal model.

      The model is heuristically defined and does not reflect Bayesian updating. For example, it over-estimates maximum control by not using losses with less than 3 tickets (intuitively, the inference here depends on what your beliefs about elasticity). Including forced three-ticket trials at the beginning of each round makes this less of an issue; but if you want to remove those trials, you might need to adjust the model. The need to introduce the modified model with kappa is likely another symptom of the heuristic nature of the model updating equations.

    1. eLife Assessment

      This study establishes bathy phytochromes, a unique class of bacterial photoreceptors that respond to near-infrared light (NIR), as important tools for bacterial optogenetics. NIR light is a key control signal in optogenetics due to its deep tissue penetration and the ability to combine with existing red- and blue-light sensitive systems, but thus far, NIR-activated proteins have been poorly characterized. The strength of the evidence is solid overall, with comprehensive in vitro characterization, modular design strategies, and validation across different hosts. There are some questions that remain such as the rationale for linker choices, characterization of growth and performance relative to controls, and the physiological significance of color blind effects at alkaline pH but overall, this study should advance the fields of optogenetics and photobiology and inspire future work.

    2. Reviewer #1 (Public review):

      Summary:

      This is an interesting study characterizing and engineering so-called bathy phytochromes, i.e., those that respond to near infrared (NIR) light in the ground state, for optogenetic control of bacterial gene expression. Previously, the authors have developed a structure-guided approach to functionally link several light-responsive protein domains to the signaling domain of the histidine kinase FixL, which ultimately controls gene expression. Here, the authors use the same strategy to link bathy phytochrome light-responsive domains to FixL, resulting in sensors of NIR light. Interestingly, they also link these bathy phytochrome light-sensing domains to signaling domains from the tetrathionate-sensing SHK TtrS and the toluene-sensing SHK TodS, demonstrating the generality of their protein engineering approach more broadly across bacterial two-component systems.

      This is an exciting result that should inspire future bacterial sensor design. They go on to leverage this result to develop what is, to my knowledge, the first system for orthogonally controlling the expression of two separate genes in the same cell with NIR and Red light, a valuable contribution to the field.

      Finally, the authors reveal new details of the pH-dependent photocycle of bathy phytochromes and demonstrate that their sensors work in the gut - and plant-relevant strains E. coli Nissle 1917 and A. tumefaciens.

      Strengths:

      (1) The experiments are well-founded, well-executed, and rigorous.

      (2) The manuscript is clearly written.

      (3) The sensors developed exhibit large responses to light, making them valuable tools for ontogenetic applications.

      (4) This study is a valuable contribution to photobiology and optogenetics.

      Weaknesses:

      (1) As the authors note, the sensors are relatively insensitive to NIR light due to the rapid dark reversion process in bathy phytochromes. Though NIR light is generally non-phototoxic, one would expect this characteristic to be a limitation in some downstream applications where light intensities are not high (e.g., in vivo).

      (2) Though they can be multiplexed with Red light sensors, these bathy phytochrome NIR sensors are more difficult to multiplex with other commonly used light sensors (e.g., blue) due to the broad light responsivity of the Pfr state. This challenge may be overcome by careful dosing of blue light, as the authors discuss, but other bacterial NIR sensing systems with less cross-talk may be preferred in some applications.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Meier et al. engineer a new class of light-regulated two-component systems. These systems are built using bathy-bacteriophytochromes that respond to near-infrared (NIR) light. Through a combination of genetic engineering and systematic linker optimization, the authors generate bacterial strains capable of selective and tunable gene expression in response to NIR stimulation. Overall, these results are an interesting expansion of the optogenetic toolkit into the NIR range. The cross-species functionality of the system, modularity, and orthogonality have the potential to make these tools useful for a range of applications.

      Strengths:

      (1) The authors introduce a novel class of near-infrared light-responsive two-component systems in bacteria, expanding the optogenetic toolbox into this spectral range.

      (2) Through engineering and linker optimization, the authors achieve specific and tunable gene expression, with minimal cross-activation from red light in some cases.

      (3) The authors show that the engineered systems function robustly in multiple bacterial strains, including laboratory E. coli, the probiotic E. coli Nissle 1917, and Agrobacterium tumefaciens.

      (4) The combination of orthogonal two-component systems can allow for simultaneous and independent control of multiple gene expression pathways using different wavelengths of light.

      (5) The authors explore the photophysical properties of the photosensors, investigating how environmental factors such as pH influence light sensitivity.

      Weaknesses:

      (1) The expression of multi-gene operons and fluorescent reporters could impose a metabolic burden. The authors should present data comparing optical density for growth curves of engineered strains versus the corresponding empty-vector control to provide insight into the burden and overall impact of the system on host viability and growth.

      (2) The manuscript consistently presents normalized fluorescence values, but the method of normalization is not clear (Figure 2 caption describes normalizing to the maximal fluorescence, but the maximum fluorescence of what?). The authors should provide a more detailed explanation of how the raw fluorescence data were processed. In addition, or potentially in exchange for the current presentation, the authors should include the raw fluorescence values in supplementary materials to help readers assess the actual magnitude of the reported responses.

      (3) Related to the prior point, it would be useful to have a positive control for fluorescence that could be used to compare results across different figure panels.

      (4) Real-time gene expression data are not presented in the current manuscript, but it would be helpful to include a time-course for some of the key designs to help readers assess the speed of response to NIR light.

    4. Reviewer #3 (Public review):

      Summary:

      This paper by Meier et al introduces a new optogenetic module for the regulation of bacterial gene expression based on "bathy-BphP" proteins. Their paper begins with a careful characterization of kinetics and pH dependence of a few family members, followed by extensive engineering to produce infrared-regulated transcriptional systems based on the authors' previous design of the pDusk and pDERusk systems, and closing with characterization of the systems in bacterial species relevant for biotechnology.

      Strengths:

      The paper is important from the perspective of fundamental protein characterization, since bathy-BphPs are relatively poorly characterized compared to their phytochrome and cyanobacteriochrome cousins. It is also important from a technology development perspective: the optogenetic toolbox currently lacks infrared-stimulated transcriptional systems. Infrared light offers two major advantages: it can be multiplexed with additional tools, and it can penetrate into deep tissues with ease relative to the more widely used blue light-activated systems. The experiments are performed carefully, and the manuscript is well written.

      Weaknesses:

      My major criticism is that some information is difficult to obtain, and some data is presented with limited interpretation, making it difficult to obtain intuition for why certain responses are observed. For example, the changes in red/infrared responses across different figures and cellular contexts are reported but not rationalized. Extensive experiments with variable linker sequences were performed, but the rationale for linker choices was not clearly explained. These are minor weaknesses in an overall very strong paper.

    1. eLife Assessment

      This work models reinforcement-learning experiments using a recurrent neural network. It examines if the detailed credit assignment necessary for back-propagation through time can be replaced with random feedback. In this important study the authors show that it yields a satisfactory approximation and the evidence to support that it holds within relatively simple tasks is solid.

    2. Reviewer #1 (Public review):

      Summary:

      Can a plastic RNN serve as a basis function for learning to estimate value. In previous work this was shown to be the case, with a similar architecture to that proposed here. The learning rule in previous work was back-prop with an objective function that was the TD error function (delta) squared. Such a learning rule is non-local as the changes in weights within the RNN, and from inputs to the RNN depends on the weights from the RNN to the output, which estimates value. This is non-local, and in addition, these weights themselves change over learning. The main idea in this paper is to examine if replacing the values of these non-local changing weights, used for credit assignment, with random fixed weights can still produce similar results to those obtained with complete bp. This random feedback approach is motivated by a similar approach used for deep feed-forward neural networks.

      This work shows that this random feedback in credit assignment performs well but is not as well as the precise gradient-based approach. When more constraints due to biological plausibility are imposed performance degrades. These results are consistent with previous results on random feedback.

      Strengths:

      • The authors show that random feedback can approximate well a model trained with detailed credit assignment.<br /> • The authors simulate several experiments including some with probabilistic reward schedules and show results similar to those obtained with detailed credit assignments as well as in experiments.<br /> • The paper examines the impact of more biologically realistic learning rules and the results are still quite similar to the detailed back-prop model.

      Weaknesses:

      • The impact of the article is limited by using a network with discrete time-steps, and only a small number of time steps from stimulus to reward. They assume that each time step is on the order of hundreds of ms. They justify this by pointing to some slow intrinsic mechanisms, but they do not implement these slow mechanisms is a network with short time steps, instead they assume without demonstration that these could work as suggested. This is a reasonable first approximation, but its validity should be explicitly tested.

      • As the delay between cue and reward increases the performance decreases. This is not surprising given the proposed mechanism, but is still a limitation, especially given that we do not really know what a is the reasonable value of a single time step.

    3. Reviewer #2 (Public review):

      Summary:

      Tsurumi et al. show that recurrent neural networks can learn state and value representations in simple reinforcement learning tasks when trained with random feedback weights. The traditional method of learning for recurrent network in such tasks (backpropogation through time) requires feedback weights which are a transposed copy of the feed-forward weights, a biologically implausible assumption. This manuscript builds on previous work regarding "random feedback alignment" and "value-RNNs", and extends them to a reinforcement learning context. The authors also demonstrate that certain non-negative constraints can enforce a "loose alignment" of feedback weights. The author's results suggest that random feedback may be a powerful tool of learning in biological networks, even in reinforcement learning tasks.

      Strengths:

      The authors describe well the issues regarding biologically plausible learning in recurrent networks and in reinforcement learning tasks. They take care to propose networks which might be implemented in biological systems and compare their proposed learning rules to those already existing in literature. Further, they use small networks on relatively simple tasks, which allows for easier intuition into the learning dynamics.

      Weaknesses:

      The principles discovered by the authors in these smaller networks are not applied to larger networks or more complicated tasks with long temporal delays (>100 timesteps), so it remains unclear to what degree these methods can scale or can be used more generally.

      Comments on revisions: I would still want to see how well the network learns tasks with longer time delays (on the order of 100 or even 1000 timesteps). Previous work has shown that random feedback struggles to encode longer timescales (see Murray 2019, Figure 2), so I would be interested to see how that translates to the RL context in your model.

    4. Reviewer #3 (Public review):

      Summary:

      The paper studies learning rules in a simple sigmoidal recurrent neural network setting. The recurrent network has a single layer of 10 to 40 units. It is first confirmed that feedback alignment (FA) can learn a value function in this setting. Then so-called bio-plausible constraints are added: (1) when value weights (readout) is non-negative, (2) when the activity is non-negative (normal sigmoid rather than downscaled between -0.5 and 0.5), (3) when the feedback weights are non-negative, (4) when the learning rule is revised to be monotic: the weights are not downregulated. In the simple task considered all four biological features do not appear to impair totally the learning.

      Strengths:

      (1) The learning rules are implemented in a low-level fashion of the form: (pre-synaptic-activity) x (post-synaptic-activity) x feedback x RPE. Which is therefore interpretable in terms of measurable quantities in the wet-lab.

      (2) I find that non-negative FA (FA with non negative c and w) is the most valuable theoretical insight of this paper: I understand why the alignment between w and c is automatically better at initialization.

      (3) The task choice is relevant, since it connects with experimental settings of reward conditioning with possible plasticity measurements.

      Weaknesses:

      (4) The task is rather easy, so it's not clear that it really captures the computational gap that exists with FA (gradient-like learning) and simpler learning rule like a delta rule: RPE x (pre-synpatic) x (post-synaptic). To control if the task is not too trivial, I suggest adding a control where the vector c is constant c_i=1.

      (5) Related to point 3), the main strength of this paper is to draw potential connection with experimental data. It would be good to highlight more concretely the prediction of the theory for experimental findings. (Ideally, what should be observed with non-negative FA that is not expected with FA or a delta rule (constant global feedback) ?).

      (6a) Random feedback with RNN in RL have been studied in the past, so it is maybe worth giving some insights how the results and the analyzes compare to this previous line of work (for instance in this paper [1]). For instance, I am not very surprised that FA also works for value prediction with TD error. It is also expected from the literature that the RL + RNN + FA setting would scale to tasks that are more complex than the conditioning problem proposed here, so is there a more specific take-home message about non-negative FA? or benefits from this simpler toy task?

      (6b) Related to task complexity, it is not clear to me if non-negative value and feedback weights would generally scale to harder tasks. If the task in so simple that a global RPE signal is sufficient to learn (see 4 and 5), then it could be good to extend the task to find a substantial gap between: global RPE, non-negative FA, FA, BP. For a well chosen task, I expect to see a performance gap between any pair of these four learning rules. In the context of the present paper, this would be particularly interesting to study the failure mode of non-negative FA and the cases where it does perform as well as FA.

      (7) I find that the writing could be improved, it mostly feels more technical and difficult than it should. Here are some recommendations:<br /> 7a) For instance, the technical description of the task (CSC) is not fully described and requires background knowledge from other paper which is not desirable.<br /> 7b) Also the rationale for the added difficulty with the stochastic reward and new state is not well explained.<br /> 7c) In the technical description of the results I find that the text dives into descriptive comments of the figures but high-level take home messages would be helpful to guide the reader. I got a bit lost, although I feel that there is probably a lot of depth in these paragraphs.

      (8) Related to the writing issue and 5), I wished that "bio-plausibility" was not the only reason to study positive feedback and value weights. Is it possible to develop a bit more specifically what and why this positivity is interesting? Is there an expected finding with non-negative FA both in the model capability? or maybe there is a simpler and crisp take-home message to communicate the experimental predictions to the community would be useful?

      [1] https://www.nature.com/articles/s41467-020-17236-y

      Comments on revisions:

      Thank you for addressing all my comments in your reply.

    5. Author response:

      The following is the authors’ response to the original reviews

      Summary of our revisions

      (1) We have explained the reason why the untrained RNN with readout (value-weight) learning only could not well learn the simple task: it is because we trained the models continuously across trials with random inter-trial intervals rather than separately for each episodic trial and so it was not trivial for the models to recognize that cue presentation in different trials constitutes a same single state since the activities of untrained RNN upon cue presentation should differ from trial to trial (Line 177-185).

      (2) We have shown that dimensionality was higher in the value-RNNs than in the untrained RNN (Fig. 2K,6H).

      (3) We have shown that even when distractor cue was introduced, the value-RNNs could learn the task (Fig. 10).

      (4) We have shown that extended value-RNNs incorporating excitatory and inhibitory units and conforming to the Dale's law could still learn the tasks (Fig. 9,10-right column).

      (5) In the original manuscript, the non-negatively constrained value-RNN showed loose alignment of value-weight and random feedback from the beginning but did not show further alignment over trials. We have clarified its reason and found a way, introducing a slight decay (forgetting), to make further alignment occur (Fig. 8E,F).

      (6) We have shown that the value-RNNs could learn the tasks with longer cue-reward delay (Fig. 2M,6J) or action selection (Fig. 11), and found cases where random feedback performed worse than symmetric feedback.

      (7) We compared our value-RNNs with e-prop (Bellec et al., 2020, Nat Commun). While e-prop incorporates the effects of changes in RNN weights across distant times through "eligibility trace", our value-RNNs do not. The reason why our models can still learn the tasks with cue-reward delay is considered to be because our models use TD error and TD learning itself, even TD(0) without eligibility trace, is a solution for temporal credit assignment. In fact, TD error-based e-prop was also examined, but for that, result with symmetric feedback, but not with random feedback, was shown (their Fig. 4,5) while for another setup of reward-based e-prop without TD error, result with random feedback was shown (their SuppFig. 5). We have noted these in Line 695-711 (and also partly in Line 96-99).

      (8) In the original manuscript, we emphasized only the spatial locality (random rather than symmetric feedback) of our learning rule. But we have now also emphasized the temporal locality (online learning) as it is also crucial for bio-plausibility and critically different from the original value-RNN with BPTT. We also changed the title.

      (9) We have realized that our estimation of true state values was invalid (as detailed in page 34 of this document). Effects of this error on performance comparisons were small, but we apologize for this error.

      Reviewer #1 (Public review):

      Summary:

      Can a plastic RNN serve as a basis function for learning to estimate value. In previous work this was shown to be the case, with a similar architecture to that proposed here. The learning rule in previous work was back-prop with an objective function that was the TD error function (delta) squared. Such a learning rule is non-local as the changes in weights within the RNN, and from inputs to the RNN depends on the weights from the RNN to the output, which estimates value. This is non-local, and in addition, these weights themselves change over learning. The main idea in this paper is to examine if replacing the values of these non-local changing weights, used for credit assignment, with random fixed weights can still produce similar results to those obtained with complete bp. This random feedback approach is motivated by a similar approach used for deep feed-forward neural networks.

      This work shows that this random feedback in credit assignment performs well but is not as well as the precise gradient-based approach. When more constraints due to biological plausibility are imposed performance degrades. These results are not surprising given previous results on random feedback. This work is incomplete because the delay times used were only a few time steps, and it is not clear how well random feedback would operate with longer delays. Additionally, the examples simulated with a single cue and a single reward are overly simplistic and the field should move beyond these exceptionally simple examples.

      Strengths:

      • The authors show that random feedback can approximate well a model trained with detailed credit assignment.

      • The authors simulate several experiments including some with probabilistic reward schedules and show results similar to those obtained with detailed credit assignments as well as in experiments.

      • The paper examines the impact of more biologically realistic learning rules and the results are still quite similar to the detailed back-prop model.

      Weaknesses:

      *please note that we numbered your public review comments and recommendations for the authors as Pub1 and Rec1 etc so that we can refer to them in our replies to other comments.

      Pub1. The authors also show that an untrained RNN does not perform as well as the trained RNN. However, they never explain what they mean by an untrained RNN. It should be clearly explained.

      These results are actually surprising. An untrained RNN with enough units and sufficiently large variance of recurrent weights can have a high-dimensionality and generate a complete or nearly complete basis, though not orthonormal (e.g: Rajan&Abbott 2006). It should be possible to use such a basis to learn this simple classical conditioning paradigm. It would be useful to measure the dimensionality of network dynamics, in both trained and untrained RNN's.

      We have added an explanation of untrained RNN in Line 144-147:

      “As a negative control, we also conducted simulations in which these connections were not updated from initial values, referring to as the case with "untrained (fixed) RNN". Notably, the value weights w (i.e., connection weights from the RNN to the striatal value unit) were still trained in the models with untrained RNN.”

      We have also analyzed the dimensionality of network dynamic by calculating the contribution ratios of each principal component of the trajectory of RNN activities. It was revealed that the contribution ratios of later principal components were smaller in the cases with untrained RNN than in the cases with trained value RNN. We have added these results in Fig. 2K and Line 210-220 (for our original models without non-negative constraint):

      “In order to examine the dimensionality of RNN dynamics, we conducted principal component analysis (PCA) of the time series (for 1000 trials) of RNN activities and calculated the contribution ratios of PCs in the cases of oVRNNbp, oVRNNrf, and untrained RNN with 20 RNN units. Figure 2K shows a log of contribution ratios of 20 PCs in each case. Compared with the case of untrained RNN, in oVRNNbp and oVRNNrf, initial component(s) had smaller contributions (PC1 (t-test p = 0.00018 in oVRNNbp; p = 0.0058 in oVRNNrf) and PC2 (p = 0.080 in oVRNNbp; p = 0.0026 in oVRNNrf)) while later components had larger contributions (PC3~10,15~20 p < 0.041 in oVRNNbp; PC5~20 p < 0.0017 in oVRNNrf) on average, and this is considered to underlie their superior learning performance. We noticed that late components had larger contributions in oVRNNrf than in oVRNNbp, although these two models with 20 RNN units were comparable in terms of cue~reward state values (Fig. 2J-left).”

      and Fig. 6H and Line 412-416 (for our extended models with non-negative constraint):

      “Figure 6H shows contribution ratios of PCs of the time series of RNN activities in each model with 20 RNN units. Compared with the cases with naive/shuffled untrained RNN, in oVRNNbp-rev and oVRNNrf-bio, later components had relatively high contributions (PC5~20 p < 1.4×10,sup>−6</sup> (t-test vs naive) or < 0.014 (vs shuffled) in oVRNNbp-rev; PC6~20 p < 2.0×10<sup>−7</sup> (vs naive) or PC7~20 p < 5.9×10<sup>−14</sup> (vs shuffled) in oVRNNrf-bio), explaining their superior value-learning performance.”

      Regarding the poor performance of the model with untrained RNN, we would like to add a note. It is sure that untrained RNN with sufficient dimensions should be able to well represent just <10 different states, and state values should be able to be well learned through TD learning regardless of whatever representation is used. However, a difficulty (nontriviality) lies in that because we modeled the tasks in a continuous way, rather than in an episodic way, the activity of untrained RNN upon cue presentation should generally differ from trial to trial. Therefore, it was not trivial for RNN to know that cue presentation in different trials, even after random lengths of inter-trial interval, should constitute a same single state. We have added this note in Line 177-185:

      “This inferiority of untrained RNN may sound odd because there were only four states from cue to reward while random RNN with enough units is expected to be able to represent many different states (c.f., [49]) and the effectiveness of training of only the readout weights has been shown in reservoir computing studies [50-53]. However, there was a difficulty stemming from the continuous training across trials (rather than episodic training of separate trials): the activity of untrained RNN upon cue presentation generally differed from trial to trial, and so it is non-trivial that cue presentation in different trials should be regarded as the same single state, even if it could eventually be dealt with at the readout level if the number of units increases.”

      The original value RNN study (Hennig et al., 2023, PLoS Comput Biol) also modeled tasks in a continuous way (though using backprop-through-time (BPTT) for training) and their model with untrained RNN also showed considerably larger RPE error than the value RNN even when the number of RNN units was 100 (the maximum number plotted in their Fig. 6A).

      Pub2. The impact of the article is limited by using a network with discrete time-steps, and only a small number of time steps from stimulus to reward. What is the length of each time step? If it's on the order of the membrane time constant, then a few time steps are only tens of ms. In the classical conditioning experiments typical delays are of the order to hundreds of milliseconds to seconds. Authors should test if random feedback weights work as well for larger time spans. This can be done by simply using a much larger number of time steps.

      In the revised manuscript, we examined the cases in which the cue-reward delay (originally 3 time steps) was elongated to 4, 5, or 6 time-steps. Our online value RNN models with random feedback could still achieve better performance (smaller squared value error) than the models with untrained RNN, although the performance degraded as the cue-reward delay increased. We have added these results in Fig. 2M and Line 223-228 (for our original models without non-negative constraint)

      “We further examined the cases with longer cue-reward delays. As shown in Fig. 2M, as the delay increased, the mean squared error of state values (at 3000-th trial) increased, but the relative superiority of oVRNNbp and oVRNNrf over the model with untrained RNN remained to hold, except for cases with small number of RNN units (5) and long delay (5 or 6) (p < 0.0025 in Wilcoxon rank sum test for oVRNNbp or oVRNNrf vs untrained for each number of RNN units for each delay).”

      and Fig. 6J and Line 422-429 (for our extended models with non-negative constraint):

      “Figure 6J shows the cases with longer cue-reward delays, with default or halved learning rates. As the delay increased, the mean squared error of state values (at 3000-th trial) increased, but the relative superiority of oVRNNbp-rev and oVRNNrf-bio over the models with untrained RNN remained to hold, except for a few cases with 5 RNN units (5 delay oVRNNrf-bio vs shuffled with default learning rate, 6 delay oVRNNrf-bio vs naive or shuffled with halved learning rate) (p < 0.047 in Wilcoxon rank sum test for oVRNNbp-rev or oVRNNrf-bio vs naive or shuffled untrained for each number of RNN units for each delay).”

      Also, we have added the note about our assumption and consideration on the time-step that we described in our provisional reply in Line 136-142:

      “We assumed that a single RNN unit corresponds to a small population of neurons that intrinsically share inputs and outputs, for genetic or developmental reasons, and the activity of each unit represents the (relative) firing rate of the population. Cortical population activity is suggested to be sustained not only by fast synaptic transmission and spiking but also, even predominantly, by slower synaptic neurochemical dynamics [46] such as short-term facilitation, whose time constant can be around 500 milliseconds [47]. Therefore, we assumed that single time-step of our rate-based (rather than spike-based) model corresponds to 500 milliseconds.”

      Pub3. In the section with more biologically constrained learning rules, while the output weights are restricted to only be positive (as well as the random feedback weights), the recurrent weights and weights from input to RNN are still bi-polar and can change signs during learning. Why is the constraint imposed only on the output weights? It seems reasonable that the whole setup will fail if the recurrent weights were only positive as in such a case most neurons will have very similar dynamics, and the network dimensionality would be very low. However, it is possible that only negative weights might work. It is unclear to me how to justify that bipolar weights that change sign are appropriate for the recurrent connections and inappropriate for the output connections. On the other hand, an RNN with excitatory and inhibitory neurons in which weight signs do not change could possibly work.

      We examined extended models that incorporated inhibitory and excitatory units and followed Dale's law with certain assumptions, and found that these models could still learn the tasks. We have added these results in Fig. 9 and subsection “4.1 Models with excitatory and inhibitory units” and described the details of the extended models in Line 844-862:

      Pub4. Like most papers in the field this work assumes a world composed of a single cue. In the real world there many more cues than rewards, some cues are not associated with any rewards, and some are associated with other rewards or even punishments. In the simplest case, it would be useful to show that this network could actually work if there are additional distractor cues that appear at random either before the CS, or between the CS and US. There are good reasons to believe such distractor cues will be fatal for an untrained RNN, but might work with a trained RNN, either using BPPT or random feedback. Although this assumption is a common flaw in most work in the field, we should no longer ignore these slightly more realistic scenarios.

      We examined the performance of the models in a task in which distractor cue randomly appeared. As a result, our model with random feedback, as well as the model with backprop, could still learn the state values much better than the models with untrained RNN. We have added these results in Fig. 10 and subsection “4.2 Task with distractor cue”

      Reviewer #1 (Recommendations for the authors):

      Detailed comments to authors

      Rec1. Are the untrained RNNs discussed in methods? It seems quite good in estimating value but has a strong dopamine response at time of reward. Is nothing trained in the untrained RNN or are the W values trained. Untrained RNN are not bad at estimating value, but not as good as the two other options. It would seem reasonable that an untrained RNN (if I understand what it is) will be sufficient for such simple Pavlovian conditioning paradigms. This is provided that the RNN generates a complete, or nearly complete basis. Random RNN's provided that the random weights are chosen properly can indeed generate a nearly complete basis. Once there is a nearly complete temporal basis, it seems that a powerful enough learning rule will be able to learn the very simple Pavlovian conditioning. Since there are only 3 time-steps from cue to reward, an RNN dimensionality of 3 would be sufficient. A failure to get a good approximation can also arise from the failure of the learning algorithm for the output weights (W).

      As we mentioned in our reply to your public comment Pub1 (page 3-5), we have added an explanation of "untrained RNN" (in which the value weights were still learnt) (Line 144-147). We also analyzed the dimensionality of network dynamics by calculating the contribution ratios of principal components of the trajectory of RNN activities, showing that the contribution ratios of later principal components were smaller in the cases with untrained RNN than in the cases with trained value RNN (Fig. 2K/Line 210-220, Fig.6H/Line 412-416). Moreover, also as we mentioned in our reply to your public comment Pub1, we have added a note that even learning of a small number of states was not trivially easy because we considered continuous learning across trials rather than episodic learning of separate trials and thus it was not trivial for the model to know that cue presentation in different trials after random lengths of inter-trial interval should still be regarded as a same single state (Line 177-185).

      Rec2. For all cases, it will be useful to estimate the dimensionality of the RNN. Is the dimensionality of the untrained RNN smaller than in the trained cases? If this is the case, this might depend on the choice of the initial random (I assume) recurrent connectivity matrix.

      As mentioned above, we have analyzed the dimensionality of the network dynamics, and as you said, the dimensionality of the model with untrained RNN (which was indeed the initial random matrix as you said, as we mentioned above) was on average smaller than the trained value RNN models (Fig. 2K/Line 210-220, Fig.6H/Line 412-416).

      Rec3. It is surprising that the error starts increasing for more RNN units above ~15. See discussion. This might indicate a failure to adjust the learning parameters of the network rather than a true and interesting finding.

      Thank you very much for this insightful comment. In the original manuscript, we set the learning rate to a fixed value (0.1), without normalization by the squared norm of feature vector (as we mentioned in Line 656-7 of the original manuscript) because we thought such a normalization could not be locally (biologically) implemented. However, we have realized that the lack of normalization resulted in excessively large learning rate when the number of RNN units was large and it could cause instability and error increase as you suggested. Therefore, in the revised manuscript, we have implemented a normalization of learning rate (of value weights) that does not require non-local computations, specifically, division by the number of RNN units. As a result, the error now monotonically decreased, as the number of RNN units increased, in the non-negatively constrained models (Fig. 6E-left) and also largely in the unconstrained model with random feedback, although still not in the unconstrained model with backprop or untrained RNN (Fig. 2J-left)

      Rec4. Not numbering equations is a problem. For example, the explanations of feedback alignment (lines 194-206) rely on equations in the methods section which are not numbered. This makes it hard to read these explanations. Indeed, it will also be better to include a detailed derivation of the explanation in these lines in a mathematical appendix. Key equations should be numbered.

      We have added numbers to key equations in the Methods, and references to the numbers of corresponding equations in the main text. Detailed derivations are included in the Methods.

      Rec5. What is shown in Figure 3C? - an equation will help.

      We have added an explanation using equations in the main text (Line 256-259).

      Rec6. The explanation of why alignment occurs is not satisfactory, but neither is it in previous work on feedforward networks. The least that should be done though

      Regarding why alignment occurs, what remained mysterious (to us) was that in the case of nonnegatively constrained model, while the angle between value weight vector (w) and the random feedback vector (c) was relatively close (loosely aligned) from the beginning, it appeared (as mentioned in the manuscript) that there was no further alignment over trials, despite that the same mechanism for feedback alignment that we derived for the model without non-negative constraint was expected to operate also under the non-negative constraint. We have now clarified the reason for this, and found a way, introduction of slight decay (forgetting) of value weights, by which feedback alignment came to occur in the non-negatively constraint model. We have added these in the revised manuscript (Line 463-477):

      “As mentioned above, while the angle between w and c was on average smaller than 90° from the beginning, there was no further alignment over trials. This seemed mysterious because the mechanism for feedback alignment that we derived for the models without non-negative constraint was expected to work also for the models with non-negative constraint. As a possible reason for the non-occurrence of feedback alignment, we guessed that one or a few element(s) of w grew prominently during learning, and so w became close to an edge or boundary of the non-negative quadrant and thereby angle between w and other vector became generally large (as illustrated in Fig. 8D). Figure 8Ea shows the mean±SEM of the elements of w ordered from the largest to smallest ones after 1500 trials. As conjectured above, a few elements indeed grew prominently.

      We considered that if a slight decay (forgetting) of value weights (c.f., [59-61]) was assumed, such a prominent growth of a few elements of w may be mitigated and alignment of w to c, beyond the initial loose alignment because of the non-negative constraint, may occur. These conjectures were indeed confirmed by simulations (Fig. 8Eb,c and Fig. 8F). The mean squared value error slightly increased when the value-weightdecay was assumed (Fig. 8G), however, presumably reflecting a decrease in developed values and a deterioration of learning because of the decay.”

      Rec7. I don't understand the qualitative difference between 4G and 4H. The difference seems to be smaller but there is still an apparent difference. Can this be quantified?

      We have added pointers indicating which were compared and statistical significance on Fig. 4D-H, and also Fig. 7 and Fig. 9C.

      Rec8. More biologically realistic constraints.

      Are the weights allowed to become negative? - No.

      Figure 6C - untrained RNN with non-negative x_i. Again - it was not explained what untrained RNN is. However, given my previous assumption, this is probably because the units developed in an untrained RNN is much further from representing a complete basis function. This cannot be done with only positive values. It would be useful to see network dynamics of units for untrained RNN. It might also be useful in all cases to estimate the dimensionality of the RNN. For 3 time-steps, it needs to be at least 3, and for more time steps as in Figure 4, larger.

      As we mentioned in our reply to your public comment Pub3 (page 6-8), in the revised manuscript we examined models that incorporated inhibitory and excitatory units and followed Dale's law, which could still learn the tasks (Fig. 9, Line 479-520). We have also analyzed the dimensionality of network dynamics as we mentioned in our replies to your public comment Pub1 and recommendations Rec1 and Rec2.

      Rec9. A new type of untrained RNN is introduced (Fig 6D) this is the first time an explanation of of the untrained RNN is given. Indeed, the dimensionality of the second type of untrained RNN should be similar to the bioVRNNrf. The results are still not good.

      In the model with the new type of untrained RNN whose elements were shuffled from trained bioVRNNrf, contribution ratios of later principal components of the trajectory of RNN activities (Fig. 6H gray dotted line) were indeed larger than those in the model with native untrained RNN (gray solid line) but still much smaller than those in the trained value RNN models with backprop (red line) or random feedback (blue line). It is considered that in value RNN, RNN connections were trained to realize high-dimensional trajectory, and shuffling did not generally preserve such an ability.

      Rec10. The discussion is too long and verbose. This is not a review paper.

      We have made the original discussion much more compact (from 1686 words to 940 words). We have added new discussion, in response to the review comments, but the total length remains to be shorter than before (1589 words).

      Reviewer #2 (Public review):

      Summary:

      Tsurumi et al. show that recurrent neural networks can learn state and value representations in simple reinforcement learning tasks when trained with random feedback weights. The traditional method of learning for recurrent network in such tasks (backpropagation through time) requires feedback weights which are a transposed copy of the feed-forward weights, a biologically implausible assumption. This manuscript builds on previous work regarding "random feedback alignment" and "value-RNNs", and extends them to a reinforcement learning context. The authors also demonstrate that certain nonnegative constraints can enforce a "loose alignment" of feedback weights. The author's results suggest that random feedback may be a powerful tool of learning in biological networks, even in reinforcement learning tasks.

      Strengths:

      The authors describe well the issues regarding biologically plausible learning in recurrent networks and in reinforcement learning tasks. They take care to propose networks which might be implemented in biological systems and compare their proposed learning rules to those already existing in literature. Further, they use small networks on relatively simple tasks, which allows for easier intuition into the learning dynamics.

      Weaknesses:

      The principles discovered by the authors in these smaller networks are not applied to deeper networks or more complicated tasks, so it remains unclear to what degree these methods can scale up, or can be used more generally.

      We have examined extended models that incorporated inhibitory and excitatory units and followed Dale's law with certain assumptions, and found that these models could still learn the tasks. We have added these results in Fig. 9 and subsection “4.1 Models with excitatory and inhibitory units”.

      We have also examined the performance of the models in a task in which distractor cue randomly appeared, finding that our models could still learn the state values much better than the models with untrained RNN. We have added these result in Fig. 10 and subsection “4.2 Task with distractor cue”.

      Regarding the depth, we continue to think about it but have not yet come up with concrete ideas.

      Reviewer #2 (Recommendations for the authors):

      (1) I think the work would greatly benefit from more proofreading. There are language errors/oddities throughout the paper, I will list just a few examples from the introduction:

      Thank you for pointing this out. We have made revisions throughout the paper.

      line 63: "simultaneously learnt in the downstream of RNN". Simultaneously learnt in networks downstream of the RNN? Simulatenously learn in a downstream RNN? The meaning is not clear in the original sentence.

      We have revised it to "simultaneously learnt in connections downstream of the RNN" (Line 67-68).

      starting in line 65: " A major problem, among others.... value-encoding unit" is a run-on sentence and would more readable if split into multiple sentences.

      We have extensively revised this part, which now consists of short sentences (Line 70-75).

      line 77: "in supervised learning of feed-forward network" should be either "in supervised learning of a feed-forward network" or "in supervised learning of feed-forward networks".

      We have changed "feed-forward network" to "feed-forward networks" (Line 83).

      (2) Under what conditions can you use an online learning rule which only considers the influence of the previous timestep? It's not clear to me how your networks solve the temporal credit assignment problem when the cue-reward delay in your tasks is 3-5ish time steps. How far can you stretch this delay before your networks stop learning correctly because of this one-step assumption? Further, how much does feedback alignment constrain your ability to learn long timescales, such as in Murray, J.M. (2019)?

      The reason why our models can solve the temporal credit assignment problem at least to a certain extent is considered to be because temporal-difference (TD) learning, which we adopted, itself has a power to resolve temporal credit assignment, as exemplified in that TD(0) algorithms without eligibility trance can still learn the value of distant rewards. We have added a discussion on this in Line 702-705:

      “…our models do not have "eligibility trace" (nor memorable/gated unit, different from the original value-RNN [26]), but could still solve temporal credit assignment to a certain extent because TD learning is by itself a solution for it (notably, recent work showed that combination of TD(0) and model-based RL well explained rat's choice and DA patterns [132]).”

      We have also examined the cases in which the cue-reward delay (originally 3 time steps) was elongated to 4, 5, or 6 time-steps, and our models with random feedback could still achieve better performance than the models with untrained RNN although the performance degraded as the cue-reward delay increased. We have added these results in Fig. 2M and Line 223-228 (for our original models without non-negative constraint)

      “We further examined the cases with longer cue-reward delays. As shown in Fig. 2M, as the delay increased, the mean squared error of state values (at 3000-th trial) increased, but the relative superiority of oVRNNbp and oVRNNrf over the model with untrained RNN remained to hold, except for cases with small number of RNN units (5) and long delay (5 or 6) (p < 0.0025 in Wilcoxon rank sum test for oVRNNbp or oVRNNrf vs untrained for each number of RNN units for each delay).”

      and Fig. 6J and Line 422-429 (for our extended models with non-negative constraint):

      “Figure 6J shows the cases with longer cue-reward delays, with default or halved learning rates. As the delay increased, the mean squared error of state values (at 3000-th trial) increased, but the relative superiority of oVRNNbp-rev and oVRNNrf-bio over the models with untrained RNN remained to hold, except for a few cases with 5 RNN units (5 delay oVRNNrf-bio vs shuffled with default learning rate, 6 delay oVRNNrf-bio vs naive or shuffled with halved learning rate) (p < 0.047 in Wilcoxon rank sum test for oVRNNbp-rev or oVRNNrf-bio vs naive or shuffled untrained for each number of RNN units for each delay).”

      As for the difficulty due to random feedback compared to backprop, there appeared to be little difference in the models without non-negative constraint (Fig. 2M), whereas in the models with nonnegative constraint, when the cue-reward delay was elongated to 6 time-steps, the model with random feedback performed worse than the model with backprop (Fig. 6J bottom-left panel).

      (3) Line 150: Were the RNN methods trained with continuation between trials?

      Yes, we have added

      “The oVRNN models, and the model with untrained RNN, were continuously trained across trials in each task, because we considered that it was ecologically more plausible than episodic training of separate trials.” in Line 147-150. This is considered to make learning of even the simple cue-reward association task nontrivial, as we describe in our reply to your comment 9 below.

      (4) Figure 2I, J: indicate the statistical significance of the difference between the three methods for each of these measures.

      We have added statistical information for Fig. 2J (Line 198-203):

      “As shown in the left panel of Fig. 2J, on average across simulations, oVRNNbp and oVRNNrf exhibited largely comparable performance and always outperformed the untrained RNN (p < 0.00022 in Wilcoxon rank sum test for oVRNNbp or oVRNNrf vs untrained for each number of RNN units), although oVRNNbp somewhat outperformed or underperformed oVRNNrf when the number of RNN units was small (≤10 (p < 0.049)) or large (≥25 (p < 0.045)), respectively.”

      and also Fig. 6E (for non-negative models) (Line 385-390):

      “As shown in the left panel of Fig. 6E, oVRNNbp-rev and oVRNNrf-bio exhibited largely comparable performance and always outperformed the models with untrained RNN (p < 2.5×10<sup>−12</sup> in Wilcoxon rank sum test for oVRNNbp-rev or oVRNNrf-bio vs naive or shuffled untrained for each number of RNN units), although oVRNNbp-rev somewhat outperformed or underperformed oVRNNrf-bio when the number of RNN units was small (≤10 (p < 0.00029)) or large (≥25 (p < 3.7×10<sup>−6</sup>)), respectively…”

      Fig. 2I shows distributions, whose means are plotted in Fig. 2J, and we did not add statistics to Fig. 2I itself.

      (5) Line 178: Has learning reached a steady state after 1000 trials for each of these networks? Can you show a plot of error vs. trial number?

      We have added a plot of error vs trial number for original models (Fig. 2L, Line 221-223):

      “We examined how learning proceeded across trials in the models with 20 RNN units. As shown in Fig. 2L, learning became largely converged by 1000-th trial, although slight improvement continued afterward.”

      and non-negatively constrained models (Fig. 6I, Line 417-422):

      “Figure 6I shows how learning proceeded across trials in the models with 20 RNN units. While oVRNNbp-rev and oVRNNrf-bio eventually reached a comparable level of errors, oVRNNrf-bio outperformed oVRNNbp-rev in early trials (at 200, 300, 400, or 500 trials; p < 0.049 in Wilcoxon rank sum test for each). This is presumably because the value weights did not develop well in early trials and so the backprop-type feedback, which was the same as the value weights, did not work well, while the non-negative fixed random feedback worked finely from the beginning.”

      As shown in these figures, learning became largely steady at 1000 trials, but still slightly continued, and we have added simulations with 3000 trials (Fig. 2M and Fig. 6J).

      (6) Line 191: Put these regression values in the figure caption, as well as on the plot in Figure 3B.

      We have added the regression values in Fig. 3B and its caption.

      (7) Line 199: This idea of being in the same quadrant is interesting, but I think the term "relatively close angle" is too vague. Is there another more quantatative way to describe this what you mean by this?

      We have revised this (Line 252-254) to “a vector that is in a relatively close angle with c , or more specifically, is in the same quadrant as (and thus within at maximum 90° from) c (for example, [c<sub>1</sub>  c<sub>2</sub>  c<sub>3</sub>]<sup>T</sup> and [0.5c<sub>1</sub> 1.2c<sub>2</sub> 0.8c<sub>3</sub>]T) “

      (8) Line 275: I'd like to see this measure directly in a plot, along with the statistical significance.

      We have added pointers indicating which were compared and statistical significance on Fig. 4D-H, and also Fig. 7 and Fig. 9C.

      (9) Line 280: Surely the untrained RNN should be able to solve the task if the reservoir is big enough, no? Maybe much bigger than 50 units, but still.

      We think this is not sure. A difficulty lies in that because we modeled the tasks in a continuous way rather than in an episodic way (as we mentioned in our reply to your comment 3), the activity of untrained RNN upon cue presentation should generally differ from trial to trial. Therefore, it was not trivial for RNN to know that cue presentation in different trials, even after random lengths of inter-trial interval, should constitute a same single state. We have added this note in Line 177-185:

      “This inferiority of untrained RNN may sound odd because there were only four states from cue to reward while random RNN with enough units is expected to be able to represent many different states (c.f., [49]) and the effectiveness of training of only the readout weights has been shown in reservoir computing studies [50-53]. However, there was a difficulty stemming from the continuous training across trials (rather than episodic training of separate trials): the activity of untrained RNN upon cue presentation generally differed from trial to trial, and so it is non-trivial that cue presentation in different trials should be regarded as the same single state, even if it could eventually be dealt with at the readout level if the number of units increases.”

      The original value RNN study (Hennig et al., 2023, PLoS Comput Biol) also modeled tasks in a continuous way (though using BPTT for training) and their model with untrained RNN also showed considerably larger RPE error than the value RNN even when the number of RNN units was 100 (the maximum number plotted in their Fig. 6A).

      (10) It's a bit confusing to compare Figure 4C to Figure 4D-H because there are also many features of D-H which do not match those of C (response to cue, response to late reward in task 1). It would make sense to address this in some way. Is there another way to calculate the true values of the states (e.g., maybe you only start from the time of the cue) which better approximates what the networks are doing?

      As we mentioned in our replies to your comments 3 and 9, our models with RNN were trained continuously across trials rather than separately for each episodic trial, and whether the models could still learn the state representation is a key issue. Therefore, starting learning from the time of cue would not be an appropriate way to compare the models, and instead we have made statistical comparison regarding key features, specifically, TD-RPEs at early and late rewards, as indicated in Fig. 4D-H.

      (11) Line 309: Can you explain why this non-monotic feature exists? Why do you believe it would be more biologically plausible to assume monotonic dependence? It doesn't seem so straightforward to me, I can imagine that competing LTP/LTD mechanisms may produce plasticity which would have a non-monotic dependence on post-synaptic activity.

      Thank you for this insightful comment. As you suggested, non-monotonic dependence on the postsynaptic activity (BCM rule) has been proposed for unsupervised learning (cortical self-organization) (Bienenstock et al., 1982 J Neurosci), and there were suggestions that triplet-based STDP could be reduced to a BCM-like rule and additional components (Gjorgjieva et al., 2011 PNAS; Shouval, 2011 PNAS). However, the non-monotonicity appeared in our model, derived from the backprop rule, is maximized at the middle and thus opposite from the BCM rule, which is minimized at the middle (i.e., initially decrease and thereafter increase). Therefore we consider that such an increase-then-decreasetype non-monotonicity would be less plausible than a monotonic increase, which could approximate an extreme case (with a minimum dip) of the BCM rule. We have added a note on this point in Line 355-358:

      “…the dependence on the post-synaptic activity was non-monotonic, maximized at the middle of the range of activity. It would be more biologically plausible to assume a monotonic increase (while an opposite shape of nonmonotonicity, once decrease and thereafter increase, called the BCM (Bienenstock-Cooper-Munro) rule has actually been suggested [56-58]).”

      (12) Line 363: This is the most exciting part of the paper (for me). I want to learn way more about this! Don't hide this in a few sentences. I want to know all about loose vs. feedback alignment. Show visualizations in 3D space of the idea of loose alignment (starting in the same quadrant), and compare it to how feedback alignment develops (ending in the same quadrant). Does this "loose" alignment idea give us an idea why the random feedback seems to settle at 45 degree angle? it just needs to get the signs right (same quadrant) for each element?

      In reply to this encouraging comment, we have made further analyses of the loose alignment. By the term "loose alignment", we meant that the value weight vector w and the feedback vector c are in the same (non-negative) quadrant, as you said. But what remained mysterious (to us) was while the angle between w and c was relatively close (loosely aligned) from the beginning, it appeared (as mentioned in the manuscript) that there was no further alignment over trials (and the angle actually settled at somewhat larger than 45°), despite that the same mechanism for feedback alignment that we derived for the model without non-negative constraint was expected to operate also under the nonnegative constraint. We have now clarified the reason for this, and found a way, introduction of slight decay (forgetting) of value weights, by which feedback alignment came to occur in the non-negatively constraint model. We have added this in Line 463-477:

      “As mentioned above, while the angle between w and c was on average smaller than 90° from the beginning, there was no further alignment over trials. This seemed mysterious because the mechanism for feedback alignment that we derived for the models without non-negative constraint was expected to work also for the models with non-negative constraint. As a possible reason for the non-occurrence of feedback alignment, we guessed that one or a few element(s) of w grew prominently during learning, and so w became close to an edge or boundary of the non-negative quadrant and thereby angle between w and other vector became generally large (as illustrated in Fig. 8D). Figure 8Ea shows the mean±SEM of the elements of w ordered from the largest to smallest ones after 1500 trials. As conjectured above, a few elements indeed grew prominently.

      We considered that if a slight decay (forgetting) of value weights (c.f., [59-61]) was assumed, such a prominent growth of a few elements of w may be mitigated and alignment of w to c, beyond the initial loose alignment because of the non-negative constraint, may occur. These conjectures were indeed confirmed by simulations (Fig. 8Eb,c and Fig. 8F). The mean squared value error slightly increased when the value-weightdecay was assumed (Fig. 8G), however, presumably reflecting a decrease in developed values and a deterioration of learning because of the decay.”

      As for visualization, because the model's dimension was high such as 12, we could not come up with better ways of visualization than the trial versus angle plot (Fig. 3A, 8A,F). Nevertheless, we would expect that the abovementioned additional analyses of loose alignment (with graphs) are useful to understand what are going on.

      (13) Line 426: how does this compare to some of the reward modulated hebbian rules proposed in other RNNs? See Hoerzer, G. M., Legenstein, R., & Maass, W. (2014). Put another way, you arrived at this from a top-down approach (gradient descent->BP->approximated by RF->non-negativity constraint>leads to DA dependent modulation of Hebbian plasticity). How might this compare to a bottom up approach (i.e. starting from the principle of Hebbian learning, and adding in reward modulation)

      The study of Hoerzer et al. 2014 used a stochastic perturbation, which we did not assume but can potentially be integrated. On the other hand, Hoerzer et al. trained the readout of untrained RNN, whereas we trained both RNN and its readout. We have added discussion to compare our model with Hoerzer et al. and other works that also used perturbation methods, as well as other top-down approximation method, in Line 685-711 (reference 128 is Hoerzer et al. 2014 Cereb Cortex):

      “As an alternative to backprop in hierarchical network, aside from feedback alignment [36], Associative Reward-Penalty (A<sub>R-P</sub>) algorithm has been proposed [124-126]. In A<sub>R-P</sub>, the hidden units behave stochastically, allowing the gradient to be estimated via stochastic sampling. Recent work [127] has proposed Phaseless Alignment Learning (PAL), in which high-frequency noise-induced learning of feedback projections proceeds simultaneously with learning of forward projections using the feedback in a lower frequency. Noise-induced learning of the weights on readout neurons from untrained RNN by reward-modulated Hebbian plasticity has also been demonstrated [128]. Such noise- or perturbation-based [40] mechanisms are biologically plausible because neurons and neural networks can exhibit noisy or chaotic behavior [129-131], and might improve the performance of value-RNN if implemented.

      Regarding learning of RNN, "e-prop" [35] was proposed as a locally learnable online approximation of BPTT [27], which was used in the original value RNN 26. In e-prop, neuron-specific learning signal is combined with weight-specific locally-updatable "eligibility trace". Reward-based e-prop was also shown to work [35], both in a setup not introducing TD-RPE with symmetric or random feedback (their Supplementary Figure 5) and in another setup introducing TD-RPE with symmetric feedback (their Figure 4 and 5). Compared to these, our models differ in multiple ways.

      First, we have shown that alignment to random feedback occurs in the models driven by TD-RPE. Second, our models do not have "eligibility trace" (nor memorable/gated unit, different from the original valueRNN [26]), but could still solve temporal credit assignment to a certain extent because TD learning is by itself a solution for it (notably, recent work showed that combination of TD(0) and model-based RL well explained rat's choice and DA patterns [132]). However, as mentioned before, single time-step in our models was assumed to correspond to hundreds of milliseconds, incorporating slow synaptic dynamics, whereas e-prop is an algorithm for spiking neuron models with a much finer time scale. From this aspect, our models could be seen as a coarsetime-scale approximation of e-prop. On top of these, our results point to a potential computational benefit of biological non-negative constraint, which could effectively limit the parameter space and promote learning.”

      Related to your latter point (and also replying to other reviewer's comment), we also examined the cases where the random feedback in our model was replaced with uniform feedback, which corresponds to a simple bottom-up reward-modulated triplet plasticity rule. As a result, the model with uniform feedback showed largely comparable, but somewhat worse, performance than the model with random feedback. We have added the results in Fig. 2J-right and Line 206-209 (for our original models without non-negative constraint):

      “The green line in Fig. 2J-right shows the performance of a special case where the random feedback in oVRNNrf was fixed to the direction of (1, 1, ..., 1)<sup>T</sup> (i.e., uniform feedback) with a random coefficient, which was largely comparable to, but somewhat worse than, that for the general oVRNNrf (blue line).”

      and Fig. 6E-right and Line 402-407 (for our extended models with non-negative constraint):

      “The green and light blue lines in the right panels of Figure 6E and Figure 6F show the results for special cases where the random feedback in oVRNNrf-bio was fixed to the direction of (1, 1, ..., 1) <sup>T</sup> (i.e., uniform feedback) with a random non-negative magnitude (green line) or a fixed magnitude of 0.5 (light blue line). The performance of these special cases, especially the former (with random magnitude) was somewhat worse than that of oVRNNrf-bio, but still better than that of the models with untrained RNN. and also added a biological implication of the results in Line 644-652:

      We have shown that oVRNNrf and oVRNNrf-bio could work even when the random feedback was uniform, i.e., fixed to the direction of (1, 1, ..., 1) <sup>T</sup>, although the performance was somewhat worse. This is reasonable because uniform feedback can still encode scalar TD-RPE that drives our models, in contrast to a previous study [45], which considered DA's encoding of vector error and thus regarded uniform feedback as a negative control. If oVRNNrf/oVRNNrf-bio-like mechanism indeed operates in the brain and the feedback is near uniform, alignment of the value weights w to near (1, 1, ..., 1) is expected to occur. This means that states are (learned to be) represented in such a way that simple summation of cortical neuronal activity approximates value, thereby potentially explaining why value is often correlated with regional activation (fMRI BOLD signal) of cortical regions [113].”

      Reviewer #3 (Public review):

      Summary:

      The paper studies learning rules in a simple sigmoidal recurrent neural network setting. The recurrent network has a single layer of 10 to 40 units. It is first confirmed that feedback alignment (FA) can learn a value function in this setting. Then so-called bio-plausible constraints are added: (1) when value weights (readout) is non-negative, (2) when the activity is non-negative (normal sigmoid rather than downscaled between -0.5 and 0.5), (3) when the feedback weights are non-negative, (4) when the learning rule is revised to be monotic: the weights are not downregulated. In the simple task considered all four biological features do not appear to impair totally the learning.

      Strengths:

      (1) The learning rules are implemented in a low-level fashion of the form: (pre-synaptic-activity) x (post-synaptic-activity) x feedback x RPE. Which is therefore interpretable in terms of measurable quantities in the wet-lab.

      (2) I find that non-negative FA (FA with non negative c and w) is the most valuable theoretical insight of this paper: I understand why the alignment between w and c is automatically better at initialization.

      (3) The task choice is relevant since it connects with experimental settings of reward conditioning with possible plasticity measurements.

      Weaknesses:

      (4) The task is rather easy, so it's not clear that it really captures the computational gap that exists with FA (gradient-like learning) and simpler learning rule like a delta rule: RPE x (pre-synpatic) x (postsynaptic). To control if the task is not too trivial, I suggest adding a control where the vector c is constant c_i=1.

      We have examined the cases where the feedback was uniform, i.e., in the direction of (1, 1, ..., 1) in both models without and with non-negative constraint. In both models, the models with uniform feedback performed somewhat worse than the original models with random feedback, but still better than the models with untrained RNN. We have added the results in Fig. 2J-right and Line 206-209 (for our original models without non-negative constraint):

      “The green line in Fig. 2J-right shows the performance of a special case where the random feedback in oVRNNrf was fixed to the direction of (1, 1, ..., 1) <sup>T</sup> (i.e., uniform feedback) with a random coefficient, which was largely comparable to, but somewhat worse than, that for the general oVRNNrf (blue line).”

      and Fig. 6E-right and Line 402-407 (for our extended models with non-negative constraint):

      “The green and light blue lines in the right panels of Figure 6E and Figure 6F show the results for special cases where the random feedback in oVRNNrf-bio was fixed to the direction of (1, 1, ..., 1) <sup>T</sup> (i.e., uniform feedback) with a random non-negative magnitude (green line) or a fixed magnitude of 0.5 (light blue line). The performance of these special cases, especially the former (with random magnitude) was somewhat worse than that of oVRNNrf-bio, but still better than that of the models with untrained RNN.”

      We have also added a discussion on the biological implication of the model with uniform feedback mentioned in our provisional reply in Line 644-652:

      “We have shown that oVRNNrf and oVRNNrf-bio could work even when the random feedback was uniform, i.e., fixed to the direction of (1, 1, ..., 1) <sup>T</sup>, although the performance was somewhat worse. This is reasonable because uniform feedback can still encode scalar TD-RPE that drives our models, in contrast to a previous study [45], which considered DA's encoding of vector error and thus regarded uniform feedback as a negative control. If oVRNNrf/oVRNNrf-bio-like mechanism indeed operates in the brain and the feedback is near uniform, alignment of the value weights w to near (1, 1, ..., 1) is expected to occur. This means that states are (learned to be) represented in such a way that simple summation of cortical neuronal activity approximates value, thereby potentially explaining why value is often correlated with regional activation (fMRI BOLD signal) of cortical regions [113].”

      In addition, while preparing the revised manuscript, we found a recent simulation study, which showed that uniform feedback coupled with positive forward weights was effective in supervised learning of one-dimensional output in feed-forward network (Konishi et al., 2023, Front Neurosci).

      We have briefly discussed this work in Line 653-655:

      “Notably, uniform feedback coupled with positive forward weights was shown to be effective also in supervised learning of one-dimensional output in feed-forward network [114], and we guess that loose alignment may underlie it.”

      (5) Related to point 3), the main strength of this paper is to draw potential connection with experimental data. It would be good to highlight more concretely the prediction of the theory for experimental findings. (Ideally, what should be observed with non-negative FA that is not expected with FA or a delta rule (constant global feedback) ?).

      We have added a discussion on the prediction of our models, mentioned in our provisional reply, in Line 627-638:

      “oVRNNrf predicts that the feedback vector c and the value-weight vector w become gradually aligned, while oVRNNrf-bio predicts that c and w are loosely aligned from the beginning. Element of c could be measured as the magnitude of pyramidal cell's response to DA stimulation. Element of w corresponding to a given pyramidal cell could be measured, if striatal neuron that receives input from that pyramidal cell can be identified (although technically demanding), as the magnitude of response of the striatal neuron to activation of the pyramidal cell. Then, the abovementioned predictions could be tested by (i) identify cortical, striatal, and VTA regions that are connected, (ii) identify pairs of cortical pyramidal cells and striatal neurons that are connected, (iii) measure the responses of identified pyramidal cells to DA stimulation, as well as the responses of identified striatal neurons to activation of the connected pyramidal cells, and (iv) test whether DA→pyramidal responses and pyramidal→striatal responses are associated across pyramidal cells, and whether such associations develop through learning.”

      Moreover, we have considered another (technically more doable) prediction of our model, and described it in Line 639-643:

      “Testing this prediction, however, would be technically quite demanding, as mentioned above. An alternative way of testing our model is to manipulate the cortical DA feedback and see if it will cause (re-)alignment of value weights (i.e., cortical striatal strengths). Specifically, our model predicts that if DA projection to a particular cortical locus is silenced, effect of the activity of that locus on the value-encoding striatal activity will become diminished.”

      (6a) Random feedback with RNN in RL have been studied in the past, so it is maybe worth giving some insights how the results and the analyzes compare to this previous line of work (for instance in this paper [1]). For instance, I am not very surprised that FA also works for value prediction with TD error. It is also expected from the literature that the RL + RNN + FA setting would scale to tasks that are more complex than the conditioning problem proposed here, so is there a more specific take-home message about non-negative FA? or benefits from this simpler toy task? [1] https://www.nature.com/articles/s41467-020-17236-y

      As for a specific feature of non-negative models, we did not describe (actually did not well recognize) an intriguing result that the non-negative random feedback model performed generally better than the models without non-negative constraint with either backprop or random feedback (Fig. 2J-left versus Fig. 6E-left (please mind the difference in the vertical scales)). This suggests that the non-negative constraint effectively limited the parameter space and thereby learning became efficient. We have added this result in Line 392-395:

      “Remarkably, oVRNNrf-bio generally achieved better performance than both oVRNNbp and oVRNNrf, which did not have the non-negative constraint (Wilcoxon rank sum test, vs oVRNNbp : p < 7.8×10,sup>−6</sup> for 5 or ≥25 RNN units; vs oVRNNrf: p < 0.021 for ≤10 or ≥20 RNN units).”

      Also, in the models with non-negative constraint, the model with random feedback learned more rapidly than the model with backprop although they eventually reached a comparable level of errors, at least in the case with 20 RNN units. This is presumably because the value weights did not develop well in early trials and so the backprop-based feedback, which was the same as the value weights, did not work well, while the non-negative fixed random feedback worked finely from the beginning. We have added this result in Fig. 6I and Line 417-422:

      “Figure 6I shows how learning proceeded across trials in the models with 20 RNN units. While oVRNNbp-rev and oVRNNrf-bio eventually reached a comparable level of errors, oVRNNrf-bio outperformed oVRNNbp-rev in early trials (at 200, 300, 400, or 500 trials; p < 0.049 in Wilcoxon rank sum test for each). This is presumably because the value weights did not develop well in early trials and so the backprop-type feedback, which was the same as the value weights, did not work well, while the non-negative fixed random feedback worked finely from the beginning.”

      We have also added a discussion on how our model can be positioned in relation to other models including the study you mentioned (e-prop by Bellec, ..., Maass, 2020) in subsection “Comparison to other algorithms” of the Discussion):

      Regarding the slightly better performance of the non-negative model with random feedback than that of the non-negative model with backprop when the number of RNN units was large (mentioned in our provisional reply), state values in the backprop model appeared underdeveloped than those in the random feedback model. Slightly better performance of random feedback than backprop held also in our extended model incorporating excitatory and inhibitory units (Fig. 9B).

      (6b) Related to task complexity, it is not clear to me if non-negative value and feedback weights would generally scale to harder tasks. If the task in so simple that a global RPE signal is sufficient to learn (see 4 and 5), then it could be good to extend the task to find a substantial gap between: global RPE, non-negative FA, FA, BP. For a well chosen task, I expect to see a performance gap between any pair of these four learning rules. In the context of the present paper, this would be particularly interesting to study the failure mode of non-negative FA and the cases where it does perform as well as FA.

      In the cue-reward association task with 3 time-steps delay, the non-negative model with random feedback performed largely comparably to the non-negative model with backprop, and this remained to hold in a task where distractor cue, which was not associated with reward, appeared in random timings. We have added the results in Fig. 10 and subsection “4.2 Task with distractor cue”.

      We have also examined the cases where the cue-reward delay was elongated. In the case of longer cue-reward delay (6 time-steps), in the models without non-negative constraint, the model with random feedback performed comparably to (and slightly better than when the number of RNN units was large) the model with backprop (Fig. 2M). In contrast, in the models with non-negative constraint, the model with random feedback underperformed the model with backprop (Fig. 6J, left-bottom). This indicates a difference between the effect of non-negative random feedback and the effect of positive+negative random feedback.

      We have further examined the performance of the models in terms of action selection, by extending the models to incorporate an actor-critic algorithm. In a task with inter-temporal choice (i.e., immediate small reward vs delayed large reward), the non-negative model with random feedback performed worse than the non-negative model with backprop when the number of RNN units was small. When the number of RNN increased, these models performed more comparably. These results are described in Fig. 11 and subsection “4.3 Incorporation of action selection”.

      (7) I find that the writing could be improved, it mostly feels more technical and difficult than it should. Here are some recommendations:

      7a) for instance the technical description of the task (CSC) is not fully described and requires background knowledge from other paper which is not desirable.

      7b) Also the rationale for the added difficulty with the stochastic reward and new state is not well explained.

      7c) In the technical description of the results I find that the text dives into descriptive comments of the figures but high-level take home messages would be helpful to guide the reader. I got a bit lost, although I feel that there is probably a lot of depth in these paragraphs.

      As for 7a), 'CSC (complete serial compound)' was actually not the name of the task but the name of the 'punctate' state representation, in which each state (timing from cue) is represented in a punctate manner, i.e., by a one-hot vector such as (1, 0, ..., 0), (0, 1, ..., 0), ..., and (0, 0, ..., 1). As you pointed out, using the name of 'CSC' would make the text appearing more technical than it actually is, and so we have moved the reference to the name of 'CSC' to the Methods (Line 903-907):

      “For the agents with punctate state representation, which is also referred to as the complete serial compound (CSC) representation [1, 48, 133], each timing from a cue in the tasks was represented by a 10-dimensional one-hot vector, starting from (1 0 0 ... 0)<sup>T</sup> for the cue state, with the next state (0 1 0 ... 0) <sup>T</sup> and so on.”

      and in the Results we have instead added a clearer explanation (Line 163-165):

      “First, for comparison, we examined traditional TD-RL agent with punctate state representation (without using the RNN), in which each state (time-step from a cue) was represented in a punctate manner, i.e., by a one-hot vector such as (1, 0, ..., 0), (0, 1, ..., 0), and so on.”

      As for 7b), we have added the rationale for our examination of the tasks with probabilistic structures (Line 282-294):

      “Previous work [54] examined the response of DA neurons in cue-reward association tasks in which reward timing was probabilistically determined (early in some trials but late in other trials). There were two tasks, which were largely similar but there was a key difference that reward was given in all the trials in one task whereas reward was omitted in some randomly determined trials in another task. Starkweather et al. [54] found that the DA response to later reward was smaller than the response to earlier reward in the former task, presumably reflecting the animal's belief that delayed reward will surely come, but the opposite was the case in the latter task, presumably because the animal suspected that reward was omitted in that trial. Starkweather et al.[54] then showed that such response patterns could be explained if DA encoded TD-RPE under particular state representations that incorporated the probabilistic structures of the task (called the 'belief state'). In that study, such state representations were 'handcrafted' by the authors, but the subsequent work [26] showed that the original value-RNN with backprop (BPTT) could develop similar representations and reproduce the experimentally observed DA patterns.”

      As for 7c), we have extensively revised the text of the results, adding high-level explanations while trying to reduce the lengthy low-level descriptions (e.g., Line 172-177 for Fig2E-G).

      (8) Related to the writing issue and 5), I wished that "bio-plausibility" was not the only reason to study positive feedback and value weights. Is it possible to develop a bit more specifically what and why this positivity is interesting? Is there an expected finding with non-negative FA both in the model capability? or maybe there is a simpler and crisp take-home message to communicate the experimental predictions to the community would be useful?

      There is actually an unexpected finding with non-negative model: the non-negative random feedback model performed generally better than the models without non-negative constraint with either backprop or random feedback (Fig. 2J-left versus Fig. 6E-left), presumably because the nonnegative constraint effectively limited the parameter space and thereby learning became efficient, as we mentioned in our reply to your point 6a above (we did not well recognize this at the time of original submission).

      Another potential merit of our present work is the simplicity of the model and the task. This simplicity enabled us to derive an intuitive explanation on why feedback alignment could occur. Such an intuitive explanation was lacking in previous studies while more precise mathematical explanations did exist. Related to the mechanism of feedback alignment, one thing remained mysterious to us at the time of original submission. Specifically, in the non-negatively constraint random feedback model, while the angle between the value weight (w) and the random feedback (c) was relatively close (loosely aligned) from the beginning, it appeared (as mentioned in the manuscript) that there was no further alignment over trials (and the angle actually settled at somewhat larger than 45°), despite that the same mechanism for feedback alignment that we derived for the model without non-negative constraint was expected to operate also under the non-negative constraint. We have now clarified the reason for this, and found a way, introduction of slight decay (forgetting) of value weights, by which feedback alignment came to occur in the non-negatively constraint model. We have added this in Line 463-477:

      “As mentioned above, while the angle between w and c was on average smaller than 90° from the beginning, there was no further alignment over trials. This seemed mysterious because the mechanism for feedback alignment that we derived for the models without non-negative constraint was expected to work also for the models with non-negative constraint. As a possible reason for the non-occurrence of feedback alignment, we guessed that one or a few element(s) of w grew prominently during learning, and so w became close to an edge or boundary of the non-negative quadrant and thereby angle between w and other vector became generally large (as illustrated in Fig. 8D). Figure 8Ea shows the mean±SEM of the elements of w ordered from the largest to smallest ones after 1500 trials. As conjectured above, a few elements indeed grew prominently.

      We considered that if a slight decay (forgetting) of value weights (c.f., [59-61]) was assumed, such a prominent growth of a few elements of w may be mitigated and alignment of w to c, beyond the initial loose alignment because of the non-negative constraint, may occur. These conjectures were indeed confirmed by simulations (Fig. 8Eb,c and Fig. 8F). The mean squared value error slightly increased when the value-weightdecay was assumed (Fig. 8G), however, presumably reflecting a decrease in developed values and a deterioration of learning because of the decay.”

      Correction of an error in the original manuscript

      In addition to revising the manuscript according to your comments, we have made a correction on the way of estimating the true state values. Specifically, in the original manuscript, we defined states by relative time-steps from a reward and estimated their values by calculating the sums of discounted future rewards starting from them through simulations. However, we assumed variable inter-trial intervals (ITIs) (4, 5, 6, or 7 time-steps with equal probabilities), and so until receiving cue information, agent should not know when the next reward will come. Therefore, states for the timings up to the cue timing cannot be defined by the upcoming reward, but previously we did so (e.g., state of "one timestep before cue") without taking into account the ITI variability.

      We have now corrected this issue, having defined the states of timings with respect to the previous (rather than upcoming) reward. For example, when ITI was 4 time-steps and agent existed in its last time-step, agent will in fact receive a cue at the next time-step, but agent should not know it until actually receiving the cue information and instead should assume that s/he was at the last time-step of ITI (if ITI was 4), last − 1 (if ITI was 5), last − 2 (if ITI was 6), or last − 3 (if ITI was 7) with equal probabilities (in a similar fashion to what we considered when thinking about state definition for the probabilistic tasks). We estimated the true values of states defined in this way through simulations. As a result, the corrected true value of the cue-timing has become slightly smaller than the value described in the original manuscript (reflecting the uncertainty about ITI length), and consequently small positive TD-RPE has now appeared at the cue timing.

      Because we measured the performance of the models by squared errors in state values, this correction affected the results reporting the performance. Fortunately, the effects were relatively minor and did not largely alter the results of performance comparisons. However, we sincerely apologize for this error. In the revised manuscript, we have used the corrected true values throughout the manuscript, and we have described the ways of estimating these values in Line 919-976.

    1. eLife Assessment

      This important manuscript presents a thorough analysis of trans-specific polymorphism (TSP) in Major Histocompatibility Complex gene families across primates. The analysis makes the most of currently available genomic data and methods to substantially increase the amount and evolutionary time that TSPs can be observed. Both false negative TSPs due to missing genes at the assembly and/or annotation level, as well as false positives due to read mismapping with missing paralogs, are well assessed and discussed. Overall the evidence provided is compelling, and the manuscript clearly delineates the path for future progress on the topic.

    2. Reviewer #2 (Public review):

      Summary:

      In this study, the authors characterized population genetic variation in the MHC locus across primates and looked for signals of long-term balancing selection (specifically trans-species polymorphism, TSP) in this highly polymorphic region. To carry out these tasks, they used Bayesian methods for phylogenetic inference (i.e. BEAST2) and applied a new Bayesian test to quantify evidence supporting monophyly vs. transspecies polymorphism for each exon across different species pairs. Their results, although mostly confirmatory, represent the most comprehensive analyses of primate MHC evolution to date and novel findings or possible discrepancies are clearly pointed out. However, as the authors discuss, the available data are insufficient to fully capture primates' MHC evolution.

      Strengths of the paper include: using appropriate methods and statistically rigorous analyses; very clear figures and detailed description of the results methods that make it easy to follow despite the complexity of the region and approach; a clever test for TSP that is then complemented by positive selection tests and the protein structures for a quite comprehensive study.

      That said, weaknesses include: lack of information about how many sequences are included and whether uneven sampling across taxa might results in some comparisons without evidence for TSP; frequent reference to the companion paper instead of summarizing (at least some of) the critical relevant information (e.g., how was orthology inferred?); no mention of the quality of sequences in the database and whether there is still potential effects of mismapping or copy number variation affecting the sequence comparison.

      Comments on revisions:

      The authors have sufficiently addressed the reviewers' comments or provided additional details justifying their work. In particular, expansion of the discussion section on limitations of the analysis and clearer reference to how this relates to their companion paper represent improvements. Remaining suggestions are to still make clearer how much sparsity of sequences in the database may impact the conclusions (e.g., is this more of a problem for some genes or taxa than others? Is it a small problem or a large problem?). The data summary tables are a bit hard to read and seem to contain some information not used in the article - maybe the presentation of these could be improved or the full details, or a shorter table summer in the main paper and full details only in the supplement.

    3. Reviewer #3 (Public review):

      Summary:

      The study uses publicly available sequences of classical and non-classical genes from a number of primate species to assess the extent and depth of TSP across the primate phylogeny. The analyses were carried out in a coherent and, in my opinion, robust inferential framework and provide evidence for ancient (even > 30 million years) TSP at several classical class I and class II genes. The authors also characterise evolutionary rates at individual codons, map these rates onto MHC protein structures, and find that the fastest evolving codons are extremely enriched for autoimmune and infectious disease associations.

      Strengths:

      The study is comprehensive, relying on a large data set, state-of-the-art phylogenetic analyses and elegant tests of TSP. The results are not entirely novel, but a synthesis and re-analysis of previous findings is extremely valuable and timely.

      Weaknesses:

      Following the revision by the Authors I see mostly one weakness - Older literature on the subject is duly cited, but the discussion of the findings the context of this literature is limited.

      Comments on revisions:

      Lines 441-452 - In this section, you discuss an apparent paradox between long-lived balancing selection and strong directional selection, referencing elevated substitution rates. However, this issue is more nuanced and may not be best framed in terms of substitution rates. That terminology is common in phylogenetic analyses, where differences between sequences-or changes along phylogenetic branches-are often interpreted as true substitutions in the population genetic sense. In the case of MHC trees and the rates you're discussing here, the focus is more accurately on the rate at which new mutations become established within particular allelic lineages. So while this still concerns evolutionary rates at specific codons, equating them directly with substitution rates may be misleading. A more precise term or framing might be warranted in this context.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      MHC (Major Histocompatibility Complex) genes have long been mentioned as cases of trans-species polymorphism (TSP), where alleles might have their most recent common ancestor with alleles in a different species, rather than other alleles in the same species (e.g., a human MHC allele might coalesce with a chimp MHC allele, more recently than the two coalesce with other alleles in either species). This paper provides a more complete estimate of the extent and ages of TSP in primate MHC loci. The data clearly support deep TSP linking alleles in humans to (in some cases) old world monkeys, but the amount of TSP varies between loci.

      Strengths:

      The authors use publicly available datasets to build phylogenetic trees of MHC alleles and loci. From these trees they are able to estimate whether there is compelling support for Trans-species polymorphisms (TSPs) using Bayes Factor tests comparing different alternative hypotheses for tree shape. The phylogenetic methods are state-of-the-art and appropriate to the task.

      The authors supplement their analyses of TSP with estimates of selection (e.g., dN/dS ratios) on motifs within the MHC protein. They confirm what one would suspect: classical MHC genes exhibit stronger selection at amino acid residues that are part of the peptide binding region, and non-classical MHC exhibit less evidence of selection. The selected sites are associated with various diseases in GWAS studies.

      Weaknesses:

      An implication drawn from this paper (and previous literature) is that MHC has atypically high rates of TSP. However, rates of TSP are not estimated for other genes or gene families, so readers have no basis of comparison. No framework to know whether the depth and frequency of TSP is unusual for MHC family genes, relative to other random genes in the genome, or immune genes in particular. I expect (from previous work on the topic), that MHC is indeed exceptional in this regard, but some direct comparison would provide greater confidence in this conclusion.

      We agree that context is important! Although we expected to get the most interesting results from studying the classical genes, we did include the non-classical genes specifically for comparison. They are located in the same genomic region, have multiple sequences catalogued in different species (although they are less diverse), and perform critical immune functions. We think this is a more appropriate set to compare with the classical MHC genes than, say, a random set of genes. Interestingly, we did not detect TSP in these non-classical genes. This likely means that the classical MHC genes are truly exceptional, but it could also mean that not enough sequences are available for the non-classical genes to detect TSP. 

      It would be very interesting to repeat this analysis for another gene family to see whether such deep TSP also occurs in other immune or non-immune gene families. We are lucky that decades of past work and a dedicated database exists for cataloging MHC sequences. When this level of sequence collection is achieved for other highly polymorphic gene families, it will be possible to do a comparable analysis.  

      Given the companion paper's evidence of genic gain/loss, it seems like there is a real risk that the present study under-estimates TSP, if cases of TSP have been obscured by the loss of the TSP-carrying gene paralog from some lineages needed to detect the TSP. Are the present analyses simply calculating rates of TSP of observed alleles, or are you able to infer TSP rates conditional on rates of gene gain/loss?

      We were not able to infer TSP rates conditional on rates of gene gain/loss. We agree that some cases of TSP were likely lost due to the loss of a gene paralog from certain species. Furthermore, the dearth of MHC whole-region and allele sequences available for most primates makes it difficult to detect TSP, even if the gene paralog is still present. Long-read sequencing of more primate genomes should help with this. We agree that it would also be very interesting to study TSPs that were maintained for millions of years but were lost recently.

      Figure 5 (and 6) provide regression model fits (red lines in panel C) relating evolutionary rates (y axis not labeled) to site distance from the peptide binding groove, on the protein product. This is a nice result. I wonder, however, whether a linear model (as opposed to non-linear) is the most biologically reasonable choice, and whether non-linear functions have been evaluated. The authors might consider generalized additive models (GAMs) as an alternative that relaxes linearity assumptions.

      We agree that a linear model is likely not the most biologically reasonable choice, as protein interactions are complex. However, we made the choice to implement the simplest model because the evolutionary rates we inferred were relative, making parameters relatively meaningless. We were mainly concerned with positive or negative slopes and we leave the rest to the protein interaction experts.

      The connection between rapidly evolving sites, and disease associations (lines 382-3) is very interesting. However, this is not being presented as a statistical test of association. The authors note that fast-evolving amino acids all have at least one association: but is this really more disease-association than a random amino acid in the MHC? Or, a randomly chosen polymorphic amino acid in MHC? A statistical test confirming an excess of disease associations would strengthen this claim.

      To strengthen this claim, we added Figure 6 - Figure Supplement 7 (NOTE: this needs to be renamed as Table 1 - Figure Supplement 1, which the eLife template does not allow). Here, we plot the number of associations for each amino acid against evolutionary rate, revealing a significant positive slope in Class I. We also added explanatory text for this figure in lines 400-404.

      Reviewer #2 (Public review):

      Summary

      In this study, the authors characterized population genetic variation in the MHC locus across primates and looked for signals of long-term balancing selection (specifically trans-species polymorphism, TSP) in this highly polymorphic region. To carry out these tasks, they used Bayesian methods for phylogenetic inference (i.e. BEAST2) and applied a new Bayesian test to quantify evidence supporting monophyly vs. transspecies polymorphism for each exon across different species pairs. Their results, although mostly confirmatory, represent the most comprehensive analyses of primate MHC evolution to date and novel findings or possible discrepancies are clearly pointed out. However, as the authors discuss, the available data are insufficient to fully capture primates' MHC evolution.

      Strengths of the paper include: using appropriate methods and statistically rigorous analyses; very clear figures and detailed description of the results methods that make it easy to follow despite the complexity of the region and approach; a clever test for TSP that is then complemented by positive selection tests and the protein structures for a quite comprehensive study.

      That said, weaknesses include: lack of information about how many sequences are included and whether uneven sampling across taxa might results in some comparisons without evidence for TSP; frequent reference to the companion paper instead of summarizing (at least some of) the critical relevant information (e.g., how was orthology inferred?); no mention of the quality of sequences in the database and whether there is still potential effects of mismapping or copy number variation affecting the sequence comparison.

      To address these comments, we added Tables 2-4 to allow readers to more readily understand the data we included in each group. We refer to these tables in the introduction (line 95), in the “Data” section of the results (lines 128-129), and the “Data” section of the methods (lines 532-534).  We also added text (lines 216-219 and 250-252) to more explicitly point out that our method is conservative when few sequences are available.

      We also added a paragraph to the discussion which addresses data quality and mismapping issues (lines 473-499).

      We clarified the role of our companion paper (line 49-50) by changing “In our companion paper, we explored the relationships between the different classical and non-classical genes” to “In our companion paper, we built large multi-gene trees to explore the relationships between the different classical and non-classical genes.” We also changed the text in lines 97-99 from “In our companion paper, we compared genes across dozens of species and learned more about the orthologous relationships among them” to “In our companion paper, we built trees to compare genes across dozens of species. When paired with previous literature, these trees helped us infer orthology and assign sequences to genes in some cases.”

      Reviewer #3 (Public review):

      Summary

      The study uses publicly available sequences of classical and non-classical genes from a number of primate species to assess the extent and depth of TSP across the primate phylogeny. The analyses were carried out in a coherent and, in my opinion, robust inferential framework and provided evidence for ancient (even > 30 million years) TSP at several classical class I and class II genes. The authors also characterise evolutionary rates at individual codons, map these rates onto MHC protein structures, and find that the fastest evolving codons are extremely enriched for autoimmune and infectious disease associations.

      Strengths

      The study is comprehensive, relying on a large data set, state-of-the-art phylogenetic analyses and elegant tests of TSP. The results are not entirely novel, but a synthesis and re-analysis of previous findings is extremely valuable and timely.

      Weaknesses

      I've identified weaknesses in several areas (details follow in the next section):

      -  Inadequate description and presentation of the data used

      -  Large parts of the results read like extended figure captions, which breaks the flow. - Older literature on the subject is duly cited, but the authors don't really discuss their findings in the context of this literature.

      -  The potential impact of mechanisms other than long-term maintenance of allelic lineages by balancing selection, such as interspecific introgression and incorrect orthology assessment, needs to be discussed.

      We address these comments in the more detailed section below.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      The abstract could benefit from being sharpened. A personal pet peeve is a common habit of saying we don't know everything about a topic (line 16 - "lack a full picture of primate MHC evolution"); We never know everything on a topic, so this is hardly a strong rationale to do more work on it. This is followed by "to start addressing this gap" - which is vague because you haven't explicitly stated any gap, you simply said we are not yet omniscent on the topic. Please clearly identify a gap in our knowledge, a question that you will be able to answer with this paper.

      That makes sense! We added another sentence to the abstract to make the specific gap clearer. Inserted “In particular, we do not know to what extent genes and alleles are retained across speciation events” in lines 16-17.

      Reviewer #2 (Recommendations for the authors):

      - Some discussion of alternative explanations when certain comparisons were not found to have TSP - is this consistent with genetic drift sometimes leading to lineage loss, or does it suggest that the proposed tradeoff between autoimmunity and pathogen recognition might differ depending on primates' life history and/or exposure to similar pathogens? Could the trade-off of pathogen to self-recognition not be as costly in some species?

      This is consistent with genetic drift, as no lineages are expected to be maintained across these distantly-diverged primates under neutral selection. These ideas are certainly possible, but our Bayes Factor test only reveals evidence (or lack thereof) for deviations from the species tree and cannot provide reasons why or why not.

      - It would be interesting to put these results on very long-term balancing selection in the context of what has been reported at the region for shorter term balancing selection. The discussion compares findings of previous genes in the literature but not regarding the time scale.

      Indeed, there is some evidence for the idea of “divergent allele advantage”, in which MHC-heterozygous individuals have a greater repertoire of peptides that they can present, leading to greater resistance against pathogens and greater fitness. This heterozygote advantage thus leads to balancing selection (Pierini and Lenz, 2018; Chowell et al., 2019). Our discussion mentions other time scales of balancing selection across the primates at the MHC and other loci, but we choose to focus more on long-term than short-term balancing selection.

      - Lines 223-226 - how is the difference in BF across exons in MHC-A to be interpreted? The paragraph is about MHC-A, but then the explanation in the last sentence is for when similar BF are observed which is not the case for MHC-A. Is this interpreted as lack of evidence for TSP? Or something about recombination or gene conversion? Or that one exon may be under balancing selection but not the other?

      Thank you for pointing out the confusing logic in this paragraph. 

      Previous: “For MHC-A, Bayes factors vary considerably depending on exon and species pair. Many sequences had to be excluded from MHC-A comparisons because they were identified as gene-converted in the \textit{GENECONV} analysis or were previously identified as recombinants \citep{Hans2017,Gleimer2011,Adams2001}. Importantly, for MHC-A we do not see concordance in Bayes factors across the different exons, whereas we do for the other gene groups. Similar Bayes factors across all exons for a given comparison is thus evidence in favor of TSP being the primary driver of the observed deep coalescence structure (rather than recombination or gene conversion).” Current (lines 228-238): 

      “For MHC-A, Bayes factors vary considerably depending on exon and species pair. Past work suggests that this gene has had a long history of gene conversion affecting different exons, resulting in different evolutionary histories for different parts of the gene \citep{Hans2017,Gleimer2011,Adams2001}. Indeed, we excluded many MHC-A sequences from our Bayes factor calculations because they were identified as gene-converted in our \textit{GENECONV} analysis or were previously suggested to be recombinants. As shown in \FIG{bayes_factors_classI}, the lack of concordance in Bayes factors across the different exons for MHC-A is evidence for gene conversion, rather than balancing selection, being the most important factor in this gene's evolution. In contrast, the other gene groups generally show concordance in Bayes factors across exons. We interpret this as evidence in favor of TSP being the primary driver of the observed deep coalescence structure for MHC-B and -C (rather than recombination or gene conversion).”

      - In Figures 5C and 6C, the points sometimes show a kind of smile pattern of possibly higher rates further from the peptide. Did authors explore other fits like a polynomial? Or, whether distance only matters in close proximity to the peptide? Out of curiosity, is it possible to map substitution time/branch into the distance to the peptide binding region for each substitution? Is there any pattern with distance to interacting proteins in non-peptide binding MHC proteins like MHC-DOA? Although they don't have a PBR they do interact with other proteins.

      Thank you for these ideas! We did not explore other fits, such as a polynomial, because we wanted to implement the simplest model. Our evolutionary rates are relative, making parameters relatively meaningless. We were mainly concerned with positive or negative slopes and we leave the rest to the protein interaction experts.

      There is most likely a relationship between evolutionary rate and the distance to interacting proteins in the non-peptide-binding molecules MHC-DM and -DO. However, there are few currently available models and it is difficult to determine which residues in these models are actually interacting. However, researchers with more experience in protein interactions would be able to undertake such an analysis. 

      - How biased is the database towards human alleles? Could this affect some of the analyses, including the coincidence of rapidly evolving sites with associations? Are there more associations than expected under some null model?

      While the database is indeed biased toward human alleles, we included only a small subset of these in order to create a more balanced data set spanning the primates. This is unlikely to affect the coincidence of rapidly-evolving sites with associations; however, we note that there are no such association studies meeting our criteria in other species, meaning the associations are only coming from studies on humans.

      - To this reader, it is unnecessary and distracting to describe the figures within the text; there are frequent sentences in the text that belongs in the figure legend instead (e.g., lines 139-143, 208-211, 214-215, 328-330, etc). It would be better to focus on the results from the figures and then cite the figure, where the colors and exactly what is plotted can be in the figure legend.

      We appreciate these comments on overall flow. We removed lines 139-143 and lengthened the Figure 2 caption (and associated supplementary figure captions) to contain all necessary detail. We removed lines 208-211 and 214-215 and lengthened the captions for Figure 3, Figure 4, and associated supplementary figures. We removed a sentence from lines 303-304.  

      - I'm still concerned that the poor mappability of short-read data is contributing in some ways. Were the sequences in the database mostly from long-reads? Was nucleotide diversity calculated directly from the sequences in the database or from another human dataset? Is missing data at some sites accounted for in the denominator?

      The sequences in the database are mostly from short reads and come from a wide array of labs. We have added a paragraph to the discussion to explain the limitations of this (lines 473-499). However, the nucleotide diversity calculations shown in Figure 1 do not rely on the MHC database; rather, they are calculated from the human genomes in the 1000 Genomes project. Nucleotide diversity would be calculable for other species, but we did not do so for exactly the reason you mention–too much missing data.

      - The Figure 2 and Figure 3 supplements took me a little bit to understand - is it really worth pointing out the top 5 Bayes-factor comparisons when there is no evidence for TSP? A lot of the colored squares are not actually supporting TSP but in the grids you can't see which are and which aren't without looking at the Bayes Factor. I wonder if it would help if only those with BF > 100 were shown? Or if these were marked some other way so that it was easy to see where TSPs are supported.

      Thank you for your perspective on these figures! We initially limited them to only show >100 Bayes factors for each gene group and region, but some gene groups have no high Bayes factors. Additionally, the “summary” tree pictured in these figures is necessarily a simplification of the full space of posterior trees. We felt that showing low Bayes factor comparisons could help readers understand this relationship. For example, allele sets that look non-monophyletic on the summary tree may still have a low Bayes factor, showing that they are generally monophyletic throughout the larger (un-visualizable) space of trees.

      Reviewer #3 (Recommendations for the authors):

      Specific comments

      Abstract

      I think the abstract would benefit from some editing. For example, one might get the impression that you equate allele sharing, which would normally be understood as sharing identical sequences, with sharing ancestral allelic lineages. This distinction is important because you can have many TSPs without sharing identical allele sequences. In l. 20 you write about "deep TSP", which requires either definition of reformulation. In l. 21-23 you seem to suggest that long-term retention of allelic lineages is surprising in the light of rapid sequence evolution - it may be, depending on the evolutionary scenarios one is willing to accept, but perhaps it's not necessary to float such a suggestion in the abstract where it cannot be properly explained due to space constraints? The last sequence needs a qualifier like "in some cases".

      Thank you for catching these! For clarity, we changed several words:

      ● “alleles” to “allelic lineages” in line 13

      ● “deep” to “ancient” in line 21

      ● “Despite” to “in addition to” in line 22

      ● Added “in some cases” to line 28

      Results - Overall, parts of the results read like extended figure captions. I understand that the authors want to make the complex figures accessible to the reader. However, including so much information in the text disrupts the flow and makes it difficult to follow what the main findings and conclusions are.

      We appreciate these comments on overall flow. We removed lines 139-143 and lengthened the Figure 2 caption (and associated supplementary figure captions) to contain all necessary detail. We removed lines 208-211 and 214-215 and lengthened the captions for Figure 3, Figure 4, and associated supplementary figures. We removed a sentence from lines 303-304.  

      l. 37-39 such a short sentence on non-classical MHC is necessarily an oversimplification, I suggest it be expanded or deleted.

      There is certainly a lot to say about each of these genes! While we do not have space in this paper’s introduction to get into these genes’ myriad functions, we added a reference to our companion paper in lines 40-41:

      “See the appendices of our companion paper \citep{Fortier2024a} for more detail.”

      These appendices are extensive, and readers can find details and references for literature on each specific gene there. In addition, several genes are mentioned in analyses further on in the results, and their specific functions are discussed in more detail when they arise.

      l. 47 -49 It would be helpful to briefly outline your criteria for selecting these 17 genes, even if this is repeated later.

      Thank you! For greater clarity, we changed the text (lines 50-52) from “Here, we look within 17 specific genes to characterize trans-species polymorphism, a phenomenon characteristic of long-term balancing selection.” to “Here, we look within 17 specific genes---representing classical, non-classical, Class I, and Class II ---to characterize trans-species polymorphism, a phenomenon characteristic of long-term balancing selection.“  

      l.85-87 I may be completely wrong, but couldn't problems with establishing orthology in some cases lead to false inferences of TSP, even in primates? Or do you think the data are of sufficient quality to ignore such a possibility? (you touch on this in pp. 261-264)

      Yes, problems with establishing orthology can lead to false inferences of TSP, and it has happened before. For example, older studies that used only exon 2 (binding-site-encoding) of the MHC-DRB genes inferred trees that grouped NWM sequences with ape and OWM sequences. Thus, they named these NWM genes MHC-DRB3 and -DRB5 to suggest orthology with ape/OWM MHC-DRB3 and -DRB5, and they also suggested possible TSP between the groups. However, later studies that used non-binding-site-encoding exons or introns noticed that these NWM sequences did not group with ape/OWM sequences (which now shared the same name), providing evidence against orthology. This illustrates that establishing orthology is critical before assessing TSP (as is comparing across regions). This is part of the reason we published a companion paper (https://doi.org/10.7554/eLife.103545.1), which clears up questions of orthology and supports the analyses we did in this paper. In cases where orthology was ambiguous, this also helped us to be conservative in our conclusions here. The problems with ambiguous gene assignment are also discussed in lines 488-499.

      l. 88-93 is the first place (others are pp. 109-118 and 460-484) where a fuller description of the data used would be welcome. It's clear that the amount of data from different species varies enormously, not only in the number of alleles per locus, but also in the loci for which polymorphism data are available. In such a synthesis study, one would expect at least a tabulation of the data used in the appendices and perhaps a summary table in the main article.

      l. 109-118 Again, a more quantitative summary of the data used, with reference to a table, would be useful.

      Thank you! To address these comments, we added Tables 2-4 to allow readers to more readily understand the data we included in each group. We refer to these tables in the introduction (line 95), in the “Data” section of the results (lines 128-129), and the “Data” section of the methods (lines 532-534). Supplementary Files listing the exact alleles and sequences used in each group are also included in the resubmission.

      l. 123-124 here you say that the definition of the "16 gene groups" is in the methods (probably pp. 471-484), but it would be useful to present an informative summary of your rationale in the introduction or here

      Thank you! We agree that it is helpful to outline these groups earlier. We have changed the paragraph in lines 123-135 from: 

      “We considered 16 gene groups and two or three different genic regions for each group: exon 2 alone, exon 3 alone, and/or exon 4 alone. Exons 2 and 3 encode the peptide-binding region (PBR) for the Class I proteins, and exon 2 alone encodes the PBR for the Class II proteins. For the Class I genes, we also considered exon 4 alone because it is comparable in size to exons 2 and 3 and provides a good contrast to the PBR-encoding exons. See the Methods for more detail on how gene groups were defined. Because few intron sequences were available for non-human species, we did not include them in our analyses.” To: 

      “We considered 16 gene groups spanning MHC classes and functions. These include the classical Class I genes (MHC-A-related, MHC-B-related, MHC-C-related), non-classical Class I genes (MHC-E-related, MHC-F-related, MHC-G-related), classical Class IIA genes (MHC-DRA-related, MHC-DQA-related, MHC-DPA-related), classical Class IIB genes (MHC-DRB-related, MHC-DQB-related, MHC-DPB-related), non-classical Class IIA genes (MHC-DMA-related, MHC-DOA-related, and non-classical Class IIB genes (MHC-DMB-related, MHC-DOB-related). We studied two or three different genic regions for each group: exon 2 alone, exon 3 alone, and (for Class I) exon 4 alone. Exons 2 and 3 encode the peptide-binding region (PBR) for the Class I proteins, and exon 2 alone encodes the PBR for the Class II proteins. For the Class I genes, we also considered exon 4 alone because it is comparable in size to exons 2 and 3 and provides a good contrast to the PBR-encoding exons. Because few intron sequences were available for non-human species, we did not include them in our analyses.”

      l. 100 "alleles" -> "allelic lineages"

      Thank you for catching this. We have changed this language in line 104.

      l. 227-238 it's important to discuss the possible effect of the number of sequences available on the detectability of TSP - this is particularly important as the properties of MHC genealogies may differ considerably from those expected for neutral genealogies.

      This is a good point that may not be obvious to readers. We have added several sentences to clarify this:

      Line 193-194: “In a neutral genealogy, monophyly of each species' sequences is expected.”

      Line 213-219: “Note that the number of sequences available for comparison also affects the detectability of TSP. For example, if the only sequences available are from the same allelic lineage, they will coalesce more recently in the past than they would with alleles from a different lineage and would not show evidence for TSP. This means our method is well-suited to detect TSP when a diverse set of allele sequences are available, but it is conservative when there are few alleles to test. There were few available alleles for some non-classical genes, such as MHC-F, and some species, such as gibbon.”

      Line 244-246: “However, since there are fewer alleles available for the non-classical genes, we note that our method is likely to be conservative here.”

      l. 301 and 624-41 it's been difficult for me to understand the rationale behind using rates at mostly gap positions as the baseline and I'd be grateful for a more extensive explanation

      Normalizing the rates posed a difficult problem. We couldn’t include every single sequence in the same alignment because BEAST’s computational needs scale with the number of sequences. Therefore, we had to run BEAST separately on smaller alignments focused on a single group of genes at a time. We still wanted to be able to compare evolutionary rates across genes, but because of the way SubstBMA is implemented, evolutionary rates are relative, not absolute. Recall that to help us compare the trees, we included a common set of “backbone” sequences in all of the 16 alignments. This set included some highly-diverged genes. Initially, we planned to use 4-fold degenerate sites as the baseline sites for normalization, but there simply weren’t enough of them once we included the “backbone” set on top of the already highly diverse set of sequences in each alignment. This diversity presented an opportunity.  In BEAST, gaps are treated as missing and do not contribute any probability to the relevant branch or site (https://groups.google.com/g/beast-users/c/ixrGUA1p4OM/m/P4R2fCDWMUoJ?pli=1). So, we figured that sites that were “mostly gap” (a gap in all the human backbone sequences but with an insertion in some sequence) were mostly not contributing to the inference of the phylogeny or evolutionary rates. Because the “backbone” sequences are common to all alignments, making the “mostly gap” sites somewhat comparable across sets while not affecting inferred rates, we figured they would be a reasonable choice for the normalization (for lack of a better option).

      We added text to lines 680 and 691-693 to clarify this rationale.

      l. 380-84 this overview seems rather superficial. Would it be possible to provide a more quantitative summary?

      To make this more quantitative, we plotted the number of associations for each amino acid against evolutionary rate, shown in Figure 6 - Figure Supplement 7 (NOTE: this needs to be renamed as Table 1 - Figure Supplement 1, which the template does not allow). This reveals a significant positive slope for the Class I genes, but not for Class II. We also added explanatory text for this figure in lines 400-404.

      Discussion - your approach to detecting TSP is elegant but deserves discussion of its limitations and, in particular, a clear explanation of why detecting TSP rather than quantifying its extent is more important in the context of this work. Another important point for discussion is alternative explanations for the patterns of TSP or, more broadly, gene tree - species tree discordance. Although long-term maintenance of allelic lineages due to long-term balancing selection is probably the most convincing explanation for the observed TSP, interspecific introgression and incorrect orthology assessment may also have contributed, and it would be good to see what the authors think about the potential contribution of these two factors.

      Overall, our goal was to use modern statistical methods and data to more confidently assess how ancient the TSP is at each gene. We have added several lines of text (as noted elsewhere in this document) to more clearly illustrate the limitations of our approach. We also agree that interspecific introgression and incorrect orthology assessment can cause similar patterns to arise. We attempted to minimize the effect of incorrect orthology assessment by creating multi-gene trees and exploring reference primate genomes, as described in our companion paper (https://doi.org/10.7554/eLife.103545.1), but cannot eliminate it completely. We have added a paragraph to the discussion to address this (lines 488-499). Interspecific introgression could also cause gene tree-species tree discordance, but we are not sure about how systematic this would have to be to cause the overall patterns we observe, nor about how likely it would have been for various clades of primates across the world.

      l. 421 -424 A more nuanced discussion distinguishing between positive selection, which facilitates the establishment of a mutation, and directional selection, which leads to its fixation, would be useful here.

      We added clarification to this sentence (line 443-445), from “Indeed, within the phylogeny we find that the most rapidly-evolving codons are substituted at around 2--4-fold the baseline rate.” to “Indeed, within the phylogeny we find that the most rapidly-evolving codons are substituted at around 2--4-fold the baseline rate, generating ample mutations upon which selection may act.”

      l. 432-434 You write here about the shaping of TCR repertoires, but I couldn't find any such information in the paper, including Table 1.

      We did not include a separate column for these, so they can be hard to spot. They take the form of “TCR 𝛽 Interaction Probability >50%”, “TCR Expression (TRAV38-1)”, or “TCR 𝛼 Interaction Probability >50%” and can be found in Table 1.

      l. 436-442 Here a more detailed discussion in the context of divergent allelic advantage and even the evolution of new S-type specificities in plants would be valuable.

      We added an additional citation to a review article to this sentence (lines 438-439).  

      l. 443 The use of the word "training" here is confusing, suggesting some kind of "education" during the lifetime of the animal.

      We agree that “train” is not an entirely appropriate term, and have changed it to “evolve” (line 465).

      489-491 What data were used for these calculations?

      Apologies for missing this citation! We used the 1000 genomes project data, and the citation has been updated (line 541-542).

    1. eLife Assessment

      This study reports valuable findings on the role of Layilin in the motility and suppressive capacity of clonal expanded regulatory T cells (Tregs) in the skin. Although the strength of the study is utilizing conditional knock-out mice and human skin samples, the analysis of the molecular mechanism by which Layilin affects Treg function is incomplete. The study will be of interest to medical scientists working on skin immunology.

    2. Reviewer #1 (Public review):

      Summary and Strengths:

      This work shows that the gene encoding Layilin is expressed preferentially in human skin Tregs, and that the fraction of Tregs expressing Layilin may overexpress genes related to T cell activation and adhesion. Expression of Layilin on Tregs would have no impact on activation markers or in vitro suppressive function. However, activation of Layilin either with a cross-linking antibody or collagen IV, its natural ligand, would promote cell adhesion via LFA1 activation. The in vivo functional role of Layilin in Tregs is studied in a conditional KO mouse model in a model of skin inflammation. Deletion of Layilin in Tregs led to an attenuation of the disease score and a reduction in the cutaneous lymphocyte infiltrate. This work is clearly innovative, but a number of major points limit its interest.

      Weakness and major points:

      (1) The number of panels and figures suggests that this story is quite complete but several data presented in the main figures do not provide essential information for a proper understanding of Layilin's role in Tregs.

      Figures 1I, 1J, and the whole of Figure 2 could be placed as supplementary figures. Also, for Figure 3E, it would be preferable to show the percentage of cells expressing cytokines rather than their absolute numbers. In fact, the drop in the numbers of cytokine-producing cells is probably due solely to the drop in total cell numbers and not to a decrease in the proportion of cells expressing cytokines. If this is the case, these data should be shown in supplementary figures. Finally, Figures 4 and 5 could be merged.

      (2) Some important data are not shown or not mentioned.

      (a) It would be important to show the proportion of Treg, Tconv, and CD8 expressing Layilin in healthy skin and in patients developing psoriasis, as well as in the blood of healthy subjects.<br /> (b) We lack information to be convinced that there is enrichment for migration and adhesion genes in Layilin+ Tregs in the GSEA data. The authors should indicate what geneset libraries they used. Indeed, it is tempting to show only the genesets that give results in line with the message you want to get across. If these genesets come from public banks, the bank used should be indicated, and the results of all gene sets shown in an unbiased way. In addition, it should be indicated whether the analyses were performed on untransformed or pseudobulk scRNAseq data analyses. Finally, it would be preferable to confirm the GSEA data with z-score analyses, as Ingenuity does, for example. Indeed, in GSEA-type analyses, there are genes that have activating but also inhibiting effects on a pathway in a given gene set.<br /> (c) For all FACS data, the raw data should be shown as histograms or dot plots for representative samples.<br /> (d) For Figure 5B, the number of samples analyzed is insufficient to draw clear conclusions.

      (3) For Figs. 4 and 5, the design of the experiment poses a problem. Indeed, the comparison between Layn+ and Layn- cells may, in part, not be directly linked to the expression or absence of expression of this protein. Indeed, Layn+ and Layn- Tregs may constitute populations with different biological properties, beyond the expression of Layn. However, in the experiment design used here, a significant fraction of the sorted Layn- Tregs will be cells belonging to the population that has never expressed this protein. It would have been preferable to sort first the Layn+ Tregs, then knock down this protein and re-sort the Layn- Tregs and Layn+ Tregs. If this experiment is too cumbersome to perform, I agree that the authors should not do it. However, it would be important to mention the point I have just made in the text.

    3. Reviewer #2 (Public review):

      Summary:

      In their manuscript, Gouirand et al. report on the role of Layilin expression for the motility and suppressive capacity of regulatory T cells (Tregs). In previous studies, the authors had already demonstrated that Layilin is expressed on Tregs, that it acts as a negative regulator of their suppressive capacity, that it functions to anchor Tregs in non-lymphoid tissues, and that it enhances the adhesive properties of Layilin-expressing cells by co-localization with the integrin αLβ2 (LFA-1). Building on these published data, the authors now show that Layilin is highly expressed on a subset of clonally expanded effector Tregs in both healthy and psoriatic skin and that deletion of Layilin in Tregs in vivo resulted in significantly attenuated skin inflammation. Furthermore, the authors addressed the molecular mechanism by which Layilin affects the suppressive capacity of Tregs and showed that Layilin increased Treg adhesion via modulation of LFA-1, resulting in distinct cytoskeletal changes.

      Strengths:

      Certainly, the strength of this study lies in the combination of data from mouse and human models.

      Weaknesses:

      Some of the conclusions drawn by the authors must be treated with caution, as the experimental conditions were not always appropriate, leading to a risk of misinterpretation.

    4. Reviewer #3 (Public review):

      Summary:

      Gouirand et al explore the function of Layilin on Treg in the context of psoriasis using both patient samples and a conditional mutant mouse model. They perform functional analysis in the patient samples using Cas9-mediated deletion. The authors suggest that Layilin works in concert with integrins to bind collagen IV to attenuate cell movement.

      The work is well done and built on solid human data. The report is a modest advance from the authors' previous report in 2021 that focused on tumor responses, with this report focusing on psoriasis. There are some experimental concerns that should be considered.

      Strengths:

      (1) Good complementation of patient and animal model data.

      (2) Solid experimentation using state-of-the-art approaches.

      (3) There is clearly a biological effect of LAYN deficiency in the mouse model.

      (4) The report adds some new information to what was already known from the previous reports.

      Weaknesses:

      (1) It is not clear that the assays used for functional analysis of the patient samples were optimal.

      (2) Several conclusions are not fully substantiated.

      (3) The report is lacking some experimental details.

    5. Author response:

      Reviewer 1:

      Concern 1: Figures 1I, 1J, and the whole of Figure 2 could be placed as supplementary figures. Also, for Figure 3E, it would be preferable to show the percentage of cells expressing cytokines rather than their absolute numbers. In fact, the drop in the numbers of cytokine-producing cells is probably due solely to the drop in total cell numbers and not to a decrease in the proportion of cells expressing cytokines. If this is the case, these data should be shown in supplementary figures. Finally, Figures 4 and 5 could be merged.

      We thank you for your recommendations. As rearranging figures is not critical to convey the data, we have decided to keep the figures and supplemental figures as they are currently presented.

      Concern 2a: It would be important to show the proportion of Treg, Tconv, and CD8 expressing Layilin in healthy skin and in patients developing psoriasis, as well as in the blood of healthy subjects.

      This data is published in a previous manuscript from our group. Please see Figure 1 in “Layilin Anchors Regulatory T Cells in Skin” (PMID: 34470859)

      Concern 2b: We lack information to be convinced that there is enrichment for migration and adhesion genes in Layilin+ Tregs in the GSEA data. The authors should indicate what geneset libraries they used. Indeed, it is tempting to show only the genesets that give results in line with the message you want to get across. If these genesets come from public banks, the bank used should be indicated, and the results of all gene sets shown in an unbiased way. In addition, it should be indicated whether the analyses were performed on untransformed or pseudobulk scRNAseq data analyses. Finally, it would be preferable to confirm the GSEA data with z-score analyses, as Ingenuity does, for example. Indeed, in GSEA-type analyses, there are genes that have activating but also inhibiting effects on a pathway in a given gene set.

      Given that we have already shown that layilin plays a major role in Treg and CD8+ T cell adhesion in tissues, we used a candidate approach for our GSEA. We tested the hypothesis that adhesion and motility pathways are enriched in Layilin-expressing Tregs. There was a statistically significant enrichment for these genes in Layilin+ Tregs compared to Layilin- Tregs, which we feel adequately tests our hypothesis.

      Concern 2c: For all FACS data, the raw data should be shown as histograms or dot plots for representative samples.

      We respect this concern. We omit these secondary to space constraints.

      Concern 2d: For Figure 5B, the number of samples analyzed is insufficient to draw clear conclusions.

      We respectfully disagree. Three doners were used in a paired fashion (internally controlled) achieving statistical significance.

      Concern 3: For Figs. 4 and 5, the design of the experiment poses a problem. Indeed, the comparison between Layn+ and Layn- cells may, in part, not be directly linked to the expression or absence of expression of this protein. Indeed, Layn+ and Layn- Tregs may constitute populations with different biological properties, beyond the expression of Layn. However, in the experiment design used here, a significant fraction of the sorted Layn- Tregs will be cells belonging to the population that has never expressed this protein. It would have been preferable to sort first the Layn+ Tregs, then knock down this protein and re-sort the Layn- Tregs and Layn+ Tregs. If this experiment is too cumbersome to perform, I agree that the authors should not do it. However, it would be important to mention the point I have just made in the text.

      We agree. However, as the reviewer points out, these experiments are not logistically and practically feasible at this point. We do perform several experiments in this manuscript in which layilin is reduced via gene editing with results supporting our hypotheses.

      Reviewer 2:

      Some of the conclusions drawn by the authors must be treated with caution, as the experimental conditions were not always appropriate, leading to a risk of misinterpretation.

      We have been transparent with all our methods and data. We will leave this to the reader to determine level of rigor and the robustness of the data.

      Reviewer 3:

      Weaknesses:

      It is not clear that the assays used for functional analysis of the patient samples were optimal. (2) Several conclusions are not fully substantiated. (3) The report is lacking some experimental details.

      We have tried to be as comprehensive and thorough as possible. We feel that the data supports our conclusions. We will leave this to the reader to interpret and conclude.

    1. eLife Assessment

      This revised study describes an important new model for in vivo manipulation of microglia, exploring how mutations in the Adar1 gene within microglia contribute to Aicardi-Goutières Syndome. The methodology is validated with exceptional data, supporting the authors' conclusions. The paper underscores both the advantages and limitations of using transplanted cells as a surrogate for microglia, making it a resource that is of value for biologists studying macrophages and microglia.

    2. Reviewer #1 (Public review):

      Summary:

      Aicardi-Goutières Syndrome (AGS) is a genetic disorder that primarily affects the brain and immune system through excessive interferon production. The authors sought to investigate the role of microglia in AGS by first developing bone-marrow-derived progenitors in vitro that carry the estrogen-regulated (ER) Hoxb8 cassette, allowing them to expand indefinitely in the presence of estrogen and differentiate into macrophages when estrogen is removed. When injected into the brains of Csf1r-/- mice, which lack microglia, these cells engraft and resemble wild-type (WT) microglia in transcriptional and morphological characteristics, although they lack Sall1 expression. The authors then generated CRISPR-Cas9 Adar1 knockout (KO) ER-Hoxb8 macrophages, which exhibited increased production of inflammatory cytokines and upregulation of interferon-related genes. This phenotype could be rescued using a Jak-Stat inhibitor or by concurrently mutating Ifih1 (Mda5). However, these Adar1-KO macrophages fail to successfully engraft in the brain of both Csf1r-/- and Cx3cr1-creERT2:Csf1rfl/fl mice. To overcome this, the authors used a mouse model with a patient-specific Adar1 mutation (Adar1 D1113H) to derive ER-Hoxb8 bone marrow progenitors and macrophages. They discovered that Adar1 D1113H ER-Hoxb8 macrophages successfully engraft the brain, although at lower levels than WT-derived ER-Hoxb8 macrophages, leading to increased production of Isg15 by neighboring cells. These findings shed new light on the role of microglia in AGS pathology.

      Strengths:

      The authors convincingly demonstrate that ER-Hoxb8 differentiated macrophages are transcriptionally and morphologically similar to bone marrow-derived macrophages. They also show evidence that when engrafted in vivo, ER-Hoxb8 microglia are transcriptomically similar to WT microglia. Furthermore, ER-Hoxb8 macrophages engraft the Csf1r-/- brain with high efficiency and rapidly (2 weeks), showing a homogenous distribution. The authors also effectively use CRISPR-Cas9 to knock out TLR4 in these cells with little to no effect on their engraftment in vivo, confirming their potential as a model for genetic manipulation and in vivo microglia replacement.

      Overall, this paper demonstrates an innovative approach to manipulating microglia using ER-Hoxb8 cells as surrogates. The authors present convincing evidence of the model's efficacy and potential for broader application in microglial research, given its ease of production and rapid brain engraftment potential in microglia-deficient mice. Using mouse-derived cells for transplantation reduces complications that can come with the use of human cell lines, highlighting the utility of this system for research in mouse models.

    3. Reviewer #2 (Public review):

      Summary:

      Microglia have been implicated in brain development, homeostasis, and diseases. "Microglia replacement" has gain tractions in recent years, using primary microglia, bone marrow or blood-derived myeloid cells, or human iPSC-induced microglia. Here, the authors extended their previous work in the area and provide evidence to support: (1) Estrogen-regulated (ER) homeobox B8 (Hoxb8) conditionally immortalized macrophages from bone marrow can serve as stable, genetically manipulated cell lines. These cells are highly comparable to primary bone marrow-derived (BMD) macrophages in vitro, and, when transplanted into a microglia-free brain, engraft the parenchyma and differentiate into microglia-like cells (MLCs). Taking advantage of this model system, the authors created stable, Adar1-mutated ER-Hoxb8 lines using CRISPR-Cas9 to study the intrinsic contribution of macrophages to Aicardi-Goutières Syndrome (AGS) disease mechanism.

      Strengths:

      The studies are carefully designed and well-conducted. The imaging data and gene expression analysis are carried out at a high level of technical competences and the studies provide strong evidence that ER-Hoxb8 immortalized macrophages from bone marrow are a reasonable source for "microglia replacement" exercise. The findings are clearly presented, and the main message will be of general interest to the neuroscience and microglia communities.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Aicardi-Goutières Syndrome (AGS) is a genetic disorder that primarily affects the brain and immune system through excessive interferon production. The authors sought to investigate the role of microglia in AGS by first developing bone-marrow-derived progenitors in vitro that carry the estrogen-regulated (ER) Hoxb8 cassette, allowing them to expand indefinitely in the presence of estrogen and differentiate into macrophages when estrogen is removed. When injected into the brains of Csf1r-/- mice, which lack microglia, these cells engraft and resemble wild-type (WT) microglia in transcriptional and morphological characteristics, although they lack Sall1 expression. The authors then generated CRISPR-Cas9 Adar1 knockout (KO) ER-Hoxb8 macrophages, which exhibited increased production of inflammatory cytokines and upregulation of interferon-related genes. This phenotype could be rescued using a Jak-Stat inhibitor or by concurrently mutating Ifih1 (Mda5). However, these Adar1-KO macrophages fail to successfully engraft in the brain of both Csf1r-/- and Cx3cr1-creERT2:Csf1rfl/fl mice. To overcome this, the authors used a mouse model with a patient-specific Adar1 mutation (Adar1 D1113H) to derive ER-Hoxb8 bone marrow progenitors and macrophages. They discovered that Adar1 D1113H ER-Hoxb8 macrophages successfully engraft the brain, although at lower levels than WT-derived ER-Hoxb8 macrophages, leading to increased production of Isg15 by neighboring cells. These findings shed new light on the role of microglia in AGS pathology.

      Strengths:

      The authors convincingly demonstrate that ER-Hoxb8 differentiated macrophages are transcriptionally and morphologically similar to bone marrow-derived macrophages. They also show evidence that when engrafted in vivo, ER-Hoxb8 microglia are transcriptomically similar to WT microglia. Furthermore, ER-Hoxb8 macrophages engraft the Csf1r-/- brain with high efficiency and rapidly (2 weeks), showing a homogenous distribution. The authors also effectively use CRISPR-Cas9 to knock out TLR4 in these cells with little to no effect on their engraftment in vivo, confirming their potential as a model for genetic manipulation and in vivo microglia replacement.

      Weaknesses:

      The robust data showing the quality of this model at the transcriptomic level can be strengthened with confirmation at protein and functional levels. The authors were unable to investigate the effects of Adar1-KO using ER-Hoxb8 cells and instead had to rely on a mouse model with a patient-specific Adar1 mutation (Adar1 D1113H). Additionally, ER-Hoxb8-derived microglia do not express Sall1, a key marker of microglia, which limits their fidelity as a full microglial replacement, as has been rightfully pointed out in the discussion.

      Overall, this paper demonstrates an innovative approach to manipulating microglia using ER-Hoxb8 cells as surrogates. The authors present convincing evidence of the model's efficacy and potential for broader application in microglial research, given its ease of production and rapid brain engraftment potential in microglia-deficient mice. While Adar1-KO macrophages do not engraft well, the success of TLR4-KO line highlights the model's potential for investigating other genes. Using mouse-derived cells for transplantation reduces complications that can come with the use of human cell lines, highlighting the utility of this system for research in mouse models.

      Thank you for this thoughtful and balanced assessment. The major suggestion from Reviewer 1 was that confirmation of RNAseq data with protein or functional studies would add strength.  We provided protein staining by IHC for IBA1 in vivo, as well as protein staining by FACS for CD11B, CD45, and TMEM119 in vitro and in vivo.  For TLR4, we showed successful protein KO and blunted response to LPS (a TLR4 ligand) challenge, which we believe provides some protein and functional data to support the approach.  To bolster these data, we added staining for P2RY12 on brain-engrafted ER-Hoxb8s.

      Regarding the Adar1 KO phenotypes showing non-engraftment. Because ADAR1 KO mice are embryonically lethal due to hematopoietic failure, we see the health impacts of Adar1 KO on ER-Hoxb8s as a strength of the transplantation model, enabling the assessment of ADAR1 global function in macrophages and microglia-like cells without generation of a transgenic mouse line. In addition, it was a surprise that the health impact occurs at the macrophage and not the progenitor stage, perhaps providing insight for future studies of ADAR1’s role in hematopoiesis. Instead, we were able to show a significant impact of complete loss of Adar1 on survival and engraftment, suggesting an important biological function of ADAR1. Macrophage-specific D1113H mutation, which affects part of the deaminase domain, shows that when the RNA deamination (but not the RNA binding) function of ADAR1 is disrupted, we find brain-wide interferonopathy. This is very exciting to our group and hopefully the community as astrocytes are thought to be a major driver of brain interferonopathy in patients with ADAR1 mutations. Instead, this suggests that disruption of brain macrophages is also a major contributor. 

      Reviewer #2 (Public review):

      Summary:

      Microglia have been implicated in brain development, homeostasis, and diseases. "Microglia replacement" has gained traction in recent years, using primary microglia, bone marrow or blood-derived myeloid cells, or human iPSC-induced microglia. Here, the authors extended their previous work in the area and provided evidence to support: (1)

      Estrogen-regulated (ER) homeobox B8 (Hoxb8) conditionally immortalized macrophages from bone marrow can serve as stable, genetically manipulated cell lines. These cells are highly comparable to primary bone marrow-derived (BMD) macrophages in vitro, and, when transplanted into a microglia-free brain, engraft the parenchyma and differentiate into microglia-like cells (MLCs). Taking advantage of this model system, the authors created stable, Adar1-mutated ER-Hoxb8 lines using CRISPR-Cas9 to study the intrinsic contribution of macrophages to the Aicardi-Goutières Syndrome (AGS) disease mechanism.

      Strengths:

      The studies are carefully designed and well-conducted. The imaging data and gene expression analysis are carried out at a high level of technical competence and the studies provide strong evidence that ER-Hoxb8 immortalized macrophages from bone marrow are a reasonable source for "microglia replacement" exercise. The findings are clearly presented, and the main message will be of general interest to the neuroscience and microglia communities.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      This is an elegant study, demonstrating both the utility and limitations of ER-Hoxb8 technology as a surrogate model for microglia in vivo. The manuscript is well-designed and clearly written, but authors should consider the following suggestions:

      (1) Validation of RNA hits at the protein level: To strengthen the comparison between ER-Hoxb8 macrophages and WT bone marrow-derived macrophages, validating several RNA hits at the protein level would be beneficial. As many of these hits are surface markers, flow cytometry could be employed for confirmation (e.g., Figure 1D, Figure 3E).

      In vitro, we show protein levels by flow cytometry for CD11B (ITGAM) and CD45 (PTPRC; Figure 1C), as well as TMEM119 (Supplemental Figure 2A) and TLR4 (Supplemental Figure 3C/D). In vivo, we show TMEM119 protein levels by flow cytometry (Figure 3A), as well as their CD11B/CD45 pregates (Supplemental Figure 2C), plus immunostaining for IBA1 (AIF1; Figure 2D). We now provide additional data showing P2RY12 immunostaining in brain-engrafted cells (Supplemental Figure 2B). 

      (2) The authors should consider testing the phagocytic capacity of ER-Hoxb8-derived macrophages to further validate their functionality.

      Thank you for the suggestion. We measured ER-Hoxb8 macrophage ability to engulf phosphatidylserine-coated beads that mimic apoptotic cells, compared with phosphatidylcholine-coated beads, now as new Supplemental Figure 1C/D. This agrees with existing literature showing efficient engulfment/phagocytosis by ER-Hoxb8-derived cells (Elhag et al., 2021).

      (3) For Figure 3E, incorporating a wild-type (WT) microglia reference would be beneficial to establish a baseline for comparison (e.g. including WT microglia data in the graph or performing a ratio analysis against WT expression levels).

      We agree - we now include bars representing our sequenced primary microglia data in Figure 3E as a comparison.  

      (4) Some statistical analyses may require refinement. Specifically, for Figure 4J, where the effects of Adar1 KO and Adar1 KO with Bari are compared, it would be more appropriate to use a two-way ANOVA.

      Thank you for noting it. We have now done more appropriate two-way ANOVA and included the updated results in Figure 4J and the corresponding Supplemental Figure 4G. Errors in figure legend texts have also been corrected to reflect the statistical tests used.

      (5) Cx3cr1-creERT2 pups injected with tamoxifen: The authors could clarify the depletion ratio in these experiments before the engraftment and assess whether the depletion is global or regional. In comparison to Csf1r-/-, where TLR4-KO ER-Hoxb8 engraft globally, in Cx3cr1-creERT2, the engraftment seems more regional (Figure 5A vs Supplementary Figure 5B); is this due to the differences in depletion efficiency?

      This is an excellent question and observation, and one that we are very interested in, though that finding does not change the conclusions of this particular study.  We find some region-specific differences in depletion early after tamoxifen injection, but that all brain regions are >95% depleted by P7. For instance, in a recently published manuscript (Bastos et al., 2025) we find some differences in the depletion kinetics in the genetic model. By P3, we find 90% depletion in cortex with 50-60% in thalamus and hippocampus. In other studies, we typically deliver primary monocytes, and this is the first study where we report engraftment of ER-Hoxb8 cells in the inducible model.  In this sense, it is possible that depletion kinetics may regionally affect engraftment, but future studies are required to more finely assess this point with ER-Hoxb8s, as it may change how these models are used in the future.

      Bastos et al., Monocytes can efficiently replace all brain macrophages and fetal liver monocytes can generate bonafide SALL1+ microglia, Immunity (2025), https://doi.org/10.1016/j.immuni.2025.04.006

      (6) It would be helpful for the authors to clarify whether Adar1 is predominantly expressed by microglia, especially since the study aims to show its role in dampening the interferon response.

      That’s a wonderful point. Adar1 is expressed by all brain cells, with highest transcript level in some neurons, astrocytes, and oligodendrocytes. It is an interferon-stimulated gene, and mutation itself leads to interferonopathy, we believe, due to poor RNA editing and detection of endogenous RNA as non-self by MDA5. We hope it can dampen the interferon response, but in the case of mutation, Adar1 is probably causal of interferonopathy.  It is induced in microglia upon systemic inflammatory challenge (LPS). We have edited the text to highlight its expression pattern.  See BrainRNAseq.org (Zhang*, Chen*, Sloan*, et al., 2014 and Bennett et al., 2016)

      Reviewer #2 (Recommendations for the authors):

      (1) There appears to be a morphological difference between wt and Adar1/Ifih1 double KO (dKO) cells in the engrafted brains (Figure 5). It would be good if the authors could systematically compare the morphology (e.g., soma size, number, and length of branches) of the engrafted MLCs between the wt and mutant cells.

      We agree. While cells did not differ in branch number or length, engrafted dKO cells had significantly larger somas compared with controls, which we now present in Figure S5A.

      (2) To fully appreciate the extent of how those engrafted ER-Hoxb8 immortalized macrophages resemble primary, engrafted yolk sac-myeloid cells, vs engrafted iPSC-induced microglia, it would be informative to provide a comparison of their RNAseq data derived from the engrafted ER-Hoxb8 immortalized macrophages with published data transcriptomic data sets (e.g. Bennett et al. Neuron 2018; Chadarevian et al. Neuron 2024; Schafer et al. Cell 2023).

      Thank you for this suggestion. To address this, we provide our full dataset for additional experiments. To compare with a similar non-immortalized model, we compared top up- and down-regulated genes from our data to those of ICT yolk sac progenitor cells from our previous work (Bennett et al., 2018). We find overlap between brain-engrafted ER-Hoxb8-, bone marrow-, and yolk sac-derived cells (Supplemental Figure 2F, Supplemental Table 3).  

      Minor comments:

      Figure 6C: red arrow showing zoom in regions are not matchable. It might be beneficial to provide bigger images with each channel for C and D as a Supplemental Figure.

      We fixed this in Figure 6C to show areas of interest in the cortex for both conditions. Figure S7A shows intermediate power images to aid in interpretation.

    1. eLife Assessment

      This valuable work proposes a novel, rapid S. aureus entry mechanism via Ca²⁺-dependent lysosomal exocytosis and acid sphingomyelinase release, which influences bacterial sub-cellular fate. However, reliance on chemical inhibitors and the absence of a knockout phenotype weakens the overall impact, making the study incomplete.

    2. Reviewer #2 (Public review):

      In the manuscript, Ruhling et al propose a rapid uptake pathway that is dependent on lysosomal exocytosis, lysosomal Ca2+ and acid sphingomyelinase, and further suggest that the intracellular trafficking and fate of the pathogen is dictated by the mode of entry. Overall, this is manuscript argues for an important mechanism of a 'rapid' cellular entry pathway of S.aureus that is dependent on lysosomal exocytosis and acid sphingomyelinase and links the intracellular fate of bacterium including phagosomal dynamics, cytosolic replication and host cell death to different modes of uptake.

      Key strength is the nature of the idea proposed, while continued reliance on inhibitor treatment combined with lack of phenotype for genetic knock out is a major weakness. While the authors argue a role for undetectable nano-scale Cer platforms on the cell surface caused by ASM activity, results do not rule out a SM independent role in the cellular uptake phenotype of ASM inhibitors.

      The authors have attempted to address many of the points raised in the previous revision. While the new data presented provide partial evidence, the reliance on chemical inhibitors and lack of clear results directly documenting release of lysosomal Ca2+, or single bacterial tracking, or clear distinction between ASM dependent and independent processes dampen the enthusiasm.

      I acknowledge the author's argument of different ASM inhibitors showing similar phenotypes across different assays as pointing to a role for ASM, but the lack of phenotype in ASM KO cells is concerning. The author's argument that altered lipid composition in ASM KO cells could be overcoming the ASM-mediated infection effects by other ASM-independent mechanisms is speculative, as they acknowledge, and moderates the importance of ASM-dependent pathway. The SM accumulation in ASM KO cells does not distinguish between localized alterations within the cells. If this pathway can be compensated, how central is it likely to be ?

      The authors allude to lower phagosomal escape rate in ASM KO cells compared to inhibitor treatment, which appears to contradict the notion of uptake and intracellular trafficking phenotype being tightly linked. As they point out, these results might be hard to interpret. Could an inducible KD system recapitulate (some of) the phenotype of inhibitor treatment ? If S. aureus does not escape phagosome in macrophages, could it provide a system to potentially decouple the uptake and intracellular trafficking effects by ASM (or its inhibitor treatment) ?

      The role of ASM on cell surface remains unclear. The hypothesis proposed by the authors that the localized generation of Cer on the surface by released ASM leads to generation of Cer-enriched platforms could be plausible, but is not backed by data, technical challenges to visualize these platforms notwithstanding. These results do not rule out possible SM independent effects of ASM on the cell surface, if indeed the role of ASM is confirmed by controlled genetic depletion studies.

      The reviewer acknowledges technical challenges in directly visualizing lysosomal Ca2+ using the methods outlined. Genetically encoded lysosomal Ca2+ sensor such as Gcamp3-ML1 might provide better ways to directly visualize this during inhibitor treatment, or S. aureus infection.

    1. eLife Assessment

      The authors modified a common method to induce epilepsy in mice to provide an improved approach to screen new drugs for epilepsy. This is important because of the need to develop new drugs for patients who are refractory to current medications. The authors' method evokes seizures to circumvent a low rate of spontaneous seizures and the approach was validated using two common anti-seizure medications. The strength of evidence was solid, making the study invaluable, but there were some limitations to the approach and methods.

    2. Reviewer #1 (Public review):

      Summary:

      This important study by Chen et. al. describes a novel approach for optogentically evoking seizures in an etiologically relevant mouse model of epilepsy. The authors developed a model that can trigger seizures "on demand" using optogenetic stimulation of CA1 principal cells in mice rendered epileptic by an intra-hippocampal kainate (IHK) injection into CA3. The authors discuss their model in the context of the limitations of current animal models used in epilepsy drug development. In particular, their model addresses concerns regarding existing models where testing typically involves inducing acute seizures in healthy animals or waiting on infrequent, spontaneous seizures in epileptic animals.

      Strengths:

      A strength of this manuscript is that this approach may facilitate the evaluation of novel therapeutics since these evoked seizures, despite having some features that were significantly different from spontaneous seizures, are suggested to be sufficiently similar to spontaneous seizures which are more laborious to analyze. The data demonstrating the commonality of pharmacology and EEG features between evoked seizures and spontaneous seizures in epileptic mice, while also being different from evoked seizures in naïve mice, are convincing. The structural, functional, and behavioral differences between a seizure-naïve and epileptic mouse, which emerge due to the enduring changes occurring during epileptogenesis, are complex and important. Accordingly, this study highlights the importance of using mice that have underwent epileptogenesis as model organisms for testing novel therapeutics. Furthermore, this study positively impacts the wider epilepsy research community by investigating seizure semiology in these populations.

      Weaknesses:

      This study convincingly demonstrates that the feature space measurements for stimulus-evoked seizures in epileptic mice were significantly different from those in naïve mice; this result allows the authors to conclude that "seizures induced in chronically epileptic animals differed from those in naïve animals". However, the authors also conclude that "induced seizures resembled naturally occurring spontaneous seizures in epileptic animals" despite their own data demonstrating similar, albeit fewer, significant differences in feature space measurements. It is unclear if and what the threshold is whereby significant differences in these feature space measurements lead to the conclusion that the differences are meaningful, as in the comparison of epileptic and naïve mice, or not meaningful, as in the comparison of evoked and spontaneous seizures.

    3. Reviewer #2 (Public review):

      The authors aimed to develop an animal model of temporal lobe epilepsy (TLE) that will generate "on-demand" seizures and an improved platform to advance our ability to find new anti-seizure drugs (ASDs) for drug-resistant epilepsy (DRE). Unlike some of the work in this field, the authors are studying actual seizures, and hopefully events that are similar to actual epileptic seizures. To develop an optimized screening tool, however, one also needs high-throughput systems with actual seizures as a quantitative, rigorous, and reproducible outcome measures. The authors aim to provide such a model; however, this approach may be over-stated here and seems unlikely to address the critical issue of drug resistance, which is their most important claim.

      Strengths:

      - The authors have generated an animal model of "on demand" seizures, which could be used to screen new ASDs and potentially other therapies. The authors and their model make a good-faith effort to emulate the epileptic condition and to use seizure susceptibility or probability as a quantitative output measure.

      - The events considered to be seizures appear to be actual seizures, with some evidence that the seizures are different from seizures in the naïve brain. Their effort to determine how different ASDs raise seizure probability or threshold to an optogenetic stimulus to the CA1 area of the rodent hippocampus is focused on an important problem, as many if not most ASD screening uses surrogate measures that may not be as well linked to actual epileptic seizures.

      - Another concern is their stimulation of dorsal hippocampus, while ventral hippocampus would seem more appropriate.

      - Use of optogenetic techniques allows specific stimulation of the targeted CA1 pyramidal cells, and it appears that this approach is reproducible and reliable with quantitative rigor.

      - The authors have taken on a critically important problem, and have made a good-faith effort to address many of the technical concerns raised in the reviews, but the underlying problem of DRE remains.

      Weaknesses:

      - Although the model has potential advantages, it also has disadvantages. As stated by the authors, the pre-test work-load to prepare the model may not be worth the apparent advantages. And most important, the paper frequently mentions DRE but does not directly address it, and yet drug resistance is the critical issue in this field.

      - Although the paper shows examples of actual seizures, there remains some concern that some of the events might not be seizures - or a homogeneous population of seizures. More quantitative assessment of the electrical properties (e.g., duration) of the seizures and their probability is likely to be more useful than the proposed quantification in the future of the behavioral seizure stages, because the former could be both more objective and automated, while the behavioral analysis of the seizures will likely be more subjective and less reliable (and also fraught with subjectivity and analytical problems). Nonetheless, the authors point that the presence of "Racine 3 or above" behavioral seizures (in addition to their electrical data) is a good argument that many (if not all) of the "seizures" are actual epileptic seizures.

      - Optogenetic stimulation of CA1 provides cell-specificity for the stimulation, but it is not clear that this method would actually be better than electrical stimulation of a kindled rodent with superimposed hippocampal injury. The reader is unfortunately left with the concern of whether this model would be easier and more efficacious than kindling.

      - Although the authors have taken on a critically important problem, and have combined a variety of technologies, this approach may facilitate more rapid screening of ASDs against actual seizures (beneficial), but it does not really address the fundamentally critical yet difficult problem of DRE. A critical issue for DRE that is not well-addressed relates to adverse effects, which is often why many ASDs are not well tolerated by many patients (e.g., LEV). Thus, we are left with: how does this address anti-seizure DRE?

      - The focus of this paper seems to be more on seizures more than on epilepsy. In the absence of seizure spontaneity, the work seems to primarily address the issues of seizure spread and duration. Although this is useful, it does not seem to be addressing the question of what trips the system to generate a seizure.

      An appraisal of whether the authors achieved their aims, and whether the results support their conclusions:

      - The authors seem to have developed a new and useful model; however, it is not clear how this will address that core problem of DRE, which was their stated aim.

      - A discussion of the likely impact of the work on the field, and the utility of the methods and data to the community.

      - As stated before in the original review, the potential impact would primarily be aimed at the ETSP or a drug-testing CRO; however, much more work will be required to convince the epilepsy community that this approach will actually identify new ASDs for DRE. The approach is potentially time-consuming with a steep and potentially difficult optimization curve, and thus may not be readily adaptable to the typical epilepsy-models neuroscience laboratory.

      Any additional context you think would help readers interpret or understand the significance of the work:

      - The problem of DRE is much more complicated than described by the authors here; however, the paper could end up being more useful than is currently apparent. Although this work could be seen as technically - and maybe conceptually - elegant and a technical tour de force, will it "deliver on the promise"? Is it better than kindling for DRE? In attempting to improve the discovery process, how will the model move us to another level? Will this model really be any better than others, such as kindling?

    4. Reviewer #3 (Public review):

      This revised paper develops and characterizes a new approach for screening drugs for epilepsy. The idea is to increase the ability to study seizures in animals with epilepsy because most animal models have rare seizures. Thus, the authors use the existing intrahippocampal kainic acid (IHKA) mouse model, which can have very unpredictable seizures with long periods of time between seizures. This approach is of clear utility to researchers who may need to observe many seizure events per mouse during screening of antiseizure medications. A key strength is also that more utility can be derived from each individual mouse. The authors modified the IHKA model to inject KA into CA3 instead of CA1 in order to preserve the CA1 pyramidal cells that they will later stimulate. To express the excitatory opsin channelrhodopsin (ChR2) in area CA1, they use a virus that expresses ChR2 in cells that express the Thy-1 promoter. The authors demonstrate that CA3 delivery of KA can induce a very similar chronic epilepsy phenotype to the injection of KA in CA1 and show that optical excitation of CA1 can reliably induce seizures. The authors evaluate the impact of repeated stimulation on the reliability of seizure induction and show that seizures can be reliably induced by CA1 stimulation, at least for the short term (up to 16 days). These are strengths of the study.

      However, there are several limitations: the seizures are evoked, not spontaneous. It is not clear how induced seizures can be used to investigate if antiseizure medication can reduce spontaneous seizures. Although seizure inducibility and severity can be assessed, the lack of spontaneous seizures is a limitation. To their credit, the authors show that electrophysiological signatures of induced vs spontaneous seizures are similar in many ways, but the authors also show several differences. Notably, the induced seizures are robustly inhibited by the antiseizure medication levetiracetam and variably but significantly inhibited by diazepam, similar to many mouse models with chronic recurrent seizure activity. One also wonders if using a mouse model with numerous seizures (such as the pilocarpine model) might be more efficient than using a modified IHKA protocol.

      In this revised manuscript, the authors address some previous concerns related to definitions of seizures and events that are trains of spikes, sex as a biological variable, and present new images of ChR2 expression (but these images could be improved to see the cells more clearly). A few key concerns remain unaddressed, however. For example, it is still not clear that evoked seizures triggered by stimulating CA1 are similar to spontaneous seizures, regardless of the idea that CA1 plays a role in seizure disorders. It also remains unclear whether repeated activation of the hippocampal circuit will result in additional alterations to this circuit that affect the seizure phenotype over prolonged intervals (after 16 days). Furthermore, the use of SVM with the number of seizures being used as replicates (instead of number of mice) is inappropriate. Another theoretical concern is whether the authors are correct in suggesting that one will be able to re-use the mice for screening multiple drugs in a row.

      Strengths:<br /> - The authors show that the IHKA model of chronic epilepsy can be modified to preserve CA1 pyramidal cells, allowing optogenetic stimulation of CA1 to trigger seizures.<br /> - The authors show that repeated optogenetic stimulation of CA1 in untreated mice can promote kindling and induce seizures, indeed generating two mouse models in total.<br /> - Many electrophysiological signatures are similar between the induced and spontaneous seizures, and induced seizures reliably respond to treatment with antiseizure medications.<br /> - Given that more seizures can be observed per mouse using on-demand optogenetics, this model enhances the utility of each individual mouse.<br /> - Mice of each sex were used.

      Weaknesses:<br /> - Evaluation of seizure similarity using the SVM modeling and clustering is not sufficiently justified when using number of seizures as the statistical replicate (vs mice).<br /> - Related to the first concern, the utility of increasing number of seizures for enhancing statistical power is limited because standard practice is for sample size to be numbers of mice.<br /> - The term "seizure burden" usually refers to the number of spontaneous seizures per day, not the severity of the seizures themselves. Because the authors are evoking the seizures being studied, this study design precludes assessment of seizure burden.<br /> - It seems likely that repeatedly inducing seizures will have a long-term effect, especially in light of the downward slope at day 13-16 for induced seizures seen in Figure 4C. A duration of evaluation that is longer than 16 days is warranted.<br /> - Human epilepsy is extensively heterogeneous in both etiology and individual phenotype, and it may be hard to generalize the approach.

    5. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1 (Public review):

      Weaknesses:

      While the data generally supports the authors' conclusions, a weakness of this manuscript lies in their analytical approach where EEG feature-space comparisons used the number of spontaneous or evoked seizures as their replicates as opposed to the number of IHK mice; these large data sets tend to identify relatively small effects of uncertain biological significance as being highly statistically significant. Furthermore, the clinical relevance of similarly small differences in EEG feature space measurements between seizure-naïve and epileptic mice is also uncertain.

      In this work, we used linear mixed effect model to address two levels of variability –between animals and within animals. The interactive linear mixed effect model shows that most (~90%) of the variability in our data comes from within animals (Residual), the random effect that the model accounts for, rather than between animals. Since variability between animals are low, the model identifies common changes in seizure propagation across animals, while accounting for the variability in seizures within each animal. Therefore, the results we find are of changes that happen across animals, not of individual seizures. We made text edits to clarify the use of the linear mixed effect model. (page6, second paragraph and page 11, first paragraph)

      Finally, the multiple surgeries and long timetable to generate these mice may limit the value compared to existing models in drug-testing paradigms.

      Thank you for the suggestion. We added a discussion in the ‘Comparison to other seizure models…’ section on pages 15 and 16. In an existing model investigating spontaneous tonic-clonic seizures (such as the intra-amygdala kainate injection model), the time investment is back-loaded, requiring two to three weeks per condition while counting spontaneous seizures, which may occur only once a day. In contrast, our model requires a front-loaded time investment. Once the animals are set up, we can test multiple drugs within a few weeks, providing significant time savings. Additionally, we did not pre-screen animals in our study. Existing models often pre-select mice with high rates of spontaneous seizures, whereas in our model, seizures can be induced even in animals with few spontaneous seizures. We believe that bypassing the need for pre-screening also is a key advantage of our induced seizure model.  

      Reviewer 1 (Recommendations for the authors):

      (1) Address why the EEG data comparisons were performed between seizures and not between animals (as explicitly described in the public review). Further, a discussion of the biological significance (or lack thereof) of the effect size differences observed is warranted. This is especially concerning when the authors make the claim that spontaneous and induced seizures are essentially the same while their analysis shows all evaluated feature space parameters were significantly difference in the initial 1/3 of the EEG waveforms.

      We made text edits to clarify the use of the linear mixed effects model (page 6, second paragraph, and page 11, first paragraph)

      (2) The authors place great emphasis on the use of clinically/etiologically relevant epilepsy models in drug discovery research. There is discussion criticizing the time points required to enact kindling and the artificial nature of acute seizure induction methods. However, the combination IHK-opto seizure induction model also requires a lengthy timeline. A more tempered discussion of this novel model's strengths may benefit readers.

      Thank you for the suggestion. We added a discussion in the ‘Comparison to other seizure models…’ section on pages 15 and 16.

      (3) The authors should further emphasize the benefit of having an inducible seizure model of focal epilepsy since other mouse models (e.g., genetic or TBI models) may have superior etiological relevance (construct and face validity) but may not be amenable to their optogenetic stimulation approach.

      Thank you for the suggestion. We revised the manuscript to better emphasize the potential significance of our approach. We added a discussion in the 'Application of Models...' section on page 15, second paragraph. The on-demand seizure model can be applied to address biologically and clinically relevant questions beyond its utility in drug screening. For example, crossing the Thy1-ChR2 mouse line with genetic epilepsy models, such as Scn1a mutants, could reveal how optogenetic stimulation differentially induces seizures in mutant versus non-mutant mice, providing insights into seizure generation and propagation in Dravet syndrome. Due to the cellular specificity of optogenetics, we also envision this approach being used to study circuit-specific mechanisms of seizure generation and propagation.

      (4) Suggestion: Provide immunolabeled imagery demonstrating ChR2 presence in Thy1 cells.

      Thank you for the suggestion. We added a fluorescence image showing ChR2 expression in Fig. 2A

      (5) It might be prudent to mention any potential effects of laser heat on hippocampal cell damage, although the 10 Hz, ~10 mW, and 6 s stim is unlikely to cause any substantial burns. Without knowing the diameter and material of the optic fiber, this is left up to some interpretation.

      Thank you for the comments. In the Methods section, we listed the optical fiber diameter as 400 microns (page 17, EEG and Fiber Implantation section). Using 5–18 mW laser power with a relatively large fiber diameter of 400 microns, the power density falls within the range of commonly employed channelrhodopsin activation conditions in vivo. That said, we would like to investigate potential heat effects or cell damage in a follow-up study.

      (6) There are instances in the manuscript where the authors describe experimental and analytical parameters vaguely (e.g. "Seizures were induced several times a day", "stimulation was performed every 1 - 3 hours over many days"). These descriptions can and should be more precise.

      Thank you for the comments. To enhance clarity, we added the stimulation protocol in a flowchart format in Fig. S2A, describing how we determined the threshold and proceeded to the drug test. Following this protocol, there was variability in the number of stimulations per day.

      (7) In the second to last paragraph of the discussion, the authors state "However, HPDs are not generalizable across species - they are specific to the mouse model (55)." This statement is inaccurate. The paper cited comes from Dr. Corrine Roucard's lab at Synapcell. In fact, Dr. Rouchard argues the opposite (See Neurochem Res (2017) 42:1919-1925).

      Thank you for pointing out the mistake. On page 16, in the first paragraph, reference 55 (now 58 in the revised version) was intended to refer to 'quickly produce dose-response curves with high confidence.' In the revision, we cited another paper reporting that hippocampal spikes were not reproduced in the rat IHK model. R. Klee, C. Brandt, K. Töllner, W. Löscher, Various modifications of the intrahippocampal kainate model of mesial temporal lobe epilepsy in rats fail to resolve the marked rat-to-mouse differences in type and frequency of spontaneous seizures in this model. Epilepsy Behav. 68, 129–140 (2017).

      (8) In the discussion, Levetiracetam is highlighted as an ASM that would not be detected in acute induced seizure models; the authors point out its lack of effect in MES and PTZ. However, LEV is effective in the 6Hz test (also an acute-induced seizure model). This should be stated.

      Thank you for the comments. We highlighted the discussion on LEV in the 'Application of Model to Testing Multiple Classes of ASMs...' section on page 14.

      (9) The results text indicates that 9 epileptic mice were used to test LEV and DZP. However, the individual data points illustrated in Figure 5B show N=8 mice. Please correct.

      Thank you for the comments. A total of nine epileptic mice were used to assess two drugs, with the animals being re-used as indicated in the schematic. A total of eight assessments were conducted for DZP with six mice and eight assessments for LEV with five mice. Each assessment included hourly ChR2 activations without an ASM and hourly ChR2 activations after ASM injection.

      (10) Figure 4D: Naïve mice are labeled as solid blue circles in the legend while the data points are solid blue triangles. Please correct.

      Thank you. We corrected the marker in Fig.4D.

      Reviewer 2 (Public Review):

      Weaknesses:

      (1) Although the figures provide excellent examples of individual electrographic seizures and compare induced seizures in epileptic and naïve animals, it is unclear which criteria were used to identify an actual seizure induced by the optogenetic stimulus, versus a hippocampal paroxysmal discharge (HPD), an "afterdischarge", an "electrophysiological epileptiform event" (EEE, Ref #36, D'Ambrosio et al., 2010 Epilepsy Currents), or a so-called "spike-wave-discharge" (SWD). Were HPDs or these other non-seizure events ever induced using stimulation in animals with IH-KA? A critical issue is that these other electrical events are not actual seizures, and it is unclear whether they were included in the column showing data on "electrographic afterdischarges" in Figure 5 for the studies on ASDs. This seems to be a problem in other areas of the paper, also.

      Thank you for pointing out the unclear definition of the seizures analyzed. We added sentences at the beginning of the Results section (page 3) to clarify the terminology we used. We analyzed animal behavior during evoked events, and a high percentage of induced electrographic events were accompanied by behavioral seizures with a Racine scale of three or above. We added Supplemental Figure S9, which shows behavioral seizure severity scores observed before and during ASM testing. We hope these changes address the reviewer’s concern and improve the clarity of the manuscript.

      (2) The differences between the optogenetically evoked seizures in IH-KA vs naïve mice are interpreted to be due to the "epileptogenesis" that had occurred, but the lesion from the KA-induced injury would be expected to cause differences in the electrically and behaviorally recorded seizures - even if epileptogenesis had not occurred. This is not adequately addressed.

      Thank you for the comments. IHK-injected mice had spontaneous tonic-clonic seizures before the start of optical stimulation, as shown in Figure S1.

      (3) The authors offer little mention of other research using animal models of TLE to screen ASDs, of which there are many published studies - many of them with other strengths and/or weaknesses. For example, although Grabenstatter and Dudek (2019, Epilepsia) used a version of the systemic KA model to obtain dose-response data on the effects of carbamazepine on spontaneous seizures, that work required use of KA-treated rats selected to have very high rates of spontaneous seizures, which requires careful and tedious selection of animals. The ETSP has published studies with an intra-amygdala kainic acid (IA-KA) model (West et al., 2022, Exp Neurol), where the authors claim that they can use spontaneous seizures to identify ASDs for DRE; however, their lack of a drug effect of carbamazepine may have been a false negative secondary to low seizure rates. The approach described in this paper may help with confounds caused by low or variable seizure rates. These types of issues should be discussed, along with others.

      We appreciate the reviewer’s insights. We added a discussion comparing our model with other existing models in the Discussion section (pages 15 and 16, 'Comparison to Other Seizure Models Used in Pharmacologic Screening' section). In an existing model investigating spontaneous tonic-clonic seizures (such as the intra-amygdala kainate injection model), the time investment is back-loaded, requiring two to three weeks per condition while counting spontaneous seizures, which may occur only once a day. In contrast, our model requires a front-loaded time investment. Once the animals are set up, we can test multiple drugs within a few weeks, providing significant time savings. Additionally, we did not pre-screen animals in our study. Existing models often pre-select mice with high rates of spontaneous seizures, whereas in our model, seizures can be induced even in animals with few spontaneous seizures. We believe that bypassing the need for pre-screening is a key advantage of our induced seizure model.

      (4) The outcome measure for testing LEV and DZP on seizures was essentially the fraction of unsuccessful or successful activations of seizures, where high ASD efficacy is based on showing that the optogenetic stimulation causes fewer seizures when the drug is present. The final outcome measure is thus a percentage, which would still lead to a large number of tests to be assured of adequate statistical power. Thus, there is a concern about whether this proposed approach will have high enough resolution to be more useful than conventional screening methods so that one can obtain actual dose-response data on ASDs.

      Thank you for the comments. In this revision, we added Supplemental Figure S9, showing the severity of behavioral seizures observed before and during ASM testing for each animal. We observed a reduction in behavioral seizure severity for each subject. We would like to explore using behavioral severity as an outcome measure in a follow-up study.

      (5) The authors state that this approach should be used to test for and discover new ASDs for DRE, and also used for various open/closed loop protocols with deep-brain stimulation; however, the paper does not actually discuss rigorously or critically the background literature on other published studies in these areas or how this approach will improve future research for a broader audience than the ETSP and CROs. Thus, it is not clear whether the utility will apply more widely and how extensive a readership will be attracted to this work.

      We appreciate the reviewer’s insights. We revised the manuscript to better emphasize the potential significance of our approach (page 15, second paragraph). The on-demand seizure model can be applied to address biologically and clinically relevant questions beyond its utility in drug screening. For example, crossing the Thy1-ChR2 mouse line with genetic epilepsy models, such as Scn1a mutants, could reveal how optogenetic stimulation differentially induces seizures in mutant versus non-mutant mice, providing insights into seizure generation and propagation in Dravet syndrome. Due to the cellular specificity of optogenetics, we also envision this approach being used to study circuit-specific mechanisms of seizure generation and propagation. Regarding drug-resistant epilepsy (DRE) and anti-seizure drug (ASD) screening, we agree with the reviewer that probing new classes of ASDs for DRE represents a critical goal. However, we believe that a full exploration of additional ASD classes and/or modeling DRE lies outside the scope of this manuscript, and we would like to explore it in a follow-up study.

      Reviewer 2 (Recommendations for the authors):

      (1) The authors should explain why 10 Hz was chosen as the stimulation frequency.

      Thank you for the comment. A frequency of 10 Hz was determined based on previous work using anesthetized animals prepared in an acute in vivo setting. To simplify the paper and avoid confusion, we did not include a discussion on how we determined the frequency. Instead, we added a detailed description of how we optimized the power in a flowchart format in Supplemental Figure S2. We hope this improves reproducibility.

      (2) After micro-injection of KA, morphological changes were observed in the hippocampus, but no comparison of Chr2 expression was made in naïve animals vs KA-injected animals. Presumably, the Thy1-Chr2 mouse expresses GFP in cells that express Chr2. Thus, it may be useful to show the expression of Chr2 in animals with hippocampal sclerosis. This may explain the lack of dramatic difference between stimulation parameters in naïve vs epileptic animals, as shown in supplemental Figure S2.

      Thank you for the suggestion. We added a fluorescence image of ChR2 expression in CA1, ipsilateral to the KA-injected site, in Fig. 2A.

      (3) The authors state that "During epileptogenesis, neural networks in the brain undergo various changes ranging from modification of membrane receptors to the formation of new synapses" and that these changes are critical for successful "on-demand" seizure induction. However, it is not clear or well-discussed whether changes in neuronal cell densities that occur during sclerosis are important for "on-demand" seizure induction as well. Also, the authors showed that naïve animals exhibit a kindling-like effect, but it was unclear whether a similar effect was present in epileptic animals (i.e. do stimulation thresholds to seizure induction change as the animal gets more induction stimulations)? If present, would the secondary kindling affect drug-testing studies (e.g., would the drug effect be different on induced seizure #2 vs induced seizure #20)?

      Thank you for the suggestion. Since this is an important aspect of the model, we would like to address the kindling effect, the secondary kindling effect, and histopathology in a longer-term setting (several weeks) in a follow-up study.

      (4) The authors show that in their model, LEV and DZP were both efficacious. The authors do not seem to mention that, over 25 years ago, LEV was originally missed in the standard ETSP screens; and, it was only discovered outside of the ETSP with the kindling model. The kindling model is now used to screen ASDs. The authors should consider adding this point to the Discussion. It remains unclear, however, if the author's screening strategy shows advantages over kindling and other such approaches in the field.

      Thank you for the suggestion. We added a discussion on LEV in the 'Application of Model to Testing Multiple Classes of ASMs...' section on page 14.

      (5) P8 paragraph 2. The authors state values for naïve animals, but they should also provide values for epileptic animals since they state that the groups were not significantly different (p>0.05). It would be useful to show values for both and state the actual p-value from the test. This issue of stating mean/median values with SD and sample size should be addressed for all data throughout the paper. Additionally, Figure S2 should be added to the manuscript and discussed, as it has data that may be valuable for the reproducibility of the paper.

      Thank you for the suggestion. Figure S2 shows the threshold power required to induce electrographic activity for n = 10 epileptic animals (9.14 ± 4.75 mW) and n = 6 naïve animals (6.17 ± 1.58 mW) (Wilcoxon rank-sum test, p = 0.137). The threshold duration was comparable between the same epileptic animals (6.30 ± 1.64 s) and naïve animals (5.67 ± 1.03 s) (Wilcoxon rank-sum test, p = 0.7133). 

      (6) In addition to the other stated references on synaptic reorganization in the CA1 area, the authors should mention similar studies from Esclapez et al. (1999, J Comp Neurol).

      Thank you. We have included the reference in the revision.

      (7) All of the raw EEG data on the seizures should be accessible to the readers.

      Thank you for the suggestion. We will consider depositing EEG data in a publicly accessible site.

      Reviewer 3 (Public review):

      Weaknesses:

      (1) Evaluation of seizure similarity using the SVM modeling and clustering is not sufficiently explained to show if there are meaningful differences between induced and spontaneous seizures. SVM modeling did not include analysis to assess the overfitting of each classifier since mice were modeled individually for classification.”

      Thank you for the comment. We made text edits to clarify the purpose of the SVM analysis. It was not intended to identify meaningful differences between induced and spontaneous seizures. Rather, it was used to classify EEG epochs as 'seizures' based on spontaneous seizures as the training set, demonstrating the gross similarity between induced and spontaneous seizures.

      (2) The difference between seizures and epileptiform discharges or trains of spikes (which are not seizures) is not made clear.

      Thank you for pointing out the unclear definition of the seizures analyzed. We added sentences at the beginning of the Results section (page 3) to clarify the terminology we used. We analyzed animal behavior during evoked events, and a high percentage of induced electrographic events were accompanied by behavioral seizures with a Racine scale of three or above. We added Supplemental Figure S9 to show the types of seizures observed before and during ASM testing. We hope these changes address the reviewer’s concern and improve the clarity of the manuscript.

      (3) The utility of increasing the number of seizures for enhancing statistical power is limited unless the sample size under evaluation is the number of seizures. However, the standard practice is for the sample size to be the number of mice.

      In this work, we used a linear mixed-effects model to address two levels of variability—between animals and within animals. The interactive linear mixed-effects model shows that most (~90%) of the variability in our data comes from within animals (residual), the random effect that the model accounts for, rather than between animals. Since variability between animals is low, the model identifies common changes in seizure propagation across animals while accounting for the variability in seizures within each animal. Therefore, the results we find reflect changes that occur across animals, not individual seizures. We made text edits to clarify the use of the linear mixed-effects model.

      (4) Seizure burden is not easily tested.

      Thank you for the comment. We added Supplemental Figure S9 to summarize the severity of behavioral seizures before and during ASM testing. This addresses the reviewer’s comment on seizure burden. In a follow-up study, we would like to explore this type of outcome measure for drug screening.

      Reviewer 3 (Recommendations for the authors):

      (1) Provide a stronger rationale to use area CA1. For example, the authors mention that CA1 is active during seizure activity, but can seizures originate from CA1? That would make the approach logical and also explain why induced and spontaneous seizures are similar.

      Thank you for the comment. We discussed it in the Discussion section (page 14, first and second paragraphs).

      (2) Explain the use of SVM classifiers so it is more convincing that induced and spontaneous seizures are similar. Or, if they are not similar, explain that this is a limitation.

      We made text edits to clarify the purpose of the SVM analysis. It was not intended to identify meaningful differences between induced and spontaneous seizures. Rather, it was used to classify EEG epochs as 'seizures' based on spontaneous seizures as the training set, demonstrating the gross similarity between induced and spontaneous seizures.

      (3)If feasible, extend the duration over which seizure induction reliability is assessed so that the long-term utility of the model can be demonstrated.

      Thank you for the suggestion. We would like to assess long-term utility in a follow-up study.

      (4) The GitHub link is not yet active. The authors will be required to supply their relevant code for peer evaluation as well as publication.

      Thank you. The GitHub repository is now active.

      (5) State and assess the impacts of sex as a biological variable.

      Thank you for pointing this out. Both female and male animals were included in this study: Epileptic cohort: 7 males, 3 females; Naïve cohort: 3 males, 4 females.

    1. eLife Assessment

      This useful manuscript reports on a new mouse model for LAMA2-MD, a rare but very severe congenital muscular dystrophy. The knockout mice were generated by removing exon3 in the Lama2 gene, which results in a frameshift in exon4 and a premature stop codon. These animals lack any laminin-alpha2 protein and confirm results from previous Lama2 knockout models. Additionally, this study includes weak transcriptomics data that might be a good resource for the field. However, experimental evidence, methods, and data analyses supporting the main claims of the manuscript are incomplete.

    2. Reviewer #1 (Public review):

      Strengths:

      This work adds another mouse model for LAMA2-MD that re-iterates the phenotype of previously published models. Such as dy3K/dy3K; dy/dy and dyW/dyW mice. The phenotype is fully consistent with the data from others.

      One of the major weaknesses of the manuscript initially submitted was the overinterpretation and the overstatements. The revised version is clearly improved as the authors toned-down their interpretation and now also cite the relevant literature of previous work.

      Comments on revisions:

      This is the second revision of a paper focusing on the generation of a CRISPR/Cas9-engineered mouse model for LAMA2-MD. I have reviewed the initial submission, the first revision, and now this second revision. While there have been improvements, several issues still need to be addressed by the authors. I will outline these points without dividing them into major and minor categories:

      Introduction:

      The statement regarding existing mouse models requires correction: The claim, "They were established in the pre-gene therapy era, leaving trace of engineering, such as bacterial elements in the Lama2 gene locus, thus unsuitable for testing various gene therapy strategies," is inaccurate. Current mouse models can indeed be used for testing gene therapy strategies, regardless of whether they contain elements in the Lama2 locus. The primary consideration is whether or not they express laminin-alpha2. Please revise this statement.<br /> Results Section:

      scRNA-seq:

      The authors note that they analyzed "a total of 8,111 cells from the dyH/dyH mouse brain and 8,127 cells from the WT mouse brain were captured using the 10X Genomics platform (Figure supplement 4A, B)." This is too few cells to support firm conclusions. Furthermore, there is a discrepancy in the referred figure S4, which indicates that 10,094 cells were analyzed for dyH/dyH mice and 10,496 for wild-type mice. Please correct this inconsistency.

      Figure 5C displays differences in cell populations between wild-type and dyH/dyH mice. Given the low number of cells analyzed and the lack of replicates, these differences cannot be considered reliable. More samples should be analyzed to support these findings.

      The data suggest a defect in the BBB for dyH/dyH mice, but this conclusion is based on minimal cell counts and remains purely correlative. If BBB issues exist, experimental validation is necessary, such as injecting dyes into the bloodstream to detect any leakage. I have previously highlighted this in my comments on earlier manuscript versions.

      Bulk RNA-seq:

      The number of samples analyzed here is substantial, making the data potentially more robust. These data could serve as a valuable resource for other researchers. However, it is important to note that all data are correlative and do not provide functional insights.

      Overall:

      The manuscript still lacks significant insights, partly because existing mouse models for LAMA2-MD have been extensively analyzed. While the bulk RNA-seq data offer some value as a resource, I recommend that the authors re-assess their writing and further temper their interpretations of the findings.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      This work adds another mouse model for LAMA2-MD that re-iterates the phenotype of previously published models. Such as dy3K/dy3K; dy/dy and dyW/dyW mice. The phenotype is fully consistent with the data from others.

      Thank you for the valuable comments and good suggestions you have proposed, and we have added information and analysis of another mouse model for LAMA2-MD in the updated version 2 of this manuscript.

      One of the major weaknesses of the manuscript initially submitted was the overinterpretation and the overstatements. The revised version is clearly improved as the authors toned-down their interpretation and now also cite the relevant literature of previous work.

      Thank you for the good comments you have proposed, and we have carefully corrected the overinterpretation and overstatements in the previous updated version.

      Unfortunately, the data on RNA-seq and scRNA-seq are still rather weak. scRNA-seq was conducted with only one mouse resulting in only 8000 nuclei. I am not convinced that the data allow us to interpret them to the extent of the authors. Similar to the first version, the authors infer function by examining expression. Although they are a bit more cautious, they still argue that the BBB is not functional in dy<sup>H</sup>/dy<sup>H</sup> mice without showing leakiness. Such experiments can be done using dyes, such as Evans-blue or Cadaverin. Hence, I would suggest that they formulate the text still more carefully.

      Thank you for the valuable suggestions. We also agree that we should perform more related functional experiments such as Evans-blue or Cadaverin to confirm the impaired BBB. However, the related functional experiments haven’t been done due to the first author has been working in clinic. While, we have added the "Limitations" part, and made statements in the Limitations part with "Even though RNA-seq and scRNA-seq have been performed, the data of scRNA-seq are still insufficient due to the limited number of mouse brains. This study has provided potentially important information for the molecular pathogenetic mechanisms of muscular dystrophy and brain dysfunction for LAMA2-CMD, however, some related functional experiments have not been further performed".

      A similar lack of evidence is true for the suggested cobblestone-like lissencephaly of the mice. There is no strong evidence that this is indeed occurring in the mice (might also be a problem because mice die early). Hence, the conclusions need to be formulated in such a way that readers understand that these are interpretations and not facts.

      Thank you for the valuable suggestions. We do agree with this comment, and have made statement in the Limitations with "This study has provided potentially important information for the molecular pathogenetic mechanisms of muscular dystrophy and brain dysfunction for LAMA2-CMD, however, some related functional experiments have not been further performed". Also, for the cobblestone-like lissencephaly which was showed in LAMA2-CMD patients while not found in the mouse model, we have added the discussion as "Though the cortical malformations were not found in the dy H/dy H brains by MRI analysis probably due to the small volume in within 1 month old, Thus, the changes in transcriptomes and protein levels provided potentially useful data for the hypothesis of the impaired gliovascular basal lamina of the BBB, which might be associated with occipital pachygyria in LAMA2-CMD patients."

      Finally, I am surprised that the only improvement in the main figures is the Western blot for laminin-alpha2. The histology of skeletal muscle still looks rather poor. I do not know what the problems are but suggest that the authors try to make sections from fresh-frozen tissue. I anticipate that the mice were eventually perfused with PFA before muscles were isolated. This often results in the big gaps in the sections.

      Thank you for the valuable suggestions. We do agree with this comment and we should make sections from fresh-frozen tissue. Therefore, we have made statement in the Limitations with "Moreover, due to making sections with PFA before muscles isolated, and not from fresh-frozen tissue, there have been big gaps in the sections which do affect the histology of skeletal muscle to some extent."

      Overall, the work is improved but still would need additional experiments to make it really an important addition to the literature in the LAMA-MD field.

      Thank you for all your good comments and the valuable suggestions.

      Reviewer #2 (Public Review):

      This revised manuscript describes the production of a mouse model for LAMA2- Related Muscular Dystrophy. The authors investigate changes in transcripts within the brain and blood barrier. The authors also investigate changes in the transcriptome associated with the muscle cytoskeleton. Strengths: (1) The authors produced a mouse model of LAMA2-CMD using CRISPR-Cas9. (2) The authors identify cellular changes that disrupted the blood-brain barrier.

      Thank you for your good comments.

      Weaknesses:

      The authors throughout the manuscript overstate "discoveries" which have been previously described, published and not appropriately cited.

      Thank you for your great suggestion. We have toned-down the interpretations and overstatements throughout the manuscript, and added words such as "potentially", "possible", "some potential clues", "was speculated to probably", and so on.

      Alternations in the blood brain barrier and in the muscle cell cytoskeleton in LAMA2-CMD have been extensively studied and published in the literature and are not cited appropriately.

      Thank you for your great suggestion. We do agree with that alternations in the muscle cell cytoskeleton in LAMA2-CMD have been extensively studied and published, and the related literatures have been cited in the updated version 2.0. However, alternations in the blood brain barrier in LAMA2-CMD haven’t been extensively studied, only some papers (such as PMID: 25392494, PMID: 32792907) have investigated or discussed this issue.

      The authors have increased animal number to N=6, but this is still insufficient based on Power analysis results in statistical errors and conclusions that may be incorrect.

      Thank you for your great suggestion. We do agree that the animal number should be increased for Power analysis, and we have added statements in the Limitations with "Finally, due to the limited number of animal samples for the Power analysis, the statistical errors and conclusions might be affected."

      The use of "novel mouse model" in the manuscript overstates the impact of the study.

      Thank you for your great suggestion. We have changed the statement "novel mouse model" throughout the manuscript except the title.

      All studies presented are descriptive and do not more to the field except for producing yet another mouse model of LAMA2-CMD and is the same as all the others produced.

      Thank you for your comment. We do agree that further functional experiments have not been performed to reveal and confirm the pathogenesis. However, the analysis of phenotype was systematic and comprehensive, including survival time, motor function, serum CK, muscle MRI, muscle histopathology in different stages, and brain histopathology. Moreover, RNA-seq and scRNA-seq in LAMA2-CMD have been seldom performed before, and the data in this study could provide potentially important information for the molecular pathogenetic mechanisms of muscular dystrophy and brain dysfunction for LAMA2-CMD.

      Grip strength measurements are considered error prone and do not give an accurate measurement of muscle strength, which is better achieved using ex vivo or in vivo muscle contractility studies.

      Thank you for your great suggestion. We do agree that grip strength measurements are considered error prone and do not give an accurate measurement of muscle strength. And we have added related statement in the Limitations with "Grip strength measurements used in this study are considered error prone and do not give an accurate measurement of muscle strength, which would be better achieved using ex vivo or in vivo muscle contractility studies."

      A lack of blinded studies as pointed out of the authors is a concern for the scientific rigor of the study.

      Thank you for your great suggestion. We performed the studies with those scoring outcome measures not blinded to the groups. Actually, it was very easy to discriminate the dy<sup>H</sup>/dy<sup>H</sup> groups from the WT/Het mice due to that the dy<sup>H</sup>/dy<sup>H</sup> mice showed much smaller body shape than other groups from as early as P7 .

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      There are multiple grammatical errors throughout the manuscript which should be corrected.

      Thank you for your recommendation. We have carefully corrected the grammatical errors within the manuscript.

      The authors mention no changes in intestinal muscles, but it is unclear if they are referring to skeletal or smooth muscle.

      Thank you for your good comment. The intestinal muscles with no changes in this study are referring to smooth muscle, and we have changes the description into intestinal smooth muscles.

    1. eLife Assessment

      The authors present useful findings on the use of a single-fly behavioral paradigm for assessing different Drosophila genetic models of neurodegeneration. The experimental design and analyses are solid and can be used for quick behavioral assessment in fly models of various neurodegenerative diseases, especially those having an impact on locomotion. The work will be of interest to Drosophila biologists using behavior as a readout for their studies.

    2. Reviewer #1 (Public review):

      Translating discoveries from model organisms to humans is often challenging, especially in neuropsychiatric diseases, due to the vast gaps in the circuit complexities and cognitive capabilities. Kajtor et al. propose to bridge this gap in the fly models of Parkinson's disease (PD) by developing a new behavioural assay where flies respond to a moving shadow by modifying their locomotor activities. The authors believe the flies' response to the shadow approximates their escape response to an approaching predator. To validate this argument, they tested several PD-relevant transgenic fly lines and showed that some of them indeed have altered responses in their assay.

      Strengths:

      This single-fly-based assay is easy and inexpensive to set up, scalable and provides sensitive, quantitative estimates to probe flies' optomotor acuity. The behavioural data is detailed, and the analysis parameters are well-explained.

      Weaknesses:

      The authors have yet to link cellular physiology to behaviour. It will be interesting to see how future use of this assay helps uncover connections between cellular pathology and behavioural changes.

    3. Reviewer #2 (Public review):

      The manifestation and progression of neurodegenerative disorders is poorly understood. Many of the neuronal disorders start by presenting subtle changes in neuronal circuit and quantification and measurement of these subtle behavior responses could help one delineate the mechanisms involved. The present study very nicely uses the flies' behavioral response to predator-mimicking passing shadows to measure subtle changes in their behavior. The data from various fly genetic models of Parkinson's disease supports their claim. This single trial method is useful to capture the individual animal's response to the threatening stimuli but stops short of capturing the fine ambulatory responses which could provide further information on an individual's behavioral response. By capturing the fine features, the authors could get detailed observations, such as posture, gait or wing positioning for a better understanding the behavioral response to the passing shadow.

    4. Author response:

      The following is the authors’ response to the original reviews

      We thank the Reviewers for their constructive comments and the Editor for the possibility to address the Reviewers’ points in this rebuttal. We 

      (1) Conducted new experiments with NP6510-Gal4 and TH-Gal4 lines to address potential behavioral differences due to targeting dopaminergic vs. both dopaminergic and serotonergic neurons

      (2) Conducted novel data analyses to emphasize the strength of sampling distributions of behavioral parameters across trials and individual flies

      (3) Provided Supplementary Movies

      (4) Calculated additional statistics

      (5) Edited and added text to address all points of the Reviewers.

      Please see our point-by-point responses below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Translating discoveries from model organisms to humans is often challenging, especially in neuropsychiatric diseases, due to the vast gaps in the circuit complexities and cognitive capabilities. Kajtor et al. propose to bridge this gap in the fly models of Parkinson's disease (PD) by developing a new behavioral assay where flies respond to a moving shadow by modifying their locomotor activities. The authors believe the flies' response to the shadow approximates their escape response to an approaching predator. To validate this argument, they tested several PD-relevant transgenic fly lines and showed that some of them indeed have altered responses in their assay.

      Strengths:

      This single-fly-based assay is easy and inexpensive to set up, scalable, and provides sensitive, quantitative estimates to probe flies' optomotor acuity. The behavioral data is detailed, and the analysis parameters are well-explained.

      We thank the Reviewer for the positive assessment of our study.

      Weaknesses:

      While the abstract promises to give us an assay to accelerate fly-to-human translation, the authors need to provide evidence to show that this is indeed the case. They have used PD lines extensively characterized by other groups, often with cheaper and easier-to-setup assays like negative geotaxis, and do not offer any new insights into them. The conceptual leap from a low-level behavioral phenotype, e.g. changes in walking speed, to recapitulating human PD progression is enormous, and the paper does not make any attempt to bridge it. It needs to be clarified how this assay provides a new understanding of the fly PD models, as the authors do not explore the cellular/circuit basis of the phenotypes. Similarly, they have assumed that the behavior they are looking at is an escape-from-predator response modulated by the central complex- is there any evidence to support these assumptions? Because of their rather superficial approach, the paper does not go beyond providing us with a collection of interesting but preliminary observations.

      We thank the Reviewer for pointing out some limitations of our study. We would like to emphasize that what we perceive as the main advantage of performing single-fly and single-trial analyses is the access to rich data distributions that provide more fine-scale information compared to bulk assays. We think that this is exactly going one step closer to ‘bridging the enormous conceptual leap from a low-level behavioral phenotype, e.g. changes in walking speed, to recapitulating human PD progression’, and we showcase this in our study by comparing the distributions over the entire repertoire of behavioral responses across fly mutants. Nevertheless, we agree with the Reviewer that many more steps in this direction are needed to improve translatability. Therefore, we toned down the corresponding statements in the Abstract and in the Introduction. Moreover, to further emphasize the strength of sampling distributions of behavioral parameters across trials and individual flies, we complemented our comparisons of central tendencies with testing for potential differences in data dispersion, demonstrated in the novel Supplementary Figure S4.

      Looming stimuli have been used to characterize flies’ escape behaviors. These studies uncovered a surprisingly rich behavioral repertoire (Zacarias et al., 2018), which was modulated by both sensory and motor context, e.g. walking speed at time of stimulus presentation (Card and Dickinson, 2008; Oram and Card, 2022; Zacarias et al., 2018). The neural basis of these behaviors was also investigated, revealing loom-sensitive neurons in the optic lobe and the giant fiber escape pathway (Ache et al., 2019; de Vries and Clandinin, 2012). Although less frequently, passing shadows were also employed as threat-inducing stimuli in flies (Gibson et al., 2015). We opted for this variant of the stimulus so that we could ensure that the shadow reached the same coordinates in all linear track concurrently, aiding data analysis and scalability. Similar to the cited study, we found the same behavioral repertoire as in studies with looming stimuli, with an equivalent dependence on walking speed, confirming that looming stimuli and passing shadows can both be considered as threat-inducing visual stimuli. We added a discussion on this topic to the main text.

      Reviewer #2 (Public Review):

      In this study, Kajtor et al investigated the use of a single-animal trial-based behavioral assay for the assessment of subtle changes in the locomotor behavior of different genetic models of Parkinson's disease of Drosophila. Different genotypes used in this study were Ddc-GAL4>UASParkin-275W and UAS- α-Syn-A53T. The authors measured Drosophila's response to predatormimicking passing shadow as a threatening stimulus. Along with these, various dopamine (DA) receptor mutants, Dop1R1, Dop1R2 and DopEcR were also tested.

      The behavior was measured in a custom-designed apparatus that allows simultaneous testing of 13 individual flies in a plexiglass arena. The inter-trial intervals were randomized for 40 trials within 40 minutes duration and fly responses were defined into freezing, slowing down, and running by hierarchical clustering. Most of the mutant flies showed decreased reactivity to threatening stimuli, but the speed-response behavior was genotype invariant.

      These data nicely show that measuring responses to the predator-mimicking passing shadows could be used to assess the subtle differences in the locomotion parameters in various genetic models of Drosophila.

      The understanding of the manifestation of various neuronal disorders is a topic of active research. Many of the neuronal disorders start by presenting subtle changes in neuronal circuits and quantification and measurement of these subtle behavior responses could help one delineate the mechanisms involved. The data from the present study nicely uses the behavioral response to predator-mimicking passing shadows to measure subtle changes in behavior. However, there are a few important points that would help establish the robustness of this study.

      We thank the Reviewer for the constructive comments and the positive assessment of our study.

      (1) The visual threat stimulus for measuring response behavior in Drosophila is previously established for both single and multiple flies in an arena. A comparative analysis of data and the pros and cons of the previously established techniques (for example, Gibson et al., 2015) with the technique presented in this study would be important to establish the current assay as an important advancement.

      We thank the Reviewer for this suggestion. We included the following discussion on measuring response behavior to visual threat stimuli in the revised manuscript.

      Many earlier studies used looming stimulus, that is, a concentrically expanding shadow, mimicking the approach of a predator from above, to study escape responses in flies (Ache et al., 2019; Card and Dickinson, 2008; de Vries and Clandinin, 2012; Oram and Card, 2022; Zacarias et al., 2018) as well as rodents (Braine and Georges, 2023; Heinemans and Moita, 2024; Lecca et al., 2017). These assays have the advantage of closely resembling naturalistic, ecologically relevant threatinducing stimuli, and allow a relatively complete characterization of the fly escape behavior repertoire. As a flip side of their large degree of freedom, they do not lend themselves easily to provide a fully standardized, scalable behavioral assay. Therefore, Gibson et al. suggested a novel threat-inducing assay operating with moving overhead translational stimuli, that is, passing shadows, and demonstrated that they induce escape behaviors in flies akin to looming discs (Gibson et al., 2015). This assay, coined ReVSA (repetitive visual stimulus-induced arousal) by the authors, had the advantage of scalability, while constraining flies to a walking arena that somewhat restricted the remarkably rich escape types flies otherwise exhibit. Here we carried this idea one step further by using a screen to present the shadows instead of a physically moving paddle and putting individual flies to linear corridors instead of the common circular fly arena. This ensured that the shadow reached the same coordinates in all linear tracks concurrently and made it easy to accurately determine when individual flies encountered the stimulus, aiding data analysis and scalability. We found the same escape behavioral repertoire as in studies with looming stimuli and ReVSA (Gibson et al., 2015; Zacarias et al., 2018), with a similar dependence on walking speed (Oram and Card, 2022; Zacarias et al., 2018), confirming that looming stimuli and passing shadows can both be considered as threat-inducing visual stimuli.  

      (2) Parkinson's disease mutants should be validated with other GAL-4 drivers along with DdcGAL4, such as NP6510-Gal4 (Riemensperger et al., 2013). This would be important to delineate the behavioral differences due to dopaminergic neurons and serotonergic neurons and establish the Parkinson's disease phenotype robustly.

      We thank the Reviewer for point out this limitation. To address this, we repeated our key experiments in Fig.3. with both TH-Gal4 and NP6510-Gal4 lines, and their respective controls. These yielded largely similar results to the Ddc-Gal4 lines reported in Fig.3., reproducing the decreased speed and decreased overall reactivity of PD-model flies. Nevertheless, TH-Gal4 and NP6510-Gal4 mutants showed an increased propensity to stop. Stop duration showed a significant increase not only in α-Syn but also in Parkin fruit flies. These novel results have been added to the text and are demonstrated in Supplementary Figure S3.

      (3) The DopEcR mutant genotype used for behavior analysis is w1118; PBac{PB}DopEcRc02142TM6B, Tb1. Balancer chromosomes, such as TM6B,Tb can have undesirable and uncharacterised behavioral effects. This could be addressed by removing the balancer and testing the DopEcR mutant in homozygous (if viable) or heterozygous conditions.

      We appreciate the Reviewer's comment and acknowledge the potential for the DopEcR balancer chromosome to produce unintended behavioral effects. However, given that this mutant was not essential to our main conclusions, we opted not to repeat the experiment. Nevertheless, we now discuss the possible confounds associated with using the PBac{PB}DopEcRc02142 mutant allele over the balancer chromosome. “We recognize a limitation in using PBac{PB}DopEcRc02142 over the  TM6B, Tb<sup>1</sup> balancer chromosome, as the balancer itself may induce behavioral deficits in flies. We consider this unlikely, as the PBac{PB}DopEcRc02142 mutation demonstrates behavioral effects even in heterozygotes (Ishimoto et al., 2013). Additionally, to our knowledge, no studies have reported behavioral deficits in flies carrying the TM6B, Tb<sup>1</sup> balancer chromosome over a wild-type chromosome.”

      (4) The height of the arena is restricted to 1mm. However, for the wild-type flies (Canton-S) and many other mutants, the height is usually more than 1mm. Also, a 1 mm height could restrict the fly movement. For example, it might not allow the flies to flip upside down in the arena easily. This could introduce some unwanted behavioral changes. A simple experiment with an arena of height at least 2.5mm could be used to verify the effect of 1mm height.

      We thank the Reviewer for this comment, which prompted us to reassess the dimensions of the apparatus. The height of the arena was 1.5 mm, which we corrected now in the text. We observed that the arena did not restrict the flies walking and that flies could flip in the arena. We now include two Supplementary Movies to demonstrate this.

      (5) The detailed model for Monte Carlo simulation for speed-response simulation is not described. The simulation model and its hyperparameters need to be described in more depth and with proper justification.

      We thank the Reviewer for pointing out a lack of details with respect to Monte Carlo simulations. We used a nested model built from actual data distributions, without any assumptions. Accordingly, the stimulation did not have hyperparameters typical in machine learning applications, the only external parameter being the number of resamplings (3000 for each draw). We made these modeling choices clearer and expanded this part as follows.

      “The effect of movement speed on the distribution of behavioral response types was tested using a nested Monte Carlo simulation framework (Fig. S5). This simulation aimed to model how different movement speeds impact the probability distribution of response types, comparing these simulated outcomes to empirical data. This approach allowed us to determine whether observed differences in response distributions are solely due to speed variations across genotypes or if additional behavioral factors contribute to the differences. First, we calculated the probability of each response type at different specific speed values (outer model). These probabilities were derived from the grand average of all trials across each genotype, capturing the overall tendency at various speeds. Second, we simulated behavior of virtual flies (n = 3000 per genotypes, which falls within the same order of magnitude as the number of experimentally recorded trials from different genotypes) by drawing random velocity values from the empirical velocity distribution specific to the given genotype and then randomly selecting a reaction based on the reaction probabilities associated with the drawn velocity (inner model). Finally, we calculated reaction probabilities for the virtual flies and compared it with real data from animals of the same genotype.

      Differences were statistically tested by Chi-squared test.”

      (6) The statistical analysis in different experiments needs revisiting. It wasn't clear to me if the authors checked if the data is normally distributed. A simple remedy to this would be to check the normality of data using the Shapiro-Wilk test or Kolmogorov-Smirnov test. Based on the normality check, data should be further analyzed using either parametric or non-parametric statistical tests. Further, the statistical test for the age-dependent behavior response needs revisiting as well. Using two-way ANOVA is not justified given the complexity of the experimental design. Again, after checking for the normality of data, a more rigorous statistical test, such as split-plot ANOVA or a generalized linear model could be used.

      We thank the Reviewer for this comment. We performed Kolmogorov-Smirnov test for normality on the data distributions underlying Figure 3, and normality was rejected for all data distributions at p = 0.05, which justifies the use of the non-parametric Mann-Whitney U-test. Regarding ANOVA, we would like to point out that the ANOVA hypothesis test design is robust to deviations from normality (Knief and Forstmeier, 2021; Mooi et al., 2018). While the Kruskal-Wallis test is considered a reasonable non-parametric alternative of one-way ANOVA, there is no clear consensus for a non-parametric alternative of two-way ANOVA. Therefore, we left the two-way ANOVA for Figure 5 in place; however, to increase the statistical confidence in our conclusions, we performed Kruskal-Wallis tests for the main effect of age and found significant effects in all genotypes in accordance with the ANOVA, confirming the results (Stop frequency, DopEcR p = 0.0007; Dop1R1, p = 0.004; Dop1R2, p = 9.94 × 10<sup>-5</sup>; w<sup>1118</sup>, p = 9.89 × 10<sup>-13</sup>; y<sup>1</sup> w<sup>67</sup>c<sup>23</sup>, p = 2.54 × 10<sup>-5</sup>; Slowing down frequency, DopEcR, p = 0.0421; Dop1R1, p = 5.77 x 10<sup>-6</sup>; Dop1R2, p = 0.011; w<sup>1118</sup>, p = 2.62 x 10<sup>-5</sup>; y<sup>1</sup> w<sup>67</sup>c<sup>23</sup>, p = 0.0382; Speeding up frequency, DopEcR, p = 0.0003; Dop1R1, p = 2.06 x 10<sup>-7</sup>; Dop1R2, p = 2.19 x 10<sup>-6</sup>; w<sup>1118</sup>, p = 0.0044; y<sup>1</sup> w<sup>67</sup>c<sup>23</sup>, p = 1.36 x 10<sup>-5</sup>). We also changed the post hoc Tukey-tests to post hoc Mann-Whitney tests in the text to be consistent with the statistical analyses for Figure 3. These resulted in very similar results as the Tukey-tests. Of note, there isn’t a straightforward way of correcting for multiple comparisons in this case as opposed to the Tukey’s ‘honest significance’ approach, we thus report uncorrected p values and suggest considering them at p = 0.01, which minimizes type I errors. These notes have been added to the ‘Data analysis and statistics’ Methods section.

      (7) The dopamine receptor mutants used in this study are well characterized for learning and memory deficits. In the Parkinson's disease model of Drosophila, there is a loss of DA neurons in specific pockets in the central brain. Hence, it would be apt to use whole animal DA receptor mutants as general DA mutants rather than the Parkinson's disease model. The authors may want to rework the title to reflect the same.

      We thank the Reviewer for this comment, which suggests that we were not sufficiently clear on the Drosophila lines with DA receptor mutations. We used Mi{MIC} random insertion lines for dopamine receptor mutants, namely y<sup>1</sup> w<sup>*1</sup>; Mi{MIC}Dop1R1<sup>MI04437</sup> (BDSC 43773), y<sup>1</sup> w<sup>*1</sup>; Mi{MIC}Dop1R2<sup>MI08664</sup> (BDSC 51098) (Harbison et al., 2019; Pimentel et al., 2016), and w<sup>1118</sup>; PBac{PB}DopEcR<sup>c02142</sup>/TM6B, Tb<sup>1</sup> (BDSC 10847) (Ishimoto et al., 2013; Petruccelli et al., 2020, 2016). These lines carried reported mutations in dopamine receptors, most likely generating partial knock down of the respective receptors. We made this clearer by including the full names at the first occurrence of the lines in Results (beyond those in Methods) and adding references to each of the lines.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Please think about focusing the manuscript either on the escape response or the PD pathology and provide additional evidence to demonstrate that you indeed have a novel system to address open questions in the field.

      As detailed above, we now emphasize more that the main advantage of our single-trial-based approach lies in the appropriate statistical comparison of rich distributions of behavioral data. Please see our response to the ‘Weaknesses’ section for more details.

      (2) Please explain the rationale for choosing the genetic lines and provide appropriate genetic controls in the experiments, e.g. trans-heterozygotes. Why use Ddc-Gal4 instead of TH or other specific Split-Gal4 lines?

      We thank the Reviewer for this suggestion. We repeated our key experiments with TH-Gal4 and NP6510-Gal4 lines. Please see our response to Point #2 of Reviewer #2 for details.

      (3) Please proofread the manuscript for ommissions. e.g. there's no legend for Fig 4b.

      We respectfully point out that the legend is there, and it reads “b, Proportion of a given response type as a function of average fly speed before the shadow presentation. Top, Parkin and α-Syn flies. Bottom, Dop1R1, Dop1R2 and DopEcR mutant flies.”

      Reviewer #2 (Recommendations For The Authors):

      (1) In figure 2(c), representing the average walking speed data for different mutants would be useful to visually correlate the walking differences.

      We thank the Reviewer for this suggestion. The average walking speed was added in a scatter plot format, as suggested in the next point of the Reviewer. 

      (2) The data could be represented more clearly using scatter plots. Also, the color scheme could be more color-blindness friendly.

      We thank the Reviewer for this suggestion. We added scatter plots to Fig.2c that indeed represent the distribution of behavioral responses better. We also changed the color scheme and removed red/green labeling.

      (3) The manuscript should be checked for typos such as in line 252, 449, 484.

      Thank you. We fixed the typos.

      References

      Ache JM, Polsky J, Alghailani S, Parekh R, Breads P, Peek MY, Bock DD, von Reyn CR, Card GM. 2019. Neural Basis for Looming Size and Velocity Encoding in the Drosophila Giant Fiber Escape Pathway. Curr Biol 29:1073-1081.e4. doi:10.1016/j.cub.2019.01.079

      Braine A, Georges F. 2023. Emotion in action: When emotions meet motor circuits. Neurosci Biobehav Rev 155:105475. doi:10.1016/j.neubiorev.2023.105475

      Card G, Dickinson MH. 2008. Visually Mediated Motor Planning in the Escape Response of Drosophila. Curr Biol 18:1300–1307. doi:10.1016/j.cub.2008.07.094

      de Vries SEJ, Clandinin TR. 2012. Loom-Sensitive Neurons Link Computation to Action in the Drosophila Visual System. Curr Biol 22:353–362. doi:10.1016/j.cub.2012.01.007

      Gibson WT, Gonzalez CR, Fernandez C, Ramasamy L, Tabachnik T, Du RR, Felsen PD, Maire MR, Perona P, Anderson DJ. 2015. Behavioral Responses to a Repetitive Visual Threat Stimulus Express a Persistent State of Defensive Arousal in Drosophila. Curr Biol 25:1401– 1415. doi:10.1016/j.cub.2015.03.058

      Harbison ST, Kumar S, Huang W, McCoy LJ, Smith KR, Mackay TFC. 2019. Genome-Wide Association Study of Circadian Behavior in Drosophila melanogaster. Behav Genet 49:60–82. doi:10.1007/s10519-018-9932-0

      Heinemans M, Moita MA. 2024. Looming stimuli reliably drive innate defensive responses in male rats, but not learned defensive responses. Sci Rep 14:21578. doi:10.1038/s41598-02470256-2

      Ishimoto H, Wang Z, Rao Y, Wu C, Kitamoto T. 2013. A Novel Role for Ecdysone in Drosophila Conditioned Behavior: Linking GPCR-Mediated Non-canonical Steroid Action to cAMP Signaling in the Adult Brain. PLoS Genet 9:e1003843. doi:10.1371/journal.pgen.1003843

      Knief U, Forstmeier W. 2021. Violating the normality assumption may be the lesser of two evils. Behav Res Methods 53:2576–2590. doi:10.3758/s13428-021-01587-5

      Lecca S, Meye FJ, Trusel M, Tchenio A, Harris J, Schwarz MK, Burdakov D, Georges F, Mameli M. 2017. Aversive stimuli drive hypothalamus-to-habenula excitation to promote escape behavior. Elife 6:1–16. doi:10.7554/eLife.30697

      Mooi E, Sarstedt M, Mooi-Reci I. 2018. Market Research, Springer Texts in Business and Economics. Singapore: Springer Singapore. doi:10.1007/978-981-10-5218-7

      Oram TB, Card GM. 2022. Context-dependent control of behavior in Drosophila. Curr Opin Neurobiol 73:102523. doi:10.1016/j.conb.2022.02.003

      Petruccelli E, Lark A, Mrkvicka JA, Kitamoto T. 2020. Significance of DopEcR, a G-protein coupled dopamine/ecdysteroid receptor, in physiological and behavioral response to stressors. J Neurogenet 34:55–68. doi:10.1080/01677063.2019.1710144

      Petruccelli E, Li Q, Rao Y, Kitamoto T. 2016. The Unique Dopamine/Ecdysteroid Receptor Modulates Ethanol-Induced Sedation in Drosophila. J Neurosci 36:4647–4657. doi:10.1523/JNEUROSCI.3774-15.2016

      Pimentel D, Donlea JM, Talbot CB, Song SM, Thurston AJF, Miesenböck G. 2016. Operation of a homeostatic sleep switch. Nature 536:333–337. doi:10.1038/nature19055

      Zacarias R, Namiki S, Card GM, Vasconcelos ML, Moita MA. 2018. Speed dependent descending control of freezing behavior in Drosophila melanogaster. Nat Commun 9:1–11. doi:10.1038/s41467-018-05875-1

    1. eLife Assessment

      This is an important study that combines replications of findings and novel detailed MRI investigations to assess the impact of environmental enrichment and maternal behavior on mice brain structure at different stages of development. The results and evidence supporting the conclusions are convincing, but in detail, the interpretation is challenging, in particular due to inter-individual and inter-litter variability. The extent to which maternal care mediates the impact of enrichment on brain development during the perinatal period also remains unclear because behavior was observed only during short periods, and the performed analyses are still incomplete. This study will nevertheless be of significant interest to neuroscientists and researchers interested in neurodevelopment in relation to environmental factors because of its in-depth use of MRI to study brain plasticity in mice.

    2. Reviewer #1 (Public review):

      Kaller et al. (2025) explore the impact of environmental enrichment (EE) on the developing mouse brain, specifically during the perinatal period. The authors use high-resolution MRI to examine structural brain changes in neonates (postnatal day 7, P7) and compare these changes to those observed in adulthood. A key aspect of the study is the investigation of maternal care as a potential mediating factor in the effects of perinatal EE on neonatal brain development.

      The work exhibits the following notable strengths:

      (1) The study addresses a significant gap in the literature by investigating the effects of perinatal EE on whole-brain structure in neonates. Previous research has primarily focused on the effects of EE on the adult brain or specific aspects of early development, such as the visual system.

      (2) The authors employ a combination of high-resolution MRI and behavioral analysis of maternal care, providing a comprehensive view of the effects of EE.

      (3) The study reveals that EE affects brain structure as early as P7, with distinct regional changes compared to adulthood. The finding that maternal care influences neonatal brain structure and correlates with the effects of EE is particularly noteworthy.

      (4) The paper is clearly written, well-organized, and easy to follow. The figures and tables are informative and effectively illustrate the key findings.

      However, some weaknesses should be addressed to improve the quality of this study:

      (1) While the study includes an assessment of maternal care, the observational period is relatively short. A more extended or continuous assessment of maternal behavior could provide a more comprehensive understanding of its role in mediating the effects of EE.

      (2) The study primarily focuses on structural brain changes. Investigating the functional consequences of these changes could provide further insights into the long-term impact of perinatal EE.

      (3) The study demonstrates a correlation between maternal care and neonatal brain structure but does not elucidate the underlying mechanisms. Future studies could explore potential molecular or cellular mechanisms involved in these effects.

    3. Reviewer #2 (Public review):

      This paper by Kaller and colleagues combines an interesting replication of findings on the importance of maternal behavior on brain development in the offspring with a state-of-the-art MRI analysis and a novel comparison between such perinatal and early postnatal enrichment via the activity of the mother and a classical enriched environment in the adult. In general, the observations are as one would have expected. Early postnatal enrichment and adult enrichment have differential effects, which is plausible because, as the source of these changes is environmental, and environmental means very different things at these different stages. The three data sets presented are really interesting, and while the comparison between them might not always be as straightforward as it seems, the cross-sectional phenotyping with MRI already provides very important material and allows for interesting insight. Most interesting is possibly the massive effect of housing conditions at P7.

      In particular, the role of individual behavior differs. The authors highlight this role of the interaction with the environment, rather than the environment alone. Maternal care is a process that involves the pup.

      Importantly, the study shows that being born into an enriched environment predates certain changes that are still available after exposure at a later stage, but that there are also important differences. Detailed interpretation of these effects is not easy, however.

      Notably, the study does not include a condition of enrichment from birth into adulthood, and no analysis of the perinatal enrichment effects at an adult age. The timeline can be guessed from Figure 1b, but the authors might in places be more explicit about the fact that, indirectly and sometimes directly, animals of different ages (young adult versus adult) are compared. There is obviously no experience of maternal care in adulthood and no active exploration, etc in childhood. In part, this is what this paper is about, but it requires some thought for the reader to separate the more trivial from the more profound conclusions. Some more guidance would probably be welcome here. In general, Figure 4 is a great idea (and visually very appealing), but the content is not quite clear. "Adults born in EE vs. switched to EE in adulthood": this has, as far as I can tell, not been studied. What is compared are EE effects at two different time-points with two supposedly different mechanisms.

      From such a more mechanistic side, the authors might, for example, want to relate the observed patterns to what is known about the developmental (and plastic) dynamics in the respective brain regions at the given time. But age is a confounder here.

      There is another interesting point that the authors might discuss more prominently. The inter-individual differences in Z-score are dramatic within essentially all groups. So while the mean effects might still be statistically different, a large proportion of animals are within a range of values that could be found in either experimental group. The same is also true for the effects of maternal care, as depicted in Figure 3. While there is, for this ROI, a clear trend that overall relative volume decreases with maternal contact time at each time point, there is a large range of values for each maternal contact time bin. Consequently, neither genetics nor maternal care per se can be the driver of this variation. Part of it will be technical, but the trend in the data indicates that certainly not all of this is noise and technical error.

      This study has some open ends but also provides a very important and interesting direction for future study, corroborating the idea that behavior, maternal and own, does matter.

    4. Reviewer #3 (Public review):

      Summary:

      This study aimed to investigate the effect of environmental enrichment (EE) during the critical perinatal period on the developing brain structure and compare it with other periods. Different datasets of mice with EE or standard housing (SH) were compared with post-mortem MRI: dataset A (MRI at P96; 13 animals in EE during adulthood P53-P96, 14 animals in SH), dataset P (MRI at P43; 24 animals in EE during perinatal period and adulthood E17-P43, 25 animals in SH) and dataset N (MRI at P7; 52 animals in EE during perinatal period E13-P7, 67 animals in SH / resulting from 5 dams with 2 litters: 4 dams in EE and 6 dams in SH). The study replicated the effects observed during adulthood (main neuroanatomical EE/SH difference in datasets A and P: increase in the hippocampus volume) but also showed that volumetric changes for some regions differ between datasets A and P, suggesting different mechanisms of brain responses to enrichment depending on the period when EE was applied. Results on dataset N further showed that EE leads to lower brain size and differences for various regions: volume reduction in striatum, frontal, parietal, and occipital regions, hippocampus; volume increase for a few thalamic nuclei and hindbrain, suggesting different patterns of perinatal EE effects in datasets P and N. Since mice at P7 show little engagement with their environment, the authors further explored the hypothesis that the dams' behavior and interaction with neonates could be a mediator of brain differences observed at P7 between EE and SH animals. Maternal contact time was related to the P7 volumes for some regions (striatum, brainstem), but the variability and low sample size prevented a clear separation between EE and SH in terms of maternal behaviors.

      Strengths:

      (1) The question raised by this article is important at a fundamental level for our understanding of the complex interactions between the brain, behavior, and the environment.

      (2) This study replicates previous observations on the effects of EE in adult mice.

      (3) While some studies have been performed on neonates of dams exposed to EE during gestation, it is the first time that the effects of perinatal EE are investigated, in both the developing and mature brains with MRI. From a translational perspective, this is crucial for our understanding of human neurodevelopment in interaction with the environment.

      (4) The analyses carried out are numerous and detailed.

      Weaknesses:

      (1) The analyses carried out do not allow us to fully assess whether differences in maternal care mediate the effects of EE on brain structure during development. The observations support this causal hypothesis, but a complete mediation analysis would be useful if permitted by the sample size and the variability observed between litters.

      (2) The article is quite dense to read, given the number of analyses carried out. It is difficult at first reading to get a global view of the results. Figure 4 could be highlighted earlier to present the hypotheses and tests carried out.

      (3) The figures could be more explicit in terms of legends (particularly the supplementary figures).

    1. eLife Assessment

      This manuscript aims to identify the pacemaker cells in the lymphatic collecting vessels - the cells that initiate the autonomous action potentials and contractions needed to drive lymphatic pumping. Through the exemplary use of existing approaches (genetic deletions and cytosolic calcium detection in multiple cell types), the authors convincingly determine that lymphatic muscle cells are the origin of the action potential that triggers lymphatic contraction. The inclusion of scRNAseq and membrane potential data enhances a tremendous study. This fundamental discovery establishes a new standard for the field of lymphatic physiology.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript explores the multiple cell types present in the wall of murine collecting lymphatic vessels with the goal of identifying cells that initiate the autonomous action potentials and contractions needed to drive lymphatic pumping. Through the use of genetic models to delete individual genes or detect cytosolic calcium in specific cell types, the authors convincingly determine that lymphatic muscle cells are the origin of the action potential that triggers lymphatic contraction.

      Strengths:

      The experiments are rigorously performed, the data justify the conclusions and the limitations of the study are appropriately discussed.

      There is a need to identify therapeutic targets to improve lymphatic contraction and this work helps identify lymphatic muscle cells as potential cellular targets for intervention.

      Comments on revisions: The authors have addressed all of the reviewer comments. They should be congratulated on their precise and comprehensive study.

    3. Reviewer #2 (Public review):

      Summary:

      This is a well written manuscript describing studies directed at identifying the cell type responsible for pacemaking in murine collecting lymphatics. Using state of the art approaches, the authors identified a number of different cell types in the wall of these lymphatics and then using targeted expression of Channel Rhodopsin and GCaMP, the authors convincingly demonstrate that only activation of lymphatic muscle cells produces coordinated lymphatic contraction and that only lymphatic muscle cells display pressure-dependent Ca2+ transients as would be expected of a pacemaker in these lymphatics.

      Strengths:

      The use of targeted expression of channel rhodopsin and GCaMP to test the hypothesis that lymphatic muscle cells serve as the pacemakers in musing lymphatic collecting vessels.

      Weaknesses:

      The only significant weakness was the lack of quantitative analysis of most of the imaging data shown in Figures 1-11. In particular the colonization analysis should be extended to show cells not expected to demonstrate colocalization as a negative control for the colocalization analysis that the authors present. These weaknesses have been resolved by revision and addition of new and novel RNAseq data, additional colocalization data and membrane potential measurements.

      Comments on revisions: No additional concerns.

    4. Reviewer #3 (Public review):

      Summary:

      Zawieja et al. aimed to identify the pacemaker cells in the lymphatic collecting vessels. Authors have used various Cre-based expression systems and optogentic tools to identify these cells. Their findings suggest these cells are lymphatic muscle cells that drive the pacemaker activity in the lymphatic collecting vessels.

      Strengths:

      The authors have used multiple approaches to test their hypothesis. Some findings are presented as qualitative images, while some quantitative measurements are provided.

      Weaknesses:<br /> - More quantitative measurements.<br /> - Possible mechanisms associated with the pacemaker activity.<br /> - Membrane potential measurements.

      Comments on revisions: I do not have any additional comments.

    5. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Recommendations for the authors):

      The authors have done an impressive job in responding to the previous critique and even gone beyond what was asked. I have only very minor comments on this excellent manuscript. The manuscript also needs some light editing for grammar and readability.

      We have worked to improve the grammar and readability of the manuscript.

      Comments:

      Lines 227-234: At what age was tamoxifen administered to the various CreERTM mice?

      We have updated the ages of the mice used in this study in the methods sections.

      UMAP in Figure 5A is missing label for cluster 19.

      The UMAP in Figure 5A has the label for cluster 19 at the center-bottom of the image.

      Supplement Figure 6: Cluster 10 seems to be separate from the other AdvC clusters, and it includes some expression of Myh11 and Notch3. Further, there is low expression of Pdgfra in this cluster, which can be seen in panel B and panels D-I. Are the Pdgfra negative cells in the pie charts from cluster 10? Could the cells in this cluster by more LMC like than AdvC like?

      We agree with the reviewer that the subcluster 10 of the fibroblasts cells are intriguing if only a minor population. When assessing just this population of cells, which is 77 cells out of 2261 total, 40 of the 77 were Pdgfra+ and of the 37 remaining Pdgfra- but 11 of those were still CD34+. Thus at least half of these cells could be expected to have the PdgfraCreERTM. Only 8 of the 37 were Pdgfra-Notch3+ while 12 cells were Pdgfra+Notch3+, and only 3 were Pdgfra-Myh11+ while 3 were Pdgfra+Myh11+. 26 of 77 cells were Pdgfra+Pdgfrb+ double positive, while 12 of 37 Pdgfra- cells were still Pdgfrb+. Additionally, within the 77 cells of subcluster 10 17 were positive for Scn3a (Nav1.3), 21were positive for Kcnj8 (Kir6.1), and 33 were positive for Cacna1c (Cacna1c) which are typically LMC markers would support the reviewers thinking that this group contains a fibroblast-LMC transitional cell type. Only 2 of 77 cells were positive for the BK subunit (Kcnma1), which is a classic smooth muscle marker. Another possibility is this population represents the Pdgfra+Pdgfrb+ valve interstitial cells we identified in our IF staining and in our reporter mice. Of note almost all cells in this cluster were Col3a1+ and Vim+. Even though we performed QC analysis to remove doublets, it is also possible some of these cells could represent doublets or contaminants, however the low % of Myh11 expression, a very highly expressed gene in LMCs especially compared to ion channels, would suggest this is less likely. Assessing the presence of this particular cell cluster in future RNAseq or with spatial transcriptomics will be enlightening.

      Line 360. Proofread section title.

      We have simplified this title to read “Optogenetic Stimulation of iCre-driven Channel Rhodopsin 2”

      Lines 370-371. Are the length units supposed to be microns or millimeters?

      We have corrected this to microns as was intended. Thank you for catching this error.

      The resolution for each UMAP analysis should be stated, particularly for the identification of subclusters. How was the resolution chosen?

      To select the optimal cluster resolution, we used Clustree with various resolutions. We examined the resulting tree to identify a resolution where the clusters were well-separated and biologically meaningful, ensuring minimal merging or splitting at higher resolutions. Our goal was to find a resolution that captures relevant cell subpopulations while maintaining distinct clusters without excessive fragmentation. We have now stated the resolution for the subclustering of the LECs, LMCs, and fibroblasts. We have also added greater detail regarding the total number of cells, QC analysis, and the marker identification criteria used to the methods sections. We used resolution of 0.5 for sub-clustering LMCs, 0.87 for LECs, and 1.0 for fibroblasts.  These details are now added to the manuscript.

    1. eLife Assessment

      This important work advances our understanding of the impact of malnutrition on hematopoiesis and subsequently infection susceptibility. Support for the overall claims is convincing in some respects and incomplete in terms of identifying mechanism as highlighted by reviewers. This work will be of general interest to those in the fields of hematopoiesis, malnutrition, and dietary influence on immunity.

    2. Reviewer #2 (Public review):

      Summary:

      Sukhina et al. uses a chronic murine dietary restriction model to investigate the cellular mechanisms underlying nutritionally acquired immunodeficiency as well as the consequences of a refeeding intervention. The authors report a substantial impact of undernutrition to the myeloid compartment, which is not rescued by refeeding despite rescue of other phenotypes including lymphocyte levels, and which is associated with maintained partial susceptibility to bacterial infection.

      Strengths:

      Overall, this is a nicely executed study with an appropriate number of mice, robust phenotypes, and interesting conclusions, and the text is very well written. The authors' conclusions are generally well-supported by their data.

      Weaknesses:

      There is little evaluation of known critical drivers of myelopoiesis (e.g. PMID 20535209, 26072330, 29218601) over the course of the 40% diet, which would be of interest with regard to comparing this chronic model to other more short-term models of undernutrition.

      Further, the microbiota, well-established to be regulated by undernutrition (e.g. PMID 22674549, 27339978, etc.), and also well-established to be a critical regulator of hematopoiesis/myelopoiesis (e.g. PMID 27879260, 27799160, etc.), should be studied in any future explorations using this model.

      The authors have recognized these limitations to the study in their discussion.

    3. Reviewer #3 (Public review):

      This communication from Sukhina et al argues that a period of malnutrition (modeled by caloric restriction) causes lasting immune deficiencies (myelopoesis) not rescued by re-feeding. This is a potentially important paper exploring the effects of malnutrition on immunity, which is a clinically important topic. The revised study adds some details with respect to kinetics of immune compartment and body weight changes, but most aspects raised by the referees were deferred experimentally. Several textual changes have been made to avoid over-interpreting their data. My overall assessment of this revised study is similar to my impression before, which is that while the observations are interesting, there is both a lack of mechanistic understanding of the phenomena and a lack of resolution/detail about the phenomena itself.

    4. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This important work advances our understanding of the impact of malnutrition on hematopoiesis and subsequently infection susceptibility. Support for the overall claims is convincing in some respects and incomplete in others as highlighted by reviewers. This work will be of general interest to those in the fields of hematopoiesis, malnutrition, and dietary influence on immunity.

      We would like to thank the editors for agreeing to review our work at eLife. We greatly appreciate them assessing this study as important and of general interest to multiple fields, as well as the opportunity to respond to reviewer comments. Please find our responses to each reviewer below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this study, the authors used a chronic murine dietary restriction model to study the effects of chronic malnutrition on controls of bacterial infection and overall immunity, including cellularity and functions of different immune cell types. They further attempted to determine whether refeeding can revert the infection susceptibility and immunodeficiency. Although refeeding here improves anthropometric deficits, the authors of this study show that this is insufficient to recover the impairments across the immune cell compartments.

      Strengths:

      The manuscript is well-written and conceived around a valid scientific question. The data supports the idea that malnutrition contributes to infection susceptibility and causes some immunological changes. The malnourished mouse model also displayed growth and development delays. The work's significance is well justified. Immunological studies in the malnourished cohort (human and mice) are scarce, so this could add valuable information.

      Weaknesses:

      The assays on myeloid cells are limited, and the study is descriptive and overstated. The authors claim that "this work identifies a novel cellular link between prior nutritional state and immunocompetency, highlighting dysregulated myelopoiesis as a major." However, after reviewing the entire manuscript, I found no cellular mechanism defining the link between nutritional state and immunocompetency.

      We thank the reviewer for deeming our work significant and noting the importance of the study. We appreciate the referee’s point regarding the lack of specific cellular functional data for innate immune cells and have modified the conclusions stated in text to more accurately reflect the results presented.

      Reviewer #2 (Public review):

      Summary:

      Sukhina et al. use a chronic murine dietary restriction model to investigate the cellular mechanisms underlying nutritionally acquired immunodeficiency as well as the consequences of a refeeding intervention. The authors report a substantial impact of undernutrition on the myeloid compartment, which is not rescued by refeeding despite rescue of other phenotypes including lymphocyte levels, and which is associated with maintained partial susceptibility to bacterial infection.

      Strengths:

      Overall, this is a nicely executed study with appropriate numbers of mice, robust phenotypes, and interesting conclusions, and the text is very well-written. The authors' conclusions are generally well-supported by their data.

      Weaknesses:

      There is little evaluation of known critical drivers of myelopoiesis (e.g. PMID 20535209, 26072330, 29218601) over the course of the 40% diet, which would be of interest with regard to comparing this chronic model to other more short-term models of undernutrition.

      Further, the microbiota, which is well-established to be regulated by undernutrition (e.g. PMID 22674549, 27339978, etc.), and also well-established to be a critical regulator of hematopoiesis/myelopoiesis (e.g. PMID 27879260, 27799160, etc.), is completely ignored here.

      We thank the reviewer for agreeing that the data presented support the stated conclusions and noting the experimental rigor.  The referee highlights two important areas for future mechanistic investigation that we agree are of great importance and relevant to the submitted study. We have included further discussion of the potential role cytokines and the microbiota might play in our model.

      Reviewer #3 (Public review):

      Summary:

      Sukhina et al are trying to understand the impacts of malnutrition on immunity. They model malnutrition with a diet switch from ad libitum to 40% caloric restriction (CR) in post-weaned mice. They test impacts on immune function with listeriosis. They then test whether re-feeding corrects these defects and find aspects of emergency myelopoiesis that remain defective after a precedent period of 40% CR. Overall, this is a very interesting observational study on the impacts of sudden prolonged exposure to less caloric intake.

      Strengths:

      The study is rigorously done. The observation of lasting defects after a bout of 40% CR is quite interesting. Overall, I think the topic and findings are of interest.

      Weaknesses:

      While the observations are interesting, in this reviewer's opinion, there is both a lack of mechanistic understanding of the phenomena and also some lack of resolution/detail about the phenomena itself. Addressing the following major issues would be helpful towards aspects of both:

      (1) Is it calories, per se, or macro/micronutrients that drive these phenotypes observed with 40% CR. At the least, I would want to see isocaloric diets (primarily protein, fat, or carbs) and then some of the same readouts after 40% CR. Ie does low energy with relatively more eg protein prevent immunosuppression (as is commonly suggested)? Micronutrients would be harder to test experimentally and may be out of the scope of this study. However, it is worth noting that many of the malnutrition-associated diseases are micronutrient deficiencies.

      (2) Is immunosuppression a function of a certain weight loss threshold? Or something else? Some idea of either the tempo of immunosuppression (happens at 1, in which weight loss is detected; vs 2-3, when body length and condition appear to diverge; or 5 weeks), or grade of CR (40% vs 60% vs 80%) would be helpful since the mechanism of immunosuppression overall is unclear (but nailing it may be beyond the scope of this communication).

      (3) Does an obese mouse that gets 40% CR also become immunodeficient? As it stands, this ad libitum --> 40% CR model perhaps best models problems in the industrial world (as opposed to always being 40% CR from weaning, as might be more common in the developing world), and so modeling an obese person losing a lot of weight from CR (like would be achieved with GLP-1 drugs now) would be valuable to understanding generalizability.

      (4) Generalizing this phenomenon as "bacterial" with listeriosis, which is more like a virus in many ways (intracellular phase, requires type I IFN, etc.) and cannot be given by the natural route of infection in mice, may not be most accurate. I would want to see an experiment with E.Coli, or some other bacteria, to test the statement of generalizability (ie is it bacteria, or type I IFN-pathway dominant infections, like viruses). If this is unique listeriosis, it doesn't undermine the story as it is at all, but it would just require some word-smithing.

      (5) Previous reports (which the authors cite) implicate Leptin, the levels of which scale with fat mass, as "permissive" of a larger immune compartment (immune compartment as "luxury function" idea). Is their phenotype also leptin-mediated (ie leptin AAV)?

      (6) The inability of re-feeding to "rescue" the myeloid compartment is really interesting. Can the authors do a bone marrow transplantation (CR-->ad libitum) to test if this effect is intrinsic to the CR-experienced bone marrow?

      (7) Is the defect in emergency myelopoiesis a defect in G-CSF? Ie if the authors injected G-CSF in CR animals, do they equivalently mobilize neutrophils? Does G-CSF supplementation (as one does in humans) rescue host defense against Listeria in the CR or re-feeding paradigms?

      We thank the reviewer for considering our work of interest and noting the rigor with which it was conducted. The referee raises several excellent mechanistic hypotheses and follow-up studies to perform. We agree that defining the specific dietary deficiency driving the phenotypes is of great interest. The relative contribution of calories versus macro- and micronutrients is an area we are interested in exploring in future studies, especially given the literature on the role of micronutrients in malnutrition driven wasting as the referee notes. We also agree that it will be key to determine whether non-hematopoietic cells contribute as well as the role of soluble factors such G-CSF and Leptin in mediating the immunodeficiency all warrant further study. Likewise, it will be important to evaluate how malnutrition impacts other models of infection to determine how generalizable these phenomena are. We have added these points to the discussion section as limitations of this study.

      Regarding how the phenotypes correspond to the timing of the immunosuppression relative to weight loss, we have performed new kinetics studies to provide some insight into this area. We now find that neutropenia in peripheral blood can be detected after as little as one week of dietary restriction, with neutropenia continuing to decline after prolonged restriction. These findings indicate that the impact on myeloid cell production are indeed rapid and proceed maximum weight loss, though the severity of these phenotypes does increase as malnutrition persists. We wholeheartedly agree with the reviewer that it will be interesting to explore whether starting weight impacts these phenotypes and whether similar findings can be made in obese animals as they are treated for weight loss.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In this study, the authors used a chronic murine dietary restriction model to study the effects of chronic malnutrition on controls of bacterial infection and overall immunity, including cellularity and functions of different immune cell types. They further attempted to determine whether refeeding can revert the infection susceptibility and immunodeficiency. Although refeeding here improves anthropometric deficits, the authors of this study show that this is insufficient to recover the impairments across the immune cell compartments. The authors claim that "this work identifies a novel cellular link between prior nutritional state and immunocompetency, highlighting dysregulated myelopoiesis as a major." However, after reviewing the entire manuscript, I could not find any cellular mechanism defining the link between nutritional state and immunocompetency. The assays on myeloid cells are limited, and the study is descriptive and overstated.

      Major concerns:

      (1) Malnutrition has entirely different effects on adults and children. In this study, 6-8 weeks old C57/Bl6 mice were used that mimic adult malnutrition. I do not understand then why the refeeding strategy for inpatient treatment of severely malnourished children was utilized here.

      (2) Figure 1g shows BM cellularity is reduced, but the authors claim otherwise in the text.

      (3) What is the basis of the body condition score in Figure 1d? It will be good to have it in the supplement.

      (4) Listeria monocytogenes cause systemic infection, so bioload was not determined in tissues beyond the liver.

      (5) Figure 3; T cell functional assays were limited to CD8 T cells and lymphocytes isolated from the spleen.

      (6) Why was peripheral cell count not considered? Discrepancies exist with the absolute cell number and relative abundance data, except for the neutrophil and monocyte data, which makes the data difficult to interpret. For example, for B cells, CD4 and CD8 cells.

      (7) Also, if mice exhibit thymic atrophy, why does % abundance data show otherwise? Overall, the data is confusing to interpret.

      (8) No functional tests for neutrophil or monocyte function exist to explain the higher bacterial burden in the liver or to connect the numbers with the overall pathogen load

      The rationale for examining both innate and adaptive immunity is not clear-it is even more unclear since the exact timelines for examining both innate and adaptive immunity (D0 and D5) were used.

      (9) Figure 2e doesn't make sense - why is spleen cellularity measured when bacterial load is measured in the liver?

      (10) Although it is claimed that emergency myelopoiesis is affected, no specific marker for emergency myelopoiesis other than cell numbers was studied.

      (11) I suggest including neutrophil effector functions and looking for real markers of granulopoiesis, such as Cebp-b. Since the authors attempted to examine the entirety of immune responses, it is better to measure cell abundance, types, and functions beyond the spleen. Consider the systemic spread of m while measuring bioload.

      (12) Minor grammatical errors - please re-read the entire text and correct grammatical errors to improve the flow of the text.

      (13) Sample size details missing

      (14) Be clear on which marks were used to identify monocytes. Using just CD11b and Ly6G is insufficient for neutrophil quantification.

      (15) Also, instead of saying "undernourished patients," say "patients with undernutrition" - change throughout the text. I would recommend numbering citations (as is done for Nature citations) to ease in following the text, as there are areas when there are more than ten citations with author names.

      (16) No line numbers are provided

      (17) Abstract

      -  What does accelerated contraction mean?

      -  "In" is repeated in a sentence

      -  Be clear that the study is done in a mouse model - saying just "animals" is not sufficient

      -  Indicate how malnutrition is induced in these mice

      (18) Introduction

      -  "restriction," "immune organs," - what is this referring to?

      -  You mention lymphoid tissue and innate and adaptive immunity, which doesn't make sense.

      Please correct this.

      -  You mention a lot of lymphoid tissues, i.e. lymphoid mass gain, but how about the bone marrow and spleen, which are responsible for most innate immune compartments?

      (19) Results

      a) Figure 1

      -  Why 40% reduced diet?

      -  It would be interesting to report if the organs are smaller relative to body weight. It makes sense that the organ weight is lower in the 40RD mice, especially since they are smaller, so the novelty of this data is not apparent (Figure 1f).

      -  You say, "We observed a corresponding reduction in the cellularity of the spleen and thymus, while the cellularity of the bone marrow was unaffected (Fig. 1g)." however, your BM data is significant, so this statement doesn't reflect the data you present, please correct.

      b) Figure 2

      - Figure 2d - what tissue is this from, mentioned in the figure? And measure cellularity there. The rationale for why you look only at the spleen here is weak. Also, we would benefit from including the groups without infection here for comparison purposes.

      c) Figure 3

      - The rationale for why you further looked at T cells is weak, mainly because of the following sentence. "Despite this overall loss in lymphocyte number, the relative frequency of each population was either unchanged or elevated, indicating that while malnutrition leads to a global reduction in immune cell numbers, lymphocytes are less impacted than other immune cell populations (Supplemental 1)." Please explain in the main text.

      d) Figure 4

      -  You say the peak of the adaptive immune response, but you never looked at the peak of adaptive immune - when is this? If you have the data, please show it. You also only show d0 and d5 post-infection data for adaptive immunity, so I am unsure where this statement comes from.

      -  How did you identify neutrophils and monocytes through flow cytometry? Indicate the markers used. Also, your text does not match your data; please correct it. i.e. monocyte numbers reduced, and relative abundance increased, but your text doesn't say this.

      -  Show the flow graph first then, followed by the quantification.

      -  The study would benefit from examining markers of emergency myelopoiesis such as Cebpb through qPCR.

      -  Although the number of neutrophils is lower in the BM and spleen, how does this relate to increased bacterial load in the liver? This is especially true since you did not quantify neutrophil numbers in the liver.

      e) Figure 6

      -  Some figures are incorrectly labelled.

      -  For the refeeding data, also include the data from the 40RD group to compare the level of recovery in the outcome measures.

      (20) Discussion

      -  You claim that monocytes are reduced to the same extent as neutrophils, but this is not true.

      Please correct.

      -  Indicate some limitations of your work.

      We thank the reviewer for offering these recommendations and the constructive comments. 

      Several comments raised concerns over the rationale or reasoning behind aspects of the experimental design or the data presented, which we would like to clarify:

      • Regarding the refeeding protocol, we apologize for the confusion for the rationale. We based our methodology on the general guidelines for refeeding protocols for malnourished people. We elected to increase food intake 10% daily to avoid risk of refeeding syndrome or other complications. Our method is by no means replicates the administration of specific vitamins, minerals, electrolytes, nor precise caloric content as would be given to a human patient. The citation provided offers information from the WHO regarding the complications that can arise during refeeding syndrome, which while it is from a document on pediatric care, we did not mean to imply that our method modeled refeeding intervention for children. We have modified the text to avoid this confusion.

      • The reviewer requested more clarity on why we studied both the innate and adaptive immune system as well as why we chose the time points studied. As referenced in the manuscript, prior work has observed that caloric restriction, fasting, and malnutrition all can impact the adaptive immune system. Given these previous findings, we felt it important to evaluate how malnutrition affected adaptive immune cell populations in our model. To this end, we provide data tracking the course of T-cell responses from the start of infection through day 14 at the time that the response undergoes contraction. However, since we find that bacterial burden is not properly controlled at earlier time points (day 5), when it is understood the innate immune system is more critical for mediating pathogen clearance, we elected to better characterize the effect malnutrition had on innate immune populations, something less well described in the literature. As phenotypes both in bacterial burden and within innate immune populations were observable as early as day 5, we chose to focus on that time point rather than later time points when readouts could be further confounded by secondary or compounding effects by the lack of early control of infection. We have tried to make this rationale clear in the text and have made changes to further emphasize this reasoning.

      • The reviewer also requested an explaination over why bacterial burden was measured in the liver and the immune response was measured in the spleen. While the reviewer is correct that our model is a systemic infection, it is well appreciated that bacteria rapidly disseminate to the liver and spleen and these organs serve as major sites of infection. Given the central role the spleen plays in organizing both the innate and adaptive immune response in this model, it is common practice in the field to phenotype immune cell populations in the spleen, while using the liver to quantify bacterial burden (see PMID: 37773751 as one example of many). We acknowledge this does not provide the full scope of bacterial infection or the immune response in every potentially affected tissue, but nonetheless believe the interpretation that malnourished and previously malnourished animals do not properly control infection and their immune responses are blunted compared to controls still stands.

      The reviewer raised several points about di3erences in the results for cell frequency and absolute number and why these may deviate in some circumstances. For example, the reviewer notes that we observe thymic atrophy yet the frequency of peripheral T-cells does not decline. It should be noted that absolute number can change when frequency does not and vice versa, due to changes in other cell types within the studied population of cells. As in the case of peripheral lymphocytes in our study, the frequency can stay the same or even increase when the absolute number declines (Supplemental 1). This can occur if other populations of cells decrease further, which is indeed the case as the loss of myeloid cells is greater than that of lymphocytes. Hence, we find that the frequency of T and B cells is unchanged or elevated, despite the loss in absolute number of peripheral cell, which is our stated interpretation. We believe this is consistent with our overall observations and is why it is important to report both frequency and absolute number, as we have done. 

      We have made the requested changes to the text to address the reviewers concerns as noted to improve clarity and accuracy for the description of experiments, results, and overall conclusions drawn in the manuscript. We have also included a discussion of the limitations of our work as well as additional areas for future investigation that remain open. 

      Reviewer #2 (Recommendations for the authors):

      Regarding the known drivers of myelopoiesis, can the authors quantify circulating levels of relevant immune cytokines (e.g. type I and II IFNs, GM-CSF, etc.)?

      Regarding the microbiota (point #2), how dramatically does this undernutrition modulate the microbiota both in terms of absolute load and community composition, and how effectively/quickly is this rescued by refeeding?

      We thank the reviewer for raising these recommendations. We agree that the role of circulating factors like cytokines and growth factors in contributing to the defects in myelopoiesis is of interest and is the focus of future work. Similarly, the impact of malnutrition on the microbiota is of great interest and has been evaluated by other groups in separate studies. How the known impact of malnutrition on the microbiota affects the phenotypes we observe in myelopoiesis is unclear and warrants future investigation. We have added these points to the discussion section as limitations of this study.

    1. Author Response:

      In the Weaknesses, Reviewer 3 suggests that in the Discussion, we comment upon whether WRN ATPase/3’-5’ helicase and WRNIP1 ATPase work on Y-family Pols additively or synergistically to raise fidelity. However, in the Discussion on page 20, we do comment on the role of WRN and WRNIP1 ATPase activities in conferring an additive increase in the fidelity of TLS by Y-family Pols.

    2. eLife Assessment

      This manuscript reports an important finding for understanding the molecular mechanisms of mutagenesis, carcinogenesis, and senescence. It follows a previous report showing that the Werner syndrome protein WRN and its interacting protein WRNIP1 are indispensable for translesion DNA synthesis (TLS) by Y-family DNA polymerases (Pols). The manuscript provides convincing evidence that WRN and WRNIP1 ATPases, in addition to the previously reported role of the WRN 3'>5' exonuclease activity, are essential for promoting the fidelity of replication through DNA lesions by Y-family Pols in human cells.

    3. Reviewer #1 (Public review):

      Summary:

      Y-family polymerases, such as polymerases eta, iota, and kappa, have low fidelity relative to other polymerases involved in DNA replication and repair. This is believed to be due to their active sites being less constrained than those of other polymerases. Paradoxically, work by this lab and others shows that in vivo, these Y-family polymerases are more error-free (less error-prone) during DNA damage bypass than would be expected given their low fidelity. For this reason, the authors have been focusing on other cellular factors that may increase the fidelity of Y-family polymerases. The current paper focuses on two such factors: WRN, which possesses exonuclease and helicase activities, and WRNIP1, which possesses a DNA-dependent ATPase.

      Previously, this group showed that defects in the exonuclease function of WRN lead to a loss in the fidelity of polymerases eta and iota during DNA damage bypass, presumably by removing nucleotide misinsertions. The current paper extends this work by considering the ATPase activities of WRN and WRNIP1. The authors looked at the impact of various amino acid substitutions in these proteins on the fidelity of DNA damage bypass by Y-family polymerases. They did this by both measuring the mutation frequencies of these cell lines as well as the mutation spectra observed in them. They showed that the ATPase activities of both WRN and WRNIP1, as well as the exonuclease activities of WRN, are necessary high fidelity of Y-family polymerases in cells. They specifically examined the bypass of cyclobutene pyrimidine dimers by polymerase eta, the bypass of 6-4 photoproducts by polymerases eta and iota, and the bypass of ethenoadenine by polymerase iota. Moreover, they showed that WRNIP1 ATPase defects impair the WRN exonuclease from removing misinsertions by polymerase iota at thymine glycol lesions. These defects generally do not affect the efficiency of the bypass, only its fidelity.

      Strengths:

      The manuscript by Yoon et al is the latest in a series of important and impactful papers by this research group examining the cellular factors that enhance the fidelity of translesion synthesis by Y-family polymerases in human cell lines. Overall, the study is well designed, the data are clearly presented, and the conclusions are well supported and convincing. The authors also discuss a reasonable possibility that complex formation between the WRN and WRNIP1 proteins and Y-family polymerases could tighten the active sites of these polymerases to improve fidelity. Further studies are required to demonstrate this model, but it is a very exciting model that is well supported by the current data.

      Weaknesses:

      No weaknesses were identified by this reviewer.

    4. Reviewer #2 (Public review):

      The authors of the present study are responsible for a previous study, which also showed that in response to DNA damage, Werner syndrome protein WRN, WRN interacting protein WRNIP1, and Rev1 assemble together with Y-family Pols (Polη, Polι, or Polκ), and that they are indispensable for Trans-Lesion-Synthesis (TLS) (Genes Dev 2024). They also identified a role of WRN's 3'→5' exonuclease activity in the high in vivo fidelity of TLS by Y-family, through UV-induced CPDs by Polη, through N6 ethenodeoxyadenosine (εdA) by Polι, through thymine glycol by Polκ, and through UV-induced (6-4) photoproducts by Polη and Polι. Thus, by removing nucleotides misinserted opposite DNA lesions by the Y-family Pols, WRN's 3'→5' exonuclease activity improves the fidelity of TLS by these Pols. The present work, which follows up on this previous work, reports the crucial role also of the ATPase activities of WRN and WRNIP1 in raising the fidelity of TLS by Y family Pols, in addition to the exonuclease activity, with an entirely different mechanism, which normally consists in unwinding of DNA containing secondary structures.

      By using adequate cell line models and methodologies, notably DNA fiber, TLS, and mutation analyses assays, as well as specific ATPase point mutations, they found that progression of the replication forks through UV lesions was not affected in cells lacking the WRN exonuclease activity as well as the WRN and WRNIP1 ATPase activities, but occurs with a vast increase in error-prone TLS, notably through CPDs by Polη, with differential impacts on the nature of mutations between WRN ATPase and WRNIP1 ATPase. The relative contributions of these activities (exonuclease and ATPase) to the fidelity of TLS Pols, however, vary, depending upon the DNA lesion and the TLS Pol involved. Additionally, defects in these ATPase activities cause mutational hot spot formation in different sequence contexts. The authors provide evidence that the combined action of WRN and WRNIP1 ATPases, along with WRN 3' to 5' exonuclease, confers an enormous rise in the fidelity of TLS by Y-family Pols. They identify the means by which these otherwise highly error-prone TLS Pols have been adapted to function in an error-free manner. They suggest that WRNIP1 ATPases prevent misincorporations while WRN exonuclease removes misinserted nucleotides. This combination confers a vast increase in the fidelity of Y-family Pols, essential for genome stability.

      Overall, this is a comprehensive and thoughtful manuscript, and all the findings reported are convincing and well supported. The data cannot be considered as entirely novel, as they follow-up on the recent 2024 publication by the same authors who unveiled that the exonuclease activity of WRN and WRNIP1 confers accuracy of TLS. The experimental methods are multiple and rigorous.

    5. Reviewer #3 (Public review):

      Summary:

      Replication through DNA lesions such as UV-induced pyrimidine dimers is mainly performed by Y-family pols. These translesion synthesis (TLS) pols are intrinsically error-prone. However, in living cells, TLS must be conducted in an error-free manner. This manuscript demonstrated that WRN and WRNIP1 ATPases play an important role in addition to WRN 3'>5' exonuclease in human cells.

      Strengths:

      The authors made use of WT human fibroblasts and WRN-deficient cell line for TLS assays in human cells and siRNA knock-down experiments to analyze TLS efficiency. For the cII mutation assay, the big blue mouse embryonic fibroblasts were used. These materials, as well as other Materials and Methods, had already been well established by this group or other groups. The authors used Pol eta, iota, kappa, and theta as TLS pols, and used UV-induced CPD, (6-4)PP, epsilon dA, and thymine glycol as DNA lesions. Thus, the authors examined the generality of their results in terms of TLS pols and DNA lesions.

      Weaknesses:

      Although the main part of this manuscript is the impact of the deficiencies of WRN and WRNIP1 ATPases on TLS by Y-family DNA polymerases, especially on TLS efficiency and mutation spectrum, many readers would be interested in how these ATPases could change molecular structure of Pol eta, because the structure of it have been studied for some time.

    1. Author Response:

      We thank the reviewers for their thoughtful feedback and appreciate their recognition of the value of our findings. In response, we are refining the manuscript to clarify key terminology, more clearly describe our image analysis workflows, and temper the interpretation of our results where appropriate. We are planning to perform additional experiments to further investigate the specificity of mRNA co-localization between BK and CaV1.3 channels. We acknowledge the importance of understanding ensemble trafficking dynamics and the functional role of pre-assembly at the plasma membrane, and we plan to explore these questions in future work. We look forward to submitting a revised manuscript that addresses the reviewers’ comments in detail.

    2. eLife Assessment

      This valuable manuscript provides convincing evidence that BK and CaV1.3 channels can co-localize as ensembles early in the biosynthetic pathway, including in the ER and Golgi. The findings, supported by a range of imaging and proximity assays, offer insights into channel organization in both heterologous and endogenous systems. However, mechanistic questions remain unresolved, particularly regarding the specificity of mRNA co-localization, the dynamics of ensemble trafficking, and the functional significance of pre-assembly at the plasma membrane. While the data broadly support the central claims, certain conclusions would benefit from more restrained interpretation and additional clarification to enhance the manuscript's impact and rigor.

    3. Joint Public Review:

      This study presents a valuable contribution to our understanding of ion channel complex assembly by investigating whether BK and CaV1.3 channels begin to form functional associations early in the biosynthetic pathway, prior to reaching the plasma membrane. Using a combination of proximity ligation assays, single-molecule RNA imaging, and super-resolution microscopy, the authors provide convincing evidence that these channels co-localize intracellularly within the ER and Golgi, in both overexpression systems and a relevant endogenous cell model. The study addresses an important and underexplored aspect of membrane protein trafficking and organization, with broader implications for how ion channel signaling complexes are assembled and regulated. The experimental approaches are generally appropriate and the imaging data are clearly presented, with a commendable number of control experiments included. However, several limitations temper the interpretation of the results. The mechanisms underlying mRNA co-localization, and the role of co-translation in complex formation, remain insufficiently defined. Similarly, while intracellular colocalization is convincingly demonstrated, the study does not establish whether such early assembly is the predominant pathway for generating functional complexes at the plasma membrane. More rigorous quantification of channel co-association across compartments, and clarification of key terminology and image analysis methods, would strengthen the overall conclusions. Some of the language in the manuscript would also benefit from a more measured tone to avoid overstating the novelty of the findings. Despite these limitations, the study offers meaningful insights into intracellular ion channel organization and will be of interest to researchers in cell biology, membrane trafficking, and neurophysiology. With focused revisions addressing the outlined points, the manuscript has the potential to make a solid contribution to the field.

    1. eLife Assessment

      This important study explores the role of SIRT2 in regulating Japanese encephalitis virus replication and disease progression in rodent models. The findings presented are novel as sirtuins are known for their roles in aging, metabolism, and cell survival, but have not been studied in the context of viral infections until recently. The evidence supporting the claims is solid, although additional experiments to further characterize the clinical outcomes and directly test the link between acetylated NF-kB and SIRT2 expression would have strengthened the study. The work will be of interest to biologists studying viruses, sirtuins, and inflammation.

    2. Reviewer #1 (Public review):

      Summary:

      Desingu et al. show that JEV infection reduces SIRT2 expression. Upon JEV infection, 10-day-old SIRT2 KO mice showed increased viral titer, more severe clinical outcomes, and reduced survival. Conversely, SIRT2 overexpression reduced viral titer, clinical outcomes, and improved survival. Transcriptional profiling shows dysregulation of NF-KB and expression of inflammatory cytokines. Pharmacological NF-KB inhibition reduced viral titer. The authors conclude that SIRT2 is a regulator of JEV infection.

      Strengths:

      This paper is novel because sirtuins have been primarily studied for aging, metabolism, stem cells/regeneration. Their role in infection has not been explored until recently. Indeed, Barthez et al. showed that SIRT2 protects aged mice from SARS-CoV-2 infection (Barthez, Cell Reports 2025). Therefore, this is a timely and novel research topic. Mechanistically, the authors showed that SIRT2 suppresses the NF-KB pathway. Interestingly, SIRT2 has also been shown recently to suppress other major inflammatory pathways, such as cGAS-STING (Barthez, Cell Reports 2025) and the NLRP3 inflammasome (He, Cell Metabolism 2020; Luo, Cell Reports 2019). Together, these findings support the emerging concept that SIRT2 is a master regulator of inflammation.

      Weaknesses:

      (1) Figures 2 and 3. Although SIRT2 KO mice showed increased viral titer, more severe clinical outcomes, and reduced survival upon JEV infection, the difference is modest because even WT mice exhibited very severe disease at this viral dose. The authors should perform the experiment using a sub-lethal viral dose for WT mice, to allow the assessment of increased clinical outcomes and reduced survival in KO mice.

      (2) Figure 5K-N, the authors examined the expression of inflammatory cytokines in WT and SIRT2 KO cells upon JEV infection, in line with the dysregulation of NF-kB. It has been shown recently that SIRT2 also regulates the cGAS-STING pathway (Barthez, Cell Reports 2025) and the NLRP3 inflammasome (He, Cell Metabolism 2020; Luo, Cell Reports 2019). Do you also observe increased IFNb, IL1b, and IL18 in SIRT2 KO cells upon JEV infection? This may indicate that SIRT2 regulates systemic inflammatory responses and represents a potent protection upon viral infection. This is particularly important because in Figure 7F, the authors showed that SIRT2 overexpression reduced viral load even when NF-KB is inhibited, suggesting that NF-KB is not the only mediator of SIRT2 to suppress viral infection.

    3. Reviewer #2 (Public review):

      The manuscript by Desingu et al., explores the role of SIRT2 in regulating Japanese Encephalitis Virus (JEV) replication and disease progression in rodent models. Using both an in vitro and an in vivo approach, the authors demonstrate that JEV infection leads to decreased SIRT2 expression, which they hypothesize is exploited by JEV for viral replication. To test this hypothesis, the authors utilize SIRT2 inhibition (via AGK2 or genetic knockout) and demonstrate that it leads to increased viral load and worsens clinical outcomes in JEV-infected mice. Conversely, SIRT2 overexpression via an AAV delivery system reduces viral replication and improves survival among infected mice. The study proposes a mechanism in which SIRT2 suppresses JEV-induced autophagy and inflammation by deacetylating NF-κB, thereby reducing Beclin-1 expression (an NF-κB-dependent gene) and autophagy, which the authors consider a pathway that JEV exploits for replication. Transcriptomic analysis further supports that SIRT2 deficiency leads to NF-κB-driven cytokine hyperactivation. Additionally, pharmacological inhibition of NF-κB using Bay 11 (an IKK inhibitor) results in reduced viral load and improved clinical pathology in WT and SIRT2 KO mice. Overall, the findings from Desingu et al. are generally supported by the data and suggest that targeting SIRT2 may serve as a promising therapeutic approach for JEV infection and potentially other RNA viruses that SIRT2 helps control. However, the paper does fall short in some areas. Please see below for our comments to help improve the paper.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Desingu et al. show that JEV infection reduces SIRT2 expression. Upon JEV infection, 10-day-old SIRT2 KO mice showed increased viral titer, more severe clinical outcomes, and reduced survival. Conversely, SIRT2 overexpression reduced viral titer, clinical outcomes, and improved survival. Transcriptional profiling shows dysregulation of NF-KB and expression of inflammatory cytokines. Pharmacological NF-KB inhibition reduced viral titer. The authors conclude that SIRT2 is a regulator of JEV infection.

      This paper is novel because sirtuins have been primarily studied for aging, metabolism, stem cells/regeneration. Their role in infection has not been explored until recently. Indeed, Barthez et al. showed that SIRT2 protects aged mice from SARS-CoV-2 infection (Barthez, Cell Reports 2025). Therefore, this is a timely and novel research topic. Mechanistically, the authors showed that SIRT2 suppresses the NF-KB pathway. Interestingly, SIRT2 has also been shown recently to suppress other major inflammatory pathways, such as cGAS-STING (Barthez, Cell Reports 2025) and the NLRP3 inflammasome (He, Cell Metabolism 2020; Luo, Cell Reports 2019). Together, these findings support the emerging concept that SIRT2 is a master regulator of inflammation.

      Weaknesses:

      (1) Figures 2 and 3. Although SIRT2 KO mice showed increased viral titer, more severe clinical outcomes, and reduced survival upon JEV infection, the difference is modest because even WT mice exhibited very severe disease at this viral dose. The authors should perform the experiment using a sub-lethal viral dose for WT mice, to allow the assessment of increased clinical outcomes and reduced survival in KO mice.

      (2) Figure 5K-N, the authors examined the expression of inflammatory cytokines in WT and SIRT2 KO cells upon JEV infection, in line with the dysregulation of NF-kB. It has been shown recently that SIRT2 also regulates the cGAS-STING pathway (Barthez, Cell Reports 2025) and the NLRP3 inflammasome (He, Cell Metabolism 2020; Luo, Cell Reports 2019). Do you also observe increased IFNb, IL1b, and IL18 in SIRT2 KO cells upon JEV infection? This may indicate that SIRT2 regulates systemic inflammatory responses and represents a potent protection upon viral infection. This is particularly important because in Figure 7F, the authors showed that SIRT2 overexpression reduced viral load even when NF-KB is inhibited, suggesting that NF-KB is not the only mediator of SIRT2 to suppress viral infection.

      We thank the reviewer for the valuable recommendation. We are willing to conduct an experiment using a sub-lethal viral dose in wild-type (WT) mice to assess increased clinical outcomes and reduced survival in knockout (KO) mice, as recommended.

      Furthermore, we acknowledge reviewers' comments that SIRT2 regulates systemic inflammatory responses and provides potent protection against viral infection. Additionally, NF-κB is not the only mediator of SIRT2's suppression of viral infection; other possible molecular mechanisms are also involved in this process.

      Reviewer #2 (Public review):

      The manuscript by Desingu et al., explores the role of SIRT2 in regulating Japanese Encephalitis Virus (JEV) replication and disease progression in rodent models. Using both an in vitro and an in vivo approach, the authors demonstrate that JEV infection leads to decreased SIRT2 expression, which they hypothesize is exploited by JEV for viral replication. To test this hypothesis, the authors utilize SIRT2 inhibition (via AGK2 or genetic knockout) and demonstrate that it leads to increased viral load and worsens clinical outcomes in JEV-infected mice. Conversely, SIRT2 overexpression via an AAV delivery system reduces viral replication and improves survival among infected mice. The study proposes a mechanism in which SIRT2 suppresses JEV-induced autophagy and inflammation by deacetylating NF-κB, thereby reducing Beclin-1 expression (an NF-κB-dependent gene) and autophagy, which the authors consider a pathway that JEV exploits for replication. Transcriptomic analysis further supports that SIRT2 deficiency leads to NF-κB-driven cytokine hyperactivation. Additionally, pharmacological inhibition of NF-κB using Bay 11 (an IKK inhibitor) results in reduced viral load and improved clinical pathology in WT and SIRT2 KO mice. Overall, the findings from Desingu et al. are generally supported by the data and suggest that targeting SIRT2 may serve as a promising therapeutic approach for JEV infection and potentially other RNA viruses that SIRT2 helps control. However, the paper does fall short in some areas. Please see below for our comments to help improve the paper.

      We thank the reviewer for the valuable recommendation. We are willing to measure NF-kB acetylation in AdSIRT2 JEV-infected cells compared to WT-infected cells, to verify that the acetylation of NF-kB is truly linked to SIRT2 expression levels as per the reviewers' suggestion.

      We are willing to conduct an experiment using a sub-lethal viral dose in wild-type (WT) mice to assess increased clinical outcomes and reduced survival in knockout (KO) mice, as recommended.

      We are accepting the reviewer's suggestion that AGK2 can also inhibit other Sirtuins. Thus, to test the contribution of other Sirtuins, the experiment could be repeated using wild-type and Sirt2 KO mice. We are willing to conduct the AGK2 experiment using JEV-infected wild-type and Sirt2 knockout mice.

    1. eLife Assessment

      This valuable study tested whether several months of dolutegravir intensification alters the size of the HIV reservoir as well as immune activation in individuals already on suppressive ART. While the general study approach is appropriate and the paper is well written, the evidence supporting the claims of the authors is incomplete. The title of the paper is only partially supported by the data, based on specific issues with the study design and analysis plan highlighted by Reviewer 1. Specifically, the primary study outcomes were not clearly described a priori, the plausibility of a biologic effect is uncertain based on lack of a consistent effect across participants, and sample size is small. Given a possible observed partial effect and relevant hypothesis, this approach warrants study in a larger trial.

    2. Reviewer #1 (Public review):

      Fombellida-Lopez and colleagues describe the results of an ART intensification trial in people with HIV infection (PWH) on suppressive ART to determine the effect of increasing the dose of one ART drug, dolutegravir, on viral reservoirs, immune activation, exhaustion, and circulating inflammatory markers. The authors hypothesize that ART intensification will provide clues about the degree to which low-level viral replication is occurring in circulation and in tissues despite ongoing ART, which could be identified if reservoirs decrease and/or if immune biomarkers change. The trial design is straightforward and well-described, and the intervention appears to have been well tolerated. The investigators observed an increase in dolutegravir concentrations in circulation, and to a lesser degree in tissues, in the intervention group, indicating that the intervention has functioned as expected (ART has been intensified in vivo). Several outcome measures changed during the trial period in the intervention group, leading the investigators to conclude that their results provide strong evidence of ongoing replication on standard ART. The results of this small trial are intriguing, and a few observations in particular are hypothesis-generating and potentially justify further clinical trials to explore them in depth. However, I am concerned about over-interpretation of results that do not fully justify the authors' conclusions.

      (1) Trial objectives: What was the primary objective of the trial? This is not clearly stated. The authors describe changes in some reservoir parameters and no changes in others. Which of these was the primary outcome? No a priori hypothesis / primary objective is stated, nor is there explicit justification (power calculations, prior in vivo evidence) for the small n, unblinded design, and lack of placebo control. In the abstract (line 36, "significant decreases in total HIV DNA") and conclusion (lines 244-246), the authors state that total proviral DNA decreased as a result of ART intensification. However, in Figures 2A and 2E (and in line 251), the authors indicate that total proviral DNA did not change. These statements are confusing and appear to be contradictory. Regarding the decrease in total proviral DNA, I believe the authors may mean that they observed transient decrease in total proviral DNA during the intensification period (day 28 in particular, Figure 2A), however this level increases at Day 56 and then returns to baseline at Day 84, which is the source of the negative observation. Stating that total proviral DNA decreased as a result of the intervention when it ultimately did not is misleading, unless the investigators intended the day 28 timepoint as a primary endpoint for reservoir reduction - if so, this is never stated, and it is unclear why the intervention would then be continued until day 84? If, instead, reservoir reduction at the end of the intervention was the primary endpoint (again, unstated by the authors), then it is not appropriate to state that the total proviral reservoir decreased significantly when it did not.

      (2) Intervention safety and tolerability: The results section lacks a specific heading for participant safety and tolerability of the intervention. I was wondering about clinically detectable viremia in the study. Were there any viral blips? Was the increased DTG well tolerated? This drug is known to cause myositis, headache, CPK elevation, hepatotoxicity, and headache. Were any of these observed? What is the authors' interpretation of the CD4:8 ratio change (line 198)? Is this a significant safety concern for a longer duration of intensification? Was there also a change in CD4% or only in absolute counts? Was there relative CD4 depletion observed in the rectal biopsy samples between days 0 and 84? Interestingly, T cells dropped at the same timepoints that reservoirs declined... how do the authors rule out that reservoir decline reflects transient T cell decline that is non-specific (not due to additional blockade of replication)?

      (3) The investigators describe a decrease in intact proviral DNA after 84 days of ART intensification in circulating cells (Figure 2D), but no changes to total proviral DNA in blood or tissue (Figures 2A and 2E; IPDA does not appear to have been done on tissue samples). It is not clear why ART intensification would result in a selective decrease in intact proviruses and not in total proviruses if the source of these reservoir cells is due to ongoing replication. These reservoir results have multiple interpretations, including (but not limited to) the investigators' contention that this provides strong evidence of ongoing replication. However, ongoing replication results in the production of both intact and mutated/defective proviruses that both contribute to reservoir size (with defective proviruses vastly outnumbering intact proviruses). The small sample size and well-described heterogeneity of the HIV reservoir (with regard to overall size and composition) raise the possibility that the study was underpowered to detect differences over the 84-day intervention period. No power calculations or prior studies were described to justify the trial size or the duration of the intervention. Readers would benefit from a more nuanced discussion of reservoir changes observed here.

      (4) While a few statistically significant changes occurred in immune activation markers, it is not clear that these are biologically significant. Lines 175-186 and Figure 3: The change in CD4 cells + for TIGIT looks as though it declined by only 1-2%, and at day 84, the confidence interval appears to widen significantly at this timepoint, spanning an interquartile range of 4%. The only other immune activation/exhaustion marker change that reached statistical significance appears to be CD8 cells + for CD38 and HLA-DR, however, the decline appears to be a fraction of a percent, with the control group trending in the same direction. Despite marginal statistical significance, it is not clear there is any biological significance to these findings; Figure S6 supports the contention that there is no significant change in these parameters over time or between groups. With most markers showing no change and these two showing very small changes (and the latter moving in the same direction as the control group), these results do not justify the statement that intensifying DTG decreases immune activation and exhaustion (lines 38-40 in the abstract and elsewhere).

      (5) There are several limitations of the study design that deserve consideration beyond those discussed at line 327. The study was open-label and not placebo-controlled, which may have led to some medication adherence changes that confound results (authors describe one observation that may be evidence of this; lines 146-148). Randomized/blinded / cross-over design would be more robust and help determine signal from noise, given relatively small changes observed in the intervention arm. There does not seem to be a measurement of key outcome variables after treatment intensification ceased - evidence of an effect on replication through ART intensification would be enhanced by observing changes once intensification was stopped. Why was intensification maintained for 84 days? More information about the study duration would be helpful. Table 1 indicates that participants were 95% male. Sex is known to be a biological variable, particularly with regard to HIV reservoir size and chronic immune activation in PWH. Worldwide, 50% of PWH are women. Research into improving management/understanding of disease should reflect this, and equal participation should be sought in trials. Table 1 shows differing baseline reservoir sizes betweenthe control and intervention groups. This may have important implications, particularly for outcomes where reservoir size is used as the denominator.

      (6) Figure 1: the increase in DTG levels is interesting - it is not uniform across participants. Several participants had lower levels of DTG at the end of the intervention. Though unlikely to be statistically significant, it would be interesting to evaluate if there is a correlation between change in DTG concentrations and virologic / reservoir / inflammatory parameters. A positive relationship between increasing DTG concentration and decreased cell-associated RNA, for example, would help support the hypothesis that ongoing replication is occurring.

      (7) Figure 2: IPDA in tissue- was this done? scRNA in blood (single copy assay) - would this be expected to correlate with usCaRNA? The most unambiguous result is the decrease in cell-associated RNA - accompanying results using single-copy assay in plasma would be helpful to bolster this result. The use of the US RNA / Total DNA ratio is not helpful/difficult to interpret since the control and intervention arms were unmatched for total DNA reservoir size at study entry.

    3. Reviewer #2 (Public review):

      Summary:

      An intensification study with a double dose of 2nd generation integrase inhibitor with a background of nucleoside analog inhibitors of the HIV retrotranscriptase in 2, and inflammation is associated with the development of co-morbidities in 20 individuals randomized with controls, with an impact on the levels of viral reservoirs and inflammation markers. Viral reservoirs in HIV are the main impediment to an HIV cure, and inflammation is associated with co-morbidities.

      Strengths:

      The intervention that leads to a decrease of viral reservoirs and inflammation is quite straightforward forward as a doubling of the INSTI is used in some individuals with INSTI resistance, with good tolerability.

      This is a very well documented study, both in blood and tissues, which is a great achievement due to the difficulty of body sampling in well-controlled individuals on antiretroviral therapy. The laboratory assays are performed by specialists in the field with state-of-the art quantification assays. Both the introduction and the discussion are remarkably well presented and documented.

      The findings also have a potential impact on the management of chronic HIV infection.

      Weaknesses:

      I do not think that the size of the study can be considered a weakness, nor the fact that it is open-label either.

    4. Reviewer #3 (Public review):

      The introduction does a very good job of discussing the issue around whether there is ongoing replication in people with HIV on antiretroviral therapy. Sporadic, non-sustained replication likely occurs in many PWH on ART related to adherence, drug-drug interactions and possibly penetration of antivirals into sanctuary areas of replication and as the authors point out proving it does not occur is likely not possible and proving it does occur is likely very dependent on the population studied and the design of the intervention. Whether the consequences of this replication in the absence of evolution toward resistance have clinical significance challenging question to address.

      It is important to note that INSTI-based therapy may have a different impact on HIV replication events that results in differences in virus release for specific cell type (those responsible for "second phase" decay) by blocking integration in cells that have completed reverse transcription prior to ART initiation but have yet to be fully activated. In a PI or NNRTI-based regimen, those cells will release virus, whereas with an INSTI-based regimen, they will not.

      Given the very small sample size, there is a substantial risk of imbalance between the groups in important baseline measures. Unfortunately, with the small sample size, a non-significant P value is not helpful when comparing baseline measures between groups. One suggestion would be to provide the full range as opposed to the inter-quartile range (essentially only 5 or 6 values). The authors could also report the proportion of participants with baseline HIV RNA target not detected in the two groups.

      A suggestion that there is a critical imbalance between groups is that the control group has significantly lower total HIV DNA in PBMC, despite the small sample size. The control group also has numerically longer time of continuous suppression, lower unspliced RNA, and lower intact proviral DNA. These differences may have biased the ability to see changes in DNA and US RNA in the control group. Notably, there was no significant difference in the change in US RNA/DNA between groups (Figure 2C). The fact that the median relative change appears very similar in Figure 2C, yet there is a substantial difference in P values, is also a comment on the limits of the current sample size. The text should report the median change in US RNA and US RNA/DNA when describing Figures 2A-2C. This statistical comparison of changes in IPDA results between groups should be reported. The presentation of the absolute values of all the comparisons in the supplemental figures is a strength of the manuscript.

      In the assessment of ART intensification on immune activation and exhaustion, the fact that none of the comparisons between randomized groups were significant should be noted and discussed.

      The changes in CD4:CD8 ratio and sCD14 levels appear counterintuitive to the hypothesis and are commented on in the discussion.

      Overall, the discussion highlights the significant changes in the intensified group, which are suggestive. There is limited discussion of the comparisons between group,s where the results are less convincing.

      The limitations of the study should be more clearly discussed. The small sample size raises the possibility of imbalance at baseline. The supplemental figures (S3-S5) are helpful in showing the differences between groups at baseline, and the variability of measurements is more apparent. The lack of blinding is also a weakness, though the PK assessments do help (note 3TC levels rise substantially in both groups for most of the time on study (Figure S2).

      The many assays and comparisons are listed as a strength. The many comparisons raise the possibility of finding significance by chance. In addition, if there is an imbalance at baseline outcomes, measuring related parameters will move in the same direction.

      The limited impact on activation and inflammation should be addressed in the discussion, as they are highlighted as a potentially important consequence of intermittent, not sustained replication in the introduction.

      The study is provocative and well executed, with the limitations listed above. Pharmacokinetic analyses help mitigate the lack of blinding. The major impact of this work is if it leads to a much larger randomized, controlled, blinded study of a longer duration, as the authors point out.

    5. Author response:

      Reviewer #1 (Public Review):

      Fombellida-Lopez and colleagues describe the results of an ART intensification trial in people with HIV infection (PWH) on suppressive ART to determine the effect of increasing the dose of one ART drug, dolutegravir, on viral reservoirs, immune activation, exhaustion, and circulating inflammatory markers. The authors hypothesize that ART intensification will provide clues about the degree to which low-level viral replication is occurring in circulation and in tissues despite ongoing ART, which could be identified if reservoirs decrease and/or if immune biomarkers change. The trial design is straightforward and well-described, and the intervention appears to have been well tolerated. The investigators observed an increase in dolutegravir concentrations in circulation, and to a lesser degree in tissues, in the intervention group, indicating that the intervention has functioned as expected (ART has been intensified in vivo). Several outcome measures changed during the trial period in the intervention group, leading the investigators to conclude that their results provide strong evidence of ongoing replication on standard ART. The results of this small trial are intriguing, and a few observations in particular are hypothesis-generating and potentially justify further clinical trials to explore them in depth. However, I am concerned about over-interpretation of results that do not fully justify the authors' conclusions.

      We thank Reviewer #1 for their thoughtful and constructive comments, which will help us clarify and improve the manuscript. Below, we address each of the reviewer’s points and describe the changes that we intend to implement in the revised version. We acknowledge the reviewer’s concern regarding potential over-interpretation of certain findings, and we will take particular care to ensure that all conclusions are supported by the data and framed within the exploratory nature of the study.

      (1) Trial objectives: What was the primary objective of the trial? This is not clearly stated. The authors describe changes in some reservoir parameters and no changes in others. Which of these was the primary outcome? No a priori hypothesis / primary objective is stated, nor is there explicit justification (power calculations, prior in vivo evidence) for the small n, unblinded design, and lack of placebo control. In the abstract (line 36, "significant decreases in total HIV DNA") and conclusion (lines 244-246), the authors state that total proviral DNA decreased as a result of ART intensification. However, in Figures 2A and 2E (and in line 251), the authors indicate that total proviral DNA did not change. These statements are confusing and appear to be contradictory. Regarding the decrease in total proviral DNA, I believe the authors may mean that they observed transient decrease in total proviral DNA during the intensification period (day 28 in particular, Figure 2A), however this level increases at Day 56 and then returns to baseline at Day 84, which is the source of the negative observation. Stating that total proviral DNA decreased as a result of the intervention when it ultimately did not is misleading, unless the investigators intended the day 28 timepoint as a primary endpoint for reservoir reduction - if so, this is never stated, and it is unclear why the intervention would then be continued until day 84? If, instead, reservoir reduction at the end of the intervention was the primary endpoint (again, unstated by the authors), then it is not appropriate to state that the total proviral reservoir decreased significantly when it did not.

      We agree with the reviewer that the primary objective of the study was not explicitly stated in the submitted manuscript. We will clarify this in the revised manuscript. As registered on ClinicalTrials.gov (NCT05351684), the primary outcome was defined as “To evaluate the impact of treatment intensification at the level of total and replication-competent reservoir (RCR) in blood and in tissues”, with a time frame of 3 months. Accordingly, our aim was to explore whether any measurable reduction in the HIV reservoir (total or replication-competent) occurred during the intensification period, including at day 28, 56, or 84. The protocol did not prespecify a single time point for this effect to occur, and the exploratory design allowed for detection of transient or sustained changes within the intensification window.

      We recognize that this scope was not clearly articulated in the original text and may have led to confusion in interpreting the transient drop in total HIV DNA observed at day 28. While total DNA ultimately returned to baseline by the end of intensification, the presence of a transient reduction during this 3-month window still fits within the framework of the study’s registered objective. Moreover, although the change in total HIV DNA was transient, it aligns with the consistent direction of changes observed across the multiple independent measures, including CA HIV RNA, RNA/DNA ratio and intact HIV DNA, collectively supporting a biological effect of intensification.

      We would also like to stress that this is the first clinical trial ever, in which an ART intensification is performed not by adding an extra drug but by increasing the dosage of an existing drug. Therefore, we were more interested in the overall, cumulative, effect of intensification throughout the entire trial period, than in differences between groups at individual time points. We will clarify in the manuscript that this was a proof-of-concept phase 2 study, designed to generate biological signals rather than confirm efficacy in a powered comparison. The absence of a pre-specified statistical endpoint or sample size calculation reflects the exploratory nature of the trial.

      (2) Intervention safety and tolerability: The results section lacks a specific heading for participant safety and tolerability of the intervention. I was wondering about clinically detectable viremia in the study. Were there any viral blips? Was the increased DTG well tolerated? This drug is known to cause myositis, headache, CPK elevation, hepatotoxicity, and headache. Were any of these observed? What is the authors' interpretation of the CD4:8 ratio change (line 198)? Is this a significant safety concern for a longer duration of intensification? Was there also a change in CD4% or only in absolute counts? Was there relative CD4 depletion observed in the rectal biopsy samples between days 0 and 84? Interestingly, T cells dropped at the same timepoints that reservoirs declined... how do the authors rule out that reservoir decline reflects transient T cell decline that is non-specific (not due to additional blockade of replication)?

      We will improve the Methods section to clarify how safety and tolerability were assessed during the study. Safety evaluations were conducted on day 28 and day 84 and included a clinical examination and routine laboratory testing (liver function tests, kidney function, and complete blood count). Medication adherence was also monitored through pill counts performed by the study nurses.

      No virological blips above 50 copies/mL were observed and no adverse events were reported by participants during the 3-month intensification period. Although CPK levels were not included in the routine biological monitoring, no participant reported muscle pain or other symptoms suggestive of muscle toxicity.

      The CD4:CD8 ratio decrease noted during intensification was not associated with significant changes in absolute CD4 or CD8 counts, as shown in Figure 5. We interpret this ratio change as a transient redistribution rather than an immunological risk, therefore we do not consider it to represent a safety concern.

      We would like to clarify that CD4<sup>+</sup> T-cell counts did not significantly decrease in any of the treatment groups, as shown in Figure 5. The apparent decline observed concerns the CD4/CD8 ratio, which transiently dropped, but not the absolute number of CD4<sup>+</sup> T cells.

      (3) The investigators describe a decrease in intact proviral DNA after 84 days of ART intensification in circulating cells (Figure 2D), but no changes to total proviral DNA in blood or tissue (Figures 2A and 2E; IPDA does not appear to have been done on tissue samples). It is not clear why ART intensification would result in a selective decrease in intact proviruses and not in total proviruses if the source of these reservoir cells is due to ongoing replication. These reservoir results have multiple interpretations, including (but not limited to) the investigators' contention that this provides strong evidence of ongoing replication. However, ongoing replication results in the production of both intact and mutated/defective proviruses that both contribute to reservoir size (with defective proviruses vastly outnumbering intact proviruses). The small sample size and well-described heterogeneity of the HIV reservoir (with regard to overall size and composition) raise the possibility that the study was underpowered to detect differences over the 84-day intervention period. No power calculations or prior studies were described to justify the trial size or the duration of the intervention. Readers would benefit from a more nuanced discussion of reservoir changes observed here.

      We sincerely thank the reviewer for this insightful comment. We fully agree that the reservoir dynamics observed in our study raise several possible interpretations, and that its complexity, resulting from continuous cycles of expansion and contraction, reflects the heterogeneity of the latent reservoir.

      Total HIV DNA in PBMCs showed a transient decline during intensification (notably at day 28), ultimately returning to baseline by day 84. This biphasic pattern may reflect the combined effects of suppression of ongoing low-level replication by an increased DTG dosage, followed by the expansion of infected cell clones (mostly harboring defective proviruses). In other words, the transient decrease in total (intact + defective) DNA at day 28 may be due to an initial decrease in newly infected cells upon ART intensification, however at the subsequent time points this effect was masked by proliferation (clonal expansion) of infected cells with defective proviruses. This explains why the intact proviruses decreased, but the total proviruses did not change, between days 0 and 84.

      Importantly, we observed a significant decrease in intact proviral DNA between day 0 and day 84 in the intensification group (Figure 2D). We will highlight this result more clearly in the revised manuscript, as it directly addresses the study’s primary objective: assessing the impact of intensification on the replication-competent reservoir. In comparison, as the reviewer rightly points out, total HIV DNA includes over 90% defective genomes, which limits its interpretability as a biomarker of biologically relevant reservoir changes.

      In addition, other reservoir markers, such as cell-associated unspliced RNA and RNA/DNA ratios, also showed consistent trends supporting a modest but biologically relevant effect of intensification. Even in the absence of sustained changes in total HIV DNA, the coherence across these independent measures suggests a signal indicative of ongoing replication in at least some individuals, and at specific timepoints.

      Regarding tissue reservoirs, the lack of substantial change in total HIV DNA between days 0 and 84 is also in line with the predominance of defective sequences in these compartments. Moreover, the limited increase in rectal tissue dolutegravir levels during intensification (from 16.7% to 20% of plasma concentrations) may have limited the efficacy of the intervention in this site.

      As for the IPDA on rectal biopsies, we attempted the assay using two independent DNA extraction methods (Promega Reliaprep and Qiagen Puregene), but both yielded high DNA Shearing Index values, and intact proviral detection was successful in only 3 of 40 samples. Given the poor DNA integrity and weak signals, these results were not interpretable.

      That said, we fully acknowledge the limitations of our study, especially the small sample size, and we agree with the reviewer that caution is needed when interpreting these findings. In the revised manuscript, we will adopt a more measured tone in the discussion, clearly stating that these observations are exploratory and hypothesis-generating, and require confirmation in larger, more powered studies. Nonetheless, we believe that the convergence of multiple reservoir markers pointing in the same direction constitutes a potentially meaningful biological signal that deserves further investigation.

      (4) While a few statistically significant changes occurred in immune activation markers, it is not clear that these are biologically significant. Lines 175-186 and Figure 3: The change in CD4 cells + for TIGIT looks as though it declined by only 1-2%, and at day 84, the confidence interval appears to widen significantly at this timepoint, spanning an interquartile range of 4%. The only other immune activation/exhaustion marker change that reached statistical significance appears to be CD8 cells + for CD38 and HLA-DR, however, the decline appears to be a fraction of a percent, with the control group trending in the same direction. Despite marginal statistical significance, it is not clear there is any biological significance to these findings; Figure S6 supports the contention that there is no significant change in these parameters over time or between groups. With most markers showing no change and these two showing very small changes (and the latter moving in the same direction as the control group), these results do not justify the statement that intensifying DTG decreases immune activation and exhaustion (lines 38-40 in the abstract and elsewhere).

      We agree with the reviewer that the observed changes in immune activation and exhaustion markers were modest. We will revise the manuscript to reflect this more accurately. We will also note that these differences, while statistically significant (e.g., in TIGIT+ CD4+ T cells and CD38+HLA-DR+ CD8+ T cells), were limited in magnitude. We will explicitly acknowledge these limitations and interpret the findings with appropriate caution.

      (5) There are several limitations of the study design that deserve consideration beyond those discussed at line 327. The study was open-label and not placebo-controlled, which may have led to some medication adherence changes that confound results (authors describe one observation that may be evidence of this; lines 146-148). Randomized/blinded / cross-over design would be more robust and help determine signal from noise, given relatively small changes observed in the intervention arm. There does not seem to be a measurement of key outcome variables after treatment intensification ceased - evidence of an effect on replication through ART intensification would be enhanced by observing changes once intensification was stopped. Why was intensification maintained for 84 days? More information about the study duration would be helpful. Table 1 indicates that participants were 95% male. Sex is known to be a biological variable, particularly with regard to HIV reservoir size and chronic immune activation in PWH. Worldwide, 50% of PWH are women. Research into improving management/understanding of disease should reflect this, and equal participation should be sought in trials. Table 1 shows differing baseline reservoir sizes between the control and intervention groups. This may have important implications, particularly for outcomes where reservoir size is used as the denominator.

      We will expand the limitations section to address several key aspects raised by the reviewer: the absence of blinding and placebo control, the predominantly male study population, and the lack of post-intervention follow-up. While we acknowledge that open-label designs can introduce behavioral biases, including potential changes in adherence, we will now explicitly state that placebo-controlled, blinded trials would provide a more robust assessment and are warranted in future research.

      The 84-day duration of intensification was chosen based on previous studies and provided sufficient time for observing potential changes in viral transcription and reservoir dynamics. However, we agree that including post-intervention follow-up would have strengthened the conclusions, and we will highlight this limitation and future direction in the revised manuscript.

      The sex imbalance is now clearly acknowledged as a limitation in the revised manuscript, and we fully support ongoing efforts to promote equitable recruitment in HIV research. We would like to add that, in our study, rectal biopsies were coupled with anal cancer screening through HPV testing. This screening is specifically recommended for younger men who have sex with men (MSM), as outlined in the current EACS guidelines (see: https://eacs.sanfordguide.com/eacs-part2/cancer/cancer-screening-methods). As a result, MSM participants had both a clinical incentive and medical interest to undergo this procedure, which likely contributed to the higher proportion of male participants in the study.

      Lastly, although baseline total HIV DNA was higher in the intensified group, our statistical approach is based on a within-subject (repeated-measures) design, in which the longitudinal change of a parameter within the same participant during the study was the main outcome. In other words, we are not comparing absolute values of any marker between the groups, we are looking at changes of parameters from baseline within participants, and these are not expected to be affected by baseline imbalances.

      (6) Figure 1: the increase in DTG levels is interesting - it is not uniform across participants. Several participants had lower levels of DTG at the end of the intervention. Though unlikely to be statistically significant, it would be interesting to evaluate if there is a correlation between change in DTG concentrations and virologic / reservoir / inflammatory parameters. A positive relationship between increasing DTG concentration and decreased cell-associated RNA, for example, would help support the hypothesis that ongoing replication is occurring.

      We agree with the reviewer that assessing correlations between DTG concentrations and virological, immunological, or inflammatory markers would be highly informative. In fact, we initially explored this question in a preliminary way by examining whether individuals who showed a marked increase in DTG levels after intensification also demonstrated stronger changes in the viral reservoir. While this exploratory analysis did not reveal any clear associations, we would like to emphasize that correlating biological effects with DTG concentrations measured at a single timepoint may have limited interpretability. A more comprehensive understanding of the relationship between drug exposure and reservoir dynamics would ideally require multiple pharmacokinetic measurements over time, including pre-intensification baselines. This is particularly important given that DTG concentrations vary across individuals and over time, depending on adherence, metabolism, and other individual factors. We will clarify these points in the revised manuscript.

      (7) Figure 2: IPDA in tissue- was this done? scRNA in blood (single copy assay) - would this be expected to correlate with usCaRNA? The most unambiguous result is the decrease in cell-associated RNA - accompanying results using single-copy assay in plasma would be helpful to bolster this result.

      As mentioned in our response to point 3, we attempted IPDA on tissue samples, but technical limitations prevented reliable detection of intact proviruses. Regarding residual viremia, we did perform ultra-sensitive plasma HIV RNA quantification but due to a technical issue (an inadvertent PBMC contamination during plasma separation) that affected the reliability of the results we felt uncomfortable including these data in the manuscript.

      The use of the US RNA / Total DNA ratio is not helpful/difficult to interpret since the control and intervention arms were unmatched for total DNA reservoir size at study entry.

      We respectfully disagree with this comment. The US RNA / Total DNA ratio is commonly used to assess the relative transcriptional activity of the viral reservoir, rather than its absolute size. While we acknowledge that the total HIV-1 DNA levels differed at baseline between the two groups, the US RNA / Total DNA ratio specifically reflects the relationship between transcriptional activity and reservoir size within each individual, and is therefore not directly confounded by baseline differences in total DNA alone.

      Moreover, our analyses focus on within-subject longitudinal changes from baseline, not on direct between-group comparisons of absolute marker values. As such, the observed changes in the US RNA / Total DNA ratio over time are interpreted relative to each participant's baseline, mitigating concerns related to baseline imbalances between groups.

      Reviewer #2 (Public Review):

      Summary:

      An intensification study with a double dose of 2nd generation integrase inhibitor with a background of nucleoside analog inhibitors of the HIV retrotranscriptase in 2, and inflammation is associated with the development of co-morbidities in 20 individuals randomized with controls, with an impact on the levels of viral reservoirs and inflammation markers. Viral reservoirs in HIV are the main impediment to an HIV cure, and inflammation is associated with co-morbidities.

      Strengths:

      The intervention that leads to a decrease of viral reservoirs and inflammation is quite straightforward forward as a doubling of the INSTI is used in some individuals with INSTI resistance, with good tolerability.

      This is a very well documented study, both in blood and tissues, which is a great achievement due to the difficulty of body sampling in well-controlled individuals on antiretroviral therapy. The laboratory assays are performed by specialists in the field with state-of-the art quantification assays. Both the introduction and the discussion are remarkably well presented and documented.

      The findings also have a potential impact on the management of chronic HIV infection.

      Weaknesses:

      I do not think that the size of the study can be considered a weakness, nor the fact that it is open-label either.

      We thank Reviewer #2 for their constructive and supportive comments. We appreciate their positive assessment of the study design, the translational relevance of the intervention, and the technical quality of the assays. We also take note of their perspective regarding sample size and study design, which supports our positioning of this trial as an exploratory, hypothesis-generating phase 2 study.

      Reviewer #3 (Public Review):

      The introduction does a very good job of discussing the issue around whether there is ongoing replication in people with HIV on antiretroviral therapy. Sporadic, non-sustained replication likely occurs in many PWH on ART related to adherence, drug interactions and possibly penetration of antivirals into sanctuary areas of replication and as the authors point out proving it does not occur is likely not possible and proving it does occur is likely very dependent on the population studied and the design of the intervention. Whether the consequences of this replication in the absence of evolution toward resistance have clinical significance challenging question to address.

      It is important to note that INSTI-based therapy may have a different impact on HIV replication events that results in differences in virus release for specific cell type (those responsible for "second phase" decay) by blocking integration in cells that have completed reverse transcription prior to ART initiation but have yet to be fully activated. In a PI or NNRTI-based regimen, those cells will release virus, whereas with an INSTI-based regimen, they will not.

      Given the very small sample size, there is a substantial risk of imbalance between the groups in important baseline measures. Unfortunately, with the small sample size, a non-significant P value is not helpful when comparing baseline measures between groups. One suggestion would be to provide the full range as opposed to the inter-quartile range (essentially only 5 or 6 values). The authors could also report the proportion of participants with baseline HIV RNA target not detected in the two groups.

      We thank Reviewer #3 for their thoughtful and balanced review. We are grateful for the recognition of the strength of the Introduction, the complexity of evaluating residual replication, and the technical execution of the assays. We also appreciate the insightful suggestions for improving the clarity and transparency of our results and discussion.

      We will revise the manuscript to address several of the reviewer’s key concerns. We agree that the small sample size increases the risk of baseline imbalances. We will acknowledge these limitations in the revised manuscript. We will provide both the full range and the IQR in Table 1 in the revised manuscript.

      A suggestion that there is a critical imbalance between groups is that the control group has significantly lower total HIV DNA in PBMC, despite the small sample size. The control group also has numerically longer time of continuous suppression, lower unspliced RNA, and lower intact proviral DNA. These differences may have biased the ability to see changes in DNA and US RNA in the control group.

      We acknowledge the significant baseline difference in total HIV DNA between groups, which we have clearly reported. However, the other variables mentioned, duration of continuous viral suppression, unspliced RNA levels, and intact proviral DNA, did not differ significantly between groups at baseline, despite differences in the median values. These numerical differences do not necessarily indicate a critical imbalance.

      Notably, there was no significant difference in the change in US RNA/DNA between groups (Figure 2C).

      The nonsignificant difference in the change in US RNA/DNA between groups is not unexpected, given the significant between-group differences for both US RNA and total DNA changes. Since the ratio combines both markers, it is likely to show attenuated between-group differences compared to the individual components. However, while the difference did not reach statistical significance (p = 0.09), we still observed a trend towards a greater reduction in the US RNA/Total DNA ratio in the intervention group.

      The fact that the median relative change appears very similar in Figure 2C, yet there is a substantial difference in P values, is also a comment on the limits of the current sample size.

      Although we surely agree that in general, the limited sample size impacts statistical power, we would like to point out that in Figure 2C, while the medians may appear similar, the ranges do differ between groups. At days 56 and 84, the median fold changes from baseline are indeed close but the full interquartile range in the DTG group stays below 1, while in the control group, the interquartile range is wider and covers approximately equal distance above and below 1. This explains the difference in p values between the groups.

      The text should report the median change in US RNA and US RNA/DNA when describing Figures 2A-2C.

      These data are already reported in the Results section (lines 164–166): "By day 84, US RNA and US RNA/total DNA ratio had decreased from day 0 by medians (IQRs) of 5.1 (3.3–6.4) and 4.6 (3.1–5.3) fold, respectively (p = 0.016 for both markers)."

      This statistical comparison of changes in IPDA results between groups should be reported. The presentation of the absolute values of all the comparisons in the supplemental figures is a strength of the manuscript.

      In the assessment of ART intensification on immune activation and exhaustion, the fact that none of the comparisons between randomized groups were significant should be noted and discussed.

      We would like to point out that a statistically significant difference between the randomized groups was observed for the frequency of CD4<sup>+</sup> T cells expressing TIGIT, as shown in Figure 3A and reported in the Results section (p = 0.048).

      The changes in CD4:CD8 ratio and sCD14 levels appear counterintuitive to the hypothesis and are commented on in the discussion.

      Overall, the discussion highlights the significant changes in the intensified group, which are suggestive. There is limited discussion of the comparisons between groups where the results are less convincing.

      We will temper the language accordingly and add commentary on the limited and modest nature of these changes. Similarly, we will expand our discussion of counterintuitive findings such as the CD4:CD8 ratio and sCD14 changes.

      The limitations of the study should be more clearly discussed. The small sample size raises the possibility of imbalance at baseline. The supplemental figures (S3-S5) are helpful in showing the differences between groups at baseline, and the variability of measurements is more apparent. The lack of blinding is also a weakness, though the PK assessments do help (note 3TC levels rise substantially in both groups for most of the time on study (Figure S2).

      The many assays and comparisons are listed as a strength. The many comparisons raise the possibility of finding significance by chance. In addition, if there is an imbalance at baseline outcomes, measuring related parameters will move in the same direction.

      We agree that the multiple comparisons raise the possibility of chance findings but would like to stress that in an exploratory study like this it is very important to avoid a type II error. In addition, the consistent directionality of the most relevant outcomes (US RNA and intact DNA) lends biological plausibility to the observed effects.

      The limited impact on activation and inflammation should be addressed in the discussion, as they are highlighted as a potentially important consequence of intermittent, not sustained replication in the introduction.

      The study is provocative and well executed, with the limitations listed above. Pharmacokinetic analyses help mitigate the lack of blinding. The major impact of this work is if it leads to a much larger randomized, controlled, blinded study of a longer duration, as the authors point out.

      Finally, we fully endorse the reviewer’s suggestion that the primary contribution of this study lies in its value as a proof-of-concept and foundation for future randomized, blinded trials of greater scale and duration. We will highlight this more clearly in the revised Discussion.

    1. eLife Assessment

      Tropical single-island endemic bird populations are particularly vulnerable to climate change. The authors investigate genetic evidence of how such species dealt with climate changes in the past as a possible predictor for how they will respond to change in the future, which could provide an important example for the fields of conservation genetics and island biogeography. The authors' integration of genomics and habitat modeling is commendable, but we find that the support for their conclusions is incomplete: at times, the results presented appear to contradict each other, the authors do not fully account for key variables, and the limited taxonomic scope may cause problematic biases for the conclusions.

    2. Reviewer #1 (Public review):

      Summary:

      The authors combine PSMC and habitat modeling to try to connect habitat change during the Last Glacial Period to changes in Ne.

      Strengths:

      Observing how tropical single-island endemic bird species responded to habitat change in the past may help inform conservation interventions for these particularly vulnerable species. The combination of genomics and habitat modeling is a good idea - this sort of interdisciplinary thinking is what is needed to tackle these complex questions. Additionally, the use of PSMC makes it possible to perform this analysis on poorly-studied species with only a single genome available.

      Room for Improvement:

      Why coalescent Ne is a better predictor of extinction risk than current genomic diversity, or current Ne, isn't explicitly explained. PSMC in particular has many caveats, and some are not acknowledged or adequately addressed by the authors. For example, the authors note that population structure is a confounding factor with PSMC, but that it is not a problem in this instance. They do not provide compelling evidence for why this would be the case, they simply state that the species studied are all single-island endemics. However, single-island endemic species are not necessarily panmictic; this is even less likely to be true for species studied here that inhabit a large geographic area (ie, Australian species). Differing PSMC parameters may also impact results: the differences between passerines and non-passerines were one of their main results, but they do not provide any analysis to show that this difference was not driven by the different mutation rates used for the two groups.

      Parameters for many steps are not described, and choices that are described (such as the PSMC parameters) are not always fully explained. It is unclear why all data was mapped to the autosomes rather than removing reads that map to the sex chromosomes first. Using all the data, the reads belonging to the sex chromosomes could potentially map to other areas of the genome. It does not seem like a mapping quality filter was used, so these potential spurious alignments would not have been removed prior to analysis.

      There are points where the results are described in ways that appear to potentially differ from the supplementary figures. The authors state that even for species where PSMC results differed between models, "trends of Ne increase or decrease from the LIG to LGM were robust across all three PSMC models considered." The figures in the supplement for Pachycephala philippinensis, Rhynochetos jubatus, and Zosterops hypoxanthus appear to potentially contradict this statement, but it is difficult to tell, as the time period observed is not clearly marked on the graphs. How this robustness of trends was determined is not explained, leaving the precision of the analysis unclear.

      Table 1 also includes some information that contradicts what is in the Supplementary Tables, leading to a lack of clarity. Centropus unirufus, Chaetorhynchus papuensis, and Cnemophilus loriae are not included in Supplementary Table 4. Table 1 says Eulacestoma nigropectus, Paradisaea rubra, and Parotia lawesii did not undergo PSMC analysis, but Supplementary Table 4 says PSMC and modeling trends matched for these species. Table 1 says Rhagologus leucostigma underwent both PSMC and climate modeling, but Supplementary Table 4 says "NA" as if it was missing one of these analyses.

      Additionally, some of the results appear to contradict each other. For example, they show that there is no impact of habitat change in larger-bodied species, but also that larger-bodied species saw a decrease in Ne during the LGP. In another example, they state that when a species saw an increase in habitat during the LGP, they also had an increase in Ne. However, they also state that this was not the case for non-passerines.

      Ecosystems are highly complex; there may also be other variables influencing past demographic change other than those explored here. Results should be interpreted with caution.

    3. Reviewer #2 (Public review):

      Summary and strengths:

      In this manuscript, Karjee and colleagues used coalescent-based effective population size reconstruction (PSMC) from single genomes to understand past population trends in island birds and related this to life history traits and glacial patterns. This concept is fairly new, as there are still relatively few multiple PSMC synthesis studies. I also thought that the focus on island endemics was unique and adds value to this paper. I enjoyed seeing a paper focused on South East Asia and think that this could help contribute to our knowledge of the important biodiversity within this region.

      Major weaknesses:

      My biggest concern with this paper is that the analyses are limited to 20-30 species, and significant taxonomic bias is present (there are multiple species of passerine but only 1-2 representatives of other groups). While this is not an issue alone, many of the life history traits or geographical traits are conflated with phylogenetic diversity (e.g., there are no large-bodied passerines). Thus, it is my opinion that the impact of these drivers of past population size is conflated and cannot be disentangled with the current data. The authors themselves state that the core hypothesis surrounding Ne and habitat availability is not supported by their entire dataset (only seen in Passerines). This was not clear enough in the abstract, and conclusions cannot be drawn here as the impact of taxonomy cannot be separated from data richness, traits, etc. The PSMC analysis was done according to the most recent recommendations, and this part of the manuscript is fairly robust. However, in several places, it is incorrectly stated that the PSMC measures or can infer genetic diversity; PSMC only infers past effective population size. It cannot measure genetic diversity in the past. I cannot review the habitat reconstruction modelling as I am a conservation genomics specialist.

      Appraisal:

      I am not convinced about the findings within the paper. I do not think that the results are sufficiently supported at this time, largely due to the conflation of taxonomy with other variables. As this type of comparison is new, I do think that there is a chance for reasonable impact on the field of genomics and island biogeography if the manuscript's constraints are addressed. I do not see scope for impact on conservation at this time and find the conclusions in the abstract regarding conservation relevance to be unfounded.

    4. Author response:

      We thank the editors and the reviewers for their positive comments regarding our manuscript and the methodological approach we have taken to understand the historical demographic response of endemic island birds to climate change. We acknowledge the issues of uneven sample sizes and plan to include additional species of island endemic birds for which genomic data is now available. As requested by reviewer 1, we will also address the issues related to the PSMC analysis in the revised version of the manuscript.

    1. eLife Assessment

      This study presents important findings that enhance our understanding of immune cell interactions in the context of chronic HIV-1 infection. The evidence supporting the conclusions is convincing. The authors have employed appropriate and validated methodologies, including detailed data reprocessing and batch correction to account for inter-donor variability. The inclusion of supplementary figures and analyses, such as cell communication inference, further substantiates the robustness of the findings. Overall, this work contributes to our understanding of HIV-1 immune evasion and highlights potential therapeutic targets for reservoir eradication.

    2. Reviewer #2 (Public review):

      Summary:

      The authors observed gene ontologies associated with upregulated KLF2 target genes in HIV-1 RNA+ CD4 T Cells using scRNA-seq and scATAC-seq datasets from the PBMCs of early HIV-1-infected patients, showing immune responses contributing to HIV pathogenesis and novel targets for viral elimination.

      Strengths:

      The authors carried out detailed transcriptomics profiling with scRNA-seq and scATAC-seq datasets to conclude upregulated KLF2 target genes in HIV-1 RNA+ CD4 T Cells.

      Comments on revisions:

      The authors justified my comments.

    3. Reviewer #3 (Public review):

      The revised manuscript demonstrates a marked improvement over the previous version. The authors have successfully incorporated feedback, and have moreover expanded their analyses.

      The Methods section is now more detailed and meets the requirements for reproducible research. Authors have reprocessed the data, creating an integrated dataset using a previously published single-cell RNA-Seq atlas, which includes both healthy donors and individuals with chronic HIV-1 infection. An additional batch correction step was included into the processing pipeline after the explicit analysis of inter-donor variability within immune subsets, as was suggested.

      Several supplementary figures were added, which both improve the understanding of data and address questions raised by the reviewers. The manuscript also provides additional analysis of cell communication inference, as suggested. The study of interactions between NK cells and infected CD4+ T cells, as well as between monocytes and infected CD4+ T cells, is valuable for understanding the influence of cell signaling on antiviral response and the production of HIV-1 transcripts in infected cells.

      The authors have addressed all the reviewers' suggestions, and the current version of the manuscript is both more comprehensive and more informative. Additional analysis has strengthened the narrative and the reproducibility of the research.

      The resulting manuscript is both more robust and more informative.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aimed to elucidate the molecular mechanisms underlying HIV-1 persistence and host immune dysfunction in CD4+ T cells during early infection (<6 months). Using single-cell multi-omics technologies-including scRNA-seq, scATAC-seq, and single-cell multiome analyses-they characterized the transcriptional and epigenomic landscapes of HIV-1-infected CD4+ T cells. They identified key transcription factors (TFs), signaling pathways, and T cell subtypes involved in HIV-1 persistence, particularly highlighting KLF2 and Th17 cells as critical regulators of immune suppression. The study provides new insights into immune dysregulation during early HIV-1 infection and reveals potential epigenetic regulatory mechanisms in HIV-1-infected T cells.

      Strengths:

      The study excels through its innovative integration of single-cell multi-omics technologies, enabling detailed analysis of gene regulatory networks in HIV-1-infected cells. Focusing on early infection stages, it fills a crucial knowledge gap in understanding initial immune responses and viral reservoir establishment. The identification of KLF2 as a key transcription factor and Th17 cells as major viral reservoirs, supported by comprehensive bioinformatics analyses, provides robust evidence for the study's conclusions. These findings have immediate clinical relevance by identifying potential therapeutic targets for HIV-1 reservoir eradication.

      We sincerely appreciate the reviewer’s positive evaluation of our work.

      Weaknesses:

      Despite its strengths, the study has several limitations. By focusing exclusively on CD4+ T cells, the study overlooks other relevant immune cells such as CD14+ monocytes, NK cells, and B cells. Additionally, while the authors generated their own single-cell datasets, they need to validate their findings using other publicly available single-cell data from HIV-1-infected PBMCs.

      Thank you to Reviewer #1 for your feedback on our work. In response to this feedback, we have examined cell-cell interactions between HIV-1-infected CD4+ T cells and other innate immune cells, including monocytes and NK cells. We identified altered interaction signaling patterns (e.g., MIF, ICAM2, CCL5, CLEC2B) that contribute to immune dysfunction and viral persistence (page 9, Supplementary Fig. 5) In addition, we validated the expression of KLF2 and its target genes using a publicly available scRNA-seq dataset from HIV-1-infected PBMCs [1], which includes both healthy donors and individuals with chronic HIV-1 infection. The upregulation of key KLF2 targets in HIV-1-infected CD4+ T cells from this dataset supports the reproducibility of our findings. We have incorporated into the revised Results, Discussion, and Supplementary Materials (page 8, page 12 and Supplementary Fig. 4A).

      Reviewer #2 (Public review):

      Summary:

      The authors observed gene ontologies associated with upregulated KLF2 target genes in HIV-1 RNA+ CD4 T Cells using scRNA-seq and scATAC-seq datasets from the PBMCs of early HIV-1-infected patients, showing immune responses contributing to HIV pathogenesis and novel targets for viral elimination.

      Strengths:

      The authors carried out detailed transcriptomics profiling with scRNA-seq and scATAC-seq datasets to conclude upregulated KLF2 target genes in HIV-1 RNA+ CD4 T Cells.

      We thank the reviewer for highlighting the strengths of our work.

      Weaknesses:

      This key observation of up-regulation KLF2 associated genes family might be important in the HIV field for early diagnosis and viral clearance. However, with the limited sample size and in-vivo study model, it will be hard to conclude. I highly recommend increasing the sample size of early HIV-1-infected patients.

      Thank you to Reviewer #2 for this important comment. We acknowledge the limitations of our modest sample size, which reflects the challenges of recruiting well-characterized individuals in early HIV-1 infection (<6 months) and obtaining high-quality PBMCs for single-cell multi-omic profiling. To strengthen our findings, we validated the upregulation of KLF2 target genes using a publicly available scRNA-seq dataset from HIV-1-infected PBMCs [1], which showed similar expression patterns in HIV-1 RNA+ CD4+ T cells (page 8 and Supplementary Fig. 4A).

      Reviewer #3 (Public review):

      Summary:

      This manuscript studies intracellular changes and immune processes during early HIV-1 infection with an additional focus on the small CD4+ T cell subsets. The authors used single-cell omics to achieve high resolution of transcriptomic and epigenomic data on the infected cells which were verified by viral RNA expression. The results add to understanding of transcriptional regulation which may allow progression or HIV latency later in infected cells. The biosamples were derived from early HIV infection cases, providing particularly valuable data for the HIV research field.

      Strengths:

      The authors examined the heterogeneity of infected cells within CD4 T cell populations, identified a significant and unexpected difference between naive and effector CD4 T cells, and highlighted the differences in Th2 and Th17 cells. Multiple methods were used to show the role of the increased KLF2 factor in infected cells. This is a valuable finding of a new role for the major transcription factor in further disease progression and/or persistence.

      The methods employed by the authors are robust. Single-cell RNA-Seq from PBMC samples was followed by a comprehensive annotation of immune cell subsets, 16 in total. This manuscript presents to the scientific community a valuable multi-omics dataset of good quality, which could be further analyzed in the context of larger studies.

      We sincerely thank the reviewer for the insightful and concise summary of our work.

      Weaknesses:

      Methods and Supplementary materials

      Some technical aspects could be described in more detail. For example, it is unclear how the authors filtered out cells that did not pass quality control, such as doublets and cells with low transcript/UMI content. Next, in cell annotation, what is the variability in cell types between donors? This information is important to include in the supplementary materials, especially with such a small sample size. Without this, it is difficult to determine, whether the differences between subsets on transcriptomic level, viral RNA expression level, and chromatin assessment are observed due to cell type variations or individual patient-specific variations. For the DEG analysis, did the authors exclude the most variable genes?

      Thank you to Reviewer #3 for these detailed comments and observations. In the revised Methods section (page 16), we have added information on our quality control filtering process. Specifically, we excluded cells with fewer than 200 detected genes, high mitochondrial content (>30%), or low UMI counts. Doublets were identified and removed using DoubletFinder.

      To address inter-donor variability, we included a new supplementary figure (Supplementary Fig. 1B) showing the distribution of major immune cell types across individual donors. While we observed some variation in cell-type composition between individuals, this likely reflects natural biological heterogeneity in early HIV-1 infection. Additionally, we applied fastMNN batch correction to mitigate donor-specific technical variation. After correction, the overall patterns of gene expression within each major CD4+ T cell subset were consistent across individuals (Supplementary Fig. 1C).

      Regarding the DEG analysis, we used ‘FindMarkers’ function in Seurat (v.3.2.1), which does not exclude highly variable genes. These details have been clarified in the updated Methods section (page 18).

      The annotation of 16 cell types from PBMC samples is impressive and of good quality, however, not all cell types get attention for further analysis. It’s natural to focus primarily on the CD4 T cells according to the research objectives. The authors also study potential interactions between CD4 and CD8 T cells by cell communication inference. It would be interesting to ask additional questions for other underexplored immune cell subsets, such as: 1) Could viral RNA be detected in monocytes or macrophages during early infection? 2) What are the inferred interactions between NK cells and infected CD4 T cells, are interactions similar to CD4-CD8 results? 3) What are the inferred interactions between monocytes or macrophages and infected CD4 T cells?

      In line with our study objectives, we initially focused on CD4+ T cells as primary HIV-1 targets. However, in response to the reviewer’s comment, we examined the inferred communications between HIV-1-infected CD4+ T cells and other immune cells.

      (1) With regard to the presence of viral RNA in monocytes or macrophages, we observed negligible HIV-1 RNA signal in these cell types in our dataset, consistent with their low permissiveness in early-stage infection [2]. However, we acknowledge the limitations of detecting rare infected cells at the single-cell level.

      (2) We identified increased MIF and ICAM2 signaling between NK cells and HIV-1-infected CD4+ T cells, which are associated with KLF2-mediated immune modulation. These patterns are consistent with the CD4–CD8 interaction results observed in our dataset. (Supplementary Fig. 5A)

      (3) Through the cell-cell interaction analysis with differential expression analysis, we inferred reduced CCL5 and CD55 signaling between monocytes and HIV-1-infected CD4+ T cells (Supplementary Fig. 5B). These reductions may potentially impair immune responses and antiviral defense.

      We appreciate the reviewer’s suggestions and believe that the analysis of underexplored immune subsets strengthens the relevance of our findings. These results have been incorporated into the revised Results (page 9).

      Discussion

      It would be interesting to see more discussion of the observation of how naïve T cells produce more viral RNA compared to effector T cells. It seems counterintuitive according to general levels of transcriptional and translational activity in subsets.

      Another discussion block could be added regarding the results and conclusion comparison with Ashokkumar et al. paper published earlier in 2024 (10.1093/gpbjnl/qzae003). This earlier publication used both a cell line-based HIV infection model and primary infected CD4 T cells and identified certain transcription factors correlated with viral RNA expression.

      Thank you to Reviewer #3 for the insightful suggestions. We observed that the proportion of HIV-1-infected naïve CD4 T cells is higher compared to effector T cells. Although effector CD4 T cells are generally more active, previous studies have suggested that naïve CD4 T cells are susceptible to HIV-1 infection during early infection that may associate with initial expansion and rapid progression [3, 4]. This may be due to less restriction by antiviral signaling or more accessible chromatin states in resting cells. We have added this context and cited relevant papers to address this observation (page 11)

      In addition, we have incorporated a comparative discussion with the recent study [5], which identified FOXP1 and GATA3 as transcriptional regulators associated with HIV-1 RNA expression. While these TFs were not significantly differentially expressed in our dataset, we discuss potential reasons for this discrepancy—including differences in infection model (in vitro vs. ex vivo), infection stage (latency vs. acute), and T cell subset composition—and emphasize that both studies highlight the importance of transcriptional regulation in HIV-1 persistence (page 12 and Supplementary Fig. 4B).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The study has several notable limitations.

      First, it was restricted to early-stage HIV-1 infection (<6 months) without longitudinal data, preventing the authors from capturing temporal changes in immune cell populations, gene expression profiles, and epigenetic landscapes throughout disease progression.

      Thank you to Reviewer #1 for this important limitation. As noted, our study focused exclusively on early-stage HIV-1 infection (<6 months) to capture the initial immune dysregulation and epigenetic alterations. We agree that longitudinal analysis would provide valuable insights into disease progression. However, due to the limited availability of early-infection patient samples suitable for performing multi-omics profiling, we prioritized capturing a detailed snapshot at this early stage. To address this limitation, future studies incorporating longitudinal sampling—including chronic infection and long-term non-progressors—will be essential to fully elucidate the temporal dynamics of HIV-1 pathogenesis.

      Second, while the bioinformatic analysis compared "Uninfected" and "HIV-1-infected" cells from patients, the authors could have strengthened their findings by incorporating publicly available single-cell data from healthy donors and chronically infected HIV-1 patients to validate their arguments across all figures.

      To support the robustness of our findings, we incorporated a publicly available single-cell RNA-seq dataset [1], which includes both healthy donors and individuals with chronic HIV-1 infection. In this dataset, we validated the upregulation of KLF2 and its target genes in HIV-1-infected CD4+ T cells and observed generally consistent expression patterns with those in our early-infection cohort (page 8; page 12 and Supplementary Fig. S4). While not all gene-level trends were identically reflecting differences in infection stage and immune activation status, this external comparison reinforces the reproducibility of key observations and highlights the unique transcriptional features associated with early HIV-1 infection.

      Third, although the study focused on CD4+ T cells as primary HIV-1 targets, it overlooked other important immune cells such as CD8+ T cells, monocytes, and NK cells, which may contribute to viral persistence and immune dysfunction through cell-cell interactions.

      In the revised manuscript, we expanded our analysis to include predicted ligand–receptor interactions between HIV-1-infected and uninfected CD4+ T cells with innate and cytotoxic immune cells using CellChat v.2.1.1. Specifically, we evaluated interactions with NK cells and monocytes and identified altered signaling pathways such as MIF, ICAM2, CCL5, and CLEC2B, which are associated with immune modulation (Supplementary Fig. 5A). We have added these results to the revised Results (page 9).

      Lastly, comparing these findings with other chronic viral infections (e.g., HBV, HCV) would have positioned this work more effectively within the broader field of viral immunology and enhanced its impact.

      We agree that broader comparisons with other chronic viral infections could enhance the impact of our findings. In the current discussion, we noted similarities in interferon signaling disruption with viruses such as HCV and HSV. (page 11). Our observation that HIV-1-infected CD4+ T cells exhibit impaired interferon responses is consistent with immune evasion mechanisms reported in HCV and HSV infections. These results underscore both the shared and specific features of immune modulation and persistence during HIV-1 early infection.

      Reviewer #3 (Recommendations for the authors):

      Supplementary Table S1 should indicate which technique was used for sequencing. However, the current version of the table marks no protocol applied to the majority of the samples, which is confusing and needs to be corrected.

      Thank you to Reviewer #3 for pointing out this important oversight. We have revised Supplementary Table S1 to clearly indicate the sequencing method used for each sample. Separate columns for scRNA-seq, scATAC-seq, and sc-Multiome now specify whether each technique was applied (“Yes” or “No”) to improve clarity and transparency.

      (1) Wang, S., et al., An atlas of immune cell exhaustion in HIV-infected individuals revealed by single-cell transcriptomics. Emerg Microbes Infect, 2020. 9(1): p. 2333-2347.

      (2) Arfi, V., et al., Characterization of the early steps of infection of primary blood monocytes by human immunodeficiency virus type 1. J Virol, 2008. 82(13): p. 6557-65.

      (3) Douek, D.C., et al., HIV preferentially infects HIV-specific CD4+ T cells. Nature, 2002. 417(6884): p. 95-8.

      (4) Jiao, Y., et al., Higher HIV DNA in CD4+ naive T-cells during acute HIV-1 infection in rapid progressors. Viral Immunol, 2014. 27(6): p. 316-8.

      (5) Ashokkumar, M., et al., Integrated Single-cell Multiomic Analysis of HIV Latency Reversal Reveals Novel Regulators of Viral Reactivation. Genomics Proteomics Bioinformatics, 2024. 22(1).

    1. eLife Assessment

      This study presents valuable findings on the relationship between nutrient availability and NAD/NADH levels, which in turn regulate biomass production in cancer cells. The authors provide solid evidence to support their claims, offering insight into why it is difficult to predict which nutrients limit cancer cell growth: both cell type and nutrient availability together determine the oxidative capacity that constrains the synthesis of various metabolic intermediates. The manuscript will be of interest to researchers working in cancer and cell metabolism.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript investigates how cellular NAD/NADH ratios are controlled in cancer cell lines in vitro. The authors build on previous work, which shows that serine synthesis is sensitive to NAD/NADH ratios and PHGDH expression. Here, the authors demonstrate that serine synthesis is variable across a panel of cell lines, even when controlling for expression of serine synthesis enzymes such as PHGDH. The authors show that cellular NAD/NADH ratios correlate with the ability to synthesize serine and grow in serine-deprived environments when PHGDH levels remain constant. Investigating this variability in NAD/NADH ratios, the authors find that the cells that can positively respond to serine deprivation are able to increase oxygen consumption and cellular NAD/NADH ratios. Cells that do not increase oxygen consumption in response to serine deprivation do not increase NAD/NADH ratios and cannot grow well without serine. The authors go on to show that in cells with the ability to increase oxygen consumption upon serine deprivation, PHGDH expression alone is sufficient to fully restore growth-serine; in cells that cannot increase oxygen consumption, both PHGDH expression and interventions to increase NAD/NADH ratios are required to increase growth. Thus, cells need both PHGDH and NAD/NADH increases to maximize serine synthesis in response to serine deprivation. The authors previously showed that lipid synthesis likewise requires NAD regeneration. Interestingly, one cell line that does not increase oxygen consumption in response to serine limitation tends to increase oxygen consumption in response to lipid deprivation; accordingly, depriving this cell line of lipids increases the synthesis of serine. Together, these findings show that how cells respond to nutrient deprivation is highly variable and that the response to nutrient deprivation (for example, whether or not oxygen consumption is increased) will determine how well cells tolerate depletion of nutrients with related biosynthetic constraints. This work sheds light on the complexity of cancer cell metabolism and helps to explain why it is difficult to predict which nutrients will be limiting to any cancer cell type or environment.

      Strengths:

      (1) The authors use multiple interventions to manipulate NAD/NADH ratios in cells.

      (2) Experiments are well controlled and appropriately interpreted.

      Weaknesses:

      Overall the data support the conclusions of the manuscript. I have only two minor comments and suggestions:.

      (1) Figure 2B/C: data are presented as relative to +serine, which shows how some cells respond to -serine, but may also be of interest to see how absolute (not relative) NAD/NADH levels correlate with serine synthesis and serine-independent proliferation. In other words, is it the dynamic increase in the ratio that is most important, or the absolute level of the ratio?

      (2) Line 177-178: the authors write, "We hypothesized that the elevated NAD+/NADH ratio represented a cellular response to make the NAD+/NADH ratio more oxidized to enable serine synthesis". I recommend modest edits to avoid anthropomorphizing. It is possible that the ratio responds for reasons yet to be determined and not necessarily because the cell is deliberately trying to enable serine synthesis.

    3. Reviewer #2 (Public review):

      In the manuscript "Cancer cells differentially modulate mitochondrial respiration to alter redox state and enable biomass synthesis in nutrient-limited environments", Chang et al investigate how cancer cells respond to the limitation of certain environmental nutrients by regulating the cellular NAD+/NADH ratio. They focus on serine and lipid metabolism, pathways known to be controlled by the NAD+/NADH ratio, and propose that changes in mitochondrial respiration in response to deprivation of these nutrients can influence the NAD+/NADH ratio, thereby impacting biomass synthesis.

      While the study is descriptive in nature and does not investigate specific molecular mechanisms that explain the crosstalk between nutrient availability and mitochondrial redox changes, the experimental component is robust, and the conclusions are well supported by the results. Some suggestions could further refine the conclusions and enhance the quality of the manuscript.

      Main critiques:

      (1) Throughout the manuscript, the authors utilise the number of cell doublings per day as an endpoint readout of cell proliferation. It would be advisable to include a quantification of the cell number and to display the proliferation rate over time. This would provide valuable insights into the timeline of cellular responses and avoid potential confounding effects associated with the use of Sulforhodamine B dye, an indirect measure of cell proliferation based on protein content, which may be influenced by some of the interventions. Furthermore, it will help determine whether specific treatments reduce cellular doublings resulting from cell death. This concern is particularly evident in treatments with rotenone, e.g., Fig. 1G, where the increase in doublings could be attributed to cell death.

      (2) The authors propose a model in which the deprivation of extracellular nutrients impacts mitochondrial respiration, which in turn increases the NAD+/NADH ratio and ultimately affects metabolic biosynthetic pathways that occur in the cytosol, such as serine biosynthesis. The mechanism by which nutrient availability is sensed and transmitted across different cellular compartments to regulate mitochondrial redox status remains unclear. This concern is particularly relevant for serine metabolism, as its synthesis occurs in the cytosol, but the authors connect it to mitochondrial respiration. Compartment-specific measurements of NAD+/NADH ratio would help to understand to what extent the redox state is affected by nutrients in the mitochondria and in the cytoplasm (see also minor critiques point 2). Moreover, the use of the genetic tool LbNox could be employed to manipulate the NAD+/NADH ratio in a compartment-specific manner, while also avoiding the toxicity of certain compounds, such as rotenone. This set of experiments would add depth to the investigation, which might otherwise appear too descriptive.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang and colleagues provides new insights into how cancer cells adapt their metabolism under nutrient-deprived conditions. They find cells respond differentially to serine and lipid deprivation via oxidising the cell redox state, which enables biomass synthesis and cell proliferation. They identified mitochondrial respiration as the major mechanism that dictates the endogenous NAD+/NADH ratio. By incorporating a dual stress paradigm, serine and lipid deprivation, the study further suggests that the NAD+/NADH ratio can serve as a link to orchestrate the complex interplay between multiple nutrient changes in the tumour microenvironment.

      Strengths:

      A novel aspect of this study is the idea that cancer cells are not uniformly passive victims of nutrient limitation; some can actively invoke endogenous NAD+ regeneration to combat nutrient stress. The conclusion is well-supported by comparing multiple cell lines from different tissues and genetic backgrounds, which improves generalizability. While most of the smaller conclusions align with common reasoning and expectations, the step-by-step deduction that leads to a novel 'big picture' is commendable. Another notable strength is the integration of dual stress (lipid and serine deprivation), which better mimics the complex tumor microenvironment with multiple nutrient fluctuations, raising the translational potential of these findings. The observation that lipid-deprived cells can stimulate serine synthesis and support proliferation in a subset of cancer cell lines offers a novel perspective on metabolic plasticity under starvation conditions.

      Weaknesses:

      Although the authors derive a novel and valuable overarching concept, the presentation of this "big picture" is not clearly articulated, making it less accessible to readers outside the immediate field. It would greatly enhance the manuscript to include a clearer summary of the overarching model and its implications. Additionally, discussing the potential clinical significance and applications of the findings would increase the relevance and broader impact of the work. Finally, the manuscript's clarity and credibility are undermined by inconsistent figure labeling and the lack of statistical analysis, particularly for the Western blot data.

      While this study identifies changes in serine synthesis, mitochondrial respiration, PHGDH protein levels, and NAD+/NADH ratio in different cell lines, some of these relationships appear correlative rather than causally established (Figure 2; Figure 5; Figure 6). Some claims are thus overinterpreted. For example, the co-occurrence of increased NAD+/NADH ratio and citrate levels under lipid deprivation in A549 cells does not establish causality (Figure 5). Direct perturbation experiments that manipulate NAD+/NADH and assess downstream effects on citrate synthesis would substantially strengthen the conclusions.

      The study focuses predominantly on mitochondrial respiration as a source of NAD+ regeneration. However, it will also be interesting to check other significant pathways, such as NAD+ salvage, which have been implicated in supporting serine biosynthesis. In addition, the subcellular distribution of NAD+ may distinguish whether some cells are truly redox-unresponsive. Mitochondrial NAD+ regeneration might counteract the cytosolic NAD+ consumption, rendering a relatively stable intracellular NAD+/NADH ratio. The malate-aspartate shuttle can be an interesting aspect.

      The authors should acknowledge the limitations of short-term isotope tracing in their experimental design. Differences in metabolic rates across cell lines can affect the kinetics of metabolite labeling, limiting the direct comparability of metabolic fluxes between them. As a result, observed changes may reflect transient adaptations rather than stable metabolic reprogramming. It is important to clarify that the study primarily captures short-term responses, and the conclusions may not extrapolate to longer-term adaptations or protein-level changes under sustained nutrient stress.

    1. eLife Assessment

      Weiss et al. provide important new insights and convincing evidence to further our mechanistic understanding of how antigen presentation shapes skin persistence of CD8+ TRM. Using a mouse model for inducible genetic ablation of transforming growth factor beta receptor 3 (TGFBR3) in CD8+ T cells, they demonstrate TGFBR3's role in regulating CD8+ TRM persistence in skin. Furthermore, they show that the strength of T cell receptor (TCR) engagement upon initial CD8+ TRM skin seeding has a positive influence on subsequent TRM expansion following a secondary antigen-reencounter. Together, these mechanisms add to our understanding of how the skin CD8+ T cell repertoire is dynamically responsive to topical antigen.

    2. Reviewer #1 (Public review):

      Summary:

      Weiss et. al. seek to delineate the mechanisms by which antigen-specific CD8+ T cells outcompete bystanders in the epidermis when active TGF-b is limiting, resulting in selective retention of these cells and more complete differentiation into the TRM phenotype.

      Strengths:

      They begin by demonstrating that at tissue sites where cognate antigen was expressed, CD8+ T cells adopt a more mature TRM transcriptome than cells at tissue sites where cognate antigen was never expressed. By integrating their scRNA-Seq data on TRM with the much more comprehensive ImmGenT atlas, the authors provide a very useful resource for future studies in the field. Furthermore, they conclusively show that these "local antigen-experienced" TRM have increased proliferative capacity and that TCR avidity during TRM formation positively correlates with their future fitness. Finally, using an elegant experimental strategy, they establish that TCR signaling in CD8+ T cells in epidermis induces TGFBRIII expression, which likely contributes to endowing them with a competitive advantage over antigen-inexperienced TRM.

      Weaknesses:

      The main weakness in this paper lies in the authors' reliance on a single model to derive conclusions on the role of local antigen during the acute phase of the response by comparing T cells in model antigen-vaccinia virus (VV-OVA) exposed skin to T cells in contralateral skin exposed to DNFB 5 days after the VV-OVA exposure. In this setting, antigen-independent factors may contribute to the difference in CD8+ T cell number and phenotype at the two sites. For example, it was recently shown that very early memory precursors (formed 2 days after exposure) are more efficient at seeding the epithelial TRM compartment than those recruited to skin at later times (Silva et al, Sci Immunol, 2023). DNFB-treated skin may therefore recruit precursors with reduced TRM potential. In addition, TRM-skewed circulating memory precursors have been identified (Kok et al, JEM, 2020), and perhaps VV-OVA exposed skin more readily recruits this subset compared to DNFB-exposed skin. Therefore, when the DNFB challenge is performed 5 days after vaccinia virus, the DNFB site may already be at a disadvantage in the recruitment of CD8+ T cells that can efficiently form TRM. In addition, CD8+ T cell-extrinsic mechanisms may be at play, such as differences in myeloid cell recruitment and differentiation or local cytokine and chemokine levels in VV-infected and DNFB-treated skin that could account for differences seen in TRM phenotype and function between these two sites. Although the authors do show that providing exogenous peptide antigen at the DNFB-site rescues their phenotype in relation to the VV-OVA site, the potential antigen-independent factors distinguishing these two sites remain unaddressed. In addition, there is a possibility that peptide treatment of DNFB-treated initiates a second phase of priming of new circulatory effectors in the local-draining lymph nodes that are then recruited to form TRM at the DFNB-site, and that the effect does not solely rely on TRM precursors at the DNFB-treated skin site at the time of peptide treatment.

      Secondly, although the authors conclusively demonstrate that TGFBRIII is induced by TCR signals and required for conferring increased fitness to local-antigen-experienced CD8+ TRM compared to local antigen-inexperienced cells, this is done in only one experiment, albeit repeated 3 times. The data suggest that antigen encounter during TRM formation induces sustained TGFBRIII expression that persists during the antigen-independent memory phase. It remains unclear why only the antigen encounter in skin, but not already in the draining lymph nodes, induces sustained TGFBRIII expression. Further characterizing the dynamics of TGFBRIII expression on CD8+ T cells during priming in draining lymph nodes and over the course of TRM formation and persistence may shed more light on this question. Probing the role of this mechanism at other sites of TRM formation would also further strengthen their conclusions and enhance the significance of this finding.

    3. Reviewer #2 (Public review):

      Summary:

      The authors set out to dissect the mechanistic basis of their previously published finding that encountering cutaneous antigen augments the persistence of CD8+ memory T cells that enter skin (TRM) (Hirai et al., 2021, Immunity). Here they use the same murine model to study the fate of CD8+ T cells after antigen-priming in the lymph nodes, (1) those that re-encounter antigen in the skin via vaccinia virus (VV) versus (2) those that do not encounter antigen in skin but rather are recruited via topical dinitrofluorobenzene (DNFB) (so-called "bystander TRM"). The authors' previous publication establishes that this first group of CD8+ TRM has a persistence advantage over bystander TRM under TGFb-limiting conditions. The current paper advances this finding by elucidating the role of TGFBR3 in regulating CD8+ TRM skin persistence upon topical antigen exposure. Key novelty of the work lies in the generation and use of the CD8+ T cell-specific TGFBR3 knockout model, which allows them to demonstrate the role of TGFBR3 in fine-tuning the degree of CD8+ T cell skin persistence and that TGFBR3 expression is promoted by CD8+ TRM encountering their cognate antigen upon initial skin entry. Future work directly measuring active TGFb in the skin under different conditions would help identify physiologic scenarios that yield active TGFb-limiting conditions, thus establishing physiologic relevance.

      Strengths:

      Technical strengths of the paper include (1) complementary imaging and flow cytometry analyses, (2) integration of their scRNA-seq data with the existing CD8+ TRM literature via pathway analysis, and (3) use of orthogonal models where possible. Using a vaccina virus (VV) model, with and without ovalbumin (OVA), the authors investigate how topical antigen exposure and TCR strength regulate CD8+ TRM skin recruitment and retention. The authors use both FTY720 and a Thy1.1 depleting antibody to demonstrate that skin CD8+ TRM expand locally following both a primary and secondary recall response to topical OVA application.

      A conceptual strength of the paper is the authors' observation that TCR signal strength upon initial TRM tissue entry helps regulate the extent of their local re-expansion on subsequent antigen re-exposure. They achieved this by applying peptides of varying affinity for the OT-I TCR on the DNFB-exposed flank in tandem with initial VV-OVA + DNFB treatment. They then measured TRM expansion after OVA peptide rechallenge, revealing that encountering a higher-affinity peptide upon skin entry leads to greater subsequent re-expansion. Additionally, by generating an OT-I Thy1.1+ E8i-creERT2 huNGFR Tgfbr3fl/fl (Tgfbr3∆CD8) mouse, the authors were able to elucidate a unique role for TGFBR3 in CD8+TRM persistence when active TGFb in skin is limited.

      Weaknesses:

      Overall, the authors' conclusions are well supported, although there are some instances where additional controls, experiments, or clarifications would add rigor. The conclusions regarding skin-localized TCR signaling leading to increased skin CD8+ TRM proliferation in-situ and increased TGFBR3 expression would be strengthened by assessing skin CD8+ TRM proliferation and TGFBR3 expression in models of high versus low avidity topical OVA-peptide exposure. The authors could further increase the novelty of the paper by exploring whether TGFBR3 is regulated at the RNA or protein level. To this end, they could perform analysis of their single-cell RNA sequencing data (Figure 1), comparing Tgfbr3 mRNA in DNFB versus VV-treated skin.

      For clarity, when discussing antigen exposure throughout the paper, it would be helpful for the authors to be more precise that they are referring to the antigen in the skin rather than in the draining lymph node. A more explicit summary of some of the lab's previous work focused on CD8+ TRM and the role of TGFb would also help readers better contextualize this work within the existing literature on which it builds.

      For rigor, it would be helpful where possible to pair flow cytometry quantification with the existing imaging data. Additional controls, namely enumerating TRM in the opposite, untreated flank skin of VV-only-treated mice and the treated flank skin of DNFB-only treated mice, would help contextualize the results seen in dually-treated mice in Figure 1. In figure legends, we suggest clearly reporting unpaired T tests comparing relevant metrics within VV or DNFB-treated groups (for example, VV-OVA PBS vs VV-OVA FTY720 in Figure 3F). Finally, quantifying right and left skin draining lymph node CD8+ T cell numbers would clarify the skin specificity and cell trafficking dynamics of the authors' model.

    1. eLife Assessment

      This study presents a useful framework to extract the individuality index to predict subjects' behavior in the target tasks. However, the evidence supporting such a framework is somewhat incomplete and would benefit from overall framing and clarity on its approaches. Overall, this study would be of interest to cognitive and AI researchers who work on cognitive models in general.

    2. Reviewer #1 (Public review):

      Summary

      The manuscript presents EIDT, a framework that extracts an "individuality index" from a source task to predict a participant's behaviour in a related target task under different conditions. However, the evidence that it truly enables cross-task individuality transfer is not convincing.

      Strengths

      The EIDT framework is clearly explained, and the experimental design and results are generally well-described. The performance of the proposed method is tested on two distinct paradigms: a Markov Decision Process (MDP) task (comparing 2-step and 3-step versions) and a handwritten digit recognition (MNIST) task under various conditions of difficulty and speed pressure. The results indicate that the EIDT framework generally achieved lower prediction error compared to baseline models and that it was better at predicting a specific individual's behaviour when using their own individuality index compared to using indices from others.

      Furthermore, the individuality index appeared to form distinct clusters for different individuals, and the framework was better at predicting a specific individual's behaviour when using their own derived index compared to using indices from other individuals.

      Weaknesses

      (1) Because the "source" and "target" tasks are merely parameter variations of the same paradigm, it is unclear whether EIDT achieves true cross-task transfer. The manuscript provides no measure of how consistent each participant's behaviour is across these variants (e.g., two- vs three-step MDP; easy vs difficult MNIST). Without this measure, the transfer results are hard to interpret. In fact, Figure 5 shows a notable drop in accuracy when transferring between the easy and difficult MNIST conditions, compared to transfers between accuracy-focused and speed-focused conditions. Does this discrepancy simply reflect larger within-participant behavioural differences between the easy and difficult settings? A direct analysis of intra-individual similarity for each task pair - and how that similarity is related to EIDT's transfer performance - is needed.

      (2) Related to the previous comment, the individuality index is central to the framework, yet remains hard to interpret. It shows much greater within-participant variability in the MNIST experiment (Figure S1) than in the MDP experiment (Figure 3). Is such a difference meaningful? It is hard to know whether it reflects noisier data, greater behavioural flexibility, or limitations of the model.

      (3) The authors suggests that the model's ability to generalize to new participants "likely relies on the fact that individuality indices form clusters and individuals similar to new participants exist in the training participant pool". It would be helpful to directly test this hypothesis by quantifying the similarity (or distance) of each test participant's individuality index to the individuals or identified clusters within the training set, and assessing whether greater similarity (or closer proximity) to the clusters in the training set is associated with higher prediction accuracy for those individuals in the test set.

    3. Reviewer #2 (Public review):

      This paper introduces a framework for modeling individual differences in decision-making by learning a low-dimensional representation (the "individuality index") from one task and using it to predict behaviour in a different task. The approach is evaluated on two types of tasks: a sequential value-based decision-making task and a perceptual decision task (MNIST). The model shows improved prediction accuracy when incorporating this learned representation compared to baseline models.

      The motivation is solid, and the modelling approach is interesting, especially the use of individual embeddings to enable cross-task generalization. That said, several aspects of the evaluation and analysis could be strengthened.

      (1) The MNIST SX baseline appears weak. RTNet isn't directly comparable in structure or training. A stronger baseline would involve training the GRU directly on the task without using the individuality index-e.g., by fixing the decoder head. This would provide a clearer picture of what the index contributes.

      (2) Although the focus is on prediction, the framework could offer more insight into how behaviour in one task generalizes to another. For example, simulating predicted behaviours while varying the individuality index might help reveal what behavioural traits it encodes.

      (3) It's not clear whether the model can reproduce human behaviour when acting on-policy. Simulating behaviour using the trained task solver and comparing it with actual participant data would help assess how well the model captures individual decision tendencies.

      (4) Figures 3 and S1 aim to show that individuality indices from the same participant are closer together than those from different participants. However, this isn't fully convincing from the visualizations alone. Including a quantitative presentation would help support the claim.

      (5) The transfer scenarios are often between very similar task conditions (e.g., different versions of MNIST or two-step vs three-step MDP). This limits the strength of the generalization claims. In particular, the effects in the MNIST experiment appear relatively modest, and the transfer is between experimental conditions within the same perceptual task. To better support the idea of generalizing behavioural traits across tasks, it would be valuable to include transfers across more structurally distinct tasks.

      (6) For both experiments, it would help to show basic summaries of participants' behavioural performance. For example, in the MDP task, first-stage choice proportions based on transition types are commonly reported. These kinds of benchmarks provide useful context.

      (7) For the MDP task, consider reporting the number or proportion of correct choices in addition to negative log-likelihood. This would make the results more interpretable.

      (8) In Figure 5, what is the difference between the "% correct" and "% match to behaviour"? If so, it would help to clarify the distinction in the text or figure captions.

      (9) For the cognitive model, it would be useful to report the fitted parameters (e.g., learning rate, inverse temperature) per individual. This can offer insight into what kinds of behavioural variability the individuality index might be capturing.

      (10) A few of the terms and labels in the paper could be made more intuitive. For example, the name "individuality index" might give the impression of a scalar value rather than a latent vector, and the labels "SX" and "SY" are somewhat arbitrary. You might consider whether clearer or more descriptive alternatives would help readers follow the paper more easily.

      (11) Please consider including training and validation curves for your models. These would help readers assess convergence, overfitting, and general training stability, especially given the complexity of the encoder-decoder architecture.

    4. Reviewer #3 (Public review):

      Summary:

      This work presents a novel neural network-based framework for parameterizing individual differences in human behavior. Using two distinct decision-making experiments, the authors demonstrate the approach's potential and claims it can predict individual behavior (1) within the same task, (2) across different tasks, and (3) across individuals. While the goal of capturing individual variability is compelling and the potential applications are promising, the claims are weakly supported, and I find that the underlying problem is conceptually ill-defined.

      Strengths:

      The idea of using neural networks for parameterizing individual differences in human behavior is novel, and the potential applications can be impactful.

      Weaknesses:

      (1) To demonstrate the effectiveness of the approach, the authors compare a Q-learning cognitive model (for the MDP task) and RTNet (for the MNIST task) against the proposed framework. However, as I understand it, neither the cognitive model nor RTNet is designed to fit or account for individual variability. If that is the case, it is unclear why these models serve as appropriate baselines. Isn't it expected that a model explicitly fitted to individual data would outperform models that do not? If so, does the observed superiority of the proposed framework simply reflect the unsurprising benefit of fitting individual variability? I think the authors should either clarify why these models constitute fair control or validate the proposed approach against stronger and more appropriate baselines.

      (2) It's not very clear in the results section what it means by having a shorter within-individual distance than between-individual distances. Related to the comment above, is there any control analysis performed for this? Also, this analysis appears to have nothing to do with predicting individual behavior. Is this evidence toward successfully parameterizing individual differences? Could this be task-dependent, especially since the transfer is evaluated on exceedingly similar tasks in both experiments? I think a bit more discussion of the motivation and implications of these results will help the reader in making sense of this analysis.

      (3) The authors have to better define what exactly he meant by transferring across different "tasks" and testing the framework in "more distinctive tasks". All presented evidence, taken at face value, demonstrated transferring across different "conditions" of the same task within the same experiment. It is unclear to me how generalizable the framework will be when applied to different tasks.

      (4) Conceptually, it is also unclear to me how plausible it is that the framework could generalize across tasks spanning multiple cognitive domains (if that's what is meant by more distinctive). For instance, how can an individual's task performance on a Posner task predict task performance on the Cambridge face memory test? Which part of the framework could have enabled such a cross-domain prediction of task performance? I think these have to be at least discussed to some extent, since without it the future direction is meaningless.

      (5) How is the negative log-likelihood, which seems to be the main metric for comparison, computed? Is this based on trial-by-trial response prediction or probability of responses, as what usually performed in cognitive modelling?

      (6) None of the presented evidence is cross-validated. The authors should consider performing K-fold cross-validation on the train, test, and evaluation split of subjects to ensure robustness of the findings.

      (7) The authors excluded 25 subjects (20% of the data) for different reasons. This is a substantial proportion, especially by the standards of what is typically observed in behavioral experiments. The authors should provide a clear justification for these exclusion criteria and, if possible, cite relevant studies that support the use of such stringent thresholds.

      (8) The authors should do a better job of creating the figures and writing the figure captions. It is unclear which specific claim the authors are addressing with the figure. For example, what is the key message of Figure 2C regarding transfer within and across participants? Why are the stats presentation different between the Cognitive model and the EIDT framework plots? In Figure 3, it's unclear what these dots and clusters represent and how they support the authors' claim that the same individual forms clusters. And isn't this experiment have 98 subjects after exclusion, this plot has way less than 98 dots as far as I can tell. Furthermore, I find Figure 5 particularly confusing, as the underlying claim it is meant to illustrate is unclear. Clearer figures and more informative captions are needed to guide the reader effectively.

      (9) I also find the writing somewhat difficult to follow. The subheadings are confusing, and it's often unclear which specific claim the authors are addressing. The presentation of results feels disorganized, making it hard to trace the evidence supporting each claim. Also, the excessive use of acronyms (e.g., SX, SY, CG, EA, ES, DA, DS) makes the text harder to parse. I recommend restructuring the results section to be clearer and significantly reducing the use of unnecessary acronyms.

    1. eLife Assessment

      This manuscript makes important contributions to the methodology commonly used to assess representational structures in human and animal brain activity recorded using various techniques (especially fMRI). The evidence in the form of mathematical analysis and simulations is solid. The impact of this contribution could be improved by extending the simulations to assess the effects of violations of explicit and implicit assumptions.

    2. Reviewer #1 (Public review):

      Summary:

      This work presents a formalism for the relationship between neural signals and pooled signals (e.g., voxel estimates in fMRI) and explores why correlation-based and mean-removed Euclidean RDMs perform well in practice. The key assumption is that the pooled estimates are weighted averages, with i.i.d. non-negative weights. Two sets of simulations are used to support the theoretical findings: one based on fully simulated neural data and another that reverse-engineers neural data from an RDM estimated from real macaque data. The authors also discuss limitations of their simulations, particularly concerning the i.i.d. assumption of the weights.

      Strengths:

      The strengths of this work include its mathematical rigor and the clear connection that is drawn between the derivations and empirical observations. The simulations were well-designed and easy to follow. One small suggestion: a brief explanation of what is meant by "sparse" in Figure 3 would help orient the reader without requiring them to jump ahead to the methods. Overall, I found the work engaging and insightful.

      Weaknesses:

      Although I appreciate the effort to explore *why* certain dissimilarity measures perform well, it wasn't clear how these findings would inform the practical choices of researchers conducting RDM-based analyses. Many researchers likely already use correlation-based or mean-removed Euclidean distance measures, given their popularity. In that case, how do these results provide additional value or guidance beyond current practice?

      Another aspect that could benefit from further clarification is the core assumption underlying the work - that channel-based activity reflects a non-negative weighted average of neural activity. Is this widely accepted as the most plausible model, or are there alternative relationships that researchers should consider? While this may seem intuitive, it's not something I would expect all readers to be familiar with, and only a single reference was provided to support it (which I unfortunately didn't have time to read). That said, I did appreciate the discussion of the i.i.d. assumption in the discussion section. Can more be said to educate researchers as to when the i.i.d. assumption might be violated?

      I didn't find the "Simulations based on neural data" section added much, and it risks being misinterpreted. The main difference here is that neural data were reverse-engineered from a macaque RDM and then used in simulations similar to those in the previous section. What is the added value of using a real RDM to generate simulated data? Were the earlier simulations lacking in some way? There's also a risk of readers mistakenly inferring that human dissimilarities have been reconstructed from macaque data, an assumption that goes beyond the paper's core message, which focuses on linking neural and channel-based signals from the *same* source. If this section is retained, the motivation should be clarified, and the implied parallel in Figure 6, between the human data and simulated data, should be reconsidered.

    3. Reviewer #2 (Public review):

      Summary:

      The paper is a methodological contribution to multivariate pattern analysis and, in particular, the analysis of representational geometry via pairwise representational distances, sometimes called representational dissimilarity analysis (RDA). The authors investigate through theoretical analysis and simulations how true representational distances (defined on the neural level) give rise to representational distances estimated from neurophysiological data, including fMRI and cell recordings. They demonstrate that, due to the way measurements sample neural activity, the activity common to all sampled neurons can be amplified in the representational geometry derived from these measurements, and therefore, an empirical representational geometry may deviate substantially from the true representational geometry. The authors propose to modify the obtained representational structure by removing the dimension corresponding to that common activity, and argue that such a removal of a single dimension does not relevantly affect the representational structure, again underpinned by mathematical analysis and simulation.

      Importance:

      The paper may at first sight be tackling a specific problem within a specific subfield of cognitive neuroscience methods. However, understanding the structure of representations is a fundamental goal of cognitive psychology and cognitive neuroscience, and the fact that methods of representational geometry are not yet routinely used by the wider community may at least partially be due to uncertainty regarding the reliability of these methods. This paper is an important step towards clarifying and improving reliability, and therefore towards more widespread adoption of representational geometry methods.

      Strengths:

      The paper makes its argument generally well, relying on previous work by the authors as well as others to support assumptions about neural sampling by neurophysiological measurements. Their main points are underpinned by both detailed mathematical analysis and simulations, and the latter also produces intuitively accessible illustrations of the authors' argument. The authors discuss in detail under which exact circumstances common neural activity distorts the representational geometry, and therefore, when exactly the removal of the common dimension is necessary to minimize that distortion.

      Weaknesses:

      (1) The argument around the Johnson-Lindenstrauss lemma on pages 5 & 6 is somewhat confused, and also not really convincing.

      First, the correct reference for the lemma seems to be not [20] = Johnson et al. (1986), but Johnson & Lindenstrauss (1984). Moreover, as far as I can tell, Johnson et al. (1986) do not discuss random projections, and while they play a role in Johnson & Lindenstrauss (1984), that is only as a proof device. The paper text suggests that the lemma itself is probabilistic, while actually it is a statement of existence.

      Second, the authors correctly state that the lemma implies that "the number of measurement channels required for a good approximation does not depend on the number of neurons and grows only logarithmically with the number of stimuli", but it is not clear what the relevance of this statement for this paper is, considering that distances between N points can be exactly preserved within an N − 1 dimensional subspace, irrespective of the number of dimensions of the original space, and since in cognitive neuroscience the number of measurement channels is usually (much) larger than the number of experimental stimuli.

      The actually centrally important statement is not the Johnson-Lindenstrauss lemma, but one about the metric-preserving properties of random projections with zero-mean weights. It is this statement that needs to be backed up by the correct references, which, as far as I can tell, are neither the cited Johnson et al. (1986) nor even Johnson & Lindenstrauss (1984) for the lemma.

      (2) The detailed mathematical analyses and simulations focus on the effect of non-zero-mean sampling weights, and that is justified by the result that such sampling leads to a distorted representational geometry. However, there is another assumption which seems to be used almost everywhere in both mathematical analyses and simulations, and which I suspect may have a relevant effect on the observed representational geometry: statistical independence between weights. In particular, in fMRI, the existence of a naturally limited spatial resolution (due to MRI technology or vasculature) makes it unlikely that the weights with which a given neuron affects different voxels are independent.

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript investigates the conditions under which representational distances estimated from brain-activity measurements accurately mirror the true geometry of the underlying neural representations. Using a theoretical framework and simulations, the authors show that (i) random weighted sampling of individual neurons preserves representational distances; (ii) the non-negative pooling characteristic of fMRI stretches the geometry along the population-mean dimension; and (iii) subtracting the across-channel mean from each activity pattern removes this distortion, explaining the well-known success of correlation-based RSA. They further argue that a mean-centred, squared Euclidean (or Mahalanobis) distance retains this corrective benefit while avoiding some pitfalls of variance normalisation.

      Strengths:

      (1) Theoretical clarity and novelty:<br /> The paper offers an elegant and convincing proof of how linear measurement models affect representational geometry and pinpoints the specific condition (non-zero-mean sampling weights) under which voxel pooling introduces a systematic bias. This quantitative explanation of why mean removal is effective in RSA is new and valuable.

      (2) Simulations:<br /> Experiments on both synthetic high-dimensional fMRI data and macaque-IT-inspired embeddings corroborate the mathematics, providing practical insights into the theoretical reasoning outlined by the authors.

      (3) Actionable recommendations:<br /> The work summarises the results into clear guidelines: random single-unit sampling is "safe" (the estimated geometry is undistorted); fMRI voxel data with unstructured or single-scale codes should be mean-centred; and multi-scale cortical maps require explicit forward modelling. These guidelines are clear, and useful for future research.

      Weaknesses:

      (1) Simplistic assumptions:<br /> The assumption that measurement-channel weights are drawn independently and identically distributed (i.i.d.) from a univariate distribution is a significant idealisation for fMRI data. Voxels have spatially structured responses (and noise), meaning they do not sample neurons with i.i.d. weights. The extent to which the conclusions (especially the "exact recovery" with mean centring) hold when this assumption is violated needs more discussion. While the paper states that the non-negative IWLCS model is a best-case scenario, the implications of deviations from this best case could be elaborated.

      (2) Random-subpopulation model for electrophysiology:<br /> Similarly, the "random subpopulation model" is presented as an idealisation of single-cell recordings. In reality, electrophysiological sampling is often biased (e.g., towards larger, more active neurons or neurons in accessible locations). The paper acknowledges biased sampling as a challenge that requires separate modelling, but the gap between this idealised model and actual practice should be highlighted more strongly when interpreting the optimistic results.

      (3) Noise as an "orthogonal issue":<br /> The theoretical derivations largely ignore measurement noise, treating it as an orthogonal problem solvable by cross-validation. Although bias from noise is a well-known problem, interactions between noise and sampling-induced distortions (especially the down-scaling of orthogonal dimensions) could complicate the picture. For instance, if a dimension is already heavily down-scaled by averaging, it might become more susceptible to being obscured by noise. Addressing or highlighting these points more explicitly would make the limitations of this theoretical framework more transparent.

      (4) Simulation parameters and generalizability:<br /> The random ground-truth geometries were generated from a Gaussian mixture in 5-D and then embedded into 1,024-D, with ≈25 % of the variance coming from the mean dimension. The sensitivity of the findings to these specific parameters (initial dimensionality, geometry complexity, proportion of mean variance, and sample size) could be discussed. How would the results change if the true neural geometry had a much higher or lower intrinsic dimensionality, or if the population-mean component were substantially smaller or larger? If the authors' claims are to generalise, more scenarios should be considered.

      (5) Mean addition to the neural-data simulation:<br /> In simulations based on neural data from Kiani et al., a random mean was added to each pattern to introduce variation along the mean dimension. This was necessary because the original patterns had identical mean activation. However, the procedure might oversimplify how population means vary naturally and could influence the conclusions, particularly regarding the impact of the population-mean dimension. While precisely modelling how the mean varies across conditions is beyond the manuscript's scope, this point should be stated and discussed more clearly.

      (6) Effect of mean removal on representational geometry:<br /> As noted, the benefits of mean removal hold "under ideal conditions". Real data often violates these assumptions. A critical reader might ask: What if conditions differ in overall activation and in more complex ways (e.g., differing correlation structures across neurons)? Is it always desirable to remove population-mean differences? For example, if a stimulus truly causes a global increase in firing across the entire population (perhaps reflecting arousal or salience), subtracting the mean would treat this genuine effect as a nuisance and eliminate it from the geometry. Prior literature has cautioned that one should interpret RSA results after demeaning carefully. For instance, Ramírez (2017) dubbed this problem "representational confusion", showing that subtracting the mean pattern can change the relationships between conditions in non-intuitive ways. These potential issues and previous results should be discussed and properly referenced by the authors.

      Appraisal, Impact, and Utility:

      The authors set out to identify principled conditions under which measured representational distances faithfully reflect the underlying neural geometry and to provide practical guidance for RSA across modalities. Overall, I believe they achieved their goals. Theoretical derivations identify the bias-inducing factors in linear measurement models, and the simulations verify the analytic claims, demonstrating that mean-pattern subtraction can indeed correct some mean-related geometric distortions. These conclusions strongly rely on idealised assumptions (e.g., i.i.d. sampling weights and negligible noise), but the manuscript is explicit about them, and the reasoning from evidence to claim is sound. A deeper exploration of how robust each conclusion is to violations of these assumptions, particularly correlated voxel weights and realistic noise, would make the argument even stronger.

      Beyond their immediate aims, the authors offer contributions likely to shape future work. Its influence is likely to influence both analysis decisions and the design of future studies exploring the geometry of brain representations. By clarifying why correlation-based RSA seems to work so robustly, they help demystify a practice that has so far been adopted heuristically. Their proposal to adopt mean-centred Euclidean or Mahalanobis distances promises a straightforward alternative that better aligns representational geometry with decoding-based interpretations.

      In sum, I see this manuscript as a significant and insightful contribution to the field. The theoretical work clarifying the impact of sampling schemes and the role of mean removal is highly valuable. However, the identified concerns, primarily regarding the idealized nature of the models (especially for fMRI), the treatment of noise, and the need for more nuanced claims, suggest that some revisions are necessary. Addressing these points would substantially strengthen the paper's conclusions and enhance its impact on the neuroscience community by ensuring the proposed methods are robustly understood and appropriately applied in real-world research settings.

    1. eLife Assessment

      This study makes an important contribution by showing that humans adapt learning rates rationally to environmental volatility yet systematically misattribute noise as volatility, demonstrating approximate rationality with simplified internal models. The evidence is compelling, encompassing a cleverly designed volatility-versus-noise paradigm, innovative lesion-based comparisons between reinforcement-learning and degraded Bayesian Observer Models, and convergent behavioural and pupillometric data. Expanding formal model comparisons (e.g., BIC/AIC) and directly contrasting RL and Bayesian fits to physiological markers would further enhance the work, but these are minor limitations that do not detract from the core findings.

    2. Reviewer #1 (Public review):

      Summary:

      The authors present an interesting study using RL and Bayesian modelling to examine differences in learning rate adaptation in conditions of high and low volatility and noise respectively. Through "lesioning" an optimal Bayesian model, they reveal that apparently suboptimal adaptation of learning rates results from incorrectly detecting volatility in the environment when it is not in fact present.

      Strengths:

      The experimental task used is cleverly designed and does a good job of manipulating both volatility and noise. The modelling approach takes an interesting and creative approach to understand the source of apparently suboptimal adaptation of learning rates to noise, through carefully "lesioning" and optimal Bayesian model to determine which components are responsible for this behaviour.

      Weaknesses:

      The model space could be more extensive, although the authors have covered the most relevant models for the question at hand.

      Comments on revisions: I have no further recommendations for the authors, they have addressed my previous comments very well.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors aimed to investigate how humans learn and adapt their behavior in dynamic environments characterized by two distinct types of uncertainty: volatility (systematic changes in outcomes) and noise (random variability in outcomes). Specifically, they sought to understand how participants adjust their learning rates in response to changes in these forms of uncertainty.

      To achieve this, the authors employed a two-step approach:

      Reinforcement Learning (RL) Model:<br /> They first used an RL model to fit participants' behavior, revealing that the learning rate was context-dependent-it varied based on the levels of volatility and noise. However, the RL model showed that participants misattributed noise as volatility, leading to higher learning rates in noisy conditions, where the optimal strategy would be to be less sensitive to random fluctuations.

      Bayesian Observer Model (BOM):<br /> To better account for this context dependency, they introduced a Bayesian Observer Model (BOM), which models how an ideal Bayesian learner would update their beliefs about environmental uncertainty. They found that a degraded version of the BOM, where the agent had a coarser representation of noise compared to volatility, best fit the participants' behavior. This suggested that participants were not fully distinguishing between noise and volatility, instead treating noise as volatility and adjusting their learning rates accordingly.

      The authors also aimed to use pupillometry data (measuring pupil dilation) as a physiological marker to arbitrate between models and understand how participants' internal representations of uncertainty influenced both their behavior and physiological responses. Their objective was to explore whether the BOM could explain not just behavioral choices but also these physiological responses, thereby providing stronger evidence for the model's validity.

      Overall, the study sought to reconcile approximate rationality in human learning by showing that participants still follow a Bayesian-like learning process, but with simplified internal models that lead to suboptimal decisions in noisy environments.

      Strengths:

      The generative model presented in the study is both innovative and insightful. The authors first employ a Reinforcement Learning (RL) model to fit participants' behavior, revealing that the learning rate is context-dependent-specifically, it varies based on the levels of volatility and noise in the task. They then introduce a Bayesian Observer Model (BOM) to account for this context dependency, ultimately finding that a degraded BOM-in which the agent has a coarser representation of noise compared to volatility-provides the best fit to the participants' behavior. This suggests that participants are not fully distinguishing between noise and volatility, leading to misattribution of noise as volatility. Consequently, participants adopt higher learning rates even in noisy contexts, where an optimal strategy would involve being less sensitive to new information (i.e., using lower learning rates). This finding highlights a rational but approximate learning process, as described in the paper.

      Weaknesses:

      While the RL and Bayesian models both successfully predict behavior, it remains unclear how to fully reconcile the two approaches. The RL model captures behavior in terms of a fixed or context-dependent learning rate, while the BOM provides a more nuanced account with dynamic updates based on volatility and noise. Both models can predict actions when fit appropriately, but the pupillometry data offers a promising avenue to arbitrate between the models. However, the current study does not provide a direct comparison between the RL framework and the Bayesian model in terms of how well they explain the pupillometry data. It would be valuable to see whether the RL model can also account for physiological markers of learning, such as pupil responses, or if the BOM offers a unique advantage in this regard. A comparison of the two models using pupillometry data could strengthen the argument for the BOM's superiority, as currently, the possibility that RL models could explain the physiological data remains unexplored.

      The model comparison between the Bayesian Observer Model and the self-defined degraded internal model could be further enhanced. Since different assumptions about the internal model's structure lead to varying levels of model complexity, using a formal criterion such as Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) would allow for a more rigorous comparison of model fit. Including such comparisons would ensure that the degraded BOM is not simply favored due to its flexibility or higher complexity, but rather because it genuinely captures the participants' behavioral and physiological data better than alternative models. This would also help address concerns about overfitting and provide a clearer justification for using the degraded BOM over other potential models.

      Comments on revisions:

      The authors have addressed all my questions. Congratulations on the impressive work accomplished by the authors!

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors present an interesting study using RL and Bayesian modelling to examine differences in learning rate adaptation in conditions of high and low volatility and noise respectively. Through "lesioning" an optimal Bayesian model, they reveal that apparently a suboptimal adaptation of learning rates results from incorrectly detecting volatility in the environment when it is not in fact present.

      Strengths:

      The experimental task used is cleverly designed and does a good job of manipulating both volatility and noise. The modelling approach takes an interesting and creative approach to understanding the source of apparently suboptimal adaptation of learning rates to noise, through carefully "lesioning" and optimal Bayesian model to determine which components are responsible for this behaviour.

      We thank the reviewer for this assessment.

      Weaknesses:

      The study has a few substantial weaknesses; the data and modelling both appear robust and informative, and it tackles an interesting question. The model space could potentially have been expanded, particularly with regard to the inclusion of alternative strategies such as those that estimate latent states and adapt learning accordingly.

      We thank the reviewer for this suggestion. We agree that it would be interesting to assess the ability of alternative models to reproduce the sub-optimal choices of participants in this study. The Bayesian Observer Model described in the paper is a form of Hierarchical Gaussian Filter, so we will assess the performance of a different class of models that are able to track uncertainty-- RL based models that are able to capture changes of uncertainty (the Kalman filter, and the model described by Cochran and Cisler, Plos Comp Biol 2019). We will assess the ability of the models to recapitulate the core behaviour of participants (in terms of learning rate adaption) and, if possible, assess their ability to account for the pupillometry response.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors aimed to investigate how humans learn and adapt their behavior in dynamic environments characterized by two distinct types of uncertainty: volatility (systematic changes in outcomes) and noise (random variability in outcomes). Specifically, they sought to understand how participants adjust their learning rates in response to changes in these forms of uncertainty.

      To achieve this, the authors employed a two-step approach:

      (1) Reinforcement Learning (RL) Model: They first used an RL model to fit participants' behavior, revealing that the learning rate was context-dependent. In other words, it varied based on the levels of volatility and noise. However, the RL model showed that participants misattributed noise as volatility, leading to higher learning rates in noisy conditions, where the optimal strategy would be to be less sensitive to random fluctuations.

      (2) Bayesian Observer Model (BOM): To better account for this context dependency, they introduced a Bayesian Observer Model (BOM), which models how an ideal Bayesian learner would update their beliefs about environmental uncertainty. They found that a degraded version of the BOM, where the agent had a coarser representation of noise compared to volatility, best fit the participants' behavior. This suggested that participants were not fully distinguishing between noise and volatility, instead treating noise as volatility and adjusting their learning rates accordingly.

      The authors also aimed to use pupillometry data (measuring pupil dilation) as a physiological marker to arbitrate between models and understand how participants' internal representations of uncertainty influenced both their behavior and physiological responses. Their objective was to explore whether the BOM could explain not just behavioral choices but also these physiological responses, thereby providing stronger evidence for the model's validity.

      Overall, the study sought to reconcile approximate rationality in human learning by showing that participants still follow a Bayesian-like learning process, but with simplified internal models that lead to suboptimal decisions in noisy environments.

      Strengths:

      The generative model presented in the study is both innovative and insightful. The authors first employ a Reinforcement Learning (RL) model to fit participants' behavior, revealing that the learning rate is context-dependent-specifically, it varies based on the levels of volatility and noise in the task. They then introduce a Bayesian Observer Model (BOM) to account for this context dependency, ultimately finding that a degraded BOM - in which the agent has a coarser representation of noise compared to volatility - provides the best fit for the participants' behavior. This suggests that participants do not fully distinguish between noise and volatility, leading to the misattribution of noise as volatility. Consequently, participants adopt higher learning rates even in noisy contexts, where an optimal strategy would involve being less sensitive to new information (i.e., using lower learning rates). This finding highlights a rational but approximate learning process, as described in the paper.

      We thank the reviewer for their assessment of the paper.

      Weaknesses:

      While the RL and Bayesian models both successfully predict behavior, it remains unclear how to fully reconcile the two approaches. The RL model captures behavior in terms of a fixed or context-dependent learning rate, while the BOM provides a more nuanced account with dynamic updates based on volatility and noise. Both models can predict actions when fit appropriately, but the pupillometry data offers a promising avenue to arbitrate between the models. However, the current study does not provide a direct comparison between the RL framework and the Bayesian model in terms of how well they explain the pupillometry data. It would be valuable to see whether the RL model can also account for physiological markers of learning, such as pupil responses, or if the BOM offers a unique advantage in this regard. A comparison of the two models using pupillometry data could strengthen the argument for the BOM's superiority, as currently, the possibility that RL models could explain the physiological data remains unexplored.

      We thank the reviewer for this suggestion. In the current version of the paper, we use an extremely simple reinforcement learning model to simply measure the learning rate in each task block (as this is the key behavioural metric we are interested in). As the reviewer highlights, this simple model doesn’t estimate uncertainty or adapt to it. Given this, we don’t think we can directly compare this model to the Bayesian Observer Model—for example, in the current analysis of the pupillometry data we classify individual trials based on the BOM’s estimate of uncertainty and show that participants adapt their learning rate as expected to the reclassified trials, this analysis would not be possible with our current RL model. However, there are more complex RL based models that do estimate uncertainty (as discussed above in response to Reviewer #1) and so may more directly be compared to the BOM. We will attempt to apply these models to our task data and describe their ability to account for participant behaviour and physiological response as suggested by the Reviewer.

      The model comparison between the Bayesian Observer Model and the self-defined degraded internal model could be further enhanced. Since different assumptions about the internal model's structure lead to varying levels of model complexity, using a formal criterion such as Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) would allow for a more rigorous comparison of model fit. Including such comparisons would ensure that the degraded BOM is not simply favored due to its flexibility or higher complexity, but rather because it genuinely captures the participants' behavioral and physiological data better than alternative models. This would also help address concerns about overfitting and provide a clearer justification for using the degraded BOM over other potential models.

      Thank you, we will add this.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      For clarity, the methods would benefit from further detail of task framing to participants. I.e. were there explicit instructions regarding volatility/task contingencies? Or were participants told nothing?

      We have added in the following explanatory text to the methods section (page 20), clarifying the limited instructions provided to participants:

      “Participants were informed that the task would be split into 6 blocks, that they had to learn which was the best option to choose, and that this option may change over time. They were not informed about the different forms of uncertainty we were investigating or of the underlying structure of the task (that uncertainty varied between blocks).”

      In the results, it would be useful to report the general task behavior of participants to get a sense of how they performed across different parts of the task. Also, were participants excluded if they didn't show evidence of learning adaptation to volatility?

      We have added the following text reporting overall performance to the results (page 6):

      “Participants were able to learn the best option to choose in the task, selecting the most highly rewarded option on an average of 71% of trials (range 65% - 74%).”

      And the following text to the methods, confirming that participants were not excluded if they didn’t respond to volatility/noise (the failure in this adaptation is the focus of the current study) (page 19):

      “No exclusion criteria related to task performance were used.”

      The results would benefit from a more intuitive explanation of what the lesioning is trying to recapitulate; this can get quite technical and the objective is not necessarily clear, especially for the less computationally-minded reader.

      We have amended the relevant section of the results to clarify this point (page 9):

      “Having shown that an optimal learner adjusts its learning rate to changes in volatility and noise as expected, we next sought to understand the relative noise insensitivity of participants. In these analyses we “lesion” the BOM, to reduce its performance in some way, and then assess whether doing so recapitulates the pattern of learning rate adaptation observed for participants (Fig 3e). In other words, we damage the model so it performs less well and then assess whether this damage makes the behaviour of the BOM (shown in Fig 3f) more closely resemble that seen in participants (Fig 3e).”

      The modelling might be improved by the inclusion of another class of model. Specifically, models that adapt learning rates in response to the estimation of latent states underlying the current task outcomes would be very interesting to see. In a sense, these are also estimating volatility through changeability of latent states, and it would be interesting to explore whether the findings could also be explained by an incorrect assumption that the latent state has changed when outcomes are noisy.

      Thank you for this suggestion. We have added additional sections to the supplementary materials in which we use a general latent state model and a simple RL model to try to recapitulate the behaviour of participants (and to compare with the BOM). These additional sections are extensive, so are not reproduced here. We have also added in a section to the discussion in the main paper covering this interesting question in which we confirm that we were unable to reproduce participant behaviour (or the normative effect of the lesioned BOMs) using these models but suggest that alternative latent state formulations would be interesting to explore in future work (page 18):

      “A related question is whether other, non-Bayesian model formulations may be able to account for participants’ learning adaptation in response to volatility and noise. Of note, the reinforcement learning model used to measure learning rates in separate blocks does not achieve this goal—as this model is fitted separately to each block rather than adapting between blocks (NB the simple reinforcement learning model that is fitted across all blocks does not capture participant behaviour, see supplementary information). One candidate class of model that has potential here is latent-state models (Cochran & Cisler, 2019), in which the variance and unexpected changes in the process being learned (which have a degree of similarity with noise and volatility respectively) is estimated and used to alter the model’s rates of updating as well as the estimated number of states being considered. Using the model described by Cochran and Cisler, we were unable to replicate the learning rate adaptation demonstrated by participants in the current study (see supplementary information) although it remains possible that other latent state formulations may be more successful. “

      The discussion may benefit from a little more discussion of where this work leads us - what is the next step?

      As above, we have added in a suggestion about future modelling work. We have also added in a section about the outstanding interesting questions concerning the neural representation of these quantities, reproduced in response to the suggestion by reviewer #2 below.

      Reviewer #2 (Recommendations for the authors):

      The study presents an opportunity to explore potential neural coding models that could account for the cognitive processes underlying the task. In the field of neural coding, noise correlation is often measured to understand how a population of neurons responds to the same stimulus, which could be related to the noise signal in this task. Since the brain likely treats the stimulus as the same, with noise representing minor changes, this aspect could be linked to the participants' difficulty distinguishing noise from volatility. On the other hand, signal correlation is used to understand how neurons respond to different stimuli, which can be mapped to the volatility signal in the task. It would be highly beneficial if the authors could discuss how these established concepts from neural population coding might relate to the Bayesian behavior model used in the study. For instance, how might neurons encode the distinction between noise and volatility at a population level? Could noise correlation lead to the misattribution of noise as volatility at a neural level, mirroring the behavioral findings? Discussing possible neural models that could explain the observed behavior and relating it to the existing literature on neural population coding would significantly enrich the discussion. It would also open up avenues for future research, linking these behavioral findings to potential neural mechanisms.

      We thank the reviewer for this interesting suggestion. We have added in the following paragraph to the discussion section which we hope does justice to this interesting questions (page 18):

      Previous work examining the neural representations of uncertainty have tended to report correlations between brain activity and some task-based estimate of one form of uncertainty at a time (Behrens et al., 2007; Walker et al., 2020, 2023). We are not aware of work that has, for example, systematically varied volatility and noise and reported distinct correlations for each. An interesting possibility as to how different forms of uncertainty may be encoded is suggested by parallels with the neuronal decoding literature. One question addressed by this literature is how the brain decodes changes in the world from the distributed, noisy neural responses to those changes, with a particular focus on the influence of different forms of between-neuron correlation (Averbeck et al., 2006; Kohn et al., 2016). Specifically, signal-correlation, the degree to which different neurons represent similar external quantities (required to track volatility) is distinguished from, and often limited by, noise-correlation, the degree to which the activity of different neurons covaries independently of these external quantities. One possibility relevant to the current study, which resembles the underlying logic of the BOM, is that a population of neurons represents the estimated mean of the generative process that produces task outcomes. In this case, volatility would be tracked as the signal-correlation across this population, whereas noise would be analogous to the noise-correlation and, crucially, misestimation of noise as volatility might arise as misestimation of these two forms of correlation. While the current study clearly cannot adjudicate on the neural representation of these processes, our finding of distinct behavioural and physiological responses to the two forms of uncertainty, does suggest that separable neural representations of uncertainty are maintained. “

    1. eLife Assessment

      The authors provide compelling evidence that a chloride ion stabilizes the protonated Schiff base chromophore linkage in the animal rhodopsin Antho2a. This important finding is novel and of major interest to a broad audience, including optogenetics researchers, protein engineers, spectroscopists, and environmental biologists. The study combines state-of-the-art research methods, such as spectroscopic and mutational analyses, which are complemented by QM/MM calculations, and was further improved based on the comments from the reviewers.

    2. Reviewer #1 (Public review):

      The chromophore molecule of animal and microbial rhodopsins is retinal which forms a Schiff base linkage with a lysine in the 7-th transmembrane helix. In most cases, the chromophore is positively charged by protonation of the Schiff base, which is stabilized by a negatively charged counterion. In animal opsins, three sites have been experimentally identified, Glu94 in helix 2, Glu113 in helix 3, and Glu181 in extracellular loop 2, where a glutamate acts as the counterion by deprotonation. In this paper, Sakai et al. investigated molecular properties of anthozoan-specific opsin II (ASO-II opsins), as they lack these glutamates. They found an alternative candidate, Glu292 in helix 7, from the sequences. Interestingly, the experimental data suggested that Glu292 is not the direct counterion in ASO-II opsins. Instead, they found that ASO-II opsins employ a chloride ion as the counterion. In case of microbial rhodopsin, a chloride ion serves as the counterion of light-driven chloride pumps. This paper reports the first observation of a chloride ion as the counterion in animal rhodopsin. Theoretical calculation using a QM/MM method supports their experimental data. The authors also revealed the role of Glu292, which serves as the counterion in the photoproduct and is involved in G protein activation.

      The conclusions of this paper are well supported by data.

    3. Reviewer #2 (Public review):

      Summary:

      This work reports the discovery of a new rhodopsin from reef-building corals that is characterized experimentally, spectroscopically, and by simulation. This rhodopsin lacks a carboxylate-based counterion, which is typical for this family of proteins. Instead, the authors find that a chloride ion stabilizes the protonated Schiff base and thus serves as a counterion.

      Strengths:

      This work focuses on the rhodopsin Antho2a, which absorbs in the visible spectrum with a maximum at 503 nm. Spectroscopic studies under different pH conditions, including the mutant E292A and different chloride concentrations, indicate that chloride acts as a counterion in the dark. In the photoproduct, however, the counterion is identified as E292.

      These results lead to a computational model of Antho2a in which the chloride is modeled in addition to the Schiff base. This model is improved using the hybrid QM/MM simulations. As a validation, the absorption maximum is calculated using the QM/MM approach for the protonated and deprotonated E292 residue as well as the E292A mutant. The results are in good agreement with the experiment. However, there is a larger deviation for ADC(2) than for sTD-DFT. Nevertheless, the trend is robust since the wt and E292A mutant models have similar excitation energies. The calculations are performed at a high level of theory that includes a large QM region.

    4. Reviewer #3 (Public review):

      Summary:

      The paper by Saito et al. studies the properties of anthozoan-specific opsins (ASO-II) from organisms found in reef-building coral. Their goal was to test if ASO-II opsins can absorb visible light, and if so, what are they key factors involved.

      The most exciting aspect of this work is their discovery that ASO-II opsins do not have a counterion residue (Asp or Glu) located at any of the previously known sites found in other animal opsins.

      This is very surprising. Opsins are only able to absorb visible (long wavelength light) if the retinal Schiff base is protonated, and the latter requires (as the name implies) a "counter ion". However, the authors clearly show that some ASO-II opsins do absorb visible light.

      To address this conundrum, they tested if the counterion could be provided by exogenous chloride ions (Cl-). Their results find compelling evidence supporting this idea, and their studies of ASO-II mutant E292A suggests E292 also plays a role in G protein activation and is a counterion for a protonated Schiff base in the light-activated form.

      Strengths:

      Overall, the methods are well described and carefully executed, and the results very compelling.

      Their analysis of seven ASO-II opsin sequences undoubtedly shows they all lack a Glu or Asp residue at "normal" (previously established) counter-ion sites in mammalian opsins (typically found at positions 94, 113 or 181). The experimental studies clearly demonstrate the necessity of Cl- for visible light absorbance, as do their studies of the effect of altering the pH.

      Importantly, the authors also carried out careful QM/MM computational analysis (and corresponding calculation of the expected absorbance effects), thus providing compelling support for the Cl- acting directly as a counterion to the protonated retinal Schiff base, and thus limiting the possibility that the Cl- is simply altering the absorbance of ASO-II opsins through some indirect effect on the protein.

      Altogether, the authors clearly achieved their aims, and the results support their conclusions. The manuscript is carefully written, and refreshingly, the results and conclusions not overstated.

      This study is impactful for several reasons. There is increasing interest in optogenetic tools, especially those that leverage G protein coupled receptor systems. Thus, the authors demonstration that ASO-II opsins could be useful for such studies is of interest.

      Moreover, the finding that visible light absorbance by an opsin does not absolutely require a negatively charged amino acid be placed at one of the expected sites (94, 113 or 181) typically found in animal opsins is very intriguing and will help future protein engineering efforts. The argument that the Cl- counterion system they discover here might have been a preliminary step in the evolution of amino acid based counterions used in animal opsins is also interesting.

      Finally, given the ongoing degradation of coral reefs worldwide, the focus on these curious opsins is very timely, as is the authors proposal that the lower Schiff base pKa they discovered here for ASO-II opsins may cause them to change their spectral sensitivity and G protein activation due to changes in their environmental pH.

    1. eLife Assessment

      This valuable study employs transition-metal FRET (tmFRET) and time-correlated single-photon counting to investigate allosteric conformational changes in both isolated cyclic nucleotide-binding domains (CNBDs) and full-length bacterial CNG channels, demonstrating that transmembrane domains stabilize CNBDs in their active state. By comparing isolated CNBD constructs with full-length channels, the authors reveal how allosteric networks couple domain movements to gating energetics, providing insights into ion channel regulation mechanisms. The rigorous methodology and compelling quantitative analysis establish a framework for applying tmFRET to study conformational dynamics in diverse protein systems.

    2. Reviewer #1 (Public review):

      Summary:

      This useful work extends a prior study from the authors to observe distance changes within the CNBD domains of a full length CNG channel based on changes in single photon lifetimes due to tmFRET between a metal at an introduced chelator site and a fluorescent non canonical amino acid at another site. The data are excellent and convincingly support the authors' conclusions. In addition to the methodology being of general use for other proteins, the authors show that coupling of the CNBDs to the rest of the channel stabilizes the CNBDs in their active state relative to an isolated CNBD construct.

      Strengths:

      The manuscript is very well written and clear.

    3. Reviewer #2 (Public review):

      The manuscript by Eggan et al. investigates the energetics of conformational transitions in the cyclic nucleotide-gated (CNG) channel SthK. This lab pioneered transition metal FRET (tmFRET), which has previously provided detailed insights into ion channel conformational changes. Here, the authors analyze tmFRET fluorescence lifetime measurements in the time domain, yielding detailed insights into conformational transitions within the cyclic nucleotide binding domains (CNBDs) of the channel. The integration of tmFRET with time-correlated single-photon counting (TCSPC) represents an advancement of this technique.

    4. Reviewer #3 (Public review):

      Summary:

      This is a lucidly written manuscript describing the use of transition-metal FRET to assess distance changes during functional conformational changes in a CNG channel. The experiments were performed on an isolated C-terminal nucleotide binding domain (CNBD) and on a purified full-length channel, with FRET partners placed at two positions in the CNBD.

      The data and quantitative analysis are exemplary, and they provide a roadmap for the use of this powerful approach in other proteins. In particular, the use of the fluorescence-lifetime decay histograms to learn not just the mean distance reported by the FRET, but also the distribution of states with different distances, allows better refinement of hypotheses for the gating motions.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This useful work extends a prior study from the authors to observe distance changes within the CNBD domains of a full-length CNG channel based on changes in single photon lifetimes due to tmFRET between a metal at an introduced chelator site and a fluorescent non-canonical amino acid at another site. The data are excellent and convincingly support the authors' conclusions. The methodology is of general use for other proteins. The authors also show that coupling of the CNBDs to the rest of the channel stabilizes the CNBDs in their active state, relative to an isolated CNBD construct.

      Strengths:

      The manuscript is very well written and clear.

      Reviewer #2 (Public review):

      The manuscript "Domain Coupling in Allosteric Regulation of SthK Measured Using Time-Resolved Transition Metal Ion FRET" by Eggan et al. investigates the energetics of conformational transitions in the cyclic nucleotide-gated (CNG) channel SthK. This lab pioneered transition metal FRET (tmFRET), which has previously provided detailed insights into ion channel conformational changes. Here, the authors analyze tmFRET fluorescence lifetime measurements in the time domain, yielding detailed insights into conformational transitions within the cyclic nucleotide binding domains (CNBDs) of the channel. The integration of tmFRET with time-correlated single-photon counting (TCSPC) represents an advancement of this technique.

      The results summarize known conformational transitions of the C-helix and provide distance distributions that agree with predicted values based on available structures. The authors first validated their TCSPC approach using the isolated CNBD construct previously employed for similar experiments. They then study the more complex fulllength SthK channel protein. The findings agree with earlier results from this group, demonstrating that the C-helix is more mobile in the closed state than static structures reflect. Upon adding the activating ligand cAMP, the C-helix moves closer to the bound ligand, as indicated by a reduced fluorescence lifetime, suggesting a shorter distance between the donor and acceptor. The observed effects depend on the cAMP concentration, with affinities comparable to functional measurements. Interestingly, a substantial amount of CNBDs appear to be in the activated state even in the absence of cAMP (Figure 6E and F, fA2 ~ 0.4).

      This may be attributed to cooperativity among the CNBDs, which the authors could elaborate on further. In this context, the major limitation of this study is that distance distributions are observed only in one domain. While inter-subunit FRET is detected and accounted for, the results focus exclusively on movements within one domain. Thus, the resulting energetic considerations must be assessed with caution. In the absence of the activator, the closed state is favored, while the presence of cAMP favors the open state. This quantifies the standard assumption; otherwise, an activator would not effectively activate the channel. However, the numerical values of approximately 3 kcal/mol are limited by the fact that only one domain is observed in the experiment, and only one distance (C- helix relative to the CNBD) is probed. Additional conformational changes leading to pore opening (including rotation and upward movement of the CNBD, and radial dilation of the tetrameric assembly) are not captured by the current experiments. These limitations should be taken into account when interpreting the results.

      We agree that these are important limitations to consider in interpreting our results. These limitations and future directions are now largely covered in our discussion. We believe measurements in individual domains provide unique insights into the contributions of different parts of the protein and future work will continue to address conformational energetics in other parts of the protein and subunit cooperativity. 

      Reviewer #3 (Public review):

      Summary:

      This is a lucidly written manuscript describing the use of transition-metal FRET to assess distance changes during functional conformational changes in a CNG channel.

      The experiments were performed on an isolated C-terminal nucleotide binding domain

      (CNBD) and on a purified full-length channel, with FRET partners placed at two

      positions in the CNBD.

      Strengths:

      The data and quantitative analysis are exemplary, and they provide a roadmap for use of this powerful approach in other proteins.

      Weaknesses/Comments:

      A ~3x lower Kd for nucleotide is seen for the detergent-solubilized full-length channel, compared to electrophysiological experiments. This is worth a comment in the Discussion, particularly in the context of the effect of the pore domain on the CNBD energetics.

      We are cautious to interpret our K<sub>D</sub> values given the high affinity for cAMP and the challenges of accurately determining the total protein concentrations in our experiments. We now state this explicitly in the manuscript.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript is very well written and clear. Congrats to the authors.

      Minor comment: In "Measuring tmFRET in Full-Length SthK", 3rd paragraph: "... FRET model with both intersubunit and intersubunit FRET." Should read "intersubunit and intrasubunit".

      Thank you for the comment, this is now corrected.  

      Reviewer #2 (Recommendations for the authors):

      Overall, the manuscript is well-written and clearly explained. However, I recommend that the authors discuss the limitations more critically.

      The revised manuscript now largely addresses these limitations. Additional comments are addressed in short below:  

      A) Only one distance is measured.

      We believe validating a single distance as an important first step in determining the use of this technique and beginning to quantify the allosteric mechanism in SthK. Future studies aim to make additional measurements.

      B) Measurements are confined to a single domain in the cooperative tetrameric assembly.

      Isolating conformational changes in individual domains, allows us to determine how different parts of the protein contribute to the activation upon ligand binding.  

      C) The change in distance upon activation mirrors what is observed in the closed state, which casts doubt on whether these conformational changes actually lead to channel opening or merely reflect the upward swinging of the C-helix that contributes to coordinating cAMP in the binding pocket.

      Future studies aim to detect conformational changes in the pore and other parts of the protein.

      D) Rigid body movements, rotations, and dilations are not captured by the measurements. 

      Our measurements combine energetic information with some, although more limited, structural information.   

      E) Cooperativity is not considered in the interpretation of the results.

      It is currently unclear where in SthK cooperativity arises upon ligand activation (ie. at the level of the CNBD, C-Linker or pore). Our results do not provide evidence of cooperativity in the CNBD upon ligand binding. 

      Additionally, the authors directly correlate their results with the functional states of SthK previously reported, but it remains open whether the modified protein for tmFRET behaves similarly to WT SthK. Functional experiments with the protein used for tmFRET, which demonstrate comparable open probabilities and cAMP potency, would considerably strengthen the manuscript.

      Further optimization is needed to express the full-length protein used in tmFRET experiments in spheroplasts to enable electrophysiological recordings from these constructs. 

      Reviewer #3 (Recommendations for the authors):

      In the final paragraph of the Discussion, the sentence "In our experiments, we assumed that deleting the pore and transmembrane domains eliminates the coupling of these regions to the CNBD" seems trivial. Perhaps it would help to add "simply" before eliminates?

      We have taken the advice and added ‘simply’ in this sentence.  

      Can a statement be made about the magnitude of the effect in the C-terminal deletion experiments in refs 27-29?

      Due to the different channels used in the C-terminal deletion experiments in refs 27-29 (HCN1 and spHCN), compared to the channel we used (SthK), it is challenging to compare the magnitude of energetic changes between these studies. Additionally, the HCN experiments measured changes in the pore domain, compared to the conformational changes in the CNBD domain measured here.

    1. eLife Assessment

      The authors provide a convincing summary of ten years of Brain Initiative funding including the historical development, the specific funding mechanisms, and examples of grants funded and work produced. It is particularly valuable at this moment in history, given the cataclysmic changes in the US government structure and function occurring in early 2025.

    2. Reviewer #1 (Public review):

      This is a convincing description of approximately ten years of funding from the NIH BRAIN initiative. It is of particular value at this moment in history, given the cataclysmic changes in the US government structure and function occurring in early 2025.

      The paper contains a fair bit of documentation so that the curious reader can actually parse what this BRAIN program funded. The authors are able to draw on a wealth of real-life experience reviewing, funding, and administering large team projects, and assessing how well they achieve their goals. In revision, the paper has been improved with respect to clarity and by bringing together two separate papers into one stronger piece.

    3. Reviewer #2 (Public review):

      Summary:

      The authors provide an important summary of ten years of Brain Initiative funding including a description of the historical development of the initiative, the specific funding mechanisms utilized, and examples of grants funded and work produced. The authors also conduct analyses of the impact on overall funding in Systems and Computational Neuroscience, the raw and field normalized bibliographic impact of the work, the social media impact of the funded work, and the popularity of some tools developed.

      The authors have improved the presentation by integrating the weaker of the two manuscripts with the stronger, by clarifying terminology and by performing additional analyses.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this useful narrative, the authors attempt to capture their experience of the success of team projects for the scientific community.

      Strengths:

      The authors are able to draw on a wealth of real-life experience reviewing, funding, and administering large team projects, and assessing how well they achieve their goals.

      Weaknesses:

      The utility of the RCR as a measure is questionable. I am not sure if this really makes the case for the success of these projects. The conclusions do not depend on Figure 1.

      We respectfully disagree about the utility of the RCR, particularly because it is metric that is normalized by both year and topical area. We have added a more detailed description of how the RCR is calculated on page 6-7. Please note that figure 1 is aimed to highlight the funding opportunities, investments and number of awards associated with small lab (exploratory) versus team (elaborated, mature) research rather than a description of publication metrics.

      Reviewer #2 (Public review):

      Summary:

      The authors review the history of the team projects within the Brain initiative and analyze their success in progression to additional rounds of funding and their bibliographic impact.

      Strengths:

      The history of the team projects and the fact that many had renewed funding and produced impactful papers is well documented.

      Weaknesses:

      The core bibliographic and funding impact results have largely been reported in the companion manuscript and so represent "double dipping" I presume the slight disagreement in the number of grants (by one) represents a single grant that was not deemed to address systems/computational neuroscience. The single figure is relatively uninformative. The domains of study are sufficiently large and overlapping that there seems to be little information gained from the graphic and the Sankey plot could be simply summarized by rates of competing success.

      While we sincerely appreciate the feedback, we chose to retain these plots on domains and models to provide a sense of the broad spectrum of research topics contained in our TeamBCP awards. Further details on the awards can be derived from the award links provided in the text. Additionally, we retained the Sankey plots because these are a visual depiction of how awards transition from one mechanism to another, evolve in their funding sources, and advance in their research trajectories. The plot is an example of our continuity analysis which is only reported in the text and not visually shown for the remaining BCP programs.

      Recommendations for the authors:

      Editorial note:

      In the discussion, the reviewers agreed that the present manuscript does not make a sufficient independent contribution and so would be more profitably combined with the companion manuscript. Both reviewers noted that there was not much insight that relied on the single figure. Since neither manuscript is long, and they have overlapping authors (including the same first and last authors), this should not be a difficult merger to achieve.

      Thank you for the recommendation to merge. We have combined both manuscripts into one in this version.

      Reviewer #1 (Recommendations for the authors):

      The jargon of the grant programs could be described as a nightmare. Wellcome is spelled wrong.

      We have attempted to limit the use of jargon and to define acronyms in this version. We have corrected the spelling of Wellcome.

      Reviewer #2 (Recommendations for the authors):

      I suggest that the two manuscripts be combined into a single paper. Although the other manuscript could stand on its own, this one does not.

      The idea of culture change surrounding teams is useful but really forms more of a policy- focused opinion piece than a quantitative analysis of funding impact.

      If the authors insist on keeping these separate, it is critical to remove the team data from the other manuscript.

      We have combined both manuscripts and decided to retain the description of culture change but have edited and condensed this section and will use the supplemental report for qualitative assessments.

    1. Reviewer #1 (Public review):

      Summary:

      The study investigated how individuals living in urban slums in Salvador, Brazil, interact with environmental risk factors, particularly focusing on domestic rubbish piles, open sewers, and a central stream. The study makes use of the step selection functions using telemetry data, which is a method to estimate how likely individuals move towards these environmental features, differentiating among groups by gender, age, and leptospirosis serostatus. The results indicated that women tended to stay closer to the central stream while avoiding open sewers more than men. Furthermore, individuals who tested positive for leptospirosis tended to avoid open sewers, suggesting that behavioral patterns might influence exposure to risk factors for leptospirosis, hence ensuring more targeted interventions.

      Strengths:

      (1) The use of step selection functions to analyze human movement represents an innovative adaptation of a method typically used in animal ecology. This provides a robust quantitative framework for evaluating how people interact with environmental risk factors linked to infectious diseases (in this case, leptospirosis).

      (2) Detailed differentiation by gender and serological status allows for nuanced insights, which can help tailor targeted interventions and potentially improve public health measures in urban slum settings.

      (3) The integration of real-world telemetry data with epidemiological risk factors supports the development of predictive models that can be applied in future infectious disease research, helping to bridge the gap between environmental exposure and health outcomes.

      Weaknesses:

      (1) The sample size for the study was not calculated, although it was a nested cohort study.

      (2) The step‐selection functions, though a novel method, may face challenges in fully capturing the complexity of human decision-making influenced by socio-cultural and economic factors that were not captured in the study.

      (3) The study's context is limited to a specific urban slum in Salvador, Brazil, which may reduce the generalizability of its findings to other geographical areas or populations that experience different environmental or socio-economic conditions.

      (4) The reliance on self-reported or telemetry-based movement data might include some inaccuracies or biases that could affect the precision of the selection coefficients obtained, potentially limiting the study's predictive power.

      (5) Some participants with less than 50 relocations within the study area were excluded without clear justification, see line 149.

      (6) Some figures are not clear (see Figure 4 A & B).

      (7) No statement on conflict of interest was included, considering sponsorship of the study.

    2. eLife Assessment

      This study makes a novel and valuable contribution by adapting step selection functions, traditionally used in animal ecology, to explore human movement and environmental risk exposure in urban slums, offering a promising framework for spatial epidemiology, particularly regarding leptospirosis. The integration of GPS telemetry with environmental data and the stratification by gender and serostatus are notable strengths that enhance the study's relevance for public health applications. The strength of evidence is compelling.

    3. Reviewer #2 (Public review):

      Summary:

      Pablo Ruiz Cuenca et al. conducted a GPS logger study with 124 adult participants across four different slum areas in Salvador, Brazil, recording GPS locations every 35 seconds for 48 hours. The aim of their study was to investigate step-selection models, a technique widely used in movement ecology to quantify contact with environmental risk factors for exposure to leptospires (open sewers, community streams, and rubbish piles). The authors built two different types of models based on distance and based on buffer areas to model human environmental exposure to risk factors. They show differences in movement/contact with these risk factors based on gender and seropositivity status. This study shows the existence of modest differences in contact with environmental risk factors for leptospirosis at small spatial scales based on socio-demographics and infection status.

      Strengths:

      The authors assembled a rich dataset by collecting human GPS logger data, combined with field-recorded locations of open sewers, community streams, and rubbish piles, and testing individuals for leptospirosis via serology. This study was able to capture fine-scale exposure dynamics within an urban environment and shows differences by gender and seropositive status, using a method novel to epidemiology (step selection).

      Weaknesses:

      Due to environmental data being limited to the study area, exposure elsewhere could not be captured, despite previous research by Owers et al. showing that the extent of movement was associated with infection risk. Limitations of step selection for use in studying human participants in an urban environment would need to be explicitly discussed.

    1. eLife Assessment

      This manuscript provides valuable insights into the heterogeneity of hematopoietic stem cells and age-associated myeloid-biased hematopoiesis. While several aspects of the study are intriguing and merit further investigation, the current results remain incomplete and additional data are necessary to substantiate the conclusions. Some of the methods and data analyses partially support the claims.

    2. Reviewer #1 (Public review):

      In this study, Nishi et al. claim that the ratio of long-term hematopoietic stem cell (LT-HSC) versus short-term HSC (ST-HSC) determines the lineage output of HSCs and reduced ratio of ST-HSC in aged mice causes myeloid-biased hematopoiesis. Authors used Hoxb5 reporter mice to isolated LT-HSC and ST-HSC and performed molecular analyses and transplantation assays to support their arguments. How hematopoietic system becomes myeloid-biased upon aging is an important question with many implications in disease context as well. However, this study needs more definitive data.

      (1) Authors' experimental designs have some caveats to definitely support their claims. Authors claimed that aged LT-HSCs have no myeloid-biased clone expansion using transplantation assays. In these experiments, authors used 10 HSCs and young mice as recipients. Given the huge expansion of old HSC by number and known heterogeneity in immunophenotypically defined HSC populations, it is questionable how 10 out of so many old HSCs (an average of 300,000 up to 500,000 cells per mouse; Mitchell et al., Nature Cell Biology, 2023) can faithfully represent old HSC population. The Hoxb5+ old HSC primary and secondary recipient mice data (Fig. 2C and D) support this concern. In addition, they only used young recipients. Considering the importance of inflammatory aged niche in the myeloid-biased lineage output, transplanting young vs old LT-HSCs into aged mice will complete the whole picture.

      In response to the above comments, the authors calculated the required sample size as approximately 384 cells to represent 500,000 HSCs per old mouse. Based on the total 1260 cells used throughout the whole manuscript (Figures 2, 3, 5, 6, S3, and S6), the authors claimed that the data is reflecting old HSC behavior. However, 384 cells represent HSCs from one old mouse. Following the authors' logic, they did only 3.2 mice (1260/384) experiment for the whole manuscript to make their argument. N of 3 is not enough, especially for old mice experiments considering the heterogeneity of aged mice. Also, they did not address the comment regarding inflammatory aged niche effects.

      (2) Authors' molecular data analyses need more rigor with unbiased approaches. They claimed that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid or lymphoid gene set enrichment but aged bulk HSCs, which are just a sum of LT-HSCs and ST-HSCs by their gating scheme (Fig. 4A), showed the "tendency" of enrichment of myeloid-related genes based on the selected gene set (Fig. 4D). Although the proportion of ST-HSCs is reduced in bulk HSCs upon aging, since ST-HSCs do not exhibit lymphoid gene set enrichment based on their data, it is hard to understand how aged bulk HSCs have more myeloid gene set enrichment compared to young bulk HSCs. This bulk HSC data rather suggest that there could be a trend toward certain lineage bias (although not significant) in aged LT-HSCs or ST-HSCs. Authors need to verify the molecular lineage priming of LT-HSCs and ST-HSCs using another comprehensive dataset.

      (3) Although authors could not find any molecular evidence for myeloid-biased hematopoiesis from old HSCs (either LT or ST), they argued that the ratio between LT-HSC and ST-HSC causes myeloid-biased hematopoiesis upon aging based on young HSC experiments (Fig. 6). However, old ST-HSC functional data showed that they barely contribute to blood production unlike young Hoxb5- HSCs (ST-HSC) in the transplantation setting (Fig. 2). Is there any evidence that in unperturbed native old hematopoiesis, old Hoxb5- HSCs (ST-HSC) still contribute to blood production? To answer this question, authors performed additional experiments with increased cell number (Fig. S6). Although Fig. S6.D data has a statistical significance, it is questionable how biologically meaningful it is. More fundamental question is back to the representability. Can this cell number used in this experiment represent old HSC (either LT or ST) behavior?

    3. Reviewer #2 (Public review):

      Summary:

      Nishi et al, investigate the well-known and previously described phenomenon of age-associated myeloid-biased hematopoiesis. Using a previously established HoxB5mCherry mouse model, they used HoxB5+ and HoxB5- HSCs to discriminate cells with long-term (LT-HSCs) and short-term (ST-HSCs) reconstitution potential and compared these populations to immunophenotypically defined 'bulk HSCs' that consists of a mixture of LT-HSC and ST-HSCs. They then isolated these HSC populations from young and aged mice to test their function and myeloid bias in non-competitive and competitive transplants into young and aged recipients. Based on quantification of hematopoietic cell frequencies in the bone marrow, peripheral blood, and in some experiments the spleen and thymus, the authors argue against the currently held belief that myeloid-biased HSCs expand with age.

      While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Fig 3; Fig 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section.

      As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided.<br /> It is also unclear why the authors believe that the observed reduction of ST-HSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation.

      Based on my understanding of the presented data, the authors argue that myeloid-biased HSCs do not exist, as:<br /> a) they detect no difference between young/aged HSCs after transplant (mind low n-numbers and large std);<br /> b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSC in myeloid output LT-HScs in competitive transplants (mind low n-numbers and large std!!!).<br /> However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment.

      Strengths:

      The authors present an interesting observation and offer an alternative explanation of the origins of aged-associated myeloid-biased hematopoiesis. Their data regarding the role of the microenvironment in the spleen and thymus appears to be convincing.

      Weaknesses:

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Fig. 3, B and C)."<br /> [Comment to the authors]: Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity.

      Line 293: "Based on these findings, we concluded that myeloid-biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones."

      [Comment to the authors]: Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of ST-HSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs?

      New comment for the authors:

      While the authors provide new evidence, clarify the text, and adjust their interpretation, the presented data remain weak and do not convincingly challenge the current paradigm. As myeloid-biased HSC expansion with age has been observed and published by many different groups, the authors need to provide much stronger evidence to challenge the observations of others. Key experiments that might support their claims had been suggested, but as indicated, the authors plan to provide these much more rigorous experiments in future studies. As it stands, the overall conclusions of this manuscript thus remain weak and preliminary.

      In an attempt to quantify the absolute cell number of HSPC subpopulations, the authors use a usual readout and quantify "Number of cells per minute of analysis time". This appears to be a quick and dirty reanalysis of already existing flow cytometry data. Unfortunately, this quantification cannot count the absolute number of cells reliably, as the number of cells per minute recorded is heavily influenced by the abundance of other cell populations. Instead, the author should have counted the absolute number of HSCs, MPPs, GMPs, etc. per femur, which is typically done to address this question.

      At this point, as authors are seemingly not willing to provide additional hard evidence to support their claims in this study and are instead in the process of preparing additional data for a future manuscript, I believe this study, as it stands (although weak), suggests an interesting alternative model. Despite being highly controversial, this alternative model warrants future investigations and discussions in the field. As always, it will also be important to reproduce these findings independently in other labs. As my concerns and the concerns of the other reviewers are documented and available to read by others, I believe the manuscript should be published in its current form to stimulate critical discussion and future investigations of the current model.

    4. Reviewer #3 (Public review):

      In this manuscript, Nishi et al. propose a new model to explain the previously reported myeloid-biased hematopoiesis associated with aging. Traditionally, this phenotype has been explained by the expansion of myeloid-biased hematopoietic stem cell (HSC) clones during aging. Here, the authors question this idea and show how their Hoxb5 reporter model can discriminate long-term (LT) and short-term (ST) HSC and characterized their lineage output after transplant. From these analyses, the authors conclude that changes during aging in the LT/ST HSC proportion explain the myeloid bias observed.

      Comments on revisions:

      I appreciate the authors' reply to some of my comments. However, there are some key aspects that remain unresolved. Please see below.

      - The authors propose a critical change in the way we consider the mechanisms leading to lineage biased hematopoiesis during aging. As Reviewer 2 mentioned, such a strong claim needs to be supported by solid experimental data. Unfortunately, the level of variability in key in vivo experiments (Figure 2 and 3) diminishes the robustness of these results.

      The authors argue that even with the low number of mice used in some of these experiments and the high level of variability, differences still reach (or not) statistical significance according to their analysis. I am not an expert on statistics but the only test that is mentioned is their methodology is a Welch's t test, which is only appropriate for data following a normal distribution. A more rigorous statistical analysis should be performed to sustain the claims included in the current manuscript.

      - The chosen irradiation regiment might contribute to the uncertainty of the data and influence their interpretation. As the authors show in their response to my "comment to our #3-4 response", there is a considerable (and variable) amount of "radioresistant" CD45.1+CD45.2- cells in their primary recipients, which become concerningly high in the secondary transplant. This is not found in previous publications focused on this topic and, therefore, it makes it difficult to compare those studies with the present manuscript. The inclusion of this aspect in the text is appreciated but definitely reduces the impact of their claims.

      - The correction introduced in the main text as an answer to the original comment #3-6 is still misleading. There is an assumption for GMP, CMP and MEP to increase with age if myeloid-biased HSC clones increase with age ("in contrast to what we anticipated"). Again, the link between these two changes could be more complex than just a direct correlation.

    1. eLife Assessment

      In this valuable study, Taber et al used a battery of biophysical and structural approaches to characterize the impact of erythrocytosis-related mutations in prolyl hydroxylase domain protein 2 (PHD2). The authors show that PHD2 mutant proteins are destabilized, thus supporting the tenet that dysregulation of PHD2/hypoxia induced factor (HIF) axis underpins erythrocytosis, while providing incomplete evidence that N-terminal ODD prolyl hydroxylation of HIF is indispensable for these phenotypes. Notwithstanding that this study was found to be of broad interest for a variety of fields focusing on oxygen sensing in homeostasis and pathological states, resolving inconsistencies in the biophysical analysis (e.g., NMR, SEC, and BLI/MST) was thought to be warranted to further corroborate the proposed model.

    2. Reviewer #1 (Public review):

      Summary:

      Taber et al report the biochemical characterization of 7 mutations in PHD2 that induce erythrocytosis. Their goal is to provide a mechanism for how these mutations cause the disease. PHD2 hydroxylates HIF1a in the presence of oxygen at two distinct proline residues (P564 and P402) in the "oxygen degradation domain" (ODD). This leads to the ubiquitylation of HIF1a by the VHL E3 ligase and its subsequent degradation. Multiple mutations have been reported in the EGLN1 gene (coding for PHD2), which are associated with pseudohypoxic diseases that include erythrocytosis. Furthermore, 3 mutations in PHD2 also cause pheochromocytoma and paraganglioma (PPGL), a neuroendocrine tumour. These mutations likely cause elevated levels of HIF1a, but their mechanisms are unclear. Here, the authors analyze mutations from 152 case reports and map them on the crystal structure. They then focus on 7 mutations, which they clone in a plasmid and transfect into PHD2-KO to monitor HIF1a transcriptional activity via a luciferase assay. All mutants show impaired activation. Some mutants also impaired stability in pulse chase turnover assays (except A228S, P317R, and F366L). In vitro purified PHD2 mutants display a minor loss in thermal stability and some propensity to aggregate. Using MST technology, they show that P317R is strongly impaired in binding to HIF1a and HIF2a, whereas other mutants are only slightly affected. Using NMR, they show that the PHD2 P317R mutation greatly reduces hydroxylation of P402 (HIF1a NODD), as well as P562 (HIF1a CODD), but to a lesser extent. Finally, BLI shows that the P317R mutation reduces affinity for CODD by 3-fold, but not NODD.

      Strengths:

      (1) Simple, easy-to-follow manuscript. Generally well-written.

      (2) Disease-relevant mutations are studied in PHD2 that provide insights into its mechanism of action.

      (3) Good, well-researched background section.

      Weaknesses:

      (1) Poor use of existing structural data on the complexes of PHD2 with HIF1a peptides and various metals and substrates. A quick survey of the impact of these mutations (as well as analysis by Chowdhury et al, 2016) on the structure and interactions between PHD2 peptides of HIF1a shows that the P317R mutation interferes with peptide binding. By contrast, F366L will affect the hydrophobic core, and A228S is on the surface, and it's not obvious how it would interfere with the stability of the protein.

      (2) To determine aggregation and monodispersity of the PHD2 mutants using size-exclusion chromatography (SEC), equal quantities of the protein must be loaded on the column. This is not what was done. As an aside, the colors used for the SEC are very similar and nearly indistinguishable.

      (3) The interpretation of some mutants remains incomplete. For A228S, what is the explanation for its reduced activity? It is not substantially less stable than WT and does not seem to affect peptide hydroxylation.

      (4) The interpretation of the NMR prolyl hydroxylation is tainted by the high concentrations used here. First of all, there is a likely a typo in the method section; the final concentration of ODD is likely 0.18 mM, and not 0.18 uM (PNAS paper by the same group in 2024 reports using a final concentration of 230 uM). Here, I will assume the concentration is 180 uM. Flashman et al (JBC 2008) showed that the affinity of the NODD site (P402; around 10 uM) for PHD2 is 10-fold weaker than CODD (P564, around 1 uM). This likely explains the much faster kinetics of hydroxylation towards the latter. Now, using the MST data, let's say the P317R mutation reduces the affinity by 40-fold; the affinity becomes 400 uM for NODD (above the protein concentration) and 40 uM for CODD (below the protein concentration). Thus, CODD would still be hydroxylated by the P317R mutant, but not NODD.

      (5) The discrepancy between the MST and BLI results does not make sense, especially regarding the P317R mutant. Based on the crystal structures of PHD2 in complex with the ODD peptides, the P317R mutation should have a major impact on the affinity, which is what is reported by MST. This suggests that the MST is more likely to be valid than BLI, and the latter is subject to some kind of artefact. Furthermore, the BLI results are inconsistent with previous results showing that PHD2 has a 10-fold lower affinity for NODD compared to CODD.

      (6) Overall, the study provides some insights into mutants inducing erythrocytosis, but the impact is limited. Most insights are provided on the P317R mutant, but this mutant had already been characterized by Chowdhury et al (2016). Some mutants affect the stability of the protein in cells, but then no mechanism is provided for A228S or F366L, which have stabilities similar to WT, yet have impaired HIF1a activation.

    3. Reviewer #2 (Public review):

      Summary:

      Mutations in the prolyl hydroxylase, PHD2, cause erythrocytosis and, in some cases, can result in tumorigenesis. Taber and colleagues test the structural and functional consequences of seven patient-derived missense mutations in PHD2 using cell-based reporter and stability assays, and multiple biophysical assays, and find that most mutations are destabilizing. Interestingly, they discover a PHD2 mutant that can hydroxylate the C-terminal ODD, but not the N-terminal ODD, which suggests the importance of N-terminal ODD for biology. A major strength of the manuscript is the multidisciplinary approach used by the authors to characterize the functional and structural consequences of the mutations. However, the manuscript had several major weaknesses, such as an incomplete description of how the NMR was performed, a justification for using neighboring residues as a surrogate for looking at prolyl hydroxylation directly, or a reference to the clinical case studies describing the phenotypes of patient mutations. Additionally, the experimental descriptions for several experiments are missing descriptions of controls or validation, which limits their strength in supporting the claims of the authors.

      Strengths:

      (1) This manuscript is well-written and clear.

      (2) The authors use multiple assays to look at the effects of several disease-associated mutations, which support the claims.

      (3) The identification of P317R as a mutant that loses activity specifically against NODD, which could be a useful tool for further studies in cells.

      Weaknesses:

      Major:

      (1) The source data for the patient mutations (Figure 1) in PHD2 is not referenced, and it's not clear where this data came from or if it's publicly available. There is no section describing this in the methods.

      (2) The NMR hydroxylation assay.

      A. The description of these experiments is really confusing. The authors have published a recent paper describing a method using 13C-NMR to directly detect proly-hydroxylation over time, and they refer to this manuscript multiple times as the method used for the studies under review. However, it appears the current study is using 15N-HSQC-based experiments to track the CSP of neighboring residues to the target prolines, so not the target prolines themselves. The authors should make this clear in the text, especially on page 9, 5th line, where they describe proline cross-peaks and refer to the 15N-HSQC data in Figure 5B.<br /> B. The authors are using neighboring residues as reporters for proline hydroxylation, without validating this approach. How well do CSPs of A403 and I566 track with proline hydroxylation? Have the authors confirmed this using their 13C-NMR data or mass spec?<br /> C. Peak intensities. In some cases, the peak intensities of the end point residue look weaker than the peak intensities of the starting residue (5B, PHD2 WT I566, 6 ct lines vs. 4 ct lines). Is this because of sample dilution (i.e., should happen globally)? Can the authors comment on this?

      (3) Data validating the CRISPR KO HEK293A cells is missing.

      (4) The interpretation of the SEC data for the PHD2 mutants is a little problematic. Subtle alterations in the elution profiles may hint at different hydrodynamic radii, but as the samples were not loaded at equal concentrations or volumes, these data seem more anecdotal, rather than definitive. Repeating this multiple times, using matched samples, followed by comparison with standards loaded under identical buffer conditions, would significantly strengthen the conclusions one could make from the data.

      Minor:

      (1) Justification for picking the seven residues is not clearly articulated. The authors say they picked 7 mutants with "distinct residue changes", but no further rationale is provided.

      (2) A major finding of the paper is that a disease-associated mutation, P317R, can differentially affect HIF1 prolyhydroxylation, however, additional follow-up studies have not been performed to test this in cells or to validate the mutant in another method. Is it the position of the proline within the catalytic core, or the identity of the mutation that accounts for the selectivity?

    4. Reviewer #3 (Public review):

      Summary:

      This is an interesting and clinically relevant in vitro study by Taber et al., exploring how mutations in PHD2 contribute to erythrocytosis and/or neuroendocrine tumors. PHD2 regulates HIFα degradation through prolyl-hydroxylation, a key step in the cellular oxygen-sensing pathway.

      Using a time-resolved NMR-based assay, the authors systematically analyze seven patient-derived PHD2 mutants and demonstrate that all exhibit structural and/or catalytic defects. Strikingly, the P317R variant retains normal activity toward the C-terminal proline but fails to hydroxylate the N-terminal site. This provides the first direct evidence that N-terminal prolyl-hydroxylation is not dispensable, as previously thought.

      The findings offer valuable mechanistic insight into PHD2-driven effects and refine our understanding of HIF regulation in hypoxia-related diseases.

      Strengths:

      The manuscript has several notable strengths. By applying a novel time-resolved NMR approach, the authors directly assess hydroxylation at both HIF1α ODD sites, offering a clear functional readout. This method allows them to identify the P317R variant as uniquely defective in NODD hydroxylation, despite retaining normal activity toward CODD, thereby challenging the long-held view that the N-terminal proline is biologically dispensable. The work significantly advances our understanding of PHD2 function and its role in oxygen sensing, and might help in the future interpretation and clinical management of associated erythrocytosis.

      Weaknesses:

      There is a lack of in vivo/ex vivo validation. This is actually required to confirm whether the observed defects in hydroxylation-especially the selective NODD impairment in P317R-are sufficient to drive disease phenotypes such as erythrocytosis.

      The reliance on HRE-luciferase reporter assays may not reliably reflect the PHD2 function and highlights a limitation in the assessment of downstream hypoxic signaling.

      The study clearly documents the selective defect of the P317R mutant, but the structural basis for this selectivity is not addressed through high-resolution structural analysis (e.g., cryo-EM).

      Given the proposed central role of HIF2α in erythrocytosis, direct assessment of HIF2α hydroxylation by the mutants would have strengthened the conclusions.

    5. Author response:

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      Taber et al report the biochemical characterization of 7 mutations in PHD2 that induce erythrocytosis.

      Their goal is to provide a mechanism for how these mutations cause the disease. PHD2 hydroxylates HIF1a in the presence of oxygen at two distinct proline residues (P564 and P402) in the "oxygen degradation domain" (ODD). This leads to the ubiquitylation of HIF1a by the VHL E3 ligase and its subsequent degradation. Multiple mutations have been reported in the EGLN1 gene (coding for PHD2), which are associated with pseudohypoxic diseases that include erythrocytosis. Furthermore, 3 mutations in PHD2 also cause pheochromocytoma and paraganglioma (PPGL), a neuroendocrine tumour. These mutations likely cause elevated levels of HIF1a, but their mechanisms are unclear. Here, the authors analyze mutations from 152 case reports and map them on the crystal structure. They then focus on 7 mutations, which they clone in a plasmid and transfect into PHD2-KO to monitor HIF1a transcriptional activity via a luciferase assay. All mutants show impaired activation. Some mutants also impaired stability in pulse chase turnover assays (except A228S, P317R, and F366L). In vitro purified PHD2 mutants display a minor loss in thermal stability and some propensity to aggregate. Using MST technology, they show that P317R is strongly impaired in binding to HIF1a and HIF2a, whereas other mutants are only slightly affected. Using NMR, they show that the PHD2 P317R mutation greatly reduces hydroxylation of P402 (HIF1a NODD), as well as P562 (HIF1a CODD), but to a lesser extent. Finally, BLI shows that the P317R mutation reduces affinity for CODD by 3-fold, but not NODD.  

      Strengths: 

      (1) Simple, easy-to-follow manuscript. Generally well-written. 

      (2) Disease-relevant mutations are studied in PHD2 that provide insights into its mechanism of action. 

      (3) Good, well-researched background section. 

      Weaknesses: 

      (1) Poor use of existing structural data on the complexes of PHD2 with HIF1a peptides and various metals and substrates. A quick survey of the impact of these mutations (as well as analysis by Chowdhury et al, 2016) on the structure and interactions between PHD2 peptides of HIF1a shows that the P317R mutation interferes with peptide binding. By contrast, F366L will affect the hydrophobic core, and A228S is on the surface, and it's not obvious how it would interfere with the stability of the protein. 

      Thank you for the comment.  We will further analyze the mutations on the available PHD2 crystal structures in complex with HIFa to discern how these substitution mutations may impact PHD2 structure and function.  

      (2) To determine aggregation and monodispersity of the PHD2 mutants using size-exclusion chromatography (SEC), equal quantities of the protein must be loaded on the column. This is not what was done. As an aside, the colors used for the SEC are very similar and nearly indistinguishable. 

      Agreed.  We will perform additional experiment as suggested by the reviewer to further assess aggregation and hydrodynamic size.  The colors used in the graph will be changed for a clearer differentiation between samples.

      (3) The interpretation of some mutants remains incomplete. For A228S, what is the explanation for its reduced activity? It is not substantially less stable than WT and does not seem to affect peptide hydroxylation. 

      We agree with the reviewer that the causal mechanism for some of the tested disease-causing mutants remain unclear.  The negative findings also raise the notion, perhaps considered controversial, that there may be other substrates of PHD2 that are impacted by certain mutations, which contribute to disease pathogenesis.  We will expand our discussion accordingly. 

      (4) The interpretation of the NMR prolyl hydroxylation is tainted by the high concentrations used here. First of all, there is a likely a typo in the method section; the final concentration of ODD is likely 0.18 mM, and not 0.18 uM (PNAS paper by the same group in 2024 reports using a final concentration of 230 uM). Here, I will assume the concentration is 180 uM. Flashman et al (JBC 2008) showed that the affinity of the NODD site (P402; around 10 uM) for PHD2 is 10-fold weaker than CODD (P564, around 1 uM). This likely explains the much faster kinetics of hydroxylation towards the latter. Now, using the MST data, let's say the P317R mutation reduces the affinity by 40-fold; the affinity becomes 400 uM for NODD (above the protein concentration) and 40 uM for CODD (below the protein concentration). Thus, CODD would still be hydroxylated by the P317R mutant, but not NODD. 

      The HIF1α concentration was indeed an oversight, which will be corrected to 0.18 mM.  The study by Flashman et al.[1] showing PHD2 having a lower affinity to the NODD than CODD likely contributes to the differential hydroxylation rates via PHD2 WT.  We showed here via MST that PHD2 P317R had Kd of 320 ± 20 uM for HIF1αCODD, which should have led to a severe enzymatic defect, even at the high concentrations used for NMR (180 uM).  However, we observed only a subtle reduction in hydroxylation efficiency in comparison to PHD2 WT.  Thus, we performed another binding method using BLI that showed a mild binding defect on CODD by PHD2 P317R, consistent with NMR data.  The perplexing result is the WT-like binding to the NODD by PHD2 P317R, which appears inconsistent with the severe defect in NODD hydroxylation via PHD2 P317R as measured via NMR.  These results suggest that there are supporting residues within the PHD2/NODD interface that help maintain binding to NODD but compromise the efficiency of NODD hydroxylation upon PHD2 P317R mutation. We will perform additional binding experiments to further interrogate and validate the binding affinity of PHD2 P317R to NODD and CODD.

      (5) The discrepancy between the MST and BLI results does not make sense, especially regarding the P317R mutant. Based on the crystal structures of PHD2 in complex with the ODD peptides, the P317R mutation should have a major impact on the affinity, which is what is reported by MST. This suggests that the MST is more likely to be valid than BLI, and the latter is subject to some kind of artefact. Furthermore, the BLI results are inconsistent with previous results showing that PHD2 has a 10-fold lower affinity for NODD compared to CODD. 

      The reviewer’s structural prediction that P317R mutation should cause a major binding defect, while agreeable with our MST data, is incongruent with our NMR and the data from Chowdhury et al.[2] that showed efficient hydroxylation of CODD via PHD2 P317R.  Moreover, we have attempted to model NODD and CODD on apo PHD2 P317R structure and found that the mutation had no major impact on CODD while the mutated residue could clash with NODD, causing a shifting of peptide positioning on the protein.  However, these modeling predictions, like any in silico projections, would need experimental validation.  As mentioned in our preceding response, we also performed BLI, which showed that PHD2 P317R had a minor binding defect for CODD, consistent with the NMR results and findings by Chowdhury et al[2].  NODD binding was also measured with BLI as purified NODD peptides were not amenable for soluble-based MST assay, which showed similar K<sub>d</sub>’s for PHD2 WT and P317R.  Considering the absence of NODD hydroxylation via PHD2 P317R as measured by NMR and modeling on apo PHD2 P317R, we posit that P317R causes deviation of NODD from its original orientation that may not affect binding due to the other interactions from the surrounding elements but unfortunately disallows NODD from turnover.  Further study would be required to validate such notion, which we feel is beyond the scope of this manuscript.  However, we will perform additional binding experiments to further interrogate PHD2 P317R binding to NODD.   

      (6) Overall, the study provides some insights into mutants inducing erythrocytosis, but the impact is limited. Most insights are provided on the P317R mutant, but this mutant had already been characterized by Chowdhury et al (2016). Some mutants affect the stability of the protein in cells, but then no mechanism is provided for A228S or F366L, which have stabilities similar to WT, yet have impaired HIF1a activation. 

      We thank the reviewer for raising these and other limitations.  We will expand on the shortcomings of the present study but would like to underscore that the current work using the recently described NMR assay along with other biophysical analyses suggests a previously under-appreciated role of NODD hydroxylation in the normal oxygen-sensing pathway.  

      Reviewer #2 (Public review): 

      Summary: 

      Mutations in the prolyl hydroxylase, PHD2, cause erythrocytosis and, in some cases, can result in tumorigenesis. Taber and colleagues test the structural and functional consequences of seven patientderived missense mutations in PHD2 using cell-based reporter and stability assays, and multiple biophysical assays, and find that most mutations are destabilizing. Interestingly, they discover a PHD2 mutant that can hydroxylate the C-terminal ODD, but not the N-terminal ODD, which suggests the importance of N-terminal ODD for biology. A major strength of the manuscript is the multidisciplinary approach used by the authors to characterize the functional and structural consequences of the mutations. However, the manuscript had several major weaknesses, such as an incomplete description of how the NMR was performed, a justification for using neighboring residues as a surrogate for looking at prolyl hydroxylation directly, or a reference to the clinical case studies describing the phenotypes of patient mutations. Additionally, the experimental descriptions for several experiments are missing descriptions of controls or validation, which limits their strength in supporting the claims of the authors. 

      Strengths: 

      (1) This manuscript is well-written and clear. 

      (2) The authors use multiple assays to look at the effects of several disease-associated mutations, which support the claims. 

      (3) The identification of P317R as a mutant that loses activity specifically against NODD, which could be a useful tool for further studies in cells. 

      Weaknesses: 

      Major: 

      (1) The source data for the patient mutations (Figure 1) in PHD2 is not referenced, and it's not clear where this data came from or if it's publicly available. There is no section describing this in the methods.

      Clinical and patient information on disease-causing PHD2 mutants was compiled from various case reports and summarized in an excel sheet found in the Supplementary Information.  The case reports are cited in this excel file.  A reference to the supplementary data will be added to the Figure 1 legend and in the introduction.

      (2) The NMR hydroxylation assay. 

      A. The description of these experiments is really confusing. The authors have published a recent paper describing a method using 13C-NMR to directly detect proly-hydroxylation over time, and they refer to this manuscript multiple times as the method used for the studies under review. However, it appears the current study is using 15N-HSQC-based experiments to track the CSP of neighboring residues to the target prolines, so not the target prolines themselves. The authors should make this clear in the text, especially on page 9, 5th line, where they describe proline cross-peaks and refer to the 15N-HSQC data in Figure 5B. 

      As the reviewer mentioned, the assay that we developed directly measures the target proline residues.  This assay is ideal when mutations near the prolines are studied, such as A403, Y565 (He et al[3]).  In this previous work, we observed that the shifting of the target proline cross-peaks due to change in electronegativity on the pyrrolidine ring of proline in turn impacted the neighboring residues[3], which meant that the neighboring residues can be used as reporter residues for certain purposes.  In this study, we focused on investigating the mutations on PHD2 while leaving the sequence of the HIF-1α unchanged by using solely 15N-HSQC-based experiments without the need for double-labeled samples.  Nonetheless, we thank the reviewer for pointing out the confusion in the text and we will correct and clarify our description of this assay.

      B. The authors are using neighboring residues as reporters for proline hydroxylation, without validating this approach. How well do CSPs of A403 and I566 track with proline hydroxylation? Have the authors confirmed this using their 13C-NMR data or mass spec? 

      For previous studies, we performed intercalated 15N-HSQC and 13C-CON experiments for the kinetic measurements of wild-type HIF-1α and mutants.  We observed that the shifting pattern of A403 and I566 in the 15N-HSQC spectra aligned well with the ones of P402 and P564, respectively, in the 13C-CON spectra.  Representative data will be added to Supplemental Data.

      C. Peak intensities. In some cases, the peak intensities of the end point residue look weaker than the peak intensities of the starting residue (5B, PHD2 WT I566, 6 ct lines vs. 4 ct lines). Is this because of sample dilution (i.e., should happen globally)? Can the authors comment on this? 

      This is an astute observation by the reviewer.  We checked and confirmed that for all kinetic datasets, the peak intensities of the end point residue are always slightly lower than the ones of the starting.  This includes the cases for PHD2 A228S and P317R in 5B, although not as obvious as the one of PHD2 WT.  We agree with the reviewer that the sample dilution is a factor as a total volume of 16 microliters of reaction components was added to the solution to trigger the reaction after the first spectrum was acquired.  It is also likely that rate of prolyl hydroxylation becomes extremely slow with only a low amount of substrate available in the system.  Therefore, the reaction would not be 100% complete which was detected by the sensitive NMR experimentation.

      (3) Data validating the CRISPR KO HEK293A cells is missing. 

      We thank the reviewer for noting this oversight.  Western blots validating PHD2 KO in HEK293A cells will be added to the Supplementary Data file.

      (4) The interpretation of the SEC data for the PHD2 mutants is a little problematic. Subtle alterations in the elution profiles may hint at different hydrodynamic radii, but as the samples were not loaded at equal concentrations or volumes, these data seem more anecdotal, rather than definitive. Repeating this multiple times, using matched samples, followed by comparison with standards loaded under identical buffer conditions, would significantly strengthen the conclusions one could make from the data. 

      Agreed.  We will perform additional experiments as suggested with equal volume and concentration of each PHD2 construct loaded onto the SEC column for better assessment of aggregation.

      Minor: 

      (1) Justification for picking the seven residues is not clearly articulated. The authors say they picked 7 mutants with "distinct residue changes", but no further rationale is provided. 

      Additional justification for the selection of the mutants will be added to the ‘Mutations across the PHD2 enzyme induce erythrocytosis’ section.  Briefly, some mutants were chosen based on their frequency in the clinical data and their presence in potential mutational hot spots.  Various mutations were noted at W334 and R371, while F366L was identified in multiple individuals.  Additionally, 9 cases of PHD2-driven disease were reported to be caused from mutations located between residues 200 to 210 while 13 cases were reported between residues 369-379, so G206C and R371H were chosen to represent potential hot spots.  To examine a potential genotype-phenotype relationship, two of the mutants responsible for neuroendocrine tumor development, A228S and H374R, were also selected.  Finally, mutations located close or on catalytic core residues (P317R, R371H, and H374R) were chosen to test for suspected defects.   

      (2) A major finding of the paper is that a disease-associated mutation, P317R, can differentially affect HIF1 prolyhydroxylation, however, additional follow-up studies have not been performed to test this in cells or to validate the mutant in another method. Is it the position of the proline within the catalytic core, or the identity of the mutation that accounts for the selectivity? 

      This is the very question that we are currently addressing but as a part of a follow-up study.  Indeed, one thought is that the preferential defect observed could be the result of the loss of proline, an exceptionally rigid amino acid that makes contact with the backbone twice, or the addition of a specific amino acid, namely arginine, a flexible amino acid with an added charge at this site.  Although beyond the scope of this manuscript, we will investigate whether such and other characteristics in this region of PHD2/HIF1α interface contribute to the differential hydroxylation. 

      Reviewer #3 (Public review): 

      Summary: 

      This is an interesting and clinically relevant in vitro study by Taber et al., exploring how mutations in PHD2 contribute to erythrocytosis and/or neuroendocrine tumors. PHD2 regulates HIFα degradation through prolyl-hydroxylation, a key step in the cellular oxygen-sensing pathway. 

      Using a time-resolved NMR-based assay, the authors systematically analyze seven patient-derived PHD2 mutants and demonstrate that all exhibit structural and/or catalytic defects. Strikingly, the P317R variant retains normal activity toward the C-terminal proline but fails to hydroxylate the N-terminal site. This provides the first direct evidence that N-terminal prolyl-hydroxylation is not dispensable, as previously thought. 

      The findings offer valuable mechanistic insight into PHD2-driven effects and refine our understanding of HIF regulation in hypoxia-related diseases. 

      Strengths: 

      The manuscript has several notable strengths. By applying a novel time-resolved NMR approach, the authors directly assess hydroxylation at both HIF1α ODD sites, offering a clear functional readout. This method allows them to identify the P317R variant as uniquely defective in NODD hydroxylation, despite retaining normal activity toward CODD, thereby challenging the long-held view that the N-terminal proline is biologically dispensable. The work significantly advances our understanding of PHD2 function and its role in oxygen sensing, and might help in the future interpretation and clinical management of associated erythrocytosis. 

      Weaknesses: 

      (1) There is a lack of in vivo/ex vivo validation. This is actually required to confirm whether the observed defects in hydroxylation-especially the selective NODD impairment in P317R-are sufficient to drive disease phenotypes such as erythrocytosis. 

      We thank the reviewer for this comment, and while we agree with this statement, the objective of this study per se was to elucidate the structural and/or functional defect caused by the various diseaseassociated mutations on PHD2. The subsequent study would be to validate whether the identified defects, in particular the selective NODD impairment, would lead to erythrocytosis in vivo.  However, we feel that such study would be beyond the scope of this manuscript.

      (2) The reliance on HRE-luciferase reporter assays may not reliably reflect the PHD2 function and highlights a limitation in the assessment of downstream hypoxic signaling. 

      Agreed.  All experimental assays and systems have limitations. The HRE-luciferase assay used in the present manuscript also has limitations such as the continuous expression of exogenous PHD2 mutants driven via CMV promoter. Thus, we performed several additional biophysical methodologies to interrogate the disease-causing PHD2 mutants. The limitations of the luciferase assay will be expanded in the revised manuscript. 

      (3) The study clearly documents the selective defect of the P317R mutant, but the structural basis for this selectivity is not addressed through high-resolution structural analysis (e.g., cryo-EM). 

      We thank the reviewer for the comment.  While solving the structure of PHD2 P317R in complex with HIFα substrate is beyond the scope for this study, a structure of PHD2 P317R in complex with a clinically used inhibitor has been solved (PDB:5LAT).  In analyzing this structure and that of PHD2 WT in complex with NODD, Chowdhury et al[2] stated that P317 makes hydrophobic contacts with LXXLAP motif on HIFα and R317 is predicted to interact differently with this motif. While this analysis does not directly elucidate the reason for the preferential NODD defect, it supports the possibility that P317R substitution may be more detrimental for enzymatic activity on NODD than CODD. We will discuss this notion in the revised manuscript. 

      (4) Given the proposed central role of HIF2α in erythrocytosis, direct assessment of HIF2α hydroxylation by the mutants would have strengthened the conclusions. 

      We thank the reviewer for this comment, but we feel that such study would be beyond the scope of the present study. We observed that the PHD2 binding patterns to HIF1α and HIF2α were similar, and we have previously assigned >95% of the amino acids in HIF1α ODD for NMR study[3]. Thus, we first focused on the elucidation of possible defects on disease-associated PHD2 mutants using HIF1α as the substrate with the supposition that an identified deregulation on HIF1α could be extended to HIF2α paralog. 

      However, we agree with the reviewer that future studies should examine the impact of PHD2 mutants directly on HIF2α.  

      References:

      (1) Flashman, E. et al. Kinetic rationale for selectivity toward N- and C-terminal oxygen-dependent degradation domain substrates mediated by a loop region of hypoxia-inducible factor prolyl hydroxylases. J Biol Chem 283, 3808-3815 (2008).

      (2) Chowdhury, R. et al. Structural basis for oxygen degradation domain selectivity of the HIF prolyl hydroxylases. Nat Commun 7, 12673 (2016).

      (3) He, W., Gasmi-Seabrook, G.M.C., Ikura, M., Lee, J.E. & Ohh, M. Time-resolved NMR detection of prolyl-hydroxylation in intrinsically disordered region of HIF-1alpha. Proc Natl Acad Sci U S A 121, e2408104121 (2024).

    1. eLife Assessment

      Based on several lines of interesting data, the authors conclude that FMRP, though associated with stalled ribosomes, does not determine the position on the mRNAs at which ribosomes stall. Although this conclusion would be valuable if clearly established, the current set of data are incomplete and it is unclear if the methodologies applied in this paper are fully adequate to address this gap.

    2. Reviewer #1 (Public review):

      Summary:

      The authors have investigated the role of FMRP in the formation and function of RNA granules in mouse brain/cultured hippocampal neurons. Most of their results indicate that FMRP does not have a role in the formation or function of RNA granules with specific mRNAs, but may have some role in distal RNA granules in neurons and their response to synaptic stimulation. This is an important work (though the results are mostly negative) in understanding the composition and function of neuronal RNA granules. The last part of the work in cultured neurons is disjointed from the rest of the manuscript, and the results are neither convincing nor provide any mechanistic insight.

      Strengths:

      (1) The study is quite thorough, the methods and analysis used are robust, and the conclusion and interpretation are diligent.

      (2) The comparative study of Rat and Mouse RNA granules is very helpful for future studies.

      (3) The conclusion that the absence of FMRP does not affect the RNA granule composition and many of its properties in the system the authors have chosen to study is well supported by the results.

      (4) The difference in the response to DHPG stimulation concerning RNA granules described here is very interesting and could provide a basis for further studies, though it has some serious technical issues.

      Weaknesses:

      (1) The system used for the study (P5 mouse brain or DIV 8-10 cultured neuron) is surprising, as the majority of defects in the absence of FMRP are reported in later stages (P30+ brain and DIV 14+ neurons). It is important to test if the conclusions drawn here hold good at different developmental stages.

      (2) The term 'distal granules' is very vague. Since there is no structural or biochemical characterization of these granules, it is difficult to understand how they are different from the proximal granules and why FMRP has an effect only on these granules.

      (3) Since the manuscript does not find any effect of FMRP on neuronal RNA granules, it does not provide any new molecular insight with respect to the function of FMRP

    3. Reviewer #2 (Public review):

      In the present manuscript, Li et al. use biochemical fractionation of "RNA granules" from P5 wildtype and FMR1 knock-out mouse brains to analyze their protein/RNA content, determine a single particle cryo-EM structure of contained ribosomes, and perform ribo-seq analysis of ribosome-protected RNA fragments (RPFs). The authors conclude from these that neither the composition of the ribosome granules, nor the state of their contained ribosomes, nor the mRNA positions with high ribosome occupancy change significantly. Besides minor changes in mRNA occupancy, the one change the authors identified is a decrease in puromycylated punctae in distal neurites of cultured primary neurons of the same mice, and their enhanced resistance to different pharmacological treatments. These results directly build on their earlier work (Anadolu et al., 2023) using analogous preparations of rat brains; the authors now perform a very similar study using WT and FMR1-KO mouse brains. This is an important topic, aiming to identify the molecular underpinnings of the FMRP protein, which is the basis of a major neurological disease. Unfortunately, several limitations of this study prevent it from being more convincing in its present form.

      In order to improve this study, our main suggestions are as follows:

      (1) The authors equate their biochemically purified "RG" fraction with their imaging-based detection of puromycin-positive punctae. They claim essentially no differences in RGs, but detect differences in the latter (mostly their abundance and sensitivity to DHPG/HHT/Aniso). In the discussion the authors acknowledge the inconsistency between these two modalities: "An inconsistency in our findings is the loss of distal RPM puncta coupled with an increase in the immunoreactivity for S6 in the RG." and "Thus, it may be that the RG is not simply made up of ribosomes from the large liquid-liquid phase RNA granules."

      How can the authors be sure that they are analysing the same entities in both modalities? A more parsimonious explanation of their results would be that, while there might be some overlap, two different entities are analyzed. Much of the main message rests on this equivalence, and I believe the authors should show its validity.

      (2) The authors show that increased nuclease digestion (and magnesium concentration) led to a reduction of their RPF sizes down to levels also seen by other researchers. Analyzing these now properly digested RPFs, the authors state that the CDS coverage and periodicity drastically improved, and that spurious enrichments of secretory mRNAs, which made up one of the major fractions in their previous work, are now reduced. In my opinion, this would be more appropriately communicated as a correction to their previous work, not as a main Figure in another manuscript.

      (3) The fold changes reported in Figure 7 (ranging between log2(-0.2) and log2(+0.25)) are all extremely small and in my opinion should not be used to derive claims such as "The loss of FMRP significantly affected the abundance and occupancy of FMRP-Clipped mRNAs in WT and FMR1-KO RG (Fig 7A, 7B), but not their enrichment between RG and RCs".

      (4) Figure 8 / S8-1 - The authors show that ~2/3 of their reads stem from PCR duplicates, but that even after removing those, the majority of peaks remain unaltered. At the same time, Figure S8-1 shows the total number of peaks to be 615 compared with 1392 before duplicate removal. Can the authors comment on this discrepancy? In addition, the dataset with properly removed artefacts should be used for their main display item instead of the current Figure 8.

      (5) Figure 9 / S9-1, the density of punctae in both WT and FMR1-KO actually increases after treatment of HHT or Anisomycin (Figure S9-1 B-C). Even if a large fraction would now be "resistant to run-off", there should not be an increase. While this effect is deemed not significant, a much smaller effect in Figure 9C is deemed significant. Can the authors explain this? Given how vastly different the sample sizes are (ranging from 23 neurites in Figures S9-1 to 5,171 neurites in Figure 9), the authors should (randomly) sample to the same size and repeat their statistical analysis again, to improve their credibility.

    4. Reviewer #3 (Public review):

      Summary: Li et al describe a set of experiments to probe the role of FMRP in ribosome stalling and RNA granule composition. The authors are able to recapitulate findings from a previous study performed in rats (this one is in mice).

      Strengths:

      1) The work addresses an important and challenging issue, investigating mechanisms that regulate stalled ribosomes, focusing on the role of FMRP. This is a complicated problem, given the heterogeneity of the granules and the challenges related to their purification. This work is a solid attempt at addressing this issue, which is widely understudied.

      2) The interpretation of the results could be interesting, if supported by solid data. The idea that FMRP could control the formation and release of RNA granules, rather than the elongation by stalled ribosomes is of high importance to the field, offering a fresh perspective into translational regulation by FMRP.

      3) The authors focused on recapitulating previous findings, published elsewhere (Anadolu et al., 2023) by the same group, but using rat tissue, rather than mouse tissue. Overall, they succeeded in doing so, demonstrating, among other findings, that stalled ribosomes are enriched in consensus mRNA motifs that are linked to FMRP. These interesting findings reinforce the role of FMRP in formation and stabilization of RNA granules. It would be nice to see extensive characterization of the mouse granules as performed in Figure 1 of Anadolu and colleagues, 2023.

      4) Some of the techniques incorporated aid in creating novel hypotheses, such as the ribopuromycilation assay and the cryo-EM of granule ribosomes.

      Weaknesses:

      1) The RNA granule characterization needs to be more rigorous. Coomassie is not proper for this type of characterization, simply because protein weight says little about its nature. The enrichment of key proteins is not robust and seems to not reach significance in multiple instances, including S6 and UPF1. Furthermore, S6 is the only proxy used for ribosome quantification. Could the authors include at least 3 other ribosomal proteins (2 from small, 2 from large subunit)?

      2) Page 12-13 - The Gene Ontology analysis is performed incorrectly. First, one should not rank genes by their RPKM levels. It is well known that housekeeping genes such as those related to actin dynamics, molecular transport and translation are highly enriched in sequencing datasets. It is usually more informative when significantly different genes are ranked by p adjust or log2 Fold Change, then compared against a background to verify enrichment of specific processes. However, the authors found no DEGs. I would suggest the removal of this analysis, incorporation of a gene set enrichment analyses (ranked by p adjust). I further suggest that the authors incorporate a dimensionality reduction analysis to demonstrate that the lack of significance stems from biology and not experimental artifacts, such as poor reproducibility across biological replicates.

    5. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors have investigated the role of FMRP in the formation and function of RNA granules in mouse brain/cultured hippocampal neurons. Most of their results indicate that FMRP does not have a role in the formation or function of RNA granules with specific mRNAs, but may have some role in distal RNA granules in neurons and their response to synaptic stimulation. This is an important work (though the results are mostly negative) in understanding the composition and function of neuronal RNA granules. The last part of the work in cultured neurons is disjointed from the rest of the manuscript, and the results are neither convincing nor provide any mechanistic insight.

      Strengths:

      (1) The study is quite thorough, the methods and analysis used are robust, and the conclusion and interpretation are diligent.

      (2) The comparative study of Rat and Mouse RNA granules is very helpful for future studies.

      (3) The conclusion that the absence of FMRP does not affect the RNA granule composition and many of its properties in the system the authors have chosen to study is well supported by the results.

      (4) The difference in the response to DHPG stimulation concerning RNA granules described here is very interesting and could provide a basis for further studies, though it has some serious technical issues.

      Weaknesses:

      (1) The system used for the study (P5 mouse brain or DIV 8-10 cultured neuron) is surprising, as the majority of defects in the absence of FMRP are reported in later stages (P30+ brain and DIV 14+ neurons). It is important to test if the conclusions drawn here hold good at different developmental stages.

      (2) The term 'distal granules' is very vague. Since there is no structural or biochemical characterization of these granules, it is difficult to understand how they are different from the proximal granules and why FMRP has an effect only on these granules.

      (3) Since the manuscript does not find any effect of FMRP on neuronal RNA granules, it does not provide any new molecular insight with respect to the function of FMRP

      Thank you for your comments and for pointing out the strengths of the manuscript. Unfortunately, we will not be able to respond to point #1. The protocol for purification of the ribosomes from RNA granules does not work in older brains (See Khandjian et al, 2004 PNAS 101:13357), presumably due to the presence of large concentrations of myelin. While it would be possible to repeat our results later in culture, we have no expectation that it would be different since we do observe DHPG induction of elongation dependent, initiation independent mGLUR-LTD in later cultures (Graber et al, 2017 J. Neuroscience 37:9116)..We will strengthen this caveat in the discussion that our results are only at a snapshot of development and that it is certainly possible that different results may be seen at different times. We agree with point 2 that ‘distal granules’ is a vague term. We will remove the term and clarify that we only quantified granules larger than 50 microns from the cell soma. We do not know if these granules are distinct. We would respectfully disagree with point #3 that the study does not provide molecular insight into the function of FMRP, as disproving that FMRP is important for stalling and determining the position of stalling removes a major hypothesis about the function of FMRP, and showing that something is not true, is at least to me, providing insight.

      Reviewer #2 (Public review):

      In the present manuscript, Li et al. use biochemical fractionation of "RNA granules" from P5 wildtype and FMR1 knock-out mouse brains to analyze their protein/RNA content, determine a single particle cryo-EM structure of contained ribosomes, and perform ribo-seq analysis of ribosome-protected RNA fragments (RPFs). The authors conclude from these that neither the composition of the ribosome granules, nor the state of their contained ribosomes, nor the mRNA positions with high ribosome occupancy change significantly. Besides minor changes in mRNA occupancy, the one change the authors identified is a decrease in puromycylated punctae in distal neurites of cultured primary neurons of the same mice, and their enhanced resistance to different pharmacological treatments. These results directly build on their earlier work (Anadolu et al., 2023) using analogous preparations of rat brains; the authors now perform a very similar study using WT and FMR1-KO mouse brains. This is an important topic, aiming to identify the molecular underpinnings of the FMRP protein, which is the basis of a major neurological disease. Unfortunately, several limitations of this study prevent it from being more convincing in its present form.

      In order to improve this study, our main suggestions are as follows:

      (1) The authors equate their biochemically purified "RG" fraction with their imaging-based detection of puromycin-positive punctae. They claim essentially no differences in RGs, but detect differences in the latter (mostly their abundance and sensitivity to DHPG/HHT/Aniso). In the discussion the authors acknowledge the inconsistency between these two modalities: "An inconsistency in our findings is the loss of distal RPM puncta coupled with an increase in the immunoreactivity for S6 in the RG." and "Thus, it may be that the RG is not simply made up of ribosomes from the large liquid-liquid phase RNA granules."

      How can the authors be sure that they are analysing the same entities in both modalities? A more parsimonious explanation of their results would be that, while there might be some overlap, two different entities are analyzed. Much of the main message rests on this equivalence, and I believe the authors should show its validity.

      (2) The authors show that increased nuclease digestion (and magnesium concentration) led to a reduction of their RPF sizes down to levels also seen by other researchers. Analyzing these now properly digested RPFs, the authors state that the CDS coverage and periodicity drastically improved, and that spurious enrichments of secretory mRNAs, which made up one of the major fractions in their previous work, are now reduced. In my opinion, this would be more appropriately communicated as a correction to their previous work, not as a main Figure in another manuscript.

      (3) The fold changes reported in Figure 7 (ranging between log2(-0.2) and log2(+0.25)) are all extremely small and in my opinion should not be used to derive claims such as "The loss of FMRP significantly affected the abundance and occupancy of FMRP-Clipped mRNAs in WT and FMR1-KO RG (Fig 7A, 7B), but not their enrichment between RG and RCs".

      (4) Figure 8 / S8-1 - The authors show that ~2/3 of their reads stem from PCR duplicates, but that even after removing those, the majority of peaks remain unaltered. At the same time, Figure S8-1 shows the total number of peaks to be 615 compared with 1392 before duplicate removal. Can the authors comment on this discrepancy? In addition, the dataset with properly removed artefacts should be used for their main display item instead of the current Figure 8.

      (5) Figure 9 / S9-1, the density of punctae in both WT and FMR1-KO actually increases after treatment of HHT or Anisomycin (Figure S9-1 B-C). Even if a large fraction would now be "resistant to run-off", there should not be an increase. While this effect is deemed not significant, a much smaller effect in Figure 9C is deemed significant. Can the authors explain this? Given how vastly different the sample sizes are (ranging from 23 neurites in Figures S9-1 to 5,171 neurites in Figure 9), the authors should (randomly) sample to the same size and repeat their statistical analysis again, to improve their credibility.

      Thank you for your comments. We agree with the issue in point #1 that the equivalence of RPM puncta with the RG fraction is an issue and while we believe that we show in a number of ways that the two are related (anisomycin-resistant puromycylation, puromyclation only at high concentrations consistent with the hybrid state, etc), we would respectfully disagree that our main message results from the equivalence of the RPM-labeled RNA granules in neurites and the ribosomes isolated by sedimentation. We will make this point clearer in our revision. For point #2, we agree that the changes with increased nuclease is somewhat out of place in a narrative sense, but it is clearly relevant to this work. Whether or not one sees this as a ‘correction’ or an interesting point will depend on a better characterization of the structures of the stalled polysomes. My personal view is that the nuclease resistance of cleavage near the RNA entrance site is quite interesting. Since we reproduce our results with a similar nuclease treatment in mice, as reported in our previous publication, I believe the comparison could be of interest in the future and would like to retain it. We agree with point #3 and will temper these claims in our revised version. For point #4, we will determine more carefully why the number of peaks differs and switch the main and supplemental figures. We apologize for the typo in the figure legend in Figure 9, 171, not 5171. The box plot line shows the median not the average and the data is clearly skewed such that the median and average are different (i.e. there is a two-fold decrease in the average density of distal puncta between WT and FMRP, but the average density is actually slightly decreased with HHT and A, although the median increases slightly. We will now report the results in distinct modalities to clarify this, and we will reexamine the statistics to better address the skewed distribution of values in the revised version.

      Summary:

      Li et al describe a set of experiments to probe the role of FMRP in ribosome stalling and RNA granule composition. The authors are able to recapitulate findings from a previous study performed in rats (this one is in mice).

      Strengths:

      (1) The work addresses an important and challenging issue, investigating mechanisms that regulate stalled ribosomes, focusing on the role of FMRP. This is a complicated problem, given the heterogeneity of the granules and the challenges related to their purification. This work is a solid attempt at addressing this issue, which is widely understudied.

      (2) The interpretation of the results could be interesting, if supported by solid data. The idea that FMRP could control the formation and release of RNA granules, rather than the elongation by stalled ribosomes is of high importance to the field, offering a fresh perspective into translational regulation by FMRP.

      (3) The authors focused on recapitulating previous findings, published elsewhere (Anadolu et al., 2023) by the same group, but using rat tissue, rather than mouse tissue. Overall, they succeeded in doing so, demonstrating, among other findings, that stalled ribosomes are enriched in consensus mRNA motifs that are linked to FMRP. These interesting findings reinforce the role of FMRP in formation and stabilization of RNA granules. It would be nice to see extensive characterization of the mouse granules as performed in Figure 1 of Anadolu and colleagues, 2023.

      (4) Some of the techniques incorporated aid in creating novel hypotheses, such as the ribopuromycilation assay and the cryo-EM of granule ribosomes.

      Weaknesses:

      (1) The RNA granule characterization needs to be more rigorous. Coomassie is not proper for this type of characterization, simply because protein weight says little about its nature. The enrichment of key proteins is not robust and seems to not reach significance in multiple instances, including S6 and UPF1. Furthermore, S6 is the only proxy used for ribosome quantification. Could the authors include at least 3 other ribosomal proteins (2 from small, 2 from large subunit)?

      (2) Page 12-13 - The Gene Ontology analysis is performed incorrectly. First, one should not rank genes by their RPKM levels. It is well known that housekeeping genes such as those related to actin dynamics, molecular transport and translation are highly enriched in sequencing datasets. It is usually more informative when significantly different genes are ranked by p adjust or log2 Fold Change, then compared against a background to verify enrichment of specific processes. However, the authors found no DEGs. I would suggest the removal of this analysis, incorporation of a gene set enrichment analyses (ranked by p adjust). I further suggest that the authors incorporate a dimensionality reduction analysis to demonstrate that the lack of significance stems from biology and not experimental artifacts, such as poor reproducibility across biological replicates.

      Thank you for your comments on the strengths of the manuscript. We agree with point #1 that the mouse RNA granule characterization needs to be more rigorous and we plan to accomplish this in our revised version. Similarly, we will incorporate the additional statistical analysis suggested by the reviewer in a revised version.

    1. eLife Assessment

      In this study, the authors investigate the role of ZMAT3, a p53 target gene, in tumor suppression and RNA splicing regulation. Using quantitative proteomics, the authors uncover that ZMAT3 knockout leads to upregulation of HKDC1, a gene linked to mitochondrial respiration, and that ZMAT3 suppresses HKDC1 expression by inhibiting c-JUN-mediated transcription. This set of convincing evidence reveals a fundamental mechanism by which ZMAT3 contributes to p53-driven tumor suppression by regulating mitochondrial respiration.

    2. Reviewer #1 (Public review):

      Summary:

      ZMAT3 is a p53 target gene that the Lal group and others have shown is important for p53-mediated tumor suppression, and which plays a role in the control of RNA splicing. In this manuscript, Lal and colleagues perform quantitative proteomics of cells with ZMAT3 knockout and show that the enzyme hexokinase HKDC1 is the most upregulated protein. Mechanistically, the authors show that ZMAT3 does not appear to directly regulate the expression of HKDC1; rather, they show that the transcription factor c-JUN was strongly enriched in ZMAT3 pull-downs in IP-mass spec experiments, and they perform IP-western to demonstrate an interaction between c-JUN and ZMAT3. Importantly, the authors demonstrate, using ChIP-qPCR, that JUN is present at the HKDC1 gene (intron 1) in ZMAT3 WT cells and shows markedly enhanced binding in ZMAT3 KO cells. The data best fit a model whereby p53 transactivates ZMAT3, leading to decreased JUN binding to the HKDC1 promoter, and altered mitochondrial respiration.

      Strengths:

      The authors use multiple orthogonal approaches to test the majority of their findings.

      The authors offer a potentially new activity of ZMAT3 in tumor suppression by p53: the control of mitochondrial respiration.

      Weaknesses:

      Some indication as to whether other c-JUN target genes are also regulated by ZMAT3 would improve the broad relevance of the authors' findings.

    3. Reviewer #2 (Public review):

      Summary:

      The study elucidates the role of the recently discovered mediator of p53 tumor suppressive activity, ZMAT3. Specifically, the authors find that ZMAT3 negatively regulates HKDC1, a gene involved in the control of mitochondrial respiration and cell proliferation.

      Strengths:

      Mechanistically, ZMAT3 suppresses HKDC1 transcription by sequestering JUN and preventing its binding to the HKDC1 promoter, resulting in reduced HKDC1 expression. Conversely, p53 mutation leads to ZMAT3 downregulation and HKDC1 overexpression, thereby promoting increased mitochondrial respiration and proliferation. This mechanism is novel; however, the authors should address several points.

      Weaknesses:

      The authors conduct mechanistic experiments (e.g., transcript and protein quantification, luciferase assays) to demonstrate regulatory interactions between p53, ZMAT3, JUN, and HKDC1. These findings should be supported with functional assays, such as proliferation, apoptosis, or mitochondrial respiration analyses.

    4. Reviewer #3 (Public review):

      Summary:

      In their manuscript, Kumar et al. investigate the mechanisms underlying the tumor suppressive function of the RNA binding protein ZMAT3, a previously described tumor suppressor in the p53 pathway. To this end, they use RNA-sequencing and proteomics to characterize changes in ZMAT3-deficient cells, leading them to identify the hexokinase HKDC1 as upregulated with ZMAT3 deficiency first in colorectal cancer cells, then in other cell types of both mouse and human origin. This increase in HKDC1 is associated with increased mitochondrial respiration. As ZMAT3 has been reported as an RNA-binding and DNA-binding protein, the authors investigated this via PAR-CLIP and ChIP-seq but did not observe ZMAT3 binding to HKDC1 pre-mRNA or DNA. Thus, to better understand how ZMAT3 regulates HKDC1, the authors used quantitative proteomics to identify ZMAT3-interacting proteins. They identified the transcription factor JUN as a ZMAT3-interacting protein and showed that JUN promotes the increased HKDC1 RNA expression seen with ZMAT3 inactivation. They propose that ZMAT3 inhibits JUN-mediated transcriptional induction of HKDC1 as a mechanism of tumor suppression. This work uncovers novel aspects of the p53 tumor suppressor pathway.

      Strengths:

      This novel work sheds light on one of the most well-established yet understudied p53 target genes, ZMAT3, and how it contributes to p53's tumor suppressive functions. Overall, this story establishes a p53-ZMAT3-HKDC1 tumor suppressive axis, which has been strongly substantiated using a variety of orthogonal approaches, in different cell lines and with different data sets.

      Weaknesses:

      While the role of p53 and ZMAT3 in repressing HKDC1 is well substantiated, there is a gap in understanding how ZMAT3 acts to repress JUN-driven activation of the HKDC1 locus. How does ZMAT3 inhibit JUN binding to HKDC1? Can targeted ChIP experiments or RIP experiments be used to make a more definitive model? Can ZMAT3 mutants help to understand the mechanisms? Future work can further establish the mechanisms underlying how ZMAT3 represses JUN activity.

    1. eLife Assessment

      In their study, Neiswender et al. provide important insights into how BicD2 variants linked to spinal muscular atrophy alter dynein activity and cargo specificity. While the findings suggest disease-relevant changes in BicD2's binding partners, the evidence connecting these changes to disease mechanisms remains incomplete and would benefit from further experimental validation. The work lays a strong foundation for future research, but could be strengthened by deeper functional analysis of key interactions, such as the BicD2/HOPS complex.

    2. Reviewer #1 (Public review):

      In this work, Neiswender and colleagues test the hypothesis that mutations in BicD2 that are associated with SMALED alter BicD2-cargo interactions. To do this, they first establish the WT BicD2 cargo interactome (using a proximity-dependent biotin ligase screen with Turbo-ID on the BicD2 C-terminus). In addition to known cargo interactors, they also identified many proteins in the HOPs complex. Interestingly, they find that the HOPs complex may interact with BicD2 in a different manner than other known cargos. The authors also show that while BicD2 is required for the HOPs complex localization, on average, depletion of BicD2 from HeLa and Cos7 cells causes HOPs and Lysosome mislocalization that is consistent with Kinesin-1 trafficking defects, rather than dynein. The authors also use proximity biotin ligase approaches to define the cargo interactome of three BicD2 variants associated with SMALED. One variant (R747C) has the most altered cargo interactome. The authors highlight one protein, in particular, GRAMD1A, that is only found in the R747C dataset and mislocalizes specifically when R747C is expressed.

      The work in this manuscript is of a very high quality and contributes important findings to the field. I have a few questions that, if answered, could increase the impact of this work.

      (1) I was surprised at the effect of BicD2 knockdown on LAMP (and VPS41) localization, which really suggests that in HeLa and Cos7 cells, BicD2 regulation of Kinesin-1 (rather than dynein) is the primary driver of lysosome localization. The KIF5B-knockout rescue of the BicD2-overexpression phenotype was a very powerful result that supports this conclusion. Have the authors looked at other cargos, eg, Golgi or centrosomes in G2? Can the authors include more discussion about what this result means or how they imagine dynein and kinesin-1's interaction with BicD2 is regulated?

      (2) Have the authors examined if the SMALED mutants show diminished or increased binding to KIF5B? While the authors are correct that the mutations could hyperactivate dynein because they reduce BicD2 autoinhibition, it is possible that the SMALED mutants hyperactivate dynein because they no longer bind kinesin. This would be particularly interesting, given the complex relationship between BicD2 regulation of dynein and kinesin that the authors show in Figure 3.

      (3) What is already known about the protein GRAMD1A? Did the authors choose to focus on GRAMD1A because it was the only novel interaction found in the SMALED mutant interactomes, or was this protein interesting for a different reason? Does the known function of GRAMD1A explain the potential dysfunction of cells expressing BICD2_R747C or patients who have this mutation? More discussion of this protein and why the authors focused on it would really strengthen the manuscript.

    3. Reviewer #2 (Public review):

      Neiswender et al. investigated the interactomes between wild-type BICD2 and BICD2 mutants that are associated with Spinal Muscular Atrophy with Lower Extremity Predominance (SMALED2). Although BICD2 has previously been implicated in SMALED2, it is unclear how mutations in BICD2 may contribute to disease symptoms. In this study, the authors characterize the interactome of wild-type BICD2 and identify potential new cargos, including the HOPS complex. The authors then chose three SMALED2-associated BICD2 mutants and compared each mutant interactome to that of wild-type BICD2. Each mutant had a change in the interactome, with the most drastic being BICD2_R747C, a mutation in the cargo binding domain of BICD2. This mutant displayed less interaction with a potential new BICD2 cargo, the HOPS complex. Additionally, it displayed more interaction with an ER protein, GRAMD1A.

      The data in the paper is generally strong, but the major conclusions of this paper need more evidence to be better supported.

      (1) The authors use cells that have been engineered to express the different BICD2 constructs. As shown in Figure 4B, the authors see wide expression of BICD2_WT throughout the cell. However, WT BICD2 usually localizes to the TGN. This widespread localization introduces some uncertainty about the interactome data. The authors should either try to verify the interaction data (specifically with the HOPS complex and GRAMD1A) by immunoprecipitating endogenous BICD2 or by repeating their interactome experiment in Figure 1 using BICD2 knockout cells that express the BICD2_WT construct. This should also be done to verify the immunoprecipitation and microscopy data shown in Figure 7.

      (2) The authors conclude that cargo transport defects resulting from BICD2 mutations may contribute to SMALED2 symptoms. However, the authors are unable to determine if BICD2 directly binds to the potential new cargo, the HOPS complex. To address this, the authors could purify full-length WT BICD2 and perform in vitro experiments. Furthermore, the authors were unable to identify the minimal region of BICD2 needed for HOPS interaction. The authors could expand on the experiment attempted with the extended BICD2 C-terminal using a deltaCC1 construct, which could also be used for in vitro experiments.

      (3) Again, the authors conclude that BICD2 mutants cause cargo transport defects that are likely to lead to SMALED2 symptoms. This would be better supported if the authors are able to find a protein relevant to SMALED2 and examine if/how its localization is changed under expression of the BICD2 mutants. The authors currently use the HOPS complex and GRAMD1A as indicators of cargo transport defects, but it is unclear if these are relevant to SMALED2 symptoms.

    4. Reviewer #3 (Public review):

      Summary:

      BicD2 is a motor adapter protein that facilitates cellular transport pathways, which are impacted by human disease mutations of BicD2, causing spinal muscular atrophy with lower extremity dominance (SMALED2). The authors provide evidence that some of these mutations result in interactome changes, which may be the underlying cause of the disease. This is supported by proximity biotin ligation screens, immunoprecipitation, and cell biology assays. The authors identify several novel BicD2 interactions, such as the HOPS complex that participates in the fusion of late endosomes and autophagosomes with lysosomes, which could have important functions. Three BicD2 disease mutants studied had changes in the interactome, which could be an underlying cause for SMALED2. The study extends our understanding of the BicD2 interactome under physiological conditions, as well as of the changes in cellular transport pathways that result in SMALED2. It will be of great interest for the BicD2 and dynein fields.

      Strengths:

      Extensive interactomes are presented for both WT BicD2 as well as the disease mutants, which will be valuable for the community. The HOPS complex was identified as a novel interactor of BicD2, which is important for fusion of late endosomes and lysosomes, which is of interest, since some of the BicD2 disease mutations result in Golgi-fragmentation phenotypes. The interaction with the HOPS complex is affected by the R747C mutation, which also results in a gain-of-function interaction with GRAMD1A.

      Weaknesses:

      The manuscript should be strengthened by further evidence of the BicD2/HOPS complex interaction and the functional implications for spinal muscular atrophy by changes in the interactome through mutations. Which functional implications does the loss of the BicD2/HOPS complex interaction and the gain of function interaction with GRAMD1A have in the context of the R747C mutant?

      Major points:

      (1) In the biotin proximity ligation assay, a large number of targets were identified, but it is not clear why only the HOPS complex was chosen for further verification. Immunoprecipitation was used for target verification, but due to the very high number of targets identified in the screen, and the fact that the HOPS complex is a membrane protein that could potentially be immunoprecipitated along with lysosomes or dynein, additional experiments to verify the interaction of BicD2 with the HOPS complex (reconstitution of a complex in vitro, GST-pull down of a complex from cell extracts or other approaches) are needed to strengthen the manuscript.

      (2) In the biotin proximity ligation assay, a large number of BicD2 interactions were identified that are distinct between the mutant and the WT, but it was not clear why, particularly GRAMD1A was chosen as a gain-of-function interaction, and what the functional role of a BicD2/GRAMD1A interaction may be. A Western blot shows a strengthened interaction with the R747C mutant, but GRAMD1A also interacts with WT BicD2.

      (3) Furthermore, the functional implications of changed interactions with HOPS and GRAMD1A in the R747C mutant are unclear. Additional experiments are needed to establish the functional implication of the loss of the BicD2/HOPS interaction in the BicD2/R747C mutant. For the GRAMD1A gain of function interaction, according to the authors, a significant amount of the protein localized with BicD2/R747C at the centrosomal region. This changed localization is not very clear from the presented images (no centrosomal or other markers were used, and the changed localization could also be an effect of dynein hyperactivation in the mutant). Furthermore, the functional implication of a changed localization of GRAMD1A is unclear from the presented data.

    1. eLife Assessment

      This valuable study identifies asymmetric dimethylarginine (ADMA) histones as potential determinants of the initial genomic binding of Rhino, a Drosophila-specific chromatin protein essential for piRNA cluster specification. The authors provide correlative genomic and imaging data to support their model, although functional validation of the proposed mechanism remains incomplete. The authors could revise the manuscript to reflect that they have uncovered a small subset of piRNA clusters dependent on ADMA-histones, which may not be the general rule.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors aim to understand how Rhino, a chromatin protein essential for small RNA production in fruit flies, is initially recruited to specific regions of the genome. They propose that asymmetric arginine methylation of histones, particularly mediated by the enzyme DART4, plays a key role in defining the first genomic sites of Rhino localization. Using a combination of inducible expression systems, chromatin immunoprecipitation, and genetic knockdowns, the authors identify a new class of Rhino-bound loci, termed DART4 clusters, that may represent nascent or transitional piRNA clusters.

      Strengths:

      One of the main strengths of this work lies in its comprehensive use of genomic data to reveal a correlation between ADMA histones and Rhino enrichment at the border of known piRNA clusters. The use of both cultured cells and ovaries adds robustness to this observation. The knockdown of DART4 supports a role for H3R17me2a in shaping Rhino binding at a subset of genomic regions.

      Weaknesses:

      However, Rhino binding at, and piRNA production from, canonical piRNA clusters appears largely unaffected by DART4 depletion, and spreading of Rhino from ADMA-rich boundaries was not directly demonstrated. Therefore, while the correlation is clearly documented, further investigation would be needed to determine the functional requirement of these histone marks in piRNA cluster specification.

      The study identify piRNA cluster-like regions called DART4 clusters. While the model proposes that DART4 clusters represent evolutionary precursors of mature piRNA clusters, the functional output of these clusters remains limited. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwi-dependent silencing.

      In summary, the authors present a well-executed study that raises intriguing hypotheses about the early chromatin context of piRNA cluster formation. The work will be of interest to researchers studying genome regulation, small RNA pathways, and the chromatin mechanisms of transposon control. It provides useful resources and new candidate loci for follow-up studies, while also highlighting the need for further functional validation to fully support the proposed model.

    3. Reviewer #2 (Public review):

      This study seeks to understand how the Rhino factor knows how to localize to specific transposon loci and to specific piRNA clusters to direct the correct formation of specialized heterochromatin that promotes piRNA biogenesis in the fly germline. In particular, these dual-strand piRNA clusters with names like 42AB, 38C, 80F, and 102F generate the bulk of ovarian piRNAs in the nurse cells of the fly ovary, but the evolutionary significance of these dual-strand piRNA clusters remains mysterious since triple null mutants of these dual-strand piRNA clusters still allows fly ovaries to develop and remain fertile. Nevertheless, mutants of Rhino and its interactors Deadlock, Cutoff, Kipferl and Moonshiner, etc, causes more piRNA loss beyond these dual-strand clusters and exhibit the phenotype of major female infertility, so the impact of proper assembly of Rhino, the RDC, Kipferl etc onto proper piRNA chromatin is an important and interesting biological question that is not fully understood.

      This study tries to first test ectopic expression of Rhino via engineering a Dox-inducible Rhino transgene in the OSC line that only expresses the primary Piwi pathway that reflects the natural single pathway expression the follicle cells and is quite distinct from the nurse cell germline piRNA pathway that is promoted by Rhino, Moonshiner, etc. The authors present some compelling evidence that this ectopic Rhino expression in OSCs may reveal how Rhino can initiate de novo binding via ADMA histone marks, a feat that would be much more challenging to demonstrate in the germline where this epigenetic naïve state cannot be modeled since germ cell collapse would likely ensue. In the OSC, the authors have tested the knockdown of four of the 11 known Drosophila PRMTs (DARTs), and comparing to ectopic Rhino foci that they observe in HP1a knockdown (KD), they conclude DART1 and DART4 are the prime factors to study further in looking for disruption of ADMA histone marks. The authors also test KD of DART8 and CG17726 in OSCs, but in the fly, the authors only test Germ Line KD of DART4 only, they do not explain why these other DARTs are not tested in GLKD, the UAS-RNAi resources in Drosophila strain repositories should be very complete and have reagents for these knockdowns to be accessible.

      The authors only characterize some particular ADMA marks of H3R17me2a as showing strong decrease after DART4 GLKD, and then they see some small subset of piRNA clusters go down in piRNA production as shown in Figure 6B and Figure 6F and Supplementary Figure 7. This small subset of DART4-dependent piRNA clusters does lose Rhino and Kipferl recruitment, which is an interesting result.

      However, the biggest issue with this study is the mystery that the set of the most prominent dual-strand piRNA clusters. 42AB, 38C, 80F, and 102F, are the prime genomic loci subjected to Rhino regulation, and they do not show any change in piRNA production in the GLKD of DART4. The authors bury this surprising negative result in Supplementary Figure 5E, but this is also evident in no decrease (actually an n.s. increase) in Rhino association in Figure 5D. Since these main piRNA clusters involve the RDC, Kipferl, Moonshiner, etc, and it does not change in ADMA status and piRNA loss after DART4 GLKD, this poses a problem with the model in Figure 7C. In this study, there is only a GLKD of DART4 and no GLKD of the other DARTs in fly ovaries.

      One way the authors rationalize this peculiar exception is the argument that DART4 is only acting on evolutionarily "young" piRNA clusters like the bx, CG14629, and CG31612, but the lack of any change on the majority of other piRNA clusters in Figure 6F leaves upon the unsatisfying concern that there is much functional redundancy remaining with other DARTs not being tested by GLKD in the fly that would have a bigger impact on the other main dual-strand piRNA clusters being regulated by Rhino and ADMA-histone marks.

      Also, the current data does not provide convincing enough support for the model Figure 7C and the paper title of ADMA-histones being the key determinant in the fly ovary for Rhino recognition of the dual-strand piRNA clusters. Although much of this study's data is well constructed and presented, there remains a large gap that no other DARTs were tested in GLKD that would show a big loss of piRNAs from the main dual-strand piRNA clusters of 42AB, 38C, 80F, and 102F, where Rhino has prominent spreading in these regions.

      As the manuscript currently stands, I do not think the authors present enough data to conclude that "ADMA-histones [As a Major new histone mark class] does play a crucial role in the initial recognition of dual-strand piRNA cluster regions by Rhino" because the data here mainly just show a small subset of evolutionarily young piRNA clusters have a strong effect from GLKD of DART4. The authors could extensively revise the study to be much more specific in the title and conclusion that they have uncovered this very unique niche of a small subset of DART4-dependent piRNA clusters, but this niche finding may dampen the impact and significance of this study since other major dual-strand piRNA clusters do not change during DART4 GLKD, and the authors do not show data GLKD of any other DARTs. The niche finding of just a small subset of DART-4-dependent piRNA clusters might make another specialized genetics forum a more appropriate venue.

    1. eLife Assessment

      This is a useful study in the role of CHI3L1 in Kupffer cells, the macrophages of the liver, showing that CHI3L1 alters glucose regulation in obesity. Specifically, Chi3l1 protects glucose-dependent Kupffer cells during Metabolic dysfunction-associated steatotic liver disease (MASLD) by inhibiting glucose uptake, preventing metabolic stress and death. These data are compelling, yet require further validation.

    2. Reviewer #1 (Public review):

      The manuscript by Shan et al seeks to define the role of the CHI3L1 protein in macrophages during the progression of MASH. The authors argue that the Chil1 gene is expressed highly in hepatic macrophages. Subsequently, they use Chil1 flx mice crossed to Clec4F-Cre or LysM-Cre to assess the role of this factor in the progression of MASH using a high-fat, high-fructose diet (HFFC). They found that loss of Chil1 in KCs (Clec4F Cre) leads to enhanced KC death and worsened hepatic steatosis. Using scRNA seq, they also provide evidence that loss of this factor promotes gene programs related to cell death. From a mechanistic perspective, they provide evidence that CHI3L serves as a glucose sink and thus loss of this molecule enhances macrophage glucose uptake and susceptibility to cell death. Using a bone marrow macrophage system and KCs they demonstrate that cell death induced by palmitic acid is attenuated by the addition of rCHI3L1. While the article is well written and potentially highlights a new mechanism of macrophage dysfunction in MASH, there are some concerns about the current data that limit my enthusiasm for the study in its current form. Please see my specific comments below.

      Major:

      (1) The authors' interpretation of the results from the KC ( Clec4F) and MdM KO (LysM-Cre) experiments is flawed. For example, in Figure 2 the authors present data that knockout of Chil1 in KCs using Clec4f Cre produces worse liver steatosis and insulin resistance. However, in supplemental Figure 4, they perform the same experiment in LysM-Cre mice and find a somewhat different phenotype. The authors appear to be under the impression that LysM-Cre does not cause recombination in KCs and therefore interpret this data to mean that Chil1 is relevant in KCs and not MdMs. However, LysM-Cre DOES lead to efficient recombination in KCs and therefore Chil1 expression will be decreased in both KCs and MdM (along with PMNs) in this line.

      Therefore, a phenotype observed with KC-KO should also be present in this model unless the authors argue that loss of Chil1 from the MdMs has the opposite phenotype of KCs and therefore attenuates the phenotype. The Cx3Cr1 CreER tamoxifen inducible system is currently the only macrophage Cre strategy that will avoid KC recombination. The authors need to rethink their results with the understanding that Chil1 is deleted from KCs in the LysM-Cre experiment. In addition, it appears that only one experiment was performed, with only 5 mice in each group for both the Clec4f and LysM-Cre data. This is generally not enough to make a firm conclusion for MASH diet experiments.

      (2) The mouse weight gain is missing from Figure 2 and Supplementary Figure 4. This data is critical to interpret the changes in liver pathology, especially since they have worse insulin resistance.

      (3) Figure 4 suggests that KC death is increased with KO of Chil1. However, this data cannot be concluded from the plots shown. In Supplementary Figure 6 the authors provide a more appropriate gating scheme to quantify resident KCs that includes TIM4. The TIM4 data needs to be shown and quantified in Figure 4. As shown in Supplementary Figure 6, the F4/80 hi population is predominantly KCs at baseline; however, this is not true with MASH diets. Most of the recruited MoMFs also reside in the F4/80 hi gate where they can be identified by their lower expression of TIM4. The MoMF gate shown in this figure is incorrect. The CD11b hi population is predominantly PMNs, monocytes, and cDC,2 not MoMFs (PMID:33997821). In addition, the authors should stain the tissue for TIM4, which would also be expected to reveal a decrease in the number of resident KCs.

      (4) While the Clec4F Cre is specific to KCs, there is also less data about the impact of the Cre system on KC biology. Therefore, when looking at cell death, the authors need to include some mice that express Clec4F cre without the floxed allele to rule out any effects of the Cre itself. In addition, if the cell death phenotype is real, it should also be present in LysM Cre system for the reasons described above. Therefore, the authors should quantify the KC number and dying KCs in this mouse line as well.

      (5) I am somewhat concerned about the conclusion that Chil1 is highly expressed in liver macrophages. Looking at our own data and those from the Liver Atlas it appears that this gene is primarily expressed in neutrophils. At a minimum, the authors should address the expression of Chil1 in macrophage populations from other publicly available datasets in mouse MASH to validate their findings (several options include - PMID: 33440159, 32888418, 32362324). If expression of Chil1 is not present in these other data sets, perhaps an environmental/microbiome difference may account for the distinct expression pattern observed. Either way, it is important to address this issue.

    3. Reviewer #2 (Public review):

      The manuscript from Shan et al., sets out to investigate the role of Chi3l1 in different hepatic macrophage subsets (KCs and moMFs) in MASLD following their identification that KCs highly express this gene. To this end, they utilise Chi3l1KO, Clec4f-CrexChi3l1fl, and Lyz2-CrexChi3l1fl mice and WT controls fed a HFHC for different periods of time.

      Firstly, the authors perform scRNA-seq, which led to the identification of Chi3l1 (encoded by Chil1) in macrophages. However, this is on a limited number of cells (especially in the HFHC context), and hence it would also be important to validate this finding in other publicly available MASLD/Fibrosis scRNA-seq datasets. Similarly, it would be important to examine if cells other than monocytes/macrophages also express this gene, given the use of the full KO in the manuscript. Along these lines, utilisation of publicly available human MASLD scRNA-seq datasets would also be important to understand where the increased expression observed in patients comes from and the overall relevance of macrophages in this finding.

      Next, the authors use two different Cre lines (Clec4f-Cre and Lyz2-Cre) to target KCs and moMFs respectively. However, no evidence is provided to demonstrate that Chil1 is only deleted from the respective cells in the two CRE lines. Thus, KCs and moMFs should be sorted from both lines, and a qPCR performed to check the deletion of Chil1. This is especially important for the Lyz2-Cre, which has been routinely used in the literature to target KCs (as well as moMFs) and has (at least partial) penetrance in KCs (depending on the gene to be floxed). Also, while the Clec4f-Cre mice show an exacerbated MASLD phenotype, there is currently no baseline phenotype of these animals (or the Lyz2Cre) in steady state in relation to the same readouts provided in MASLD and the macrophage compartment. This is critical to understand if the phenotype is MASLD-specific or if loss of Chi3l1 already affects the macrophages under homeostatic conditions.

      Next, the authors suggest that loss of Chi3l1 promotes KC death. However, to examine this, they use Chi3l1 full KO mice instead of the Clec4f-Cre line. The reason for this is not clear, because in this regard, it is now not clear whether the effects are regulated by loss of Chi3l1 from KCs or from other hepatic cells (see point above). The authors mention that Chi3l1 is a secreted protein, so does this mean other cells are also secreting it, and are these needed for KC death? In that case, this would not explain the phenotype in the CLEC4F-Cre mice. Here, the authors do perform a basic immunophenotyping of the macrophage populations; however, the markers used are outdated, making it difficult to interpret the findings. Instead of F4/80 and CD11b, which do not allow a perfect discrimination of KCs and moMFs, especially in HFHC diet-fed mice, more robust and specific markers of KCs should be used, including CLEC4F, VSIG4, and TIM4.

      Additionally, while the authors report a reduction of KCs in terms of absolute numbers, there are no differences in proportions. This, coupled with a decrease also in moMF numbers at 16 weeks (when one would expect an increase if KCs are decreased, based on previous literature) suggests that the differences in KC numbers may be due to differences in total cell counts obtained from the obese livers compared with controls. To rule this out, total cell counts and total live CD45+ cell counts should be provided. Here, the authors also provide tunnel staining in situ to demonstrate increased KC death, but as it is typically notoriously difficult to visualise dying KCs in MASLD models, here it would be important to provide more images. Similarly, there appear to be many more Tunel+ cells in the KO that are not KCs; thus, it would be important to examine this in the CLEC4F-Cre line to ascertain direct versus indirect effects on cell survival.

      Finally, the authors suggest that Chi3l1 exerts its effects through binding glucose and preventing its uptake. They use ex vivo/in vitro models to assess this with rChi3l1; however, here I miss the key in vivo experiment using the CLEC4F-Cre mice to prove that this in KCs is sufficient for the phenotype. This is critical to confirm the take-home message of the manuscript.

    4. Reviewer #3 (Public review):

      This paper investigates the role of Chi3l1 in regulating the fate of liver macrophages in the context of metabolic dysfunction leading to the development of MASLD. I do see value in this work, but some issues exist that should be addressed as well as possible.

      Here are my comments:

      (1) Chi3l1 has been linked to macrophage functions in MASLD/MASH, acute liver injury, and fibrosis models before (e.g., PMID: 37166517), which limits the novelty of the current work. It has even been linked to macrophage cell death/survival (PMID: 31250532) in the context of fibrosis, which is a main observation from the current study.

      (2) The LysCre-experiments differ from experiments conducted by Ariel Feldstein's team (PMID: 37166517). What is the explanation for this difference? - The LysCre system is neither specific to macrophages (it also depletes in neutrophils, etc), nor is this system necessarily efficient in all myeloid cells (e.g., Kupffer cells vs other macrophages). The authors need to show the efficacy and specificity of the conditional KO regarding Chi3l1 in the different myeloid populations in the liver and the circulation.

      (3) The conclusions are exclusively based on one MASLD model. I recommend confirming the key findings in a second, ideally a more fibrotic, MASH model.

      (4) Very few human data are being provided (e.g., no work with own human liver samples, work with primary human cells). Thus, the translational relevance of the observations remains unclear.

    1. eLife Assessment

      This study provides valuable insights into a new toxin-antidote element in C. elegans, the first naturally occurring unlinked toxin-antidote system where endogenous small RNA pathways post-transcriptionally suppress the toxin. The strength of evidence is solid, using a combination of genomic and experimental methods. Enthusiasm, however, is tempered by its reliance on meta-analysis of existing data sets and limited experimental evaluation.

    2. Reviewer #1 (Public review):

      Summary:

      The article by Zdraljevic et al. reports the discovery of a third toxin-antidote (TA) element in C. elegans, composed of the genes mll-1 (toxin) and smll-1 (antidote). Unlike previously characterized TA systems in C. elegans, this element induces larval arrest rather than embryonic lethality. The study identifies three distinct haplotypes at the TA locus, including a hyper-divergent version in the standard laboratory strain N2, which retains a functional toxin but lacks a functional antidote. The authors propose that small RNA-mediated silencing mechanisms, dependent on MUT-16 and PRG-1, suppress the toxicity of the divergent toxin allele. This work provides insights into the evolutionary dynamics of TA elements and their regulation through RNA interference (RNAi).

      Overall, there are many things to like about this paper and only a few small quibbles, which will not require more than a little rewriting or relatively minor analyses.

      Strengths:

      (1) The discovery of a maternally deposited TA element with delayed toxicity due to delayed mRNA translation of the maternally deposited toxin mRNA is a significant addition to the literature on selfish genetic elements in metazoans.

      (2) Identifying three haplotypes at the TA locus provides a snapshot of potential evolutionary trajectories for these elements, which are often inferred but rarely demonstrated in naturally occurring strains. The genomic analysis of 550 wild isolates contextualizes the findings within natural populations, revealing geographic clustering and evolutionary pressures acting on the TA locus.

      (3) The study employs various techniques, including CRISPR/Cas9 knockouts, FISH, long-read RNA sequencing, and population genomics. The use of inducible systems to confirm toxicity and antidote functionality is particularly robust. This multifaceted approach strengthens the validity of the findings.

      (4) The authors provide compelling evidence that small RNA pathways suppress toxin activity in strains lacking a functional antidote. This highlights an alternative mechanism for neutralizing selfish genetic elements.

      Weaknesses:

      (1) The introduction focuses strongly (for good reason) on bacterial TA systems and then jumps to TA systems in C. elegans. It's unclear why TA systems in other eukaryotes are not discussed.

      (2) Similarly, there is a missed opportunity to discuss an analogy between the suppressor mechanism discovered here and the hairpin RNA suppressors of meiotic drive identified by Eric Lai and colleagues. Discussing these will provide a fuller context of the present study's findings and will not affect their novelty.

      (3) While the evidence for RNAi-mediated suppression is strong, the claim that positive selection drove diversification at piRNA binding sites requires further discussion and clarification. The elevated dN and dS are unusual (how unusual relative to other genes in vicinity? What is hyper-divergent statistically speaking?), but there is no a priori reason that there would be selection on piRNA binding sites within the mll-1 transcript to facilitate its recognition by endogenous RNAi machinery; what is the selective pressure for mll-1 to do so? Most TA systems would like to avoid being suppressed by the host. One cannot make the argument that this was motivated by the loss of the antidote because the loss of the antidote would be instantly suicidal, so the cadence of events described requiring hypermutation of the mll-1 transcript does not work.

    3. Reviewer #2 (Public review):

      Summary:

      In the manuscript by Walter-McNeill, Kruglyak, and team, the authors provide solid evidence of another toxin-antidote (TA) system in C. elegans. Generally, TA systems involve selfish and linked genetic elements, one encoding a toxin that kills progeny inheriting it, unless an antidote (the second element) is also present. Currently, only two TA systems have been characterized in this species, pointing to the importance of identifying new instances of such systems to understand their transmission dynamics, prevalence, and functions in shaping worm populations.

      Strengths:

      This novel TA system (mll-1/smll-1) was identified on LGV in wild C. elegans isolates from the Hawaiian islands, by crossing divergent strains and observing allele frequency distortions by high-throughput genome sequencing after 10 generations. These allele frequency distortions were subsequently confirmed in another set of crosses with a separate divergent strain, and crosses of heterozygous males or hermaphrodites resulted in a pattern of L1 lethality in progeny (with a rod arrest phenotype) that suggested the maternal transmission of this TA system from the XZ1516 genetic background. By elegantly combining the use of near-isogenic lines, CRISPR editing to generate knock-outs, and a transgene rescue of the antidote gene, the authors identified the genes encoding the toxin and the antidote, which they refer to as mll-1 and smll-1. Moreover, the specific mll-1 isoform responsible for the production of the toxin was identified and mll-1 transcripts were observed by FISH in early and late embryos, as well as in larvae. Inducible expression of the toxin in various strains resulted in larval arrest and rod phenotypes. The authors then characterized the genetic variation of 550 wild isolates at the toxin/antidote region on LGV and distinguished three clades: (1) one with the conserved TA system, (2) one having lost the toxin and retaining a mostly functional antidote, and (3) one having lost the antidote and retaining a divergent yet coding toxin (this includes the reference strain Bristol N2, in which the homologous toxin gene has acquired mutations and is known as B0250.8). Further, the authors show that this region is under positive selection. These data are compelling and provide very strong evidence of a new TA system in this species.

      Weaknesses:

      The question remained as to how one clade, including N2, could retain the toxin gene but not possess a functional antidote. In the second part of the manuscript, the authors hypothesized that small RNA targeting (RNAi) of the toxin transcript could provide the necessary repression to allow worms to survive without the antidote. Through a meta-analysis of multiple small RNA datasets from the literature, the authors found evidence to support this idea, in which the toxin transcript is targeted by 22G siRNAs whose biogenesis is dependent on the Mutator foci protein, MUT-16. They note that from previous studies, mut-16 null mutants displayed a varied penetrance of larval arrest. In their own hands, mut-16 mutants displayed 15% varied larval arrest and 2% rod phenotypes. In an attempt to link B0250.8 to mut-16/siRNAs, they made a double mutant and examined body length as a proxy for developmental stage. Here, they observed a partial rescue of the mut-16 size defect by B0250.8 mutation. Finally, the authors also highlight data from further meta-analysis, which predicts the recognition of B0250.8 by several piRNAs. Also based on existing data from the literature, the authors link loss of Piwi (PRG-1), which binds piRNAs, to a depletion of 22G-RNAs targeting B0250.8 and an upregulation of B0250.8 expression in gonads, suggesting that piRNAs are the primary small RNAs that target B0250.8 for downregulation. The data in this portion of the manuscript are intriguing, but somewhat preliminary and incomplete, as they are based on little primary experimentation and a collection of different datasets (which have been acquired by slightly different methods in most cases). This portion of the study would require subsequent experimentation to firmly establish this mechanistic link. For example, to be able to claim that "the N2 toxin allele has acquired mutations that enable piRNA binding to initiate MUT-16-dependent 22G small RNA amplification that targets the transcript for degradation" the identified piRNA sites should be mutated and protein and transcript levels analysed in wild-type and in the strain with mutated piRNA sites. At a minimum, the protein levels in wild-type and mut-16, prg-1, and/or wago-1 mutants should be measured by western blot and/or by live imaging (introducing a GFP or some other tag to the endogenous protein via CRISPR editing) to show that the toxin is not accumulated as a protein in wt, but increases in levels in these mutants. mRNA levels in Figure S5A suggest there is still some expression of the B0250.8 transcript in a wild-type situation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors report a study on how stimulation of receptive-field surround of V1 and LGN neurons affects their firing rates. Specifically, they examine stimuli in which a grey patch covers the classical RF of the cell and a stimulus appears in the surround. Using a number of different stimulus paradigms they find a long latency response in V1 (but not the LGN) which does not depend strongly on the characteristics of the surround grating (drifting vs static, continuous vs discontinuous, predictable grating vs unpredictable pink noise). They find that population responses to simple achromatic stimuli have a different structure that does not distinguish so clearly between the grey patch and other conditions and the latency of the response was similar regardless of whether the center or surround was stimulated by the achromatic surface. Taken together they propose that the surround-response is related to the representation of the grey surface itself. They relate their findings to previous studies that have put forward the concept of an ’inverse RF’ based on strong responses to small grey patches on a full-screen grating. They also discuss their results in the context of studies that suggest that surround responses are related to predictions of the RF content or figure-ground segregation. Strengths:

      I find the study to be an interesting extension of the work on surround stimulation and the addition of the LGN data is useful showing that the surround-induced responses are not present in the feedforward path. The conclusions appear solid, being based on large numbers of neurons obtained through Neuropixels recordings. The use of many different stimulus combinations provides a rich view of the nature of the surround-induced responses.

      Weaknesses:

      The statistics are pooled across animals, which is less appropriate for hierarchical data. There is no histological confirmation of placement of the electrode in the LGN and there is no analysis of eye or face movements which may have contributed to the surround-induced responses. There are also some missing statistics and methods details which make interpretation more difficult.

      We thank the reviewer for their positive and constructive comments, and have addressed these specific issues in response to the minor comments. For the statistics across animals, we refer to “Reviewer 1 recommendations” point 1. For the histological analysis, we refer to “Reviewer 1 recommendations point 2”. For the eye and facial movements, we refer to “Reviewer 1 recommendations point 5”. Concerning missing statistics and methods details, we refer to various responses to “Reviewer 1 recommendations”. We thoroughly reviewed the manuscript and included all missing statistical and methodological details.

      Reviewer #2 (Public review):

      Cuevas et al. investigate the stimulus selectivity of surround-induced responses in the mouse primary visual cortex (V1). While classical experiments in non-human primates and cats have generally demonstrated that stimuli in the surround receptive field (RF) of V1 neurons only modulate activity to stimuli presented in the center RF, without eliciting responses when presented in isolation, recent studies in mouse V1 have indicated the presence of purely surround-induced responses. These have been linked to prediction error signals. In this study, the authors build on these previous findings by systematically examining the stimulus selectivity of surround-induced responses.

      Using neuropixels recordings in V1 and the dorsal lateral geniculate nucleus (dLGN) of head-fixed, awake mice, the authors presented various stimulus types (gratings, noise, surfaces) to the center and surround, as well as to the surround only, while also varying the size of the stimuli. Their results confirm the existence of surround-induced responses in mouse V1 neurons, demonstrating that these responses do not require spatial or temporal coherence across the surround, as would be expected if they were linked to prediction error signals. Instead, they suggest that surround-induced responses primarily reflect the representation of the achromatic surface itself.

      The literature on center-surround effects in V1 is extensive and sometimes confusing, likely due to the use of different species, stimulus configurations, contrast levels, and stimulus sizes across different studies. It is plausible that surround modulation serves multiple functions depending on these parameters. Within this context, the study by Cuevas et al. makes a significant contribution by exploring the relationship between surround-induced responses in mouse V1 and stimulus statistics. The research is meticulously conducted and incorporates a wide range of experimental stimulus conditions, providing valuable new insights regarding center-surround interactions.

      However, the current manuscript presents challenges in readability for both non-experts and experts. Some conclusions are difficult to follow or not clearly justified.

      I recommend the following improvements to enhance clarity and comprehension:

      (1) Clearly state the hypotheses being tested at the beginning of the manuscript.

      (2) Always specify the species used in referenced studies to avoid confusion (esp. Introduction and Discussion).

      (3) Briefly summarize the main findings at the beginning of each section to provide context.

      (4) Clearly define important terms such as “surface stimulus” and “early vs. late stimulus period” to ensure understanding.

      (5) Provide a rationale for each result section, explaining the significance of the findings.

      (6) Offer a detailed explanation of why the results do not support the prediction error signal hypothesis but instead suggest an encoding of the achromatic surface.

      These adjustments will help make the manuscript more accessible and its conclusions more compelling.

      We thank the reviewer for their constructive feedback and for highlighting the need for improved clarity regarding the hypotheses and their relation to the experimental findings.

      • We have strongly improved the Introduction and Discussion section, explaining the different hypotheses and their relation to the performed experiments.

      • In the Introduction, we have clearly outlined each hypothesis and its predictions, providing a structured framework for understanding the rationale behind our experimental design. • In the Discussion, we have been more explicit in explaining how the experimental findings inform these hypotheses.

      • We explicitly mentioned the species used in the referenced studies.

      • We provided a clearer rationale for each experiment in the Results section.

      We have also always clearly stated the species that previous studies used, both in the Introduction and Discussion section.

      Reviewer #3 (Public review):

      Summary:

      This paper explores the phenomenon whereby some V1 neurons can respond to stimuli presented far outside their receptive field. It introduces three possible explanations for this phenomenon and it presents experiments that it argues favor the third explanation, based on figure/ground segregation.

      Strengths:

      I found it useful to see that there are three possible interpretations of this finding (prediction error, interpolation, and figure/ground). I also found it useful to see a comparison with LGN responses and to see that the effect there is not only absent but actually the opposite: stimuli presented far outside the receptive field suppress rather than drive the neurons. Other experiments presented here may also be of interest to the field.

      Weaknesses:

      The paper is not particularly clear. I came out of it rather confused as to which hypotheses were still standing and which hypotheses were ruled out. There are numerous ways to make it clearer.

      We thank the reviewer for their constructive feedback and for highlighting the need for improved clarity regarding the hypotheses and their relation to the experimental findings.

      • We have strongly improved the Introduction and Discussion section, explaining the different hypotheses and their relation to the performed experiments.

      • In the Introduction, we have clearly outlined each hypothesis and its predictions, providing a structured framework for understanding the rationale behind our experimental design. • In the Discussion, we have been more explicit in explaining how the experimental findings inform these hypotheses.

      ** Recommendations for the Authors:**

      Reviewer #1 (Recommendations for the Authors):

      (1) Given the data is hierarchical with neurons clustered within 6 mice (how many recording sessions per animal?) I would recommend the use of Linear Mixed Effects models. Simply pooling all neurons increases the risk of false alarms.

      To clarify: We used the standard method for analyzing single-unit recordings, by comparing the responses of a population of single neurons between two different conditions. This means that the responses of each single neuron were measured in the different conditions, and the statistics were therefore based on the pairwise differences computed for each neuron separately. This is a common and standard procedure in systems neuroscience, and was also used in the previous studies on this topic (Keller et al., 2020; Kirchberger et al., 2023). We were not concerned with comparing two groups of animals, for which hierarchical analyses are recommended. To address the reviewer’s concern, we did examine whether differences between baseline and the gray/drift condition, as well as the gray/drift compared to the grating condition, were consistent across sessions, which was indeed the case. These findings are presented in Supplementary Figure 6.

      (2) Line 432: “The study utilized three to eight-month-old mice of both genders”. This is confusing, I assume they mean six mice in total, please restate. What about the LGN recordings, were these done in the same mice? Can the authors please clarify how many animals, how many total units, how many included units, how many recording sessions per animal, and whether the same units were recorded in all experiments?

      We have now clarified the information regarding the animals used in the Methods section.

      • We state that “We included female and male mice (C57BL/6), a total of six animals for V1 recordings between three and eight months old. In two of those animals, we recorded simultaneously from LGN and V1.”

      • We state that“For each animal, we recorded around 2-3 sessions from each hemisphere, and we recorded from both hemispheres.”

      • We noted that the number of neurons was not mentioned for each figure caption. We apologize for this omission. We have now added the number for all of the figures and protocols to the revised manuscript. We note that the same neurons were recorded for the different conditions within each protocol, however because a few sessions were short we recorded more units for the grating protocol. Note that we did not make statistical comparisons between protocols.

      (3) I see no histology for confirmation of placement of the electrode in the LGN, how can they be sure they were recording from the LGN? There is also little description of the LGN experiments in the methods.

      For better clarity, we have included a reconstruction of the electrode track from histological sections of one animal post-experiment (Figure S4). The LGN was targeted via stereotactical surgery, and the visual responses in this area are highly distinct. In addition, we used a flash protocol to identify the early-latency responses typical for the LGN, which is described in the Methods section: “A flash stimulus was employed to confirm the locations of LGN at the beginning of the recording sessions, similar to our previous work in which we recorded from LGN and V1 simultaneously (Schneider et al., 2023). This stimulus consisted of a 100 ms white screen and a 2 s gray screen as the inter-stimulus interval, designed to identify visually responsive areas. The responses of multi-unit activity (MUA) to the flash stimulus were extracted and a CSD analysis was then performed on the MUA, sampling every two channels. The resulting CSD profiles were plotted to identify channels corresponding to the LGN. During LGN recordings, simultaneous recordings were made from V1, revealing visually responsive areas interspersed with non-responsive channels.”

      (4) Many statements are not backed up by statistics, for example, each time the authors report that the response at 90degree sign is higher than baseline (Line 121 amongst other places) there is no test to support this. Also Line 140 (negative correlation), Line 145, Line 180.

      For comparison purposes, we only presented statistical analyses across conditions. However, we have now added information to the figure captions stating that all conditions show values higher than the baseline.

      (5) As far as I can see there is no analysis of eye movements or facial movements. This could be an issue, for example, if the onset of the far surround stimuli induces movements this may lead to spurious activations in V1 that would be interpreted as surround-induced responses.

      To address this point, we have included a supplementary figure analyzing facial movements across different sessions and comparing them between conditions (Supplementary Figure 5). A detailed explanation of this analysis has been added to the Methods section. Overall, we observed no significant differences in face movements between trials with gratings, trials with the gray patch, and trials with the gray screen presented during baseline. Animals exhibited similar face movements across all three conditions, supporting the conclusion that the observed neural firing rate increases for the gray-patch condition are not related to face movements.

      (6) The experiments with the rectangular patch (Figure 3) seem to give a slightly different result as the responses for large sizes (75, 90) don’t appear to be above baseline. This condition is also perceptually the least consistent with a grey surface in the RF, the grey patch doesn’t appear to occlude the surface in this condition. I think this is largely consistent with their conclusions and it could merit some discussion in the results/discussion section.

      While the effect is maybe a bit weaker, the total surround stimulated also covers a smaller area because of the large rectangular gray patch. Furthermore, the early responses are clearly elevated above baseline, and the responses up to 70 degrees are still higher than baseline. Hence we think this data point for 90 degrees does not warrant a strong interpretation.

      Minor points:

      (1) Figure 1h: What is the statistical test reported in the panel (I guess a signed rank based on later figures)? Figure 4d doesn’t appear to be significantly different but is reported as so. Perhaps the median can be indicated on the distribution?

      We explained that we used a signed rank test for Figure 1h and now included the median of the distributions in Figure 4d.

      (2) What was the reason for having the gratings only extend to half the x-axis of the screen, rather than being full-screen? This creates a percept (in humans at least) that is more consistent with the grey patch being a hole in the grating as the grey patch has the same luminance as the background outside the grating.

      We explained in the Methods section that “We presented only half of the x-axis due to the large size of our monitor, in order to avoid over-stimulation of the animals with very large grating stimuli.”. Perceptually speaking, the gray patch appears as something occluding the grating, not as a “hole”.

      (3) Line 103: “and, importantly, had less than 10degree sign (absolute) distance to the grating stimulus’ RF center.” Re-phrase, a stimulus doesn’t have an RF center.

      We corrected this to “We included only single units into the analysis that met several criteria in terms of visual responses (see Methods) and, importantly, the RF center had less than 10(absolute) distance to the grating stimulus’ center. ”.

      (4) Line 143: “We recorded single neurons LGN” - should be “single LGN neurons”.

      We corrected this to “we recorded single LGN neurons”.

      (5) Line 200: They could spell out here that the latency is consistent with the latency observed for the grey patch conditions in the previous experiments. (6) Line 465: This is very brief. What criteria did they use for single-unit assignation? Were all units well-isolated or were multi-units included?

      We clarified in the Methods section that “We isolated single units with Kilosort 2.5 (Steinmetz et al., 2021) and manually curated them with Phy2 (Rossant et al., 2021). We included only single units with a maximum contamination of 10 percent.”

      (7) Line 469: “The experiment was run on a Windows 10”. Typo.

      We corrected this to “The experiment was run on Windows 10”.

      (9) Line 481: “We averaged the response over all trials and positions of the screen”. What do they mean by ’positions of the screen’?

      We changed this to “We computed the response for each position separately right, by averaging the response across all the trials where a square was presented at a given position.”

      (9) Line 483: “We fitted an ellipse in the center of the response”. How?

      We additionally explain how we preferred the detection of the RF using an ellipse fitting: “A heatmap of the response was computed. This heatmap was then smoothed, and we calculated the location of the peak response. From the heatmap we calculated the centroid of the response using the function regionprops.m that finds unique objects, we then selected the biggest area detected. Using the centroids provided as output. We then fitted an ellipse centered on this peak response location to the smoothed heatmap using the MATLAB function ellipse.m.“

      (10) Line 485 “...and positioned the stimulus at the response peak previously found”. Unclear wording, do you mean the center of the ellipse fit to the MUA response averaged across channels or something else? (11) Line 487: “We performed a permutation test of the responses inside the RF detected vs a circle from the same area where the screen was gray for the same trials.”. The wording is a bit unclear here, can they clarify what they mean by the ’same trials’, what is being compared to what here?

      We used a permutation test to compare the neuron’s responses to black and white squares inside the RF to the condition where there was no square in the RF (i.e. the RF was covered by the gray background).

      (12) Was the pink noise background regenerated on each trial or as the same noise pattern shown on each trial?

      We explain that “We randomly presented one of two different pink noise images”

      (13) Line 552: “...used a time window of the Gaussian smoothing kernel from-.05 to .05”. Missing units.

      We explained that “we used a time window of the Gaussian smoothing kernel from -.05 s to .05 s, with a standard deviation of 0.0125 s.”

      (14) Line 565: “Additionally, for the occluded stimulus, we included patch sizes of 70 degree sign and larger.”. Not sure what they’re referring to here.

      We changed this to: “For the population analyses, we analyzed the conditions in which the gray patch sizes were 70 degrees and 90 degrees”.

      (15) Line 569: What is perplexity, and how does changing it affect the t-SNE embeddings?

      Note that t-SNE is only used for visualization purposes. In the revised manuscript, we have expanded our explanation regarding the use of t-SNE and the choice of perplexity values. Specifically, we have clarified that we used a perplexity value of 20 for the Gratings with circular and rectangular occluders and 100 for the black-and-white condition. These values were empirically selected to ensure that the groups in the data were clearly separable while maintaining the balance between local and global relationships in the projected space. This choice allowed us to visually distinguish the different groups while preserving the meaningful structure encoded in the dissimilarity matrices. In particular, varying the perplexity values would not alter the conclusions drawn from the visualization, as t-SNE does not affect the underlying analytical steps of our study.

      (16) Line 572: “We trained a C-Support Vector Classifier based on dissimilarity matrices”. This is overly brief, please describe the construction of the dissimilarity matrices and how the training was implemented. Was this binary, multi-class? What conditions were compared exactly?

      In the revised manuscript, we have expanded our explanation regarding the construction of the dissimilarity matrices and the implementation of the C-Support Vector Classification (C-SVC) model (See Methods section).

      The dissimilarity matrices were calculated using the Euclidean distance between firing rate vectors for all pairs of trials (as shown in Figure 6a-b). These matrices were used directly as input for the classifier. It is important to note that t-SNE was not used for classification but only for visualization purposes. The classifier was binary, distinguishing between two classes (e.g., Dr vs St). We trained the model using 60% of the data for training and used 40% for testing. The C-SVC was implemented using sklearn, and the classification score corresponds to the average accuracy across 20 repetitions.

      Reviewer #2 (Recommendations for the Authors):

      The relationship between the current paper and Keller et al. is challenging to understand. It seems like the study is critiquing the previous study but rather implicitly and not directly. I would suggest either directly stating the criticism or presenting the current study as a follow-up investigation that further explores the observed effect or provides an alternative function. Additionally, defining the inverse RF versus surround-induced responses earlier than in the discussion would be beneficial. Some suggestions:

      (1) The introduction is well-written, but it would be helpful to clearly define the hypotheses regarding the function of surround-induced responses and revisit these hypotheses one by one in the results section.

      Indeed, we have generally improved the Introduction of the manuscript, and stated the hypotheses and their relationships to the Experiments more clearly.

      (2) Explicitly mention how you compare classic grating stimuli of varying sizes with gray patch stimuli. Do the patch stimuli all come with a full-field grating? For the full-field grating, you have one size parameter, while for the patch stimuli, you have two (size of the patch and size of the grating).

      We now clearly describe how we compare grating stimuli of varying sizes with gray patch stimuli.

      (3) The third paragraph in the introduction reads more like a discussion and might be better placed there.

      We have moved content from the third paragraph of the Introduction to the Discussion, where it fits more naturally.

      (4) Include 1-2 sentences explaining how you center RFs and detail the resolution of your method.

      We have added an explanation to the Methods: “To center the visual stimuli during the recording session, we averaged the multiunit activity across the responsive channels and positioned the stimulus at the center of the ellipse fit to the MUA response averaged across channels.”.

      (5) Motivate the use of achromatic stimuli. This section is generally quite hard to understand, so try to simplify it.

      We explained better in the Introduction why we performed this particular experiment.

      (6) The decoding analysis is great, but it is somewhat difficult to understand the most important results. Consider summarizing the key findings at the beginning of this section.

      We now provide a clearer motivation at the start of the Decoding section.

      Reviewer #3 (Recommendations for the Authors):

      I have a few suggestions to improve the clarity of the presentation.

      Abstract: it lists a series of observations and it ends with a conclusion (“based on these findings...”). However, it provides little explanation for how this conclusion would arise from the observations. It would be more helpful to introduce the reasoning at the top and show what is consistent with it.

      We have improved the abstract of the paper incorporating this feedback.

      To some extent, this applies to Results too. Sometimes we are shown the results of some experiment just because others have done a similar experiment. Would it be better to tell us which hypotheses it tests and whether the results are consistent with all 3 hypotheses or might rule one or more out? I came out of the paper rather confused as to which hypotheses were still standing and which hypotheses were ruled out.

      We have strongly improved our explanation of the hypotheses and the relationships to the experiments in the Introduction.

      It would be best if the Results section focused on the results of the study, without much emphasis on what previous studies did or did not measure. Here, instead, in the middle of Results we are told multiple times what Keller et al. (2020) did or did not measure, and what they did or did not find. Please focus on the questions and on the results. Where they agree or disagree with previous papers, tell us briefly that this is the case.

      We have revised the Results section in the revised manuscript, and ensured that there is much less focus on what previous studies did in the Results. Differences to previous work are now discussed in the Discussion section.

      The notation is extremely awkward. For instance “Gc” stands for two words (Gray center) but “Gr” stands for a single word (Grating). The double meaning of G is one of many sources of confusion.

      This notation needs to be revised. Here is one way to make it simpler: choose one word for each type of stimulus (e.g. Gray, White, Black, Drift, Stat, Noise) and use it without abbreviations. To indicate the configuration, combine two of those words (e.g. Gray/Drift for Gray in the center and Drift in the surround).

      We have corrected the notation in the figures and text to enhance readability and improve the reader’s understanding.

      Figure 1e and many subsequent ones: it is not clear why the firing rate is shown in a logarithmic scale. Why not show it in a linear scale? Anyway, if the logarithmic scale is preferred for some reason, then please give us ticks at numbers that we can interpret, like 0.1,1,10,100... or 0.5,1,2,4... Also, please use the same y-scale across figures so we can compare.

      To clarify: it is necessary to normalize the firing rates relative to baseline, in order to pool across neurons. However such a divisive normalization would be by itself problematic, as e.g. a change from 1 to 2 is the same as a change from 1 to 0.5, on a linear scale. Furthermore such division is highly outlier sensitive. For this reason taking the logarithm (base 10) of the ratio is an appropriate transformation. We changed the tick labels to 1, 2, 4 like the reviewer suggested.

      Figure 3: it is not clear what “size” refers to in the stimuli where there is no gray center. Is it the horizontal size of the overall stimulus? Some cartoons might help. Or just some words to explain.

      Figure 3: if my understanding of “size” above is correct, the results are remarkable: there is no effect whatsoever of replacing the center stimulus with a gray rectangle. Shouldn’t this be remarked upon?

      We have added a paragraph under figure 3 and in the Methods section explaining that the sizes represent the varying horizontal dimensions of the rectangular patch. In this protocol, the classical condition (i.e. without gray patch) was shown only as full-field gratings, which is depicted in the plot as size 0, indicating no rectangular patch was present.

      DETAILS The word “achromatic” appears many times in the paper and is essentially uninformative (all stimuli in this study are achromatic, including the gratings). It could be removed in most places except a few, where it is actually used to mean “uniform”. In those cases, it should be replaced by “uniform”.

      Ditto for the word “luminous”, which appears twice and has no apparent meaning. Please replace it with “uniform”.

      We have replaced the words achromatic and luminous with “uniform” stimuli to improve the clarity when we refer to only black or white stimuli.

      Page 3, line 70: “We raise some important factors to consider when describing responses to only surround stimulation.” This sentence might belong in the Discussion but not in the middle of a paragraph of Results.

      We removed this sentence.

      Neuropixel - Neuropixels (plural)

      “area LGN” - LGN

      We corrected for misspellings.

      References

      Keller, A.J., Roth, M.M., Scanziani, M., 2020. Feedback generates a second receptive field in neurons of the visual cortex. Nature 582, 545–549. doi:10.1038/s41586-020-2319-4.

      Kirchberger, L., Mukherjee, S., Self, M.W., Roelfsema, P.R., 2023. Contextual drive of neuronal responses in mouse V1 in the absence of feedforward input. Science Advances 9, eadd2498. doi:10. 1126/sciadv.add2498.

      Rossant, C., et al., 2021. phy: Interactive analysis of large-scale electrophysiological data. https://github.com/cortex-lab/phy.

      Schneider, M., Tzanou, A., Uran, C., Vinck, M., 2023. Cell-type-specific propagation of visual flicker. Cell Reports 42.

      Steinmetz, N.A., Aydin, C., Lebedeva, A., Okun, M., Pachitariu, M., Bauza, M., Beau, M., Bhagat, J., B¨ohm, C., Broux, M., Chen, S., Colonell, J., Gardner, R.J., Karsh, B., Kloosterman, F., Kostadinov, D., Mora-Lopez, C., O’Callaghan, J., Park, J., Putzeys, J., Sauerbrei, B., van Daal,R.J.J., Vollan, A.Z., Wang, S., Welkenhuysen, M., Ye, Z., Dudman, J.T., Dutta, B., Hantman, A.W., Harris, K.D., Lee, A.K., Moser, E.I., O’Keefe, J., Renart, A., Svoboda, K., H¨ausser, M., Haesler, S., Carandini, M., Harris, T.D., 2021. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science 372, eabf4588. doi:10.1126/science.abf4588.

    2. eLife Assessment

      This valuable study investigates the selectivity of neuronal responses in the neocortex and thalamus to visual stimuli presented far outside their receptive fields. The study shows convincing evidence for a long-latency surround-induced response in primary visual cortex that is absent in the dorsal lateral geniculate nucleus and does not depend strongly on the visual characteristics of the surround stimulus. The paper should be of interest to neurophysiologists interested in vision and contextual modulations.

    3. Reviewer #1 (Public review):

      Summary:

      The authors report a study on how stimulation of receptive-field surround of V1 and LGN neurons affects their firing-rates. Specifically, they examine stimuli in which a grey patch covers the classical RF of the cell and a stimulus appears in the surround. Using a number of different stimulus paradigms they find a long latency response in V1 (but not the LGN) which does not depend strongly on the characteristics of the surround grating (drifting vs static, continuous vs discontinuous, predictable grating vs unpredictable pink noise). They find that population responses to simple achromatic stimuli have a different structure that does not distinguish so clearly between the grey patch and other conditions and the latency of the response was similar regardless of whether the center or surround was stimulated by the achromatic surface. Taken together they propose that the surround-response is related to the representation of the grey surface itself. They relate their findings to previous studies which have put forward the concept of an 'inverse RF' based on strong responses to small grey patches on a full-screen grating. They also discuss their results in the context of studies that suggest that surround responses are related to predictions of the RF content or figure-ground segregation.

      Strengths:

      I find the study to be an interesting extension of the work on surround stimulation and the addition of the LGN data is useful showing that the surround-induced responses are not present in the feed-forward path. The conclusions appear solid, being based on large numbers of neurons obtained through Neuropixels recordings. The use of many different stimulus combinations provides a rich view of the nature of the surround-induced responses.

      Weaknesses:

      The LGN data comes from a small number of animals (n=2). Statistics are generally pooled across all recording sessions/animals without taking into account the higher covariance of neurons recorded in the same session. This is not a problem for paired comparisons, but for some statistics in the paper a hierarchical approach would have been more appropriate. The authors do present individual session data and the effects appear to be consistent across sessions.

    4. Reviewer #3 (Public review):

      Summary:

      This paper explores the phenomenon whereby some V1 neurons can respond to stimuli presented far outside their receptive field. It introduces three possible explanations for this phenomenon and it presents experiments that it argues favor the third explanation, which is based on figure/ground segregation.

      Strengths:

      I found it useful to see that there are three possible interpretations of this finding (prediction error, interpolation, and figure/ground). I also found it useful to see a comparison with LGN responses and to see that the effect there is not only absent but actually opposite: stimuli presented far outside the receptive field suppress rather than drive the neurons. Other experiments presented here may also be of interest to the field.

      Weaknesses:

      Though the paper has markedly improved, and now has a clearer statement of the hypotheses, it could be streamlined further, to tighten the relation between hypotheses and analyses, and to draw conclusions from those analyses in terms of the hypotheses.

    1. eLife Assessment

      This important study uses long-term behavioural observations to understand the factors that influence female-on-female aggression in gorilla social groups. The evidence supporting the claims is convincing, as it includes novel methods of assessing aggression and considers other potential factors. The work will be of interest to broad biologists working on the social interactions of animals.

    2. Reviewer #1 (Public review):

      Summary:

      This work aims to improve our understanding of the factors that influence female-on-female aggressive interactions in gorilla social hierarchies, using 25 years of behavioural data from five wild groups of two gorilla species. Researchers analysed aggressive interactions between 31 adult females, using behavioural observations and dominance hierarchies inferred through Elo-rating methods. Aggression intensity (mild, moderate, severe) and direction (measured as the rank difference between aggressor and recipient) were used as key variables. A linear mixed-effects model was applied to evaluate how aggression direction varied with reproductive state (cycling, trimester-specific pregnancy, or lactation) and sex composition of the group. This study highlights the direction of aggressive interactions between females, with most interactions being directed from higher- to lower-ranking adult females close in social rank. However, the results show that 42% of these interactions are directed from lower- to higher-ranking females. Particularly, lactating and pregnant females targeted higher-ranking individuals, which the authors suggest might be due to higher energetic needs, which increase risk-taking in lactating and pregnant females. Sex composition within the group also influenced which individuals were targeted. The authors suggest that male presence buffers female-on-female aggression, allowing females to target higher-ranking females than themselves. In contrast, females targeted lower-ranking females than themselves in groups with a larger ratio of females, which supposes a lower risk for the females since the pool of competitors is larger. The findings provide an important insight into aggression heuristics in primate social systems and the social and individual factors that influence these interactions, providing a deeper understanding of the evolutionary pressures that shape risk-taking, dominance maintenance, and the flexibility of social strategies in group-living species.

      The authors achieved their aim by demonstrating that aggression direction in female gorillas is influenced by factors such as reproductive condition and social context, and their results support the broader claim that aggression heuristics are flexible. However, some specific interpretations require further support. Despite this, the study makes a valuable contribution to the field of behavioural ecology by reframing how we think about intra-sexual competition and social rank maintenance in primates.

      Strengths:

      One of the study's major strengths is the use of an extensive dataset that compiles 25 years of behavioural data and 6871 aggressive interactions between 31 adult females in five social groups, which allows for a robust statistical analysis. This study uses a novel approach to the study of aggression in social groups by including factors such as the direction and intensity of aggressive interactions, which offers a comprehensive understanding of these complex social dynamics. In addition, this study incorporates ecological and physiological factors such as the reproductive state of the females and the sex composition of the group, which allows an integrative perspective on aggression within the broader context of body condition and social environment. The authors successfully integrate their results into broader evolutionary and ecological frameworks, enriching discussions around social hierarchies and risk sensitivity in primates and other animals.

      Weaknesses:

      Although the paper has a novel approach by studying the effect of reproductive state and social environment on female-female aggression, the use of observational data without experimental manipulation limits the ability to establish causation. The authors suggest that the difference observed in female aggression direction between groups with different sex composition might be indicative of male presence buffering aggression, which seems speculative, as no direct evidence of male intervention or support was reported. Similarly, the use of reproductive state as a proxy for energetic need is an indirect measure and does not account for actual energy expenditure or caloric intake, which weakens the authors' claims that female energetic need induces risk-taking. Overall, this paper would benefit from stronger justification and empirical support to strengthen the conclusions of the study about the mechanisms driving female aggression in gorillas.

    3. Reviewer #2 (Public review):

      Summary:

      The authors' aim in this study is to assess the factors that can shift competitive incentives against higher- or lower-ranking groupmates in two gorilla species.

      Strengths:

      This is a relevant topic, where important insights could be gained. The authors brought together a substantial dataset: a long-term behavioral dataset representing two gorilla species from five social groups.

      Weaknesses:

      The authors have not fully shown the data used in the model and explored the potential of the model. Therefore, I remain cautious about the current results and conclusions.

      Some specific suggestions that require attention are

      (1) The authors described how group size can affect aggression patterns in some species (line 54), using a whole paragraph, but did not include it as an explanation variable in their model, despite that they stated the overall group size can "conflate opposing effects of females and males" (line 85). I suggest underlining the effects of numbers of males or/and females here and de-emphasizing the effect of group size in the Introduction.

      (2) There should be more details given about how the authors calculated individual Elo-ratings (line 98). It seems that authors pooled all avoidance/displacement behaviors throughout the study period. But how often was the Elo-rating they included in the model calculated? By the day or by the month? I guess it was by the day, as they "estimate female reproductive state daily" (line 123). If so, it should be made clear in the text.

      In addition, all groups were long-term studied, and the group composition seems fluctuant based on the Table 1 in Reference 11. When an individual enters/leaves the group with a stable hierarchy, it takes time before the hierarchy turns stable again. If the avoidance/displacement behaviors used for the rank relationship were not common, it would take a few days or maybe longer. Also, were the aggressive behaviors more common during rank fluctuations? In other words, if avoidance/displacement behaviors and aggressive behaviors occur simultaneously during rank fluctuations, how did the authors deal with it and take it into consideration in the analysis?

      The authors emphasized several times in the text that gorillas "form highly stable hierarchical relationships". Also, in Reference 25, they found very high stabilities of each group's hierarchy. However, the number of females involved in that analysis was different from that used here. They need to provide more basic info on each group's dominance hierarchy and verify their statement. I strongly suggest that the authors display Elo-rating trajectories and necessary relevant statistics for each group throughout the study period as part of the supplementary materials.

      (3) The authors stated why they differentiated the different stages based on female reproductive status. They also referred to the differences in energetic needs between stages of pregnancy and lactation (lines 127-128). However, in the mixed model, they only compared the interaction score between the female cycling stage and other stages. The model was not well explained, and the results could be expanded. I suggest conducting more pairwise comparisons in the model and presenting the statistics in the text, if there are significant results. If all three pregnancy stages differed significantly from cycling and lactating stages but not from each other, they may be merged as one pregnancy stage. More in-depth analysis would help provide better answers to the research questions.

    4. Reviewer #3 (Public review):

      Smit and Robbins' manuscript investigates the dynamics of aggression among female groupmates across five gorilla groups. The authors utilize longitudinal data to examine how reproductive state, group size, presence of males, and resource availability influence patterns of aggression and overall dominance rankings as measured by Elo scores. The findings underscore the important role of group composition and reproductive status, particularly pregnancy, in shaping dominance relationships in wild gorillas. While the study addresses a compelling and understudied topic, I have several comments and suggestions that may enhance clarity and improve the reader's experience.

      (1) Clarification of longitudinal data - The manuscript states that 25 years of behavioral data were used, but this number appears unclear. Based on my calculations, the maximum duration of behavioral observation for any one group appears to be 18 years. Specifically: - ATA: 6 years - BIT: 8 years - KYA: 18 years - MUK: 6 years - ORU: 8 years I recommend that the authors clarify how the 25-year duration was derived.

      (2) Consideration of group size - The authors mention that group size was excluded from analyses to avoid conflating the opposing effects of female and male group members. While this is understandable, it may still be beneficial to explore group size effects in supplementary analyses. I suggest reporting statistics related to group size and potentially including a supplementary figure. Additionally, given that the study includes both mountain and wild gorillas, it would be helpful to examine whether any interspecies differences are apparent.

      (3) Behavioral measures clarification - Lines 112-116 describe the types of aggressive behaviors observed. It would be helpful to clarify how these behaviors differ from those used to calculate Elo scores, or whether they overlap. A brief explanation would improve transparency regarding the methodology.

      (4) Aggression rates versus Elo scores - The manuscript uses aggression rates rather than dominance rank (as measured by Elo scores) as the main outcome variable, but there is no explanation on why. How would the results differ if aggression rates were replaced or supplemented with Elo scores? The current justification for prioritizing aggression rates over dominance rank needs to be more clearly supported.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work aims to improve our understanding of the factors that influence female-on-female aggressive interactions in gorilla social hierarchies, using 25 years of behavioural data from five wild groups of two gorilla species. Researchers analysed aggressive interactions between 31 adult females, using behavioural observations and dominance hierarchies inferred through Elo-rating methods. Aggression intensity (mild, moderate, severe) and direction (measured as the rank difference between aggressor and recipient) were used as key variables. A linear mixed-effects model was applied to evaluate how aggression direction varied with reproductive state (cycling, trimester-specific pregnancy, or lactation) and sex composition of the group. This study highlights the direction of aggressive interactions between females, with most interactions being directed from higher- to lower-ranking adult females close in social rank. However, the results show that 42% of these interactions are directed from lower- to higher-ranking females. Particularly, lactating and pregnant females targeted higher-ranking individuals, which the authors suggest might be due to higher energetic needs, which increase risk-taking in lactating and pregnant females. Sex composition within the group also influenced which individuals were targeted. The authors suggest that male presence buffers female-on-female aggression, allowing females to target higher-ranking females than themselves. In contrast, females targeted lower-ranking females than themselves in groups with a larger ratio of females, which supposes a lower risk for the females since the pool of competitors is larger. The findings provide an important insight into aggression heuristics in primate social systems and the social and individual factors that influence these interactions, providing a deeper understanding of the evolutionary pressures that shape risk-taking, dominance maintenance, and the flexibility of social strategies in group-living species.

      The authors achieved their aim by demonstrating that aggression direction in female gorillas is influenced by factors such as reproductive condition and social context, and their results support the broader claim that aggression heuristics are flexible. However, some specific interpretations require further support. Despite this, the study makes a valuable contribution to the field of behavioural ecology by reframing how we think about intra-sexual competition and social rank maintenance in primates.

      Strengths:

      One of the study's major strengths is the use of an extensive dataset that compiles 25 years of behavioural data and 6871 aggressive interactions between 31 adult females in five social groups, which allows for a robust statistical analysis. This study uses a novel approach to the study of aggression in social groups by including factors such as the direction and intensity of aggressive interactions, which offers a comprehensive understanding of these complex social dynamics. In addition, this study incorporates ecological and physiological factors such as the reproductive state of the females and the sex composition of the group, which allows an integrative perspective on aggression within the broader context of body condition and social environment. The authors successfully integrate their results into broader evolutionary and ecological frameworks, enriching discussions around social hierarchies and risk sensitivity in primates and other animals.

      Thank you for the positive assessment of our work and the nice summary of the manuscript!

      Weaknesses:

      Although the paper has a novel approach by studying the effect of reproductive state and social environment on female-female aggression, the use of observational data without experimental manipulation limits the ability to establish causation. The authors suggest that the difference observed in female aggression direction between groups with different sex composition might be indicative of male presence buffering aggression, which seems speculative, as no direct evidence of male intervention or support was reported. Similarly, the use of reproductive state as a proxy for energetic need is an indirect measure and does not account for actual energy expenditure or caloric intake, which weakens the authors' claims that female energetic need induces risk-taking. Overall, this paper would benefit from stronger justification and empirical support to strengthen the conclusions of the study about the mechanisms driving female aggression in gorillas.

      We agree that experimental manipulation would allow us to extend our work. Unfortunately, this is not possible with wild, endangered gorillas.

      We have now added more references (Watts 1994; Watts 1997) and enriched our arguments regarding male presence buffering aggression. Previous research suggests that male gorillas may support lower-ranking females and they may intervene in female-female conflicts (Sicotte 2002). Unfortunately, our dataset did not allow us to test for male protection. We conduct proximity scans every 10 minutes and these scans are not associated to each interaction, meaning that we cannot reliably test if proximity to a male influence the likelihood to receive aggression.

      We have now clearly stated that reproductive state is an indirect proxy for energetic needs. We agree with your point about energy intake and expenditure, but unfortunately, we do not have data on energy expenditure or caloric intake to allow us to delve into more fine-grained analyses.

      Overall, we have tried to enrich the justification and empirical support to strengthen our conclusions by clarifying the text and adding more examples and references.

      Reviewer #2 (Public review):

      Summary:

      The authors' aim in this study is to assess the factors that can shift competitive incentives against higher- or lower-ranking groupmates in two gorilla species.

      Strengths:

      This is a relevant topic, where important insights could be gained. The authors brought together a substantial dataset: a long-term behavioral dataset representing two gorilla species from five social groups.

      Weaknesses:

      The authors have not fully shown the data used in the model and explored the potential of the model. Therefore, I remain cautious about the current results and conclusions.

      Some specific suggestions that require attention are

      (1) The authors described how group size can affect aggression patterns in some species (line 54), using a whole paragraph, but did not include it as an explanation variable in their model, despite that they stated the overall group size can "conflate opposing effects of females and males" (line 85). I suggest underlining the effects of numbers of males or/and females here and de-emphasizing the effect of group size in the Introduction.

      We did not use group size as a main predictor, as has been commonly done in other species, because of potentially conflating opposing effects of males and females. To further stress this point, we have specifically added in the introduction: “group size, the overall number of individuals in the group, might not be a good predictor of aggression heuristics, as it can conflate the effects of different kinds of individuals on aggression (see Smit & Robbins 2024 for an example of opposing effects of the number of females and number of males on female gorilla aggression).”

      We also “ran our analysis testing for group size (number of weaned individuals in the group), instead of the numbers of females and males, [and] its influence on interaction score was not significant (estimate=-0.001, p-value=0.682).”

      (2) There should be more details given about how the authors calculated individual Elo-ratings (line 98). It seems that authors pooled all avoidance/displacement behaviors throughout the study period. But how often was the Elo-rating they included in the model calculated? By the day or by the month? I guess it was by the day, as they "estimate female reproductive state daily" (line 123). If so, it should be made clear in the text.

      We rephrased accordingly: “We used all avoidance and displacement interactions throughout the study period and we used the function elo.seq from R package EloRating to infer daily individual female Elo-scores”. We also clarified that “This method takes into account the temporal sequence of interactions and updates an individual’s Elo-scores each day the individual interacted with another...”

      In addition, all groups were long-term studied, and the group composition seems fluctuant based on the Table 1 in Reference 11. When an individual enters/leaves the group with a stable hierarchy, it takes time before the hierarchy turns stable again. If the avoidance/displacement behaviors used for the rank relationship were not common, it would take a few days or maybe longer. Also, were the aggressive behaviors more common during rank fluctuations? In other words, if avoidance/displacement behaviors and aggressive behaviors occur simultaneously during rank fluctuations, how did the authors deal with it and take it into consideration in the analysis?

      We have shown in Reference 25 (Smit & Robbins 2025) after Reference 11 (Smit & Robbins 2024) that females form highly stable hierarchies, and that dyadic dominance relationships are not influenced by dispersal or death of third individuals. Notably, new immigrant females usually start at and remain low ranking, without large fluctuations in rank. Therefore, the presence of any fluctuation periods have limited influence in the aggressive interactions in our study system.

      The authors emphasized several times in the text that gorillas "form highly stable hierarchical relationships". Also, in Reference 25, they found very high stabilities of each group's hierarchy. However, the number of females involved in that analysis was different from that used here. They need to provide more basic info on each group's dominance hierarchy and verify their statement. I strongly suggest that the authors display Elo-rating trajectories and necessary relevant statistics for each group throughout the study period as part of the supplementary materials.

      In fact, the females involved in the present analysis and the analysis of Smit & Robbins 2025 are the same. Our present analysis is based on the hierarchies of Smit & Robbins 2025. Note that female gorillas disperse and occasionally immigrate to another study group. This is why some females may appear in the hierarchies of more than one group, giving the impression that there are more females involved in the analysis of Smit & Robbins 2025 (e.g. by counting the lines in the Elo-rating plots). We now specifically state that “We present these interactions and hierarchies in detail in Smit & Robbins 2025”, to clarify that the hierarchies are the same.

      (3) The authors stated why they differentiated the different stages based on female reproductive status. They also referred to the differences in energetic needs between stages of pregnancy and lactation (lines 127-128). However, in the mixed model, they only compared the interaction score between the female cycling stage and other stages. The model was not well explained, and the results could be expanded. I suggest conducting more pairwise comparisons in the model and presenting the statistics in the text, if there are significant results. If all three pregnancy stages differed significantly from cycling and lactating stages but not from each other, they may be merged as one pregnancy stage. More in-depth analysis would help provide better answers to the research questions.

      Thank you for pointing this out. First, when we considered one pregnancy stage, pregnant females showed indeed a significantly greater interaction score than females in other reproductive stages. We have now included that in the manuscript. However, we still find relevant to test for the different stages of pregnancy, given the difference of energetic needs in these stages. We have now included the pairwise comparisons in a new table (Table 2).

      Reviewer #3 (Public review):

      Smit and Robbins' manuscript investigates the dynamics of aggression among female groupmates across five gorilla groups. The authors utilize longitudinal data to examine how reproductive state, group size, presence of males, and resource availability influence patterns of aggression and overall dominance rankings as measured by Elo scores. The findings underscore the important role of group composition and reproductive status, particularly pregnancy, in shaping dominance relationships in wild gorillas. While the study addresses a compelling and understudied topic, I have several comments and suggestions that may enhance clarity and improve the reader's experience.

      (1) Clarification of longitudinal data - The manuscript states that 25 years of behavioral data were used, but this number appears unclear. Based on my calculations, the maximum duration of behavioral observation for any one group appears to be 18 years. Specifically:

      • ATA: 6 years

      • BIT: 8 years

      • KYA: 18 years

      • MUK: 6 years

      • ORU: 8 years

      I recommend that the authors clarify how the 25-year duration was derived.

      Indeed none of the five study “groups” has been studied for 25 years in a row. However, MUK emerged from a fission of group KYA in early 2016. So, from the start of group KYA in October 1998 to the end of group MUK in December 2023, there are 25 years and 2 months. We have now rephrased to “...starting in 1998 in one of the mountain gorilla groups” in the introduction, and to “We use a long-term behavioural dataset on five wild groups of the two gorilla species, starting in 1998” in the abstract.

      (2) Consideration of group size - The authors mention that group size was excluded from analyses to avoid conflating the opposing effects of female and male group members. While this is understandable, it may still be beneficial to explore group size effects in supplementary analyses. I suggest reporting statistics related to group size and potentially including a supplementary figure. Additionally, given that the study includes both mountain and wild gorillas, it would be helpful to examine whether any interspecies differences are apparent.

      We have now added the suggested extra test: “When we ran our analysis testing for group size (number of weaned individuals in the group), instead of the numbers of females and males, its influence on interaction score was not significant (estimate=-0.001, p-value=0.682).”

      Regarding species differences: In our analysis, we test for species (mountain vs western) and we find no significant differences between the two. This is stated in the results.

      (3) Behavioral measures clarification - Lines 112-116 describe the types of aggressive behaviors observed. It would be helpful to clarify how these behaviors differ from those used to calculate Elo scores, or whether they overlap. A brief explanation would improve transparency regarding the methodology.

      We now added short explanations into brackets for behaviours that are not obvious. We also added a sentence in the text to clarify the difference with the behaviours used to calculate Elo scores: “These two behaviours [avoidance and displacement] are ritualized, occurring in absence of aggression, they are considered a more reliable proxy of power relationships over aggression, and they are typically used to infer gorilla hierarchical relationships”.

      (4) Aggression rates versus Elo scores - The manuscript uses aggression rates rather than dominance rank (as measured by Elo scores) as the main outcome variable, but there is no explanation on why. How would the results differ if aggression rates were replaced or supplemented with Elo scores? The current justification for prioritizing aggression rates over dominance rank needs to be more clearly supported.

      The sentence we added above (“These two behaviours [avoidance and displacement] are ritualized, occurring in absence of aggression, they are considered a more reliable proxy of power relationships over aggression, and they are typically used to infer gorilla hierarchical relationships”) and the first paragraph of the results hopefully clarify that ritualized agonistic interactions are generally directionally consistent and more reliably capture the highly stable dominance relationships of female gorillas. This approach has been used to calculate dominance rank in gorillas in all studies that have considered it, dating back to the 1970s (namely in studies by Harcourt and Watts). On the other hand, aggression can be context dependent (we now clearly note that in the beginning of the Methods paragraph on aggressive interactions). Therefore, we use Eloscores inferred from ritualized interactions as base and a reliable proxy of power relationships; then we test if the direction of aggression within these relationships is driven also by energetic needs or the social environment.

    1. eLife Assessment

      This important work by Malita et al. describes a mechanism by which an intestinal infection causes an increase in daytime sleep through signaling from the gut to the blood-brain barrier. Their findings suggest that cytokines upd3 and upd2 produced by the intestine following infection act on glia of the blood brain barrier to regulate sleep by modulating Allatostatin A signaling. The evidence is compelling and elegantly performed using the ample Drosophila genetic toolbox, making this work appealing for a broad group of neuroscience researchers interested in sleep and gut-brain interactions.

    2. Joint Public Review:

      Summary:

      Malita and colleagues investigated the mechanism by which infections increase sleep in Drosophila. Their work is important because it further supports the idea that the blood brain barrier is involved in brain-body communication, and because it advances the field of sleep research. Using knock-down and knock-out of cytokines and cytokine receptors specifically in the endocrine cells of the gut (cytokines) as well as in the glia forming the blood-brain barrier (BBB) (cytokines receptors), the authors show that cytokines, upd2 and upd3, secreted by entero-endocrine cells in response to infections increase sleep through the Dome receptor in the BBB. They also show that gut-derived Allatostatin (Alst) A promotes wakefulness by inhibiting the Alst A signaling that is mediated by Alst receptors expressed in BBB glia. Their results suggest there may be additional mechanisms that promote elevated sleep during gut inflammation. The evidence supporting most of their claims is compelling. Nevertheless, the activation of the sleep-promoting pathway by infection should be accomplished through bacterial infection of the gut.

      Strengths:

      The work is, in general, supported by well-designed and well-performed experiments, especially those that show that the endocrine cells from the gut are the sources of the Upd cytokines, the effects of these cytokines on daytime sleep, and that the glial cells of the BBB are the target cell for the Upds action. In addition, the evidence associating the downregulation of Alst receptors in the BBB by Upd and Jak/Stat pathways is compelling.

      Weaknesses:

      (1) The model of gut inflammation that is used is based on the increase in reactive oxygen species (ROS) that is caused by adding 1% H2O2 to the food. The use of the model is supported rather weakly by two papers (ref. 26 and 27 ). The paper by Jiang et al. (26) shows that the infection by Pseudomonas entomophila induces cytokine responses Upd2 and 3, which are also induced by the Jnk pathway; there is no mention of ROS. Buchon et al. (27) is a review that refers to results that indicate that as part of the immune response to pathogens in the gut, there is production of ROS by the NADPH oxidase DUOX. Thus, there is no strong support for the use of this model.

      (2) There is no support for the use of ROS in the food instead a direct infection by pathogenic bacteria. It is known that ROS causes damage in the gut epithelium, which in turn induces the expression of the cytokines studied, which might be independent of infection and confound the results.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Summary:

      The authors sought to elucidate the mechanism by which infections increase sleep in Drosophila. Their work is important because it further supports the idea that the blood-brain barrier is involved in brain-body communication, and because it advances the field of sleep research. Using knock-down and knock-out of cytokines and cytokine receptors specifically in the endocrine cells of the gut (cytokines) as well as in the glia forming the blood-brain barrier (BBB) (cytokines receptors), the authors show that cytokines, upd2 and upd3, secreted by entero-endocrine cells in response to infections increase sleep through the Dome receptor in the BBB. They also show that gut-derived Allatostatin (Alst) A promotes wakefulness by inhibiting Alst A signaling that is mediated by Alst receptors expressed in BBB glia. Their results suggest there may be additional mechanisms that promote elevated sleep during gut inflammation.

      The authors suggest that upd3 is more critical than upd2, which is not sufficiently addressed or explained. In addition, the study uses the gut's response to reactive oxygen molecules as a proxy for infection, which is not sufficiently justified. Finally, further verification of some fundamental tools used in this paper would further solidify these findings making them more convincing.

      Strengths:

      (1) The work addresses an important topic and proposes an intriguing mechanism that involves several interconnected tissues. The authors place their research in the appropriate context and reference related work, such as literature about sickness-induced sleep, ROS, the effect of nutritional deprivation on sleep, sleep deprivation and sleep rebound, upregulated receptor expression as a compensatory mechanism in response to low levels of a ligand, and information about Alst A.

      (2) The work is, in general, supported by well-performed experiments that use a variety of different tools, including multiple RNAi lines, CRISPR, and mutants, to dissect both signal-sending and receiving sides of the signaling pathway.

      (3) The authors provide compelling evidence that shows that endocrine cells from the gut are the source of the upd cytokines that increase daytime sleep, that the glial cells of the BBB are the targets of these upds, and that upd action causes the downregulation of Alst receptors in the BBB via the Jak/Stat pathways.

      We are pleased that the reviewers recognized the strength and significance of our findings describing a gut-to-brain cytokine signaling mechanism involving the blood-brain barrier (BBB) and its role in regulating sleep, and we thank them for their comments.

      Weaknesses:

      (1) There is a limited characterization of cell types in the midgut which are classically associated with upd cytokine production.

      We thank the reviewer for raising this point. Although several midgut cell types (including the absorptive enterocytes) may indeed produce Unpaired (Upd) cytokines, our study specifically focused on enteroendocrine cells (EECs), which are well-characterized as secretory endocrine cells capable of exerting systemic effects. As detailed in our response to Results point #2 (please see below), we show that EEC-specific manipulation of Upd signaling is both necessary and sufficient to regulate sleep in response to intestinal oxidative stress. These findings support the role of EECs as a primary source of gut-derived cytokine signaling to the brain. To acknowledge the possible involvement of other source, we have also added a statement to the Discussion in the revised manuscript noting that other, non-endocrine gut cell types may contribute to systemic Unpaired signaling that modulates sleep.

      (2) Some of the main tools used in this manuscript to manipulate the gut while not influencing the brain (e.g., Voilà and Voilà + R57C10-GAL80), are not directly shown to not affect gene expression in the brain. This is critical for a manuscript delving into intra-organ communication, as even limited expression in the brain may lead to wrong conclusions.

      We agree with the reviewer that this is an important point. To address it, we performed additional validation experiments to assess whether the voilà-GAL4 driver in combination with R57C10-GAL80 (EEC>) influences upd2 or upd3 expression in the brain. Our results show that manipulation using EEC> alters upd2 and upd3 expression in the gut (Fig. 1a,b), with new data showing that this does not affect their expression levels in neuronal tissues (Fig. S1a), supporting the specificity of our approach. These new data are now included in the revised manuscript and described in the Results section. This additional validation strengthens our conclusion that the observed sleep phenotypes result from gut-specific cytokine signaling, rather than from effects on Unpaired cytokines produced in the brain.

      (1) >(3) The model of gut inflammation used by the authors is based on the increase in reactive oxygen species (ROS) obtained by feeding flies food containing 1% H2O2. The use of this model is supported by the authors rather weakly in two papers (refs. 26 and 27 ): The paper by Jiang et al. (ref. 26) shows that the infection by Pseudomonas entomophila induces cytokine responses upd2 and 3, which are also induced by the Jnk pathway. In addition, no mention of ROS could be found in Buchon et al. (ref 27); this is a review that refers to results showing that ROS are produced by the NADPH oxidase DUOX as part of the immune response to pathogens in the gut. Thus, there is no strong support for the use of this model.

      We thank the reviewer for raising this point. We agree that the references originally cited did not sufficiently justify the use of H<sub>2</sub>O<sub>2</sub> feeding as a model of gut inflammation. To address this, we have revised the Results section to clarify that we use H<sub>2</sub>O<sub>2</sub> feeding as a controlled method to elevate intestinal ROS levels, rather than as a general model of inflammation. This approach allows us to investigate the specific effects of ROS-induced cytokine signaling in the gut. We have also added additional citations to support the physiological relevance of this model. For instance, Tamamouna et al. (2021) demonstrated that H<sub>2</sub>O<sub>2</sub> feeding induces intestinal stem-cell proliferation – a response also observed during bacterial infection – and Jiang et al. (2009) showed that enteric infections increase upd2 and upd3 expression, which we similarly observe following H<sub>2</sub>O<sub>2</sub> feeding (Fig. 3a). These findings support the use of H<sub>2</sub>O<sub>2</sub> as a tool to mimic specific ROS-linked responses in the gut. We believe this targeted and tractable model is a strength of our study, enabling us to dissect how intestinal ROS modulates systemic physiology through cytokine signaling

      Additionally, we have included a statement in the Discussion acknowledging that ROS generated during infection may activate signaling mechanisms distinct from those triggered by chemically induced oxidative stress, and that exploring these differences in future studies may yield important insights into gut–brain communication. These revisions provide a stronger justification for our model while more accurately conveying both its relevance and its limitations.

      (2) >(4) Likewise, there is no support for the use of ROS in the food instead a direct infection by pathogenic bacteria. Furthermore, it is known that ROS damages the gut epithelium, which in turn induces the expression of the cytokines studied. Thus the effects observed may not reflect the response to infection. In addition, Majcin Dorcikova et al. (2023). Circadian clock disruption promotes the degeneration of dopaminergic neurons in male Drosophila. Nat Commun. 2023 14(1):5908. doi: 10.1038/s41467-02341540-y report that the feeding of adult flies with H2O2 results in neurodegeneration if associated with circadian clock defects. Thus, it would be important to discuss or present controls that show that the feeding of H2O2 does not cause neuronal damage.

      We thank the reviewer for this thoughtful follow-up point. We would like to clarify that we do not claim that the effects observed in our study directly reflect the full response to enteric infection. As outlined in our revised response to comment 3, we have updated the manuscript to more precisely describe the H<sub>2</sub>O<sub>2</sub>-feeding paradigm as a model that induces local intestinal ROS responses comparable to, but not equivalent to, those observed during pathogenic challenges. This revised framing highlights both the potential similarities and differences between chemically induced oxidative stress and infection-induced responses. Indeed, in the revised Discussion, we now explicitly acknowledge that ROS generated during infection may engage distinct signaling mechanisms compared to exogenous H<sub>2</sub>O<sub>2</sub> and emphasize the value of future studies in delineating these pathways. We are currently pursuing this direction in an independent ongoing study investigating the effects of enteric infections. However, for the present work, we chose to focus on the effects of ROS-induced responses in isolation, as this provides a clean and well-controlled context to dissect the specific contribution of oxidative stress to cytokine signaling and sleep regulation.

      To further address the reviewer’s concern, we have also included new data (a TUNEL stain for apoptotic DNA fragmentation) in the revised manuscript showing that H<sub>2</sub>O<sub>2</sub> feeding does not damage neuronal tissues under our experimental conditions (Fig. S3f,g). This addresses the point raised regarding the potential neurotoxicity of H<sub>2</sub>O<sub>2</sub>, as described by Majcin Dorcikova et al. (2023), and supports the specificity of the sleep phenotypes observed in our study. We believe these revisions and clarifications strengthen the manuscript and make our interpretation more precise.

      (3) >(5) The novelty of the work is difficult to evaluate because of the numerous publications on sleep in Drosophila. Thus, it would be very helpful to read from the authors how this work is different and novel from other closely related works such as: Li et al. (2023) Gut AstA mediates sleep deprivation-induced energy wasting in Drosophila. Cell Discov. 23;9(1):49. doi: 10.1038/s41421-023-00541-3.

      Our work highlights a distinct role for gut-derived AstA in sleep regulation compared to findings by Lin et al. (Cell Discovery, 2023)[1], who showed that gut AstA mediates energy wasting during sleep deprivation. Their study focused on the metabolic consequences of sleep loss, proposing that sleep deprivation increases ROS in the gut, which then promotes the release of the glucagon-like hormone adipokinetic hormone (AKH) through gut AstA signaling, thereby triggering energy expenditure.

      In contrast, our study addresses the inverse question – how ROS in the gut influences sleep. In our model, intestinal ROS promotes sleep, raising the intriguing possibility – cleverly pointed out by the reviewers – that ROS generated during sleep deprivation might promote sleep by inducing Unpaired cytokine signaling in the gut. According to our findings, this suppresses wake-promoting AstA signaling in the BBB, providing a mechanism to promote sleep as a restorative response to gut-derived oxidative stress and potentially limiting further ROS accumulation. Importantly, our findings support a wakepromoting role for EEC-derived AstA, demonstrated by several lines of evidence. First, EEC-specific knockdown of AstA increases sleep. Second, activation of AstA<sup>+</sup> EECs using the heat-sensitive cation channel Transient Receptor Potential A1 (TrpA1) reduces sleep, and this effect is abolished by simultaneous knockdown of AstA, indicating that the sleep-suppressing effect is mediated by AstA and not by other peptides or secreted factors released by these cells. Third, downregulation of AstA receptor expression in BBB glial cells increases sleep, further supporting the existence of a functional gut AstA– glia arousal pathway. We have now included new data in the revised manuscript showing that AstA release from EECs is downregulated during intestinal oxidative stress (Fig. 7k,l,m). This suggests that this wake-promoting signal is suppressed both at its source (the gut endocrine cells), by unknown means, and at its target, the BBB, via Unpaired cytokine signaling that downregulates AstA receptor expression. This coordinated downregulation may serve to efficiently silence this arousal-promoting pathway and facilitate sleep during intestinal stress. These new data, along with an expanded discussion, provide further mechanistic insight into gut-derived AstA signaling and strengthen our proposed model.

      This contrasts with the interpretation by Lin et al., who observed increased AstA peptide levels in EECs after antioxidant treatment and interpreted this as peptide retention. However, peptide accumulation may result from either increased production or decreased release, and peptide levels alone are insufficient to distinguish between these possibilities. To resolve this, we examined AstA transcript levels, which can serve as a proxy for production. Following oxidative stress (24 h of 1% H<sub>2</sub>O<sub>2</sub> feeding and the following day), when animals show increased sleep (Fig. 7e), we observed a decrease in AstA transcript levels followed by an increase in peptide levels (Fig. 7k,l,m), suggesting that oxidative stress leads to reduced gut AstA production and release. Furthermore, we recently found that a class of EECs that produce the hormone Tachykinin (Tk) and are distinct from the AstA<sup>+</sup> EECs express the ROSsensitive cation channel TrpA1 (Ahrentløv et al., 2025, Nature Metabolism2). In these Tk<sup>+</sup> EECs, TrpA1 mediates ROS-induced Tk hormone release. In contrast, single-cell RNA-seq data[3] do not support TrpA1 expression in AstA<sup>+</sup> EECs, consistent with our findings that ROS does not promote AstA release – an effect that would be expected if TrpA1 were functionally expressed in AstA<sup>+</sup> EECs. This contradicts the findings of Lin et al., who reported TrpA1 expression in AstA<sup>+</sup> EECs. We have now included relevant single-cell data in the revised manuscript (Fig. S6f) showing that TrpA1 is specifically expressed in Tk<sup>+</sup> EECs, but not in AstA<sup>+</sup> EECs, and we have expanded the discussion to address discrepancies in TrpA1 expression and AstA regulation.

      Taken together, our results reveal a dual-site regulatory mechanism in which Unpaired cytokines released from the gut act at the BBB to downregulate AstA receptor expression, while AstA release from EECs is simultaneously suppressed. We thank the reviewers for raising this important point. We have also included a discussion the other point raised by the reviewers – the possibility that ROS generated during sleep deprivation may engage the same signaling pathways described here, providing a mechanistic link between sleep deprivation, intestinal stress, and sleep regulation.

      Recommendations for the authors:

      A- Material and Methods:

      (1) Feeding Assay: The cited publication (doi.org:10.1371/journal.pone.0006063) states: "For the amount of label in the fly to reflect feeding, measurements must therefore be confined to the time period before label egestion commences, about 40 minutes in Drosophila, a time period during which disturbance of the flies affects their feeding behavior. There is thus a requirement for a method of measuring feeding in undisturbed conditions." Was blue fecal matter already present on the tube when flies were homogenized at 1 hour? If so, the assay may reflect gut capacity rather than food passage (as a proxy for food intake). In addition, was the variability of food intake among flies in the same tube tested (to make sure that 1-2 flies are a good proxy for the whole population)?

      We agree that this is an important point for feeding experiments. We are aware of the methodological considerations highlighted in the cited study and have extensive experience using a range of feeding assays in Drosophila, including both short- and long-term consumption assays (e.g., dye-based and CAFE assays), as well as automated platforms such as FLIC and FlyPAD (Nature Communications, 2022; Nature Metabolism, 2022; and Nature Metabolism, 2025)[2,4,5].

      For the dye-based assay, we carefully selected a 1-hour feeding window based on prior optimization. Since animals were not starved prior to the assay, shorter time points (e.g., 30 minutes) typically result in insufficient ingestion for reliable quantification. A 1-hour period provides a robust readout while remaining within the timeframe before significant label excretion occurs under our experimental conditions. To support the robustness of our findings, we complemented the dye-based assay with data from FLIC, which enables automated, high-resolution monitoring of feeding behavior in undisturbed animals over extended periods. The FLIC results were consistent with the dye-based data, strengthening our confidence in the conclusions. To minimize variability and ensure consistency across experiments, all feeding assays were performed at the same circadian time – Zeitgeber Time 0 (ZT0), corresponding to 10:00 AM when lights are turned on in our incubators. This time point coincides with the animals' natural morning feeding peak, allowing for reproducible comparisons across conditions. Regarding variability among flies within tubes, each biological replicate in the dye assay consisted of 1–2 flies, and results were averaged across multiple replicates. We observed good consistency across samples, suggesting that these small groups reliably reflect group-level feeding behavior under our conditions.

      (2) Biological replicates: whereas the number of samples is clearly reported in each figure, the number of biological replicates is not indicated. Please include this information either in Material and methods or in the relevant figure legends. Please also include a description of what was considered a biological replicate.

      We have now clarified in the Materials and Methods section under Statistics that all replicates represent independent biological samples, as suggested by the reviewers.

      (3) Control Lines: please indicate which control lines were used instead of citing another publication. If preferred, this information could be supplied as a supplementary table.

      We now provide a clear description of the control lines used in the Materials and Methods section. Specifically, all GAL4 and GAL80 lines used in this study were backcrossed for several generations into a shared w<sup>1118</sup> background and then crossed to the same w<sup>1118</sup> strain used as the genetic background for the UAS-RNAi, <i.CRISPR, or overexpression lines. This approach ensures, to a strong approximation, that the only difference between control and experimental animals is the presence or absence of the UAS transgene.

      (4) Statistical analyses: for some results (e.g., those shown in Figure 3d), it could be useful to test the interaction between genotype and treatment.

      We thank the reviewer for this helpful suggestion. In response, we have now performed two-way ANOVA analyses to assess genotype × treatment (diet) interaction effects for the relevant data, including those shown in Figure 3d as well as additional panels where animals were exposed to oxidative stress and sleep phenotypes were measured. We have added the corresponding interaction p-values in the updated figure legends for Figures 3d, 3k, 5a–c, 5f, 5h, 5i, 6c, 6e, and 7e. All of these tests revealed significant interaction effects, supporting the conclusion that the observed differences in sleep phenotypes are specifically dependent on the interaction between genetic manipulation (e.g., cytokine or receptor knockdown) and oxidative stress. These additions reinforce the interpretation that Unpaired cytokine signaling, glial JAK-STAT pathway activity, and AstA receptor regulation functionally interact with intestinal ROS exposure to modulate sleep. We thank the reviewer for suggesting this improvement.

      (5) Reporting of p values. Some are reported as specific values whereas others are reported as less than a specific value. Please make this reporting consistent across different figures.

      All p-values reported in the manuscript are exact, except in cases where values fall below p < 0.0001. In those instances, we use the inequality because the Prism software package (GraphPad, version 10), which was used for all statistical analyses, does not report more precise values. We believe this reporting approach reflects standard practice in the field.

      (6) Please include the color code used in each figure, either in the figure itself or in the legend.

      We have now clarified the color coding in all relevant figures. In particular, we acknowledge that the meaning of the half-colored circles used to indicate H<sub>2</sub>O<sub>2</sub> treatment was not previously explained. These have now been clearly labeled in each figure to indicate treatment conditions.

      (7) The scheme describing the experimental conditions and the associated chart is confusing. Please improve.

      We have improved the schematic by replacing “ROS” with “H<sub>2</sub>O<sub>2</sub>” to more clearly indicate the experimental condition used. Additionally, we have added the corresponding circle annotations so that they now also appear consistently above the relevant charts. This revised layout enhances clarity and helps readers more easily interpret the experimental conditions. We believe these changes address the reviewer’s concern and make the figure significantly more intuitive.

      8) Please indicate which line was used for upd-Gal4 and the evidence that it faithfully reflects upd3 expression.

      We have now clarified in the Materials and Methods section that the upd3-GAL4 line used in our study is Bloomington stock #98420, which drives GAL4 expression under the control of approximately 2 kb of sequence upstream of the upd3 start codon. This line has previously been used as a transcriptional reporter for upd3 activity. The only use of this line was to illustrate reporter expression in the EECs. To support this aspect of Upd3 expression, we now include new data in the revised manuscript using fluorescent in situ hybridization (FISH) against upd3, which confirms the presence of upd3 transcripts in prospero-positive EECs of the adult midgut (Fig. S1b). Additionally, we show that upd3 transcript levels are significantly reduced in dissected midguts following EEC-specific knockdown using multiple independent RNAi lines driven by voilà-GAL4, both alone and in combination with R57C10-GAL80, consistent with endogenous expression in these cells (Fig. 1a,b).

      To further address the reviewer’s concern and provide additional support for the endogenous expression of upd3 in EECs, we performed targeted knockdown experiments focusing on molecularly defined EEC subpopulations. The adult Drosophila midgut contains two major EEC subtypes characterized by their expression of Allatostatin C (AstC) or Tachykinin (Tk), which together encompass the vast majority of EECs. To selectively manipulate these populations, we used AstC-GAL4 and Tk-GAL4 drivers – both knock-in lines in which GAL4 is inserted at the respective endogenous hormone loci. This design enables precise GAL4 expression in AstC- or Tk-expressing EECs based on their native transcriptional profile. To eliminate confounding neuronal expression, we combined these drivers with R57C10GAL80, restricting GAL4 activity to the gut and generating AstC<sup>Gut</sup>> and Tk<sup>Gut</sup>> drivers. Using these tools, we knocked down upd2 and upd3 selectively in the AstC- or Tk-positive EECs. Knockdown of either cytokine in AstC-positive EECs significantly increased sleep under homeostatic conditions, recapitulating the phenotype observed with knockdown in all EECs (Fig. 1m-o). In contrast, knockdown of upd2 or upd3 in Tk-positive EECs had no effect on sleep (Fig. 1p-r). Furthermore, we show in the revised manuscript that selective knockdown of upd2 or upd3 in AstC-positive EECs abolishes the H<sub>2</sub>O<sub>2</sub>-induced increase in sleep (Fig. 3f–h). These findings demonstrate that Unpaired cytokine signaling from AstC-positive EECs is essential for mediating the sleep response to intestinal oxidative stress, highlighting this specific EEC subtype as a key source of cytokine-driven regulation in this context. These new results indicate that AstC-positive EECs are a primary source of the Unpaired cytokines that regulate sleep, while Tk-positive EECs do not appear to contribute to this function. Importantly, upd3 transcript levels were significantly reduced in dissected midguts following AstC<sup>Gut</sup> driven knockdown (Fig. S1r), further confirming that upd3 is endogenously expressed in AstC-positive EECs. Thus we have bolstered our confidence that upd3 is indeed expressed in EECs, as illustrated by the reporter line, through several means.

      (9) Please indicate which GFP line was used with upd-Gal4 (CD8, NLS, un-tagged, etc). The Material and Methods section states that it was "UAS-mCD8::GFP (#5137);", however, the stain does not seem to match a cell membrane pattern but rather a nuclear or cytoplasmic pattern. This information would help the interpretation of Figure 1C.

      We confirm that the GFP reporter line used with upd3-GAL4 was obtained from Bloomington stock #98420. As noted by the Bloomington Drosophila Stock Center, “the identity of the UAS-GFP transgene is a guess,” and the subcellular localization of the GFP fusion is therefore uncertain. We agree with the reviewer that the signal observed in Figure 1c does not display clear membrane localization and instead appears diffuse, consistent with cytoplasmic or partially nuclear localization. In any case, what we find most salient is the reporter’s labeling of Prospero-positive EECs in the adult midgut, consistent with upd3 expression in these cells. This conclusion is further supported by multiple lines of evidence presented in the revised manuscript, as mentioned above in response to question #8: (1) fluorescent in situ hybridization (FISH) for upd3 confirms expression in EECs (Fig. S1b), (2) EEC-specific RNAi knockdown of upd3 reduces transcript levels in dissected midguts, and (3) publicly available single-cell RNA sequencing datasets[3] also indicate that upd3 is expressed at low levels in a subset of adult midgut EECs under normal conditions. We have also clarified in the revised Materials and Methods section that GFP localization is undefined in the upd3-GAL4 line, to guide interpretation of the reporter signal.

      B- Results

      (1) Figure 1: According to previous work (10.1016/j.celrep.2015.06.009, http://flygutseq.buchonlab.com/data?gene=upd3%0D%0A), in basal conditions upd3 is expressed as following: ISC (35 RPKM), EB (98 RPKM), EC (57 RPKM), and EEC (8 RPKM). Accordingly, even complete KO in EECs should eliminate only a small fraction of upd3 from whole guts, even less considering the greater abundance of other cell types such as ECs compared to EECs. It would be useful to understand where this discrepancy comes from, in case it is affecting the conclusion of the manuscript. While this point per se does not affect the main conclusions of the manuscript, it makes the interpretation of the results more difficult.

      We acknowledge the previously reported low expression of upd3 in EECs. However, the FlyGut-seq site appears to be no longer available, so we could not directly compare other related genes. Nonetheless, our data – based on in situ hybridization, reporter expression, and multiple RNAi knockdowns – consistently support upd3 expression in EECs. These complementary approaches strengthen the conclusion that EECs are an important source of systemic upd3 under the conditions tested.

      (2) Figure 1: The upd2-3 mutants show sleep defects very similar to those of EEC>RNAi and >Cas9. It would thus be helpful to try to KO upd3 with other midgut drivers (An EC driver like Myo1A or 5966GS and a progenitor driver like Esg or 5961GS) to validate these results. Such experiments might identify precisely which cells are involved in the gut-brain signaling reported here.

      We appreciate the reviewer’s suggestion and agree that exploring other potential sources of Upd3 in the gut is an interesting direction. In this study, we have focused on EECs, which are the primary hormone-secreting cells in the intestine and thus the most likely candidates for mediating systemic effects such as gut-to-brain signaling. While it is possible that other gut cell types – such as enterocytes (e.g., Myo1A<sup>+</sup>) or intestinal progenitors (e.g., Esg<sup>+</sup>) – also contribute to Upd3 production, these cells are not typically endocrine in nature. Demonstrating their involvement in gutto-brain communication would therefore require additional, extensive validation beyond the scope of the current study. Importantly, our data show that manipulating Upd3 specifically in EECs is both necessary and sufficient to modulate sleep in response to intestinal ROS, strongly supporting the conclusion that EEC-derived cytokine signaling underlies the observed phenotype. In contrast, manipulating cytokines in other gut cells could produce indirect effects – such as altered proliferation, epithelial integrity, or immune responses – that complicate the interpretation of behavioral outcomes like sleep. For these reasons, we chose to focus on EECs as the source of endocrine signals mediating gut-to-brain communication. However, to address this point raised by the reviewer, we have now included a statement in the Discussion acknowledging that other non-endocrine gut cell types may also contribute to the systemic Unpaired signaling that modulates sleep in response to intestinal oxidative stress.

      (3) Figure 3: "This effect mirrored the upregulation observed with EEC-specific overexpression of upd3, indicating that it reflects physiologically relevant production of upd3 by the gut in response to oxidative stress." Please add (Figure 3a) at the end of this sentence.

      We have now added “(Figure 3a)” at the end of the sentence to clearly reference the relevant data.

      (4) For Figure 3b, do you have data showing that the increased amount of sleep was due to the addition of H2O2 per se, rather than the procedure of adding it?

      We have added new data to address this point. To ensure that the observed sleep increase was specifically due to the presence of H<sub>2</sub>O<sub>2</sub> and not an effect of the food replacement procedure, we performed a control experiment in which animals were fed standard food prepared using the same protocol and replaced daily, but without H<sub>2</sub>O<sub>2</sub>. These animals did not exhibit increased sleep, confirming that the sleep effect is attributable to intestinal ROS rather than the supplementation procedure itself (Fig. S3a). Thanks for the suggestion.

      (5) In the text it is stated that "Since 1% H2O2 feeding induced robust responses both in upd3 expression and in sleep behavior, we asked whether gut-derived Unpaired signaling might be essential for the observed ROS-induced sleep modulation. Indeed, EEC-specific RNAi targeting upd2 or upd3 abolished the sleep response to 1% H2O2 feeding." While it is indeed true that there is no additional increase in sleep time due to EEC>upd3 RNAi, it is also true that EEC>upd3 RNAi flies, without any treatment, have already increased their sleep in the first place. It is then possible that rather than unpaired signaling being essential, an upper threshold for maximum sleep allowed by manipulation of these processes was reached. It would be useful to discuss this point.

      Several findings argue against a ceiling effect and instead support a requirement for Unpaired signaling in mediating ROS-induced sleep. Animals with EEC-specific upd2 or upd3 knockdown or null mutation not only fail to increase sleep following H<sub>2</sub>O<sub>2</sub> treatment but actually exhibit reduced sleep during oxidative stress (Fig. 3e, k, l; Fig. 5e, f), suggesting that Unpaired signaling is required to sustain sleep under these conditions. Similarly, animals with glial dome knockdown also show reduced sleep under oxidative stress, closely mirroring the phenotype of EEC-specific upd3 RNAi animals (Fig. 5a–c, g–i). These results support the conclusion that gut-to-glia Unpaired cytokine signaling is necessary for maintaining elevated sleep during oxidative stress. In the absence of this signaling, animals exhibit increased wakefulness. We identify AstA as one such wake-promoting signal that is suppressed during intestinal stress. We present new data showing that this pathway is downregulated not only via Unpaired-JAK/STAT signaling in glial cells but also through reduced AstA release from the gut in the revised manuscript. This model, in which Unpaired cytokines promote sleep during intestinal stress by suppressing arousal pathways, is discussed throughout the manuscript to address the reviewer’s point.

      (6) In Figure 3k, the dots highlighting the experiment show an empty profile, a full one, and a half one. Please define what the half dots represent.

      We have now clarified the color coding in all relevant figures. Specifically, we acknowledge that the meaning of the half-colored circles indicating H<sub>2</sub>O<sub>2</sub> treatment was not previously defined – it indicates washout or recovery time. In the revised version, these symbols are now clearly labeled in each figure to indicate the treatment condition, ensuring consistent and intuitive interpretation across all panels.

      (7) The authors used appropriate GAL4 and RNAi lines to the knockdown dome, a upd2/3 JAK-STATlinked receptor, specifically in neurons and glia, respectively, in order to identify the CNS targets of upd2/3 cytokines produced by enteroendocrine cells (EECs). Pan-neuronal dome knockdown did not alter daytime sleep in adult females, yet pan-glial dome knockdown phenocopied effects of upd2/3 knockdown in EECs. They also observed that EEC-specific knockdown of upd2 and upd3 led to a decrease in JAK-STAT reporter activity in repo-positive glial cells. This supports the authors' conclusion that glial cells, not neurons, are the targets by which unpaired cytokines regulate sleep via JAK-STAT signaling. However, they do not show nighttime sleep data of pan-neuronal and pan-glial dome knockdowns. It would strengthen their conclusion if the nighttime sleep of pan-glial dome knockdown phenocopied the upd2/3 knockdowns as well, provided the pan-neuronal dome knockdown did not alter nighttime sleep.

      We have now added nighttime sleep data for both pan-glial and pan-neuronal domeless knockdowns in the revised manuscript (Fig. 2a). Glial knockdown increased nighttime sleep, similar to EEC-specific upd2/3 knockdown, while neuronal knockdown had no effect. These results further support the glial cells’ being the relevant target of gut-derived Unpaired signaling.

      (8) The authors only used one method to induce oxidative stress (hydrogen peroxide feeding). It would strengthen their argument to test multiple methods of inducing oxidative stress, such as lipopolysaccharide (LPS) feeding. In addition, it would be useful to use a direct bacterial infection to confirm that in flies, the infection promotes sleep. Additionally, flies deficient in Dome in the BBB and infected should not be affected in their sleep by the infection. These experiments would provide direct support for the mechanism proposed. Finally, the authors should add a primary reference for using ROS as a model of bacterial infection and justify their choice better.

      We agree that directly comparing different models of intestinal stress, such as bacterial infection or LPS feeding, would provide valuable insight into how gut-derived signals influence sleep in response to infection. As noted in our detailed responses above, we now include an expanded rationale for our use of H<sub>2</sub>O<sub>2</sub> feeding as a controlled and well-established method for inducing intestinal ROS – one of the key physiological responses to enteric infection and inflammation. In the revised Discussion, we explicitly acknowledge that pathogenic infections – which trigger both intestinal ROS and additional immune pathways – may engage distinct or complementary mechanisms compared to chemically induced oxidative stress. We emphasize the importance of future studies aimed at dissecting these differences. In fact, we are actively pursuing this direction in ongoing work examining sleep responses to enteric infection. For the purposes of the present study, however, we chose to focus on a tractable and specific model of ROS-induced stress to define the contribution of Unpaired cytokine signaling to gut-brain communication and sleep regulation. This approach allowed us to isolate the effect of oxidative stress from other confounding immune stimuli and identify a glia-mediated signaling mechanism linking gut epithelial stress to changes in sleep behavior.

      (9) To confirm that animals lacking EEC Unpaired signaling are not more susceptible to ROS-induced damage, the authors assessed the survival of upd2 and upd3 knockdowns on 1% H2O2 and concluded they display no additional sensitivity to oxidative stress compared to controls. It may be useful to include other tests of sensitivity to oxidative stress, in addition to survival.

      We appreciate the reviewer’s suggestion. In our view, survival is a highly informative and stringent readout, as it reflects the overall physiological capacity of the animal to withstand oxidative stress. Importantly, our data show that animals lacking EEC-derived Unpaired signaling do not exhibit reduced survival following H<sub>2</sub>O<sub>2</sub> exposure, indicating that their oxidative stress resistance is not compromised. Furthermore, we previously confirmed that feeding behavior is unaffected in these animals, suggesting that their ability to ingest food (and thus the stressor) is not impaired. As a molecular complement to these assays in response to this point and others, we have also performed an assessment of neuronal apoptosis (a TUNEL assay, Fig. S3f,g). This assay did not identify an increase in cell death in the brains of animals fed peroxide-containing medium. Thus, gross neurological health, behavior, and overall survival appear to be resilient to the environmental treatment regime we apply here, suggesting that the outcomes we observe arise from signaling per se.

      (10) The authors confirmed that animals lacking EEC-derived upd3 displayed sleep suppression similar to controls in response to starvation. These results led the authors to conclude that there is a specific requirement for EEC-derived Unpaired signaling in responding to intestinal oxidative stress. However, they previously showed that EEC-specific knockdown of upd3 and upd2 led to increased daytime sleep under normal feeding conditions. Their interpretations of their data are inconsistent.

      We appreciate the reviewer’s comment. While animals lacking EEC-derived Unpaired signaling show increased baseline sleep under normal feeding conditions, they still exhibit a robust reduction in sleep when subjected to starvation – comparable to that of control animals (Fig. S3h–j). This demonstrates that they retain the capacity to appropriately modulate sleep in response to metabolic stress. Thus, the sleep-promoting phenotype under normal conditions does not reflect a generalized inability to adjust sleep behavior. Rather, it highlights a specific role for Unpaired signaling in mediating sleep responses to intestinal oxidative stress, not in broadly regulating all sleep-modulating stimuli.

      (11) The authors report a significant increase in JAK-STAT activity in surface glial cells at ZT0 in animals fed 1% H2O2-containing food for 20 hours. This response was abolished in animals with EECspecific knockdown of upd2 or upd3. The authors confirmed there were no unintended neuronal effects on upd2 or upd3 expression in the heads. They also observed an upregulation of dome transcript levels in the heads of animals with EEC-specific knockdown of upd3 fed 1% H2O2-containing food for 15 hours, which they interpret to be a compensatory mechanism in response to low levels of the ligand. This assay is inconsistent with previous experiments in which animals were fed hydrogen peroxide for 20 hours.

      We thank the reviewer for identifying this discrepancy. The inconsistency arose from a labeling error in the manuscript. Both the JAK-STAT reporter assays in glial cells and the dome expression measurements were performed following 15 hours of H<sub>2</sub>O<sub>2</sub> feeding, not 20 hours as previously stated. We have now corrected this in the revised manuscript.

      (12) The authors show that animals with glia-specific dome knockdown did not have decreased survival on H2O2-containing food, and displayed normal rebound sleep in the morning following sleep deprivation. These results potentially undermine the significance of the paper. If the normal sleep response to oxidative stress is an important protective mechanism, why would oxidative stress not decrease survival in dome knockdown flies (that don't have the normal sleep response to oxidative stress)? This suggests that the proposed mechanism is not important for survival. The authors conclude that Dome-mediated JAK-STAT signaling in the glial cells specifically regulates ROS-induced sleep responses, which their results support.

      We agree that our survival data show that glial dome knockdown does not reduce survival under continuous oxidative stress. However, we believe this does not undermine the importance of the sleep response as an adaptive mechanism. In our survival assay, animals were continuously exposed to 1% H<sub>2</sub>O<sub>2</sub> without the opportunity to recover. In contrast, under natural conditions, oxidative stress is likely to be intermittent, and the ability to mount a sleep response may be particularly important for promoting recovery and maintaining homeostasis during or after transient stress episodes. Thus, while the JAK-STAT-mediated sleep response may not directly enhance survival under constant oxidative challenge, it likely plays a critical role in adaptive recovery under natural conditions.

      (13) Altogether, the authors conclude that enteric oxidative stress induces the release of Unpaired cytokines which activate the JAK-STAT pathway in subperineurial glia of the BBB, which leads to the glial downregulation of receptors for AstA, which is a wake-promoting factor also released by EECs. This mechanism is supported by their results, however, this research raises some intriguing questions, such as the role of upd2 versus upd3, the role of AstA-R1 versus AstA-R2, the importance of this mechanism in terms of survival, the sex-specific nature of this mechanism, and the role that nutritional availability plays in the dual functionality of Unpaired cytokine signaling in regards to sleep.

      We thank the reviewer for highlighting these important questions. Our data suggest that Upd2 and Upd3, while often considered partially redundant, both contribute to sleep regulation, with stronger effects observed for Upd3. This is consistent with prior studies indicating overlapping but non-identical roles for these cytokines. Similarly, although AstA-R1 and AstA-R2 can both be activated by AstA, knockdown of AstA-R2 consistently produces more robust sleep phenotypes, suggesting a predominant role in mediating this effect. The possibility of sex-specific regulation is indeed compelling. While our study focused on females, many gut hormones show sex-dependent activity, and we recognize this as an important avenue for future research. Finally, we have included new data in the revised manuscript showing that gut-derived AstA is downregulated under oxidative stress, further supporting our model in which Unpaired signaling suppresses arousal pathways during intestinal stress

      (14)Data Availability: It is indicated that: "Reasonable data requests will be fulfilled by the lead author". However, eLife's guidelines for data sharing require that all data associated with an article to be made freely and widely available.

      We thank the reviewer for pointing this out. We have revised the Data Availability section of the manuscript to clarify that all data will be made freely available from the lead contact without restriction, in accordance with eLife’s open data policy.

      References

      (1) Li, Y., Zhou, X., Cheng, C., Ding, G., Zhao, P., Tan, K., Chen, L., Perrimon, N., Veenstra, J.A., Zhang, L., and Song, W. (2023). Gut AstA mediates sleep deprivaPon-induced energy wasPng in Drosophila. Cell Discov 9, 49. 10.1038/s41421-023-00541-3. (2) Ahrentlov, N., Kubrak, O., Lassen, M., Malita, A., Koyama, T., Frederiksen, A.S., Sigvardsen, C.M., John, A., Madsen, P., Halberg, K.A., et al. (2025). Protein-responsive gut hormone Tachykinin directs food choice and impacts lifespan. Nature Metabolism. 10.1038/s42255-025-01267-0.

      (3) Li, H., Janssens, J., De Waegeneer, M., Kolluru, S.S., Davie, K., Gardeux, V., Saelens, W., David, F.P.A., Brbic, M., Spanier, K., et al. (2022). Fly Cell Atlas: A single-nucleus transcriptomic atlas of the adult fruit fly. Science 375, eabk2432. 10.1126/science.abk2432.

      (4) Kubrak, O., Koyama, T., Ahrentlov, N., Jensen, L., Malita, A., Naseem, M.T., Lassen, M., Nagy, S., Texada, M.J., Halberg, K.V., and Rewitz, K. (2022). The gut hormone AllatostaPn C/SomatostaPn regulates food intake and metabolic homeostasis under nutrient stress. Nature communicaPons 13, 692. 10.1038/s41467-022-28268-x.

      (5) Malita, A., Kubrak, O., Koyama, T., Ahrentlov, N., Texada, M.J., Nagy, S., Halberg, K.V., and Rewitz, K. (2022). A gut-derived hormone suppresses sugar appePte and regulates food choice in Drosophila. Nature Metabolism 4, 1532-1550. 10.1038/s42255-022-00672-z.

    1. eLife Assessment

      This important study addresses how wing morphology and kinematics change across hoverflies of different body sizes. The authors provide convincing evidence that there is no significant correlation between body size and wing kinematics across 28 species and instead argue that non-trivial changes in wing size and shape evolved to support flight across the size range. Overall, this paper illustrates the power and beauty of an integrative approach to animal biomechanics and will be of broad interest to biologists, physicists and engineers.

    2. Reviewer #1 (Public review):

      The paper is well written and the figures well laid out. The methods are easy to follow, and the rational and logic for each experiment easy to follow. The introduction sets the scene well, and the discussion is appropriate. The summary sentences throughout the text help the reader.

      The authors have done a lot of work addressing my previous concerns and those of the other Reviewers.

    3. Reviewer #2 (Public review):

      Summary

      Le Roy et al quantify wing morphology and wing kinematics across twenty eight and eight hoverfly species, respectively; the aim is to identify how weight support during hovering is ensured across body sizes. Wing shape and relative wing size vary non-trivially with body mass, but wing kinematics are reported to be size-invariant. On the basis of these results, it is concluded that weight support is achieved solely through size-specific variations in wing morphology, and that these changes enabled hoverflies to decrease in size. Adjusting wing morphology may be preferable compared to the alternative strategy of altering wing kinematics, because kinematics may be subject to stronger evolutionary and ecological constraints, dictated by the highly specialised flight and ecology of the hoverflies.

      Strengths

      The study deploys a vast array of challenging techniques, including flight experiments, morphometrics, phylogenetic analyses, and numerical simulations; it so illustrates both the power and beauty of an integrative approach to animal biomechanics. The question is well motivated, the methods appropriately designed, and the discussion elegantly places the results in broad biomechanical, ecological, and evolutionary context.

      Weaknesses

      (1) In assessing evolutionary allometry, it is key to pinpoint the variation expected from changes in size alone. The null hypothesis for wing morphology is well-defined (isometry), but the equivalent predictions for kinematic parameters, although specified, are insufficiently justified, and directly contradict classic scaling theory. A detailed justification of the "kinematic similarity" assumption, or a change in the null hypothesis, would substantially strengthen the paper, and clarify its evolutionary implications.

      (2) By relating the aerodynamic output force to wing morphology and kinematics, it is concluded that smaller hoverflies will find it more challenging to support their body mass--a scaling argument that provides the framework for this work. This hypothesis appears to stand in direct contrast to classic scaling theory, where the gravitational force is thought to present a bigger challenge for larger animals, due to their disadvantageous surface-to-volume ratios. The same problem ought to occur in hoverflies, for wing kinematics must ultimately be the result of the energy injected by the flight engine: muscle. Much like in terrestrial animals, equivalent weight support in flying animals thus requires a positive allometry of muscle force output. In other words, if a large hoverfly is able to generate the wing kinematics that suffice to support body weight, an isometrically smaller hoverfly should be, too (but not vice versa). Clarifying the relation between the scaling of muscle mechanical input, wing kinematics, and weight support would help resolve the conflict between these two contrasting hypotheses, and considerably strengthen the biomechanical motivation and evolutionary interpretation.

      (3) One main conclusion-- that miniaturization is enabled by changes in wing morphology--is insufficiently supported by the evidence. Is it miniaturization or "gigantism" that is enabled by (or drives) the non-trivial changes in wing morphology? To clarify this question, the isolated treatment of constraints on the musculoskeletal system vs the "flapping-wing based propulsion" system needs to be replaced by an integrated analysis: the propulsion of the wings, is, after all, due to muscle action. Revisiting the scaling predictions by assessing what the engine (muscle) can impart onto the system (wings) will clarify whether non-trivial adaptations in wing shape or kinematics are necessary for smaller or larger hovering insects (if at all!).

      In many ways, this work provides a blueprint for work in evolutionary biomechanics; the breadth of both the methods and the discussion reflects outstanding scholarship.

    4. Reviewer #3 (Public review):

      This paper addresses an important question about how changes in wing morphology vs. wing kinematics change with body size across an important group of high-performance insects, the hoverflies. The biomechanics and morphology convincingly support the conclusions that there is no significant correlation between wing kinematics and size across the eight specific species analyzed in depth and that instead wing morphology changes allometrically. The morphological analysis is enhanced with phylogenetically appropriate tests across a larger data set incorporating museum specimens.

      The authors have made very extensive revisions that have significantly improved the manuscript and brought the strength of conclusions in line with the excellent data. Most significantly, they have expanded their morphological analysis to include museum specimens and removed the conclusions about evolutionary drivers of miniaturization. As a result, the conclusion about morphological changes scaling with body size rather than kinematic properties is strongly supported and very nicely presented with a strong complementary set of data. I only have minor textual edits for them to consider.

    1. eLife Assessment

      This is an overall valuable set of findings on the role of centrally produced estrogens in the control of behaviors in male and female medaka. The significance of the findings rests on the revealed potential mechanism between brain derived estrogens modulating social behaviors in males as well as females. The results are supported by the analysis of multiple transgenic lines although the evidence is incomplete, and further validation would be necessary to fully validate the conclusions on the role of brain-derived estrogens. Nonetheless, the findings have led to helpful hypotheses on the hormonal control of behaviors in teleosts that can be tested further.

    2. Reviewer #1 (Public review):

      Summary:

      This research group has consistently performed cutting-edge research aiming to understand the role of hormones in the control of social behaviors, specifically by utilizing the genetically-tractable teleost fish, medaka, and the current work is no exception. The overall claim they make, that estrogens modulate social behaviors in males and females is supported, with important caveats. For one, there is no evidence these estrogens are generated by "neurons" as would be assumed by their main claim that it is NEUROestrogens that drive this effect. While indeed the aromatase they have investigated is expressed solely in the brain, in most teleosts, brain aromatase is only present in glial cells (astrocytes, radial glia). The authors should change this description so as not to mislead the reader. Below I detail more specific strengths and weaknesses of this manuscript.

      Strengths:

      • Excellent use of the medaka model to disentangle the control of social behavior by sex steroid hormones

      • The findings are strong for the most part because deficits in the mutants are restored by the molecule (estrogens) that was no longer present due to the mutation

      • Presentation of the approach and findings are clear, allowing the reader to make their own inferences and compare them with the authors'

      • Includes multiple follow-up experiments, which leads to tests of internal replication and an impactful mechanistic proposal

      • Findings are provocative not just for teleost researchers, but for other species since, as the authors point out, the data suggest mechanisms of estrogenic control of social behaviors may be evolutionary ancient

      Weaknesses:

      • As stated in the summary, the authors are attributing the estrogen source to neurons and there isn't evidence this is the case. The impact of the findings doesn't rest on this either

      • The d4 versus d8 esr2a mutants showed different results for aggression. The meaning and implications of this finding are not discussed, leaving the reader wondering

      • Lack of attribution of previous published work from other research groups that would provide the proper context of the present study

      • There are a surprising number of citations not included; some of the ones not included argue against the authors' claims that their findings were "contrary to expectation"

      • The experimental design for studying aggression in males has flaws. A standard test like a resident-intruder test should be used.

      • While they investigate males and females, there are fewer experiments and explanations for the female results, making it feel like a small addition or an aside

      • The statistics comparing "experimental to experimental" and "control to experimental" isn't appropriate

    3. Reviewer #3 (Public review):

      Summary:

      Taking advantage of the existence in fish of two genes coding for estrogen synthase, the enzyme aromatase, one mostly expressed in the brain (Cyp19a1b) and the other mostly found in the gonads (Cyp19a1a), this study investigates the role of brain-derived estrogens in the control of sexual and aggressive behavior in medaka. The constitutive deletion of Cyp19a1b markedly reduced brain estrogen content in males and to a lesser extent in females. These effects are accompanied by reduced sexual and aggressive behavior in males and reduced preference for males in females. These effects are reversed by adult treatment with supporting a role for estrogens. The deletion of Cyp19a1b is associated with a reduced expression of the genes coding for the two androgen receptors, ara and arb, in brain regions involved in the regulation of social behavior. The analysis of the gene expression and behavior of mutants of estrogen receptors indicates that these effects are likely mediated by the activation of the esr1 and esr2a isoforms. These results provide valuable insight into the role of estrogens in social behavior in the most abundant vertebrate taxon, however the conclusion of brain-derived estrogens awaits definitive confirmation.

      Strengths:

      • Evaluation of the role of brain "specific" Cyp19a1 in male teleost fish, which as a taxon are more abundant and yet proportionally less studied that the most common birds and rodents. Therefore, evaluating the generalizability of results from higher vertebrates is important. This approach also offers great potential to study the role of brain estrogen production in females, an understudied question in all taxa.

      • Results obtained from multiple mutant lines converge to show that estrogen signaling, likely synthesized in the brain drives aspects of male sexual behavior.

      • The comparative discussion of the age-dependent abundance of brain aromatase in fish vs mammals and its role in organization vs activation is important beyond the study of the targeted species.

      • The authors have made important corrections to tone down some of the conclusions which are more in line with the results.

      Weaknesses:

      • No evaluation of the mRNA and protein products of Cyp19a1b and ESR2a are presented, such that there is no proper demonstration that the mutation indeed leads to aromatase reduction. The conclusion that these effects dependent on brain derived estrogens is therefore only supported by measures of E2 with an EIA kit that is not validated. No discussion of these shortcomings is provided in the discussion thus further weakening the conclusion manuscript.

      • Most experiments are weakly powered (low sample size).

      • The variability of the mRNA content for a same target gene between experiments (genotype comparison vs E2 treatment comparison) raises questions about the reproducibility of the data (apparent disappearance of genotype effect).

      Conclusions:

      Overall, the claims regarding role of estrogens originating in the brain on male sexual behavior is supported by converging evidence from multiple mutant lines. The role of brain-derived estrogens on gene expression in the brain is weaker as are the results in females.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)>

      Summary:

      This research group has consistently performed cutting-edge research aiming to understand the role of hormones in the control of social behaviors, specifically by utilizing the genetically tractable teleost fish, medaka, and the current work is no exception. The overall claim they make, that estrogens modulate social behaviors in males and females is supported, with important caveats. For one, there is no evidence these estrogens are generated by "neurons" as would be assumed by their main claim that it is NEUROestrogens that drive this effect. While indeed the aromatase they have investigated is expressed solely in the brain, in most teleosts, brain aromatase is only present in glial cells (astrocytes, radial glia). The authors should change this description so as not to mislead the reader. Below I detail more specific strengths and weaknesses of this manuscript.

      We thank the reviewer for this very positive evaluation of our work and greatly appreciate their helpful comments and suggestions for improving the manuscript. We agree with the comment that the term “neuroestrogens” is misleading. Therefore, we have replaced “neuroestrogens” with “brain-derived estrogens” or “brain estrogens” throughout the manuscript, including the title.

      In the following sections, “neuroestrogens” has been revised to align with the surrounding context.

      Line 21: “in the brain, also known as neuroestrogens,” → “in the brain.”

      Line 28: “neuroestrogens” → “these estrogens.”

      Line 30: “mechanism of action of neuroestrogens” → “mode of action of brain-derived estrogens.”

      Line 43: “brain-derived estrogens, also called neuroestrogens,” → “estrogens.”

      Line 74: “neuroestrogen synthesis is selectively impaired while gonadal estrogen synthesis remains intact” → “estrogen synthesis in the brain is selectively impaired while that in the gonads remains intact.”

      Line 77: “neuroestrogens” → “these estrogens.”

      Line 335: “levels of neuroestrogens” → “brain estrogen levels.”

      Line 338: “neuroestrogens” → “these estrogens.”

      Line 351: “neuroestrogens” → “these estrogens.”

      Line 357: “neuroestrogen action” → “the action of brain-derived estrogens.”

      Line 359: “neuroestrogens” → “estrogen synthesis in the brain.”

      Line 390: “active synthesis of neuroestrogens” → “active estrogen synthesis in the brain.”

      Line 431: “neuroestrogens” → “estrogens in the brain.”

      Line 431: “neuroestrogen action” → “the action of brain-derived estrogens.”

      Line 433: “neuroestrogen action” → “their action.”

      Strengths:

      Excellent use of the medaka model to disentangle the control of social behavior by sex steroid hormones.

      The findings are strong for the most part because deficits in the mutants are restored by the molecule (estrogens) that was no longer present due to the mutation.

      Presentation of the approach and findings are clear, allowing the reader to make their own inferences and compare them with the authors'.

      Includes multiple follow-up experiments, which lead to tests of internal replication and an impactful mechanistic proposal.

      Findings are provocative not just for teleost researchers, but for other species since, as the authors point out, the data suggest mechanisms of estrogenic control of social behaviors may be evolutionarily ancient.

      We again thank the reviewer for their positive evaluation of our work.

      Weaknesses:

      (1) As stated in the summary, the authors attribute the estrogen source to neurons and there isn't evidence this is the case. The impact of the findings doesn't rest on this either.

      As noted in Response to reviewer #1’s summary comment, we have replaced “neuroestrogens” with “brain-derived estrogens” or “brain estrogens” throughout the manuscript.

      Line 63: We have also added the text “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (18– 20).” Following this addition, “This observation suggests” in the subsequent sentence has been replaced with “These observations suggest.”

      The following references (#18–20), cited in the newly added text above, have been included in the reference list, with other references renumbered accordingly:

      P. M. Forlano, D. L. Deitcher, D. A. Myers, A. H. Bass, Anatomical distribution and cellular basis for high levels of aromatase activity in the brain of teleost fish: aromatase enzyme and mRNA expression identify glia as source. J. Neurosci. 21, 8943–8955 (2001).

      N. Diotel, Y. Le Page, K. Mouriec, S. K. Tong, E. Pellegrini, C. Vaillant, I. Anglade, F. Brion, F. Pakdel, B. C. Chung, O. Kah, Aromatase in the brain of teleost fish: expression, regulation and putative functions. Front. Neuroendocrinol. 31, 172–192 (2010).

      A. Takeuchi, K. Okubo, Post-proliferative immature radial glial cells female-specifically express aromatase in the medaka optic tectum. PLoS One 8, e73663 (2013).

      (2) The d4 versus d8 esr2a mutants showed different results for aggression. The meaning and implications of this finding are not discussed, leaving the reader wondering.

      Line 282: As the reviewer correctly noted, circles were significantly reduced in mutant males of the Δ8 line, whereas no significant reduction was observed in those of the Δ4 line. However, a tendency toward reduction was evident in the Δ4 line (P = 0.1512), and both lines showed significant differences in fin displays. Based on these findings, we believe our conclusion that esr2a<sup>−/−</sup> males exhibit reduced aggression remains valid. To clarify this point and address potential reader concerns, we have revised the text as follows: “esr2a<sup>−/−</sup> males from both the Δ8 and Δ4 lines exhibited significantly fewer fin displays than their wildtype siblings (P = 0.0461 and 0.0293, respectively). Circles followed a similar pattern, with a significant reduction in the Δ8 line (P = 0.0446) and a comparable but non-significant decrease in the Δ4 line (P = 0.1512) (Fig. 5L; Fig. S8E), showing less aggression.”

      (3) Lack of attribution of previously published work from other research groups that would provide the proper context of the present study.

      In response to this and other comments from this reviewer, we have revised the Introduction and Discussion sections as follows.

      Line 56: “solely responsible” in the Introduction has been modified to “largely responsible”.

      Line 57: “This is consistent with the recent finding in medaka fish (Oryzias latipes) that estrogens act through the ESR subtype Esr2b to prevent females from engaging in male-typical courtship (10)” has been revised to “This is consistent with recent observations in a few teleost species that genetic ablation of AR severely impairs male-typical behaviors (13–16) and with findings in medaka fish (Oryzias latipes) that estrogens act through the ESR subtype Esr2b to prevent females from engaging in maletypical courtship (12)” to include previous studies on the behavior of AR mutant fish (Yong et al., 2017; Alward et al., 2020; Ogino et al., 2023; Nishiike and Okubo, 2024) in the Introduction.

      Line 65: “It is worth mentioning that systemic administration of estrogens and an aromatase inhibitor increased and decreased male aggression, respectively, in several teleost species, potentially reflecting the behavioral effects of brain-derived estrogens (21–24)” has been added to the Introduction. This addition provides an overview of previous studies on the effects of estrogens and aromatase on male fish aggression (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015).

      Line 367: “treatment of males with an aromatase inhibitor reduces their male-typical behaviors (31– 33)” has been edited to read “treatment of males with an aromatase inhibitor reduces their male-typical behaviors, while estrogens exert the opposite effect (21–24).”

      After the revisions described above, the following references (#13, 14, and 22) have been added to the reference list, with other references renumbered accordingly:

      L. Yong, Z. Thet, Y. Zhu, Genetic editing of the androgen receptor contributes to impaired male courtship behavior in zebrafish. J. Exp. Biol. 220, 3017–3021 (2017).

      B. A. Alward, V. A. Laud, C. J. Skalnik, R. A. York, S. A. Juntti, R. D. Fernald, Modular genetic control of social status in a cichlid fish. Proc. Natl. Acad. Sci. U.S.A. 117, 28167–28174 (2020).

      L. A. O’Connell, H. A. Hofmann, Social status predicts how sex steroid receptors regulate complex behavior across levels of biological organization. Endocrinology 153, 1341–1351 (2012).

      (4) There are a surprising number of citations not included; some of the ones not included argue against the authors' claims that their findings were "contrary to expectation".

      Line 68: As detailed in Response to reviewer #1’s comment 3 on weaknesses, we have cited previous studies on the effects of estrogens and aromatase on male fish aggression (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015) in the Introduction.

      The following revisions have also been made to avoid phrases such as “contrary to expectation” and “unexpected.”

      Line 76: “Contrary to our expectations” → “Remarkably.”

      Line 109: “Contrary to this expectation, however” → “Nevertheless.”

      Line 135: “Again, contrary to our expectation, cyp19a1b<sup>−/−</sup> males” → “cyp19a1b<sup>−/−</sup> males.”

      Line 333: “unexpected” → “noteworthy.”

      Line 337: “unexpected” → “notable.”

      (5) The experimental design for studying aggression in males has flaws. A standard test like a resident intruder test should be used.

      We agree that the resident-intruder test is the most commonly used method for assessing aggression. However, medaka form shoals and lack strong territoriality, and even slight dominance differences between the resident and the intruder can increase variability in the results, compromising data consistency. Therefore, in this study, we adopted an alternative approach: placing four unfamiliar males together in a tank and quantifying aggressive interactions in total. This method allows for the assessment of aggression regardless of territorial tendencies, making it more appropriate for our investigation.

      (6) While they investigate males and females, there are fewer experiments and explanations for the female results, making it feel like a small addition or an aside.

      We agree that the data and discussion for females are less extensive than for males. However, we have previously elucidated the mechanism by which estrogen/Esr2b signaling promotes female mating behavior (Nishiike et al., 2021, Curr Biol, 1699–1710). Accordingly, it follows that the new insights into female behavior gained from the cyp19a1b knockout model are more limited than those for males. Nevertheless, when combined with our prior findings, the female data in this study offer valuable insights, and the overall mechanism through which estrogens promote female mating behavior is becoming clearer. Therefore, we do not consider the female data in this study to be incomplete or merely supplementary.

      (7) The statistics comparing "experimental to experimental" and "control to experimental" aren't appropriate.

      The reviewer raises concerns about the statistical analysis used for Figures 4C and 4E, suggesting that Bonferroni’s test should be used instead of Dunnett’s test. However, Dunnett’s test is commonly used to compare treatment groups to a reference group that receives no treatment, as in our study. Since we do not compare the treated groups with each other, we believe Dunnett’s test is the most appropriate choice.

      Line 619: The reviewer’s concern may have arisen from the phrase “comparisons between control and experimental groups” in the Materials and Methods. We have revised it to “comparisons between untreated and E2-treated groups in Fig. 4, C and D” for clarity.

      Reviewer #2 (Public Review):

      Summary:

      The novelty of this study stems from the observations that neuro-estrogens appear to interact with brain androgen receptors to support male-typical behaviors. The study provides a step forward in clarifying the somewhat contradictory findings that, in teleosts and unlike other vertebrates, androgens regulate male-typical behaviors without requiring aromatization, but at the same time estrogens appear to also be involved in regulating male-typical behaviors. They manipulate the expression of one aromatase isoform, cyp19a1b, that is purported to be brain-specific in teleosts. Their findings are important in that brain estrogen content is sensitive to the brain-specific cyp19a1b deficiency, leading to alterations in both sexual behavior and aggressive behavior. Interestingly, these males have relatively intact fertility rates, despite the effects on the brain.

      We thank this reviewer for their positive evaluation of our work and constructive comments, which we found very helpful in improving the manuscript.

      That said, the framing of the study, the relevant context, and several aspects of the methods and results raise concerns. Two interpretations need to be addressed/tempered:

      (1) that the rescue of cyp19a1b deficiency by tank-applied estradiol is not necessarily a brain/neuroestrogen mode of action, and

      Line 155: cyp19a1b-deficient males exhibited a severe reduction in brain E2 levels, yet their peripheral E2 levels remained comparable to those in wild-type males. Given this hormonal milieu and the lack of behavioral change in wild-type males following E2 treatment, the observed recovery of mating behavior in cyp19a1b-deficient males following E2 treatment can be best explained by the restoration of brain E2 levels. However, as the reviewer pointed out, we cannot rule out the possibility that bath-immersed E2 influenced behavior through an indirect peripheral mechanism. To address this concern, we have modified the text as follows: “These results suggest that reduced E2 in the brain is the primary cause of the mating defects, highlighting a pivotal role of brain-derived estrogens in male mating behavior. However, caution is warranted, as an indirect peripheral effect of bath-immersed E2 on behavior cannot be ruled out, although this is unlikely given the comparable peripheral E2 levels in cyp19a1b-deficient and wild-type males. In contrast to mating.”

      (2) the large increases in peripheral and brain androgen levels in the cyp19a1b deficient animals imply some indirect/compensatory effects of lifelong cyp19a1b deficiency.

      As stated in line 151, androgen/AR signaling has a strong facilitative effect on male-typical behaviors in teleosts. If increased androgen levels in the periphery and brain affected behavior, the expected effect would be facilitative. However, cyp19a1b-deficient males exhibited impaired male-typical behaviors, suggesting that elevated androgen levels were unlikely to be responsible. Although chronic androgen elevation could cause androgen receptor desensitization, which could lead to behavioral suppression, our long-term androgen treatments have consistently promoted, rather than inhibited, male-typical behaviors (e.g., Nishiike et al., Proc Natl Acad Sci USA 121:e2316459121). Hence, this possibility is also highly unlikely.

      Reviewer #3 (Public Review):

      Summary:

      Taking advantage of the existence in fish of two genes coding for estrogen synthase, the enzyme aromatase, one mostly expressed in the brain (Cyp19a1b) and the other mostly found in the gonads (Cyp19a1a), this study investigates the role of neuro-estrogens in the control of sexual and aggressive behavior in teleost fish. The constitutive deletion of Cyp19a1b reduced brain estrogen content by 87% in males and about 50% in females. It led to reduced sexual and aggressive behavior in males and reduced sexual behavior in females. These effects are reversed by adult treatment with estradiol thus indicating that they are activational in nature. The deletion of Cyp19a1b is associated with a reduced expression of the genes coding for the two androgen receptors, ara, and arb, in brain regions involved in the regulation of social behavior. The analysis of the gene expression and behavior of mutants of estrogen receptors indicates that these effects are likely mediated by the activation of the esr1 and esr2a isoforms. These results provide valuable insight into the role of neuro-estrogens in social behavior in the most abundant vertebrate taxa. While estrogens are involved in the organization of the brain and behavior of some birds and rodents, neuro-estrogens appear to play an activational role in fish through a facilitatory action of androgen signaling.

      We thank this reviewer for their positive evaluation of our work and comments that have improved the manuscript.

      Strengths:

      Evaluation of the role of brain "specific" Cyp19a1 in male teleost fish, which as a taxa are more abundant and yet proportionally less studied than the most common birds and rodents. Therefore, evaluating the generalizability of results from higher vertebrates is important. This approach also offers great potential to study the role of brain estrogen production in females, an understudied question in all taxa.

      Results obtained from multiple mutant lines converge to show that estrogen signaling drives aspects of male sexual behavior.

      The comparative discussion of the age-dependent abundance of brain aromatase in fish vs mammals and its role in organization vs activation is important beyond the study of the targeted species.

      We again thank the reviewer for their positive evaluation of our work.

      Weaknesses:

      (1) The new transgenic lines are under-characterized. There is no evaluation of the mRNA and protein products of Cyp19a1b and ESR2a.

      We did not directly assess the function of cyp19a1b and esr2a in our mutant fish. However, the observed reduction in brain E2 levels, with no change in peripheral E2 levels, in cyp19a1b-deficient fish strongly supports the loss of cyp19a1b function. This is stated in the Results section (line 97) as follows: “These results show that cyp19a1b-deficient fish have reduced estrogen levels coupled with increased androgen levels in the brain, confirming the loss of cyp19a1b function.”

      Line 473: A previous study reported that female medaka lacking esr2a fail to release eggs due to oviduct atresia (Kayo et al., 2019, Sci Rep 9:8868). Similarly, in this study, some esr2a-deficient females exhibited spawning behavior but were unable to release eggs, although the sample size was limited (Δ8 line: 2/3; Δ4 line: 1/1). In contrast, this was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function. To incorporate this information into the manuscript, the following text has been added to the Materials and Methods: “A previous study reported that esr2a-deficient female medaka cannot release eggs due to oviduct atresia (59). Likewise, some esr2a-deficient females generated in this study, despite the limited sample size, exhibited spawning behavior but were unable to release eggs (Δ8 line: 2/3; Δ4 line: 1/1), while such failure was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function.”

      The following reference (#59), cited in the newly added text above, have been included in the reference list:

      D. Kayo, B. Zempo, S. Tomihara, Y. Oka, S. Kanda, Gene knockout analysis reveals essentiality of estrogen receptor β1 (Esr2a) for female reproduction in medaka. Sci. Rep. 9, 8868 (2019).

      (2) The stereotypic sequence of sexual behavior is poorly described, in particular, the part played by the two sexual partners, such that the conclusions are not easily understandable, notably with regards to the distinction between motivation and performance.

      Line 103: To provide a more detailed description of medaka mating behavior, we have revised the text from “The mating behavior of medaka follows a stereotypical pattern, wherein a series of followings, courtship displays, and wrappings by the male leads to spawning” to “The mating behavior of medaka follows a stereotypical sequence. It begins with the male approaching and closely following the female (following). The male then performs a courtship display, rapidly swimming in a circular pattern in front of the female. If the female is receptive, the male grasps her with his fins (wrapping), culminating in the simultaneous release of eggs and sperm (spawning).”

      (3) The behavior of females is only assessed from the perspective of the male, which raises questions about the interpretation of the reduced behavior of the males.

      In medaka, female mating behavior is largely passive, except for rejecting courtship attempts and releasing eggs. Therefore, its analysis relies on measuring the latency to receive following, courtship displays, or wrappings from the male and the frequency of courtship rejection or wrapping refusal. We understand the reviewer’s perspective that cyp19a1b-deficient females might not be less receptive but instead less attractive to males, potentially leading to reduced male mating efforts. However, since these females are approached and followed by males at levels comparable to wild-type females, this possibility appears unlikely. Moreover, cyp19a1b-deficient females tend to avoid males and exhibit a slightly female-oriented sexual preference. While these traits are closely associated with reduced sexual receptivity, they do not readily align with reduced sexual attractiveness. Therefore, it is more plausible to conclude that these females have decreased receptivity rather than being less attractive to males.

      (4) At no point do the authors seem to consider that a reduced behavior of one sex could result from a reduced sensory perception from this sex or a reduced attractivity or sensory communication from the other sex.

      Line 112: As noted above, the impaired mating behavior of cyp19a1b-deficient females is unlikely to be due to reduced attractiveness to males. Similarly, mating behavior tests using esr2b-deficient females as stimulus females suggest that the impaired mating behavior of cyp19a1b-deficient males cannot be attributed to reduced attractiveness to females. However, the possibility that their impaired mating behavior could be attributed to altered cognition or sexual preference cannot be ruled out. To reflect this in the manuscript, we have revised the text “, suggesting that they are less motivated to mate” to “. These results suggest that they are less motivated to mate, though an alternative interpretation that their cognition or sexual preference may be altered cannot be dismissed.”

      (5) Aspects of the methods are not detailed enough to allow proper evaluation of their quality or replication of the data.

      In response to this and other specific comments from this reviewer, we have revised the Materials and Methods section to include more detailed descriptions of the methods.

      Line 469: The following text has been added to describe the method for domain identification in medaka Esr2a: “The DNA- and ligand-binding domains of medaka Esr2a were identified by sequence alignment with yellow perch (Perca flavescens) Esr2a, for which these domain locations have been reported (58).”

      The following reference (#58), cited in the newly added text above, have been included in the reference list:

      S. G. Lynn, W. J. Birge, B. S. Shepherd, Molecular characterization and sex-specific tissue expression of estrogen receptor α (esr1), estrogen receptor βa (esr2a) and ovarian aromatase (cyp19a1a) in yellow perch (Perca flavescens). Comp. Biochem. Physiol. B Biochem. Mol. Biol. 149, 126–147 (2008).

      Line 540: The text “, and the total area of signal in each brain nucleus was calculated using Olyvia software (Olympus)” has been revised to include additional details on the single ISH method as follows: “. The total area of signal across all relevant sections, including both hemispheres, was calculated for each brain nucleus using Olyvia software (Olympus). Images were converted to a 256-level intensity scale, and pixels with intensities from 161 to 256 were considered signals. All sections used for comparison were processed in the same batch, without corrections between samples.”

      Line 596: The following text has been added to include additional details on the double ISH method: “Cells were identified as coexpressing the two genes when Alexa Fluor 555 and fluorescein signals were clearly observed in the cytoplasm surrounding DAPI-stained nuclei, with intensities markedly stronger than the background noise.”

      (6) It seems very dangerous to use the response to a mutant abnormal behavior (ESR2-KO females) as a test, given that it is not clear what is the cause of the disrupted behavior.

      esr2b-deficient females have fully developed ovaries, a normal sex steroid milieu, and sexual attractiveness to males comparable to wild-type females, yet they are completely unreceptive to male courtship (Nishiike et al., 2021, Curr Biol, 1699–1710). Although, as the reviewer noted, the detailed mechanisms underlying this phenotype remain unclear, it is evident that the loss of estrogen/Esr2b signaling in the brain severely impairs sexual receptivity. Therefore, using esr2b-deficient females as stimulus females in the mating behavior test eliminates the influence of female sexual receptivity and male attractiveness to females, enabling the exclusive assessment of male mating motivation. This rationale is already presented in the Results section (lines 116–120), and we believe this experimental design offers a robust framework for assessing male mating motivation.

      Additionally, the mating behavior test with esr2b-deficient females complemented the test with wildtype females, and its results were not the sole basis for our discussion of the male mating behavior phenotype. The results of both tests were largely concordant, and we believe that the conclusions drawn from them are highly reliable.

      Meanwhile, in the test with esr2b-deficient females, cyp19a1b-deficient males were courted more frequently by these females than wild-type males. As the reviewer noted, this may suggest an anomaly in the test. Accordingly, we have confined our discussion to the possibility that “Perhaps cyp19a1b<sup>−/−</sup> males are misidentified as females by esr2b-deficient females because they are reluctant to court or they exhibit some female-like behavior” (line 131).

      (7) Most experiments are weakly powered (low sample size) and analyzed by multiple T-tests while 2 way ANOVA could have been used in several instances. No mention of T or F values, or degrees of freedom.

      Histological analysis was conducted with a relatively small sample size, as our previous experience suggested that interindividual variability in the results would not be substantial. As significant differences were detected in many analyses, further increasing the sample size is unnecessary.

      Although two-way ANOVA could be used instead of multiple T-tests for analyzing the data in Figures 4D, 4F, 6D, S4A, and S4B, we applied the Bonferroni–Dunn correction to control for multiple pairwise comparisons in multiple T-tests. As this comparison method is equivalent to the post hoc test following two-way ANOVA, the statistical results are identical regardless of whether T-tests or two-way ANOVA are used.

      For the data in Figures 4D, 4F, S4A, and S4B, the primary focus is on whether relative luciferase activity differs between E2-treated and untreated conditions for each mutant construct. Therefore, two-way ANOVA is not particularly relevant, as assessing the main effect of construct type or its interaction with E2 treatment does not provide meaningful insights. Similarly, in Figure 6D, the focus is solely on whether wild-type and mutant females differ in time spent at each distance. Given this, two-way ANOVA is unnecessary, as analyzing the main effect of distance is not meaningful.

      Accordingly, two-way ANOVA was not employed in this study, and therefore, its corresponding F values were not included. As the figure legends specify the sample sizes for all analyses, specifying degrees of freedom separately was deemed unnecessary.

      (8) The variability of the mRNA content for the same target gene between experiments (genotype comparison vs E2 treatment comparison) raises questions about the reproducibility of the data (apparent disappearance of genotype effect).

      As the reviewer pointed out, the overall area of ara expression is larger in Figure 2J than in Figure 2F. However, the relative area ratios of ara expression among brain nuclei are consistent between the two figures, indicating the reproducibility of the results. Thus, this difference is unlikely to affect the conclusions of this study.

      Additionally, the differences in ara expression in pPPp and arb expression in aPPp between wild-type and cyp19a1b-deficient males appear less pronounced in Figures 2J and 2K than in Figures 2F and 2H. This is likely attributable to the smaller sample size used in the experiments for Figures 2J and 2K, resulting in less distinct differences. However, as the same genotype-dependent trends are observed in both sets of figures, the conclusion that ara and arb expression is reduced in cyp19a1b-deficient male brains remains valid.

      (9) The discussion confuses the effects of estrogens on sexual differentiation (developmental programming = permanent) and activation (= reversible activation of brain circuits in adulthood) of the brain and behavior. Whether sex differences in the circuits underlying social behaviors exist is not clear.

      We recognize that the effects of adult steroids are sometimes not considered to be sexual differentiation, as they do not differentiate the neural substrate, but rather transiently activate the already masculinized or feminized substrate. Arnold (2017, J Neurosci Res 95:291–300) contends that all factors that cause sex differences, including the transient effects of adult steroids, should be incorporated into a theory of sexual differentiation, and indeed, these effects may be the most potent proximate factors that make males and females different. We concur with this perspective and have adopted it as a foundation for our manuscript.

      In teleosts, early developmental exposure to steroids has minimal impact, and sexual differentiation relies primarily on steroid action in adulthood (Okubo et al., 2022, Spectrum of Sex, pp. 111–133). This is evidenced by the effective reversal of sex-typical behaviors through experimental hormonal manipulation in adult teleosts and the absence of transient early-life steroid surges observed in mammals and birds. Accordingly, our discussion on brain sexual differentiation, including the statement in line 347, “This variation among species may represent the activation of neuroestrogen synthesis at life stages critical for sexual differentiation of behavior that are unique to each species”, remains well-supported. Additionally, given these considerations, while sex differences in neural circuit activation are evident in teleosts, substantial structural differences in these circuits are unlikely.

      (10) Overall, the claims regarding the activational role of neuro-estrogens on male sexual behavior are supported by converging evidence from multiple mutant lines. The role of neuroestrogens on gene expression in the brain is mostly solid too. The data for females are comparatively weaker. Conclusions regarding sexual differentiation should be considered carefully.

      We agree that the data for females are less extensive than for males. However, we have previously elucidated the mechanism by which estrogen/Esr2b signaling promotes female mating behavior (Nishiike et al., 2021). Accordingly, it follows that the new insights into female behavior gained from the cyp19a1b knockout model are more limited than those for males. Nevertheless, when integrated with our prior findings, the data on females in this study provide significant insights, and the overall mechanism through which estrogens promote female mating behavior is becoming clearer. Therefore, we do not consider the female data in this study to be incomplete or merely supplementary.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      The authors set out to answer an intriguing question regarding the hormonal control of innate social behaviors in medaka. Specifically, they wanted to test the effects of cyp19a1b mutation on mating and aggression in males. They also test these effects in females. Their approach takes them down several distinct experimental pathways, including one investigating how cyp19a1a function is related to androgen receptor expression and how estrogens themselves may act on the androgen receptor to modulate its expression, as well as how different esr genes may be involved. The study and its results are valuable and a clear, general conclusion of a pathway from brain aromatase>estrogens>esr genes> androgen receptor can be made. This is important, novel, and impactful. However, there are issues with how the study logic is set up, the approach for assessing certain behaviors, the statistics used, the interpretation of findings, and placing the findings in the proper context based on previous work, which manifests as a general issue where previous work is not properly attributed to.

      Thank you for your thoughtful review. We have carefully addressed each specific comment, as detailed below.

      Major comments:

      (1) The background for the rationale of the current study is misleading and lacks proper context. The authors root the logic of their experiment in determining whether estrogens regulate male-typical behaviors because the current assumption is androgens are "solely responsible" for male-typical behaviors in teleosts. This is not the case. Previous studies have shown aromatase/estrogens are involved in male-typical aggression in teleosts. For example, to name a couple:

      Huffman, L. S., O'Connell, L. A., & Hofmann, H. A. (2013). Aromatase regulates aggression in the African cichlid fish Astatotilapia burtoni. Physiology & behavior, 112, 77-83.

      O'Connell, L. A., & Hofmann, H. A. (2012). Social status predicts how sex steroid receptors regulate complex behavior across levels of biological organization. Endocrinology, 153(3), 1341-1351.

      And even a recent paper sheds light on a possible AR>aromatase.estradiol hypothesis of male typical behaviors:

      Lopez, M. S., & Alward, B. A. (2024). Androgen receptor deficiency is associated with reduced aromatase expression in the ventromedial hypothalamus of male cichlids. Annals of the New York Academy of Sciences.

      Interestingly, the authors cite Hufmann et al in the discussion, so I don't understand why they make the claims they do about estrogens and male-typical behavior.

      Related to this, is an issue of proper attribution to published work. Indeed, missing are key references from lab groups using AR mutant teleosts. Here are a couple:

      Yong, L., Thet, Z., & Zhu, Y. (2017). Genetic editing of the androgen receptor contributes to impaired male courtship behavior in zebrafish. Journal of Experimental Biology, 220(17), 3017-3021.

      Alward, B. A., Laud, V. A., Skalnik, C. J., York, R. A., Juntti, S. A., & Fernald, R. D. (2020). Modular genetic control of social status in a cichlid fish. Proceedings of the National Academy of Sciences, 117(45), 28167-28174.

      Ogino, Y., Ansai, S., Watanabe, E., Yasugi, M., Katayama, Y., Sakamoto, H., ... & Iguchi, T. (2023). Evolutionary differentiation of androgen receptor is responsible for sexual characteristic development in a teleost fish. Nature communications, 14(1), 1428.

      As noted in Response to reviewer #1’s comment 3 on weaknesses, we have revised the Introduction and Discussion sections as follows.

      Line 56: “solely responsible” in the Introduction has been modified to “largely responsible”.

      Line 57: The text “This is consistent with the recent finding in medaka fish (Oryzias latipes) that estrogens act through the ESR subtype Esr2b to prevent females from engaging in male-typical courtship (10)” has been revised to “This is consistent with recent observations in a few teleost species that genetic ablation of AR severely impairs male-typical behaviors (13–16) and with findings in medaka fish (Oryzias latipes) that estrogens act through the ESR subtype Esr2b to prevent females from engaging in male-typical courtship (12)” to include previous studies on the behavior of AR mutant fish (Yong et al., 2017; Alward et al., 2020; Ogino et al., 2023; Nishiike and Okubo, 2024) in the Introduction.

      Line 65: “It is worth mentioning that systemic administration of estrogens and an aromatase inhibitor increased and decreased male aggression, respectively, in several teleost species, potentially reflecting the behavioral effects of brain-derived estrogens (21–24)” has been added to the Introduction, providing an overview of previous studies on the effects of estrogens and aromatase on male fish aggression (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015).

      Line 367: “treatment of males with an aromatase inhibitor reduces their male-typical behaviors (31– 33)” has been edited to read “treatment of males with an aromatase inhibitor reduces their male-typical behaviors, while estrogens exert the opposite effect (21–24).”

      After the revisions described above, the following references (#13, 14, and 22) have been added to the reference list:

      L. Yong, Z. Thet, Y. Zhu, Genetic editing of the androgen receptor contributes to impaired male courtship behavior in zebrafish. J. Exp. Biol. 220, 3017–3021 (2017).

      B. A. Alward, V. A. Laud, C. J. Skalnik, R. A. York, S. A. Juntti, R. D. Fernald, Modular genetic control of social status in a cichlid fish. Proc. Natl. Acad. Sci. U.S.A. 117, 28167–28174 (2020).

      L. A. O’Connell, H. A. Hofmann, Social status predicts how sex steroid receptors regulate complex behavior across levels of biological organization. Endocrinology 153, 1341–1351 (2012).

      While Lopez and Alward (2024) provide valuable insights into the regulation of cyp19a1b expression by androgens, our study focuses specifically on the functional aspects of cyp19a1b. Expanding the discussion to include expression regulation would divert from the primary focus of our manuscript. For this reason, we have opted not to cite this reference.

      (2) As it is now, the authors are only citing a book chapter/review from their own group. This is a serious issue as it does not provide the proper context for the work. The authors need to fix their issues of attribution to previously published work and the proper interpretation of the work that they are aware of as it pertains to ideas proposed on the roles of androgens and estrogens in the control of male-typical behaviors. This is also important to get the citations right because the common use of "contrary to expectations" when describing their results is actually not correct. Many of the observations are expected to a degree. However, this doesn't take away from a generally stellar experimental design and mostly clear results. The authors do not need to rely on enhancing the impact of their paper by making false claims of unexpected findings. The depth and clarity of your findings are where the impact of your work is.

      As detailed in Response to reviewer #1’s comment 3 on weaknesses, we have cited previous studies on the effects of estrogens and aromatase on male fish aggression (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015) in the Introduction.

      Additionally, as noted in Response to reviewer #1’s comment 4 on weaknesses, we have made the following revisions to avoid phrases such as “contrary to expectation” and “unexpected.”

      Line 76: “Contrary to our expectations” → “Remarkably.”

      Line 109: “Contrary to this expectation, however” → “Nevertheless.”

      Line 135: “Again, contrary to our expectation, cyp19a1b<sup>−/−</sup> males” → “cyp19a1b<sup>−/−</sup> males.”

      Line 333: “unexpected” → “noteworthy.”

      Line 337: “unexpected” → “notable.”

      (3) The experimental design for studying aggression in males has flaws. A standard test like a residentintruder test should be used. An assay in which only male mutants are housed together? I do not understand the logic there and the logic for the approach isn't even explained. Too many confounds that are not controlled for. It makes it seem like an aspect of the study that was thrown in as an aside.

      As noted in Response to reviewer #1’s comment 5 on weaknesses, medaka form shoals and lack strong territoriality. As a result, even slight differences in dominance between the resident and intruder can substantially impact the outcomes of the resident-intruder test. Therefore, we adopted an alternative approach in this study.

      (4) Hormonal differences in the mutants seem to vary based on sex, and the meaning of these differences, or how they affect interpreting the findings, wasn't discussed. There was no acknowledegment of the fact that female central E2 was still at 50%, meaning the "rescue" experiments using peripheral injections are not given the proper context. For example, this is different than giving a fish with only 16% of their normal central E2 an E2 injection. Missing as well is a clear hypothesis for why E2 injections did not rescue aggression deficits in cyp19a1b mutant males.

      Line 385: As the reviewer pointed out, the degree of brain estrogen reduction in cyp19a1b-deficient fish differs greatly between males and females. This is likely because females receive a large supply of estrogens from the ovaries. Given that estrogen levels in cyp19a1b-deficient females were 50% of those in wild-type females, it can be inferred that half of their brain estrogens are synthesized locally, while the other half originates from the ovaries. This is an important finding, and we have already noted in the Discussion that “females have higher brain levels of estrogens, half of which are synthesized locally in the brain (i.e., neuroestrogens)” However, as this explanation was not sufficiently clear, we have revised it to “females have higher brain levels of estrogens, with half being synthesized locally and the other half supplied by the ovaries.”

      The reviewer raised a concern that conducting the estrogen rescue experiment in females, where 50% of brain estrogens remain, might be inappropriate. However, as this experiment was conducted exclusively in males, this concern is not applicable.

      Line 377: As noted in the reviewer’s subsequent comment, the failure of aggression recovery in E2treated cyp19a1b-deficient males could be due to insufficient induction of ara/arb expression in aggression-relevant brain regions. To address this concern, we have inserted the following statement into the Discussion after “the development of male behaviors may require moderate neuroestrogen levels that are sufficient to induce the expression of ara and arb, but not esr2b, in the underlying neural circuitry”: “This may account for the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study.”

      (5) In relation to that, the "null" results may have some of the most interesting implications, but they are barely discussed. For example, what does it mean that E2 didn't restore aggression in male cyp19 mutants? Is this a brain region factor? Could this relate to findings from Lopez et al NYAS, where male and female Ara mutants show different effects on brain-region-specific aromatase expression? And maybe this relates to the different impact of estrogens on ar expression. Were the different effects impacted in aggression areas? Maybe this is why E2 injection didn't retore aggression in males. You could make the argument that: (1) E2 doesn't restore ar expression in aggression regions and that's why there was no rescue. Or (2) that the circuits in adulthood that regulate aggression are NOT dependent on aggression but in early development they are. Another null finding not expanded on is why the two esr2a mutant lines showed differences. There is no reason to trust one line over the other, meaning we still don't know whether esr2a is required for latency to follow.

      As stated in our response to the previous comment, we have added the following text to the Discussion (line 377): “This may account for the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study.” Meanwhile, as discussed in lines 341–342, it is highly unlikely that the neural circuits regulating aggression are primarily influenced by early-life estrogen exposure, because androgen administration in adulthood alone is sufficient to induce high levels of aggression in both sexes. This notion is further supported by previous observations that cyp19a1b expression in the brain is minimal during embryonic development (Okubo et al., 2011, J Neuroendocrinol, 23:412–423).

      The findings of Lopez and Alward (2024) pertain to the regulation of cyp19a1b expression by androgen receptors. While this represents an important aspect of neuroendocrine regulation, it does not appear to be directly relevant to our discussion on cyp19a1b-mediated regulation of androgen receptor expression.

      To ensure the reliability of behavioral analyses in mutant fish, we consider a phenotype valid only when it is consistently observed in two independent mutant lines. In the mating behavior test examining esr2adeficient males using esr2b-deficient females as stimulus females, Δ8 line males exhibited a shorter latency to initiate following than wild-type males, whereas Δ4 line males did not. This discrepancy led us to refrain from drawing conclusions about the role of esr2a in mating behavior, even though the mating behavior test using wild-type females as stimulus females yielded consistent results in the Δ8 and Δ4 lines. Therefore, we do not consider the reviewer’s concern to be a significant issue.

      (6) Not sure what's going on with the statistics, but it is not appropriate here to treat a "control" group as special. All groups are "experimental" groups. There is nothing special about the control group in this context. all should be Bonferroni post-hoc tests.

      Line 619: As detailed in Response to reviewer #1’s comment 7 on weaknesses, we consider Dunnett’s test the most appropriate choice for the experiments presented in Figures 4C and 4E. We acknowledge that the reviewer’s concern may stem from the phrase “comparisons between control and experimental groups” in the Materials and Methods section. To clarify this point, we have revised it to “comparisons between untreated and E2-treated groups in Fig. 4, C and D” for clarity.

      Minor comments:

      Line 47: then how can you say the aromatization hypothesis is "correct"? it only applies to a few species so far. Need to change the framing, not state so strongly such a vague thing as a hypothesis being "correct".

      Line 45: To address this concern, we have modified “widely accepted as correct” to “widely acknowledged”, ensuring a more precise characterization.

      Figure 1: looks like a dosage effect in males but not females. this should be discussed at some point, even if just to mention a dosage effect exists and put it in context.

      Line 91: We have revised the sentence “In males, brain E2 in heterozygotes (cyp19a1b+/−) was also reduced to 45% of the level in wild-type siblings (P = 0.0284) (Fig. 1A)” by adding “, indicating a dosage effect of cyp19a1b mutation” to make this point explicit.

      Were male cyp19 KO aggressive towards females?

      We have not observed cyp19a1b-deficient males exhibiting aggressive behavior towards females in our experiments. Therefore, we do not consider them aggressive toward females.

      Please explain how infertility would lead to reduced mating.

      Line 142: As the reviewer has questioned, even if cyp19a1b-deficient males exhibit infertility due to efferent duct obstruction, it is difficult to imagine that this directly leads to reduced mating. However, the inability to release sperm could indirectly affect behavior. To address this, we have added “, possibly due to the perception of impaired sperm release” after “If this is also the case in medaka, the observed behavioral defects might be secondary to infertility.”

      Describe something about the timing of the treatment here. How can peripheral E2 injections restore it when peripheral levels are normal? Did these injections restore central levels? This needs to be shown experimentally.

      Line 517: As described in the Materials and Methods, E2 treatment was conducted by immersing fish in E2-containing water for 4 days. However, we had not explicitly stated that the water was changed daily to maintain the nominal concentration. To clarify this and address reviewer #2’s comment 9, we have revised “males were treated with 1 ng/ml of E2 (Fujifilm Wako Pure Chemical, Osaka, Japan) or vehicle (ethanol) alone by immersion in water for 4 days” to “males were treated with 1 ng/ml of E2 (Fujifilm Wako Pure Chemical, Osaka, Japan), which was first dissolved in 100% ethanol (vehicle), or with the vehicle alone by immersion in water for 4 days, with daily water changes to maintain the nominal concentration.”

      Line 522: The treatment effectively restored mating activity and ara/arb expression in the brain, suggesting a sufficient increase in brain E2 levels. However, we did not measure the actual increase, and its extent remains uncertain. To reflect this in the manuscript, we have now added the following sentence: “Although the exact increase in brain E2 levels following E2 treatment was not quantified, the observed positive effects on behavior and gene expression suggest that it was sufficient.”

      I know the nomenclature differs among those who study teleosts, but it's ARa and then gene is ar1 (as an example; arb would be ar2). You're recommended the following citation to remain consistent:

      Munley, K. M., Hoadley, A. P., & Alward, B. A. (2023). A phylogenetics-based nomenclature system for steroid receptors in teleost fishes. General and Comparative Endocrinology, 114436.

      Paralogous genes resulting from the third round of whole-genome duplication in teleosts are typically designated by adding the suffixes “a” and “b” to their gene symbols. This convention also applies to the two androgen receptor genes, commonly referred to as ara and arb. While the alternative names ar1 and ar2 may gain broader acceptance in the future, ara and arb remain more widely used at present. Therefore, we have chosen to retain ara and arb in this manuscript.

      Line 268: how is this "suggesting" less aggression? They literally showed fewer aggressive displays, so it doesn't suggest it - it literally shows it.

      Line 285: Following this thoughtful suggestion, we have changed “suggesting less aggression” to “showing less aggression.”

      Line 317: how can you still call it the primary driver?

      The stimulatory effects of aromatase/estrogens on male-typical behaviors are exerted through the potentiation of androgen/AR signaling. Thus, we still believe that androgens—specifically 11KT in teleosts—serve as the primary drivers of these behaviors.

      Line 318: not all deficits, like aggression, were rescued.

      Line 334: To address this comment, “These behavioral deficits were rescued by estrogen administration, indicating that reduced levels of neuroestrogens are the primary cause of the observed phenotypes: in other words, neuroestrogens are pivotal for male-typical behaviors in teleosts” has been modified and now reads “Deficits in mating were rescued by estrogen administration, indicating that reduced brain estrogen levels are the primary cause of the observed mating impairment; in other words, brain-derived estrogens are pivotal at least for male-typical mating behaviors in teleosts.”

      Line 324: what do you mean by "sufficient"? To show that, you'd have to castrate the male and only give estrogen back. the authors continue to overstate virtually every aspect of their study, seemingly in an unnecessary manner.

      Line 341: Our intention was to convey that brain-derived estrogens early in life are not essential for the expression of male-typical behaviors in teleosts. However, we recognize that the term “sufficient” could be misinterpreted as implying that estrogens alone are adequate, without contributions from other factors such as androgens. To clarify this, we have revised the text from “neuroestrogen activity in adulthood is sufficient for the execution of male-typical behaviors, while that in early in life is not requisite. Thus, while” to “brain-derived estrogens early in life is not essential for the execution of male-typical behaviors. While.”

      Line 329: so? in adult mice, amygdala aromatase neurons still regulate aggression. The amount in adulthood seems less important compared to site-specific functions.

      Line 346: We do not intend to suggest that brain aromatase activity in adulthood plays a negligible role in male behaviors in rodents, as we have already acknowledged its necessity in the Introduction (lines 42–43). To enhance clarity and prevent misinterpretation, we have added “, although it remains important for male behavior in adulthood” to the end of the sentence: “brain aromatase activity in rodents reaches its peak during the perinatal period and thereafter declines with age.”

      Line 351: This contradicts what you all have been saying.

      Line 65: As mentioned in Response to reviewer #1’s comment 3 on weaknesses, the following text has been added to the Introduction: “It is worth mentioning that systemic administration of estrogens and an aromatase inhibitor increased and decreased male aggression, respectively, in several teleost species, potentially reflecting the behavioral effects of brain-derived estrogens (21–24)”, providing an overview of previous studies on the effects of estrogens and aromatase on male fish aggression (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015). With this revision, we believe the inconsistency has been addressed.

      Line 367: Additionally, we have revised the sentence from “treatment of males with an aromatase inhibitor reduces their male-typical behaviors (31–33)” to “treatment of males with an aromatase inhibitor reduces their male-typical behaviors, while estrogens exert the opposite effect (21–24).”

      Line 360: change to "...possibility that is not mutually exclusive,"

      Line 378: We have revised the phrase as suggested from “Another possibility, not mutually exclusive,” to “Another possibility that is not mutually exclusive.”

      Line 363: but it didn't rescue aggression

      Line 381: In response, we have revised the sentence from “This possibility is supported by the present observation that estrogen treatment facilitated mating behavior in cyp19a1b-deficient males but not in their wild-type siblings” to “This possibility is at least likely for mating behavior, as estrogen treatment facilitated mating behavior in cyp19a1b-deficient males but not in their wild-type siblings.”

      Line 367: on average

      To explain the sex differences in the role of aromatase, what about the downstream molecular or neural targets? In mammals, hodology is related to sex differences. there could be convergent sex differences in regulating the same type of behaviors as well.

      Our findings demonstrate that brain-derived estrogens promote the expression of ara, arb, and their downstream target genes vt and gal in males, while enhancing the expression of npba, a downstream target of Esr2b signaling, in females. The identity of additional target genes and their roles in specific neural circuits remain to be elucidated, and we aim to address these in future research.

      Lines 378-382: this doesn't logically follow. pgf2a could be the target of estrogens which in the intact animal do regulate female sexual receptivity. And how can you say this given that your lab has shown in esr2b mutants females don't mate?

      We agree that PGF2α signaling may be activated by estrogen signaling, as stated in lines 404–407: “the present finding provides a likely explanation for this apparent contradiction, namely, that neuroestrogens, rather than or in addition to ovarian-derived circulating estrogens, may function upstream of PGF2α signaling to mediate female receptivity.” The observation that esr2b-deficient females do not accept male courtship is also stated in lines 401–403: “we recently challenged it by showing that female medaka deficient for esr2b are completely unreceptive to males, and thus estrogens play a critical role in female receptivity.”

      Line 396-397: or the remaining estrogens are enough to activate esr2b-dependent female-typical mating behaviors.

      We agree that cyp19a1b deficiency did not completely preclude female mating behavior, most likely because residual estrogens in the brains of cyp19a1b-deficient females enable weak activation of Esr2b signaling. However, the relevant section in the Discussion is not focused on examining why mating behavior persisted, but rather on considering the implications of this finding for the neural circuits regulating mating behavior. Therefore, incorporating the suggested explanation here would shift the focus and would not be appropriate.

      Line 420-421: this is a lot of variation. Was age controlled for?

      The time required for medaka to reach sexual maturity varies with rearing density and food availability. Due to space constraints, we adjust these parameters as needed, which led to variation in the ages of the experimental fish. However, since all experiments were conducted using sibling fish of the same age that had just reached sexual maturity, we believe this does not affect our conclusions.

      Line 457: have these kits been validated in medaka?

      Although we have not directly validated its applicability in medaka, its extensive use in this species suggests that it us unlikely to pose any issues (e.g., Ussery et al., 2018, Aquat Toxicol, 205:58–65; Lee et al., 2019, Ecotoxicol Environ Saf, 173:174–181; Kayo et al., 2020, Gen Comp Endocrinol, 285:113272; Fischer et al., 2021, Aquat Toxicol, 236:105873; Royan et al., 2023, Endocrinology, 164:bqad030).

      Line 589, re fish that spawned: how many times did this happen? Please note it is based on genotype and experiment. This could be important.

      Line 627: In response to this comment, we have added the following details: “Specifically, 7/18 cyp19a1b<sup>+/+</sup>, 11/18 cyp19a1b<sup>+/−</sup>, and 6/18 cyp19a1b<sup>−/−</sup> males were excluded in Fig. 1D; 6/10 cyp19a1b<sup>+/+</sup>, 3/10 cyp19a1b<sup>+/−</sup>, and 6/10 cyp19a1b<sup>−/−</sup> females were excluded in Fig. 6B; 2/23 esr1+/+ and 5/24 esr1−/− males were excluded in Fig. S7; 2/24 esr2a+/+ and 3/23 esr2a<sup>−/−</sup> males were excluded in Fig. S8A; 0/23 esr2a+/+ and 0/23 esr2a<sup>−/−</sup> males were excluded in Fig. S8B.”

      Reviewer #2 (Recommendations For The Authors):

      Abstract:

      (A1) The framing of neuroestrogens being important for male-typical rodents, and not for other vertebrate lineages, does not account for other groups (birds) in which this is true (the authors can consult their cited work by Balthazart (Reference 6) for extensive accounting of this). This makes the novelty clause in the abstract "indicating that neuro-estrogens are pivotal for male-typical behaviors even in nonrodents" less surprising and should be acknowledged by the authors by amending or omitting this novelty clause. The findings regarding androgen receptor transcription (next sentence) are more important and pertinent.

      Line 27: We recognize that the aromatization hypothesis applies to some birds, including zebra finches, as stated in the Introduction (lines 48–49) and Discussion (lines 432–433). However, this was not reflected in the Abstract. Following the reviewer’s suggestion, we have changed “in non-rodents” to “in teleosts.”

      (A2) The medaka line that has been engineered to have aromatase absent in the brain is presented briefly in the abstract, but can be misinterpreted as naturally occurring. This should be amended, by including something like "engineered" or "directed mutant" before 'male medaka fish'.

      Line 24: We have added “mutagenesis-derived” before “male medaka fish” in response to this comment.

      Introduction:

      (I1) The paragraph on teleost brain aromatase should acknowledge that while the capacity for estrogen synthesis in the brain is 100-1000 fold higher in teleosts as compared to rodents and other vertebrates, the majority of this derives from glial and not neural sources. This can be confusing for readers since the term 'neuroestrogens' often refers to the neuronal origin and signalling. And this observation includes the exclusive radial glial expression of cyp19a1b in medaka (Diotel et al., 2010), and first discovered in midshipman (Forlano et al., 2001), each of which should also be cited here. In addition, the authors expend much text comparing teleosts and rodents, but it is worth expanding these kinds of comparisons, especially by pointing out that parts of the primate brain are found to densely express aromatase (see work by Ei Terasawa and others).

      In response to this comment and a similar comment from reviewer #1, we have replaced “neuroestrogens” with “brain-derived estrogens” or “brain estrogens” throughout the manuscript.

      Line 63: We have also added the text “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (18– 20).” As a result of this addition, we have changed “This observation suggests” to “These observations suggest” in the subsequent sentence.

      Line 51: Additionally, to include information on aromatase in the primate brain, we have added the following text: “In primates, the hypothalamic aromatization of androgens to estrogens plays a central role in female gametogenesis (10) but is not essential for male behaviors (7, 8).”

      The following references (#10 and 18–20), cited in the newly added text above, have been included in the reference list, with other references renumbered accordingly:

      E. Terasawa, Neuroestradiol in regulation of GnRH release. Horm. Behav. 104, 138–145 (2018).

      P. M. Forlano, D. L. Deitcher, D. A. Myers, A. H. Bass, Anatomical distribution and cellular basis for high levels of aromatase activity in the brain of teleost fish: aromatase enzyme and mRNA expression identify glia as source. J. Neurosci. 21, 8943–8955 (2001).

      N. Diotel, Y. Le Page, K. Mouriec, S. K. Tong, E. Pellegrini, C. Vaillant, I. Anglade, F. Brion, F. Pakdel, B. C. Chung, O. Kah, Aromatase in the brain of teleost fish: expression, regulation and putative functions. Front. Neuroendocrinol. 31, 172–192 (2010).

      A. Takeuchi, K. Okubo, Post-proliferative immature radial glial cells female-specifically express aromatase in the medaka optic tectum. PLoS One 8, e73663 (2013).

      (I2) It is difficult to resolve from the introduction and work cited how restricted cyp19a1b is to the medaka brain. Important for the results of this study, it is not clear whether it is more of a bias in the brain vs other tissues, or if the cyp19a1b deficiency is restricted to the brain, and gonadal/peripheral cyp19 expression persists. The authors need to improve their consideration of the alternatives, i.e., that this manipulation is not somehow affecting: 1) peripheral aromatase expression (either cyp19a1a or cyp19a1b) in the gonad or elsewhere, 2) compensatory processes, such as other steroidogenic genes (are androgen synthesizing enzymes increasing?).

      Our previous study demonstrated that cyp19a1b is expressed in the gonads, but at levels tens to hundreds of times lower than those in the brain (Okubo et al., 2011, J Neuroendocrinol 23:412–423). Additionally, a separate study in medaka reported that cyp19a1b expression in the ovary is considerably lower than that of cyp19a1a (Nakamoto et al., 2018, Mol Cell Endocrinol 460:104–122). Given these observations, any potential effect of cyp19a1b knockout on peripheral estrogen synthesis is likely negligible. Indeed, Figures S1C and S1D confirm that cyp19a1b knockout does not alter peripheral E2 levels.

      Line 72: To incorporate this information into the Introduction and address the following comment, we have added the following text: “In medaka, cyp19a1b is also expressed in the gonads, but only at a level tens to hundreds of times lower than in the brain and substantially lower than that of cyp19a1a (26, 27).”

      The following references (#26 and 27), cited in the newly added text above, have been included in the reference list, with other references renumbered accordingly:

      K. Okubo, A. Takeuchi, R. Chaube, B. Paul-Prasanth, S. Kanda, Y. Oka, Y. Nagahama, Sex differences in aromatase gene expression in the medaka brain. J. Neuroendocrinol. 23, 412–423 (2011).

      M. Nakamoto, Y. Shibata, K. Ohno, T. Usami, Y. Kamei, Y. Taniguchi, T. Todo, T. Sakamoto, G. Young, P. Swanson, K. Naruse, Y. Nagahama, Ovarian aromatase loss-of-function mutant medaka undergo ovary degeneration and partial female-to-male sex reversal after puberty. Mol. Cell. Endocrinol. 460, 104–122 (2018).

      We have not assessed whether the expression of other steroidogenic enzymes is altered in cyp19a1bdeficient fish, and this may be investigated in future studies.

      (I3) Related, there are documented sex differences in the brain expression of cyp19a1b especially in adulthood (Okubo et al 2011) and this study should be cited here for context.

      Line 72: As stated in our previous response, we have cited Okubo et al. (2011) by adding the following sentence: “In medaka, cyp19a1b is also expressed in the gonads, but only at a level tens to hundreds of times lower than in the brain and substantially lower than that of cyp19a1a (26, 27).”

      Methods

      (M1) The rationale is unclear as presented for using mutagen screening for cype19a1b while using CRISPR for esr2a. Are there methodological/biochemical reasons why the authors chose to not use the same method for both?

      At the time we generated the cyp19a1b knockouts, genome editing was not yet available, and the TILLING-based screening was the only method for obtaining mutants in medaka. In contrast, by the time we generated the esr2a knockouts, CRISPR/Cas9 had become available, enabling a more efficient and convenient generation of knockout lines. This is why the two knockout lines were generated using different methods.

      (M2) Measurement of steroids in biological matrices is not straightforward, and it is good that the authors use multiple extraction steps (organic followed by C18 columns) before loading samples on the ELISA plates, which are notoriously sensitive. Even though these methods have been published before by this group of authors previously, the quality control and ELISA performance values (recovery, parallelism, etc.) should be presented for readers to evaluate.

      Thank you for appreciating our sample purification method. Unfortunately, we have not evaluated the recovery rate or parallelism, but we recognize this a subject for future studies.

      (M3) Mating behavior - E2 treated males were not co-housed with social partners for the full 24 hr before testing, but instead a few hours (?) prior to testing. The rationale for this should be spelled out explicitly.

      Line 494: In response to this comment, we have added “to ensure the efficacy of E2 treatment” to the end of the sentence “The set-up was modified for E2-treated males, which were kept on E2 treatment and not introduced to the test tanks until the day of testing.”

      (M4) The E2 treatment is listed as 1ng/ml vs. vehicle (ethanol). Is the E2 dissolved in 100% ethanol for administration to the tank water? Clarification is needed.

      Line 517: As the reviewer correctly assumed, E2 was first dissolved in 100% ethanol before being added to the tank water. To provide this information and address reviewer #1’s minor comment 5, we have revised “males were treated with 1 ng/ml of E2 (Fujifilm Wako Pure Chemical, Osaka, Japan) or vehicle (ethanol) alone by immersion in water for 4 days” to “males were treated with 1 ng/ml of E2 (Fujifilm Wako Pure Chemical, Osaka, Japan), which was first dissolved in 100% ethanol (vehicle), or with the vehicle alone by immersion in water for 4 days, with daily water changes to maintain the nominal concentration.”

      (M5) The authors exclude fish from the analysis of courtship display behavior for those individuals that spawned immediately at the start of the testing (and therefore it was impossible to register courtship display behaviors). How often did fish in the various treatment groups exhibit this "fast spawning" behavior? Was the occurrence rate different by treatment group? It is unlikely that these omissions from the data set drove large-scale patterns, but an indication of how often this occurred would be reassuring.

      Line 627: In response to this comment, we have included the following details: “Specifically, 7/18 cyp19a1b<sup>+/+</sup>, 11/18 cyp19a1b<sup+/−</sup>, and 6/18 cyp19a1b<sup>−/−</sup> males were excluded in Fig. 1D; 6/10 cyp19a1b+/+, 3/10 cyp19a1b+/−, and 6/10 cyp19a1b<sup>−/−</sup> females were excluded in Fig. 6B; 2/23 esr1+/+ and 5/24 esr1−/− males were excluded in Fig. S7; 2/24 esr2a+/+ and 3/23 esr2a<sup>−/−</sup> males were excluded in Fig. S8A; 0/23 esr2a+/+ and 0/23 esr2a<sup>−/−</sup> males were excluded in Fig. S8B.” These data indicate that the proportion of excluded males is nearly constant within each trial and is independent of the genotype of the focal fish.

      Results

      (R1) It is striking to see the genetic-'dose' dependent suppression of brain E2 content by heterozygous and homozygous cyp19a1b deficiency, indicating that, as the authors point out, the majority of E2 in the male medaka brain (and 1/2 in the female brain) have a brain-derived origin. It is important also for the interpretation that there are large compensatory increases in brain levels of androgens, when E2 levels drop in the cyp19a1b mutant homozygotes. This latter point should receive more attention.

      Also, there are large increases in peripheral androgen levels in the homozygote mutants for cyp19a1b in both males and females. This indicates a peripheral effect in addition to the clear brain knockdown of E2 synthesis. These nuances need to be addressed.

      In response to this comment, we have revised the Results section as follows:

      Line 91: “, indicating a dosage effect of cyp19a1b mutation” has been added to the end of the sentence “In males, brain E2 in heterozygotes (cyp19a1b<sup>+/−</sup>) was also reduced to 45% of the level in wild-type siblings (P = 0.0284) (Fig. 1A).”

      Line 94: To draw more attention to the increase in brain androgen levels caused by cyp19a1b deficiency, “Brain levels of testosterone” has been modified to “Strikingly, brain levels of testosterone.”

      Line 100: “Their peripheral 11KT levels also increased 3.7- and 1.8-fold, respectively (P = 0.0789, males; P = 0.0118, females) (Fig. S1, C and D)” has been modified and now reads “In addition, peripheral 11KT levels in cyp19a1b<sup>−/−</sup> males and females increased 3.7- and 1.8-fold, respectively (P = 0.0789, males; P = 0.0118, females) (Fig. S1, C and D), indicating peripheral influence in addition to central effects.”

      (R2) The interpretation on page 4 that cyp19a1b deficient males are 'less motivated' to mate is premature, given the behavioral measures used in this study. There are several competing explanations for these findings (e.g., alterations in motivation, sensory discrimination, preference, etc.) that could be followed up in future work, but the current results are not able to distinguish among these possibilities.

      Line 112: We agree that the possibility of altered cognition or sexual preference cannot be dismissed. To incorporate this perspective, we have revised the text “, suggesting that they are less motivated to mate” to “These results suggest that they are less motivated to mate, though an alternative interpretation that their cognition or sexual preference may be altered cannot be dismissed.”

      (R3) On page 5, the authors present that peripheral E2 manipulation (delivery to the fish tank) restores courtship behavior in males, and then go on to erroneously conclude that this demonstrates "that reduced E2 in the brain was the primary cause of the mating defects, indicating a pivotal role of neuroestrogens in male mating behavior." Because this is a peripheral E2 treatment, there can be manifold effects on gonadal physiology or other endocrine events that can have indirect effects on the brain and behavior. Without manipulation of E2 directly to the brain to 'rescue' the cyp19a1b deficiency, the authors cannot conclude that these effects are directly on the central nervous system. Tellingly, the tank E2 treatment did not rescue aggressive behavior, suggestive of the potential for indirect effects.

      Line 155: As detailed in Response to reviewer #2’s specific comment 1, we have revised the text from “These results demonstrated that reduced E2 in the brain was the primary cause of the mating defects, indicating a pivotal role of neuroestrogens in male mating behavior. In contrast” to “These results suggest that reduced E2 in the brain is the primary cause of the mating defects, highlighting a pivotal role of brain-derived estrogens in male mating behavior. However, caution is warranted, as an indirect peripheral effect of bath-immersed E2 on behavior cannot be ruled out, although this is unlikely given the comparable peripheral E2 levels in cyp19a1b-deficient and wild-type males. In contrast to mating.”

      (R4) The downregulation of androgen-dependent gene expression (vasotocin in pNVT and galanin in pPMp) in the cyp19a1b deficient males (Figure 3) could be due to exceedingly high levels of brain androgens in the cyp19a1b deficient males. The best way to test the idea that estrogens can restore the expression to be more wild-type directly (like what is happening for ara and arb) is to look at these same markers (vasotocin and galanin) in these same brain areas in the brains of E2-treated males. The authors should have these brains from Figure 2. Unless I missed something, those experiments were not performed/reported here. It is clear that the ara and arb receptors have EREs and are 'rescued' by E2 treatment, but in principle, there could be indirect actions for reasons stated above for the behavior due to the peripheral E2 tank application.

      Thank you for your insightful comment. We agree that the current results cannot exclude the possibility that excessive androgen levels caused the downregulation of vt and gal. However, our previous studies showed that excessive 11KT administration to gonadectomized males and females increased the expression of these genes to levels comparable to wild-type males (Yamashita et al., 2020, eLife, 9:e59470; Kawabata-Sakata et al., 2024, Mol Cell Endocrinol 580:112101), making this scenario unlikely. That said, testing whether estrogen treatment restores vt and gal expression in cyp19a1bdeficient males would be informative, and we see this as an important direction for future research.

      Discussion

      (D1) The authors need to clarify whether EREs are found in other vertebrate AR introns, or is this unique to the teleost genome duplication?

      We have identified multiple ERE-like sequences within intron 1 of the mouse AR gene. However, sequence data alone do not provide sufficient evidence of their functionality, rendering this information of limited relevance. Therefore, we have chosen not to include this discussion in the current paper.

      Reviewer #3 (Recommendations For The Authors):

      (1) The authors are strongly encouraged to report information regarding the effect of Cyp19a1b deletion on the brain content of aromatase protein (ideally both isoforms investigated separately) as the two isoforms are mostly but not completely brain vs gonad specific. The analysis of other tissues would also strengthen the characterization of this model.

      We agree that measuring aromatase protein levels in the brain of our fish would be valuable for confirming the loss of cyp19a1b function. However, as no suitable method is currently available, this issue will need to be addressed in future studies. While this constitutes indirect evidence, the observed reduction in brain E2 levels, with no change in peripheral E2 levels, in cyp19a1b-deficient fish strongly suggests the loss of cyp19a1b function, as noted in Response to reviewer #3’s comment 1 on weaknesses.

      (2) As presented, this study reads as niche work. A better description of the behavior and reproductive significance of the different aspects of the behavioral sequence would allow a better understanding of the results and would thus allow the non-specialist to appreciate the significance of the observations.

      Line 103: In response to this comment and Reviewer #3’s comment 2 on weaknesses, we have revised the sentence from “The mating behavior of medaka follows a stereotypical pattern, wherein a series of followings, courtship displays, and wrappings by the male leads to spawning” to “The mating behavior of medaka follows a stereotypical sequence. It begins with the male approaching and closely following the female (following). The male then performs a courtship display, rapidly swimming in a circular pattern in front of the female. If the female is receptive, the male grasps her with his fins (wrapping), culminating in the simultaneous release of eggs and sperm (spawning)” in order to provide a more detailed description of medaka mating behavior.

      (3) The data regarding female behavior are limited and incomplete. It is suggested to keep this for another manuscript unless data on the behavior of the female herself is added. Indeed, analyzing female's behavior from the male's perspective complicates the interpretation of the results while a description of what the females do would provide valuable and interpretable information.

      We thank the reviewer for this thoughtful suggestion and agree that the data and discussion for females are less extensive than for males. However, we have previously elucidated the mechanism by which estrogen/Esr2b signaling promotes female mating behavior (Nishiike et al., 2021). Accordingly, it follows that the new insights into female behavior gained from the cyp19a1b knockout model are more limited than those for males. Nevertheless, when combined with our prior findings, the female data in this study offer valuable insights, and the overall mechanism through which estrogens promote female mating behavior is becoming clearer. Therefore, we do not consider the female data in this study to be incomplete or merely supplementary.

      (4) In Figure 2, the validity to run multiple T-tests rather than a two-way ANOVA comparing TRT and genotype is questionable. Moreover, why are the absolute values in CTL higher than in the initial experiment comparing genotypes for ara in PPa, pPPp, and NVT as well as for arb in aPPp. More importantly, these graphs do not seem to reproduce the genotype effects for ara in pPPp and NVT and for arb in aPPp.

      The data in Figures 2J and 2K were analyzed with an exclusive focus on the difference between vehicletreated and E2-treated males, without considering genotype differences. Therefore, the use of T-tests for significance testing is appropriate.

      As the reviewer noted, the overall ara expression area is larger in Figure 2J than in Figure 2F. However, as detailed in Response to reviewer #3’s comment 8 on weaknesses, the relative area ratios of ara expression among brain nuclei are consistent between the two figures, indicating the reproducibility of the results. Thus, we consider this difference unlikely to affect the conclusions of this study.

      Additionally, the differences in ara expression in pPPp and arb expression in aPPp between wild-type and cyp19a1b-deficient males appear smaller in Figures 2J and 2K compared to Figures 2F and 2H. This is likely due to the smaller sample size used in the experiments for Figures 2J and 2K, which makes the differences less distinct. However, since the same genotype-dependent trends are observed in both sets of figures, the conclusion that ara and arb expression is reduced in cyp19a1b-deficient male brains remains valid.

      (5) More information is required regarding the analysis of single ISH - How was the positive signal selected from the background in the single ISH analyses? How was this measure standardized across animals? How many sections were imaged per region? Do the values represent unilateral or bilateral analysis?

      Line 540: Following this comment, we have provided additional details on the single ISH method in the manuscript. Specifically, “, and the total area of signal in each brain nucleus was calculated using Olyvia software (Olympus)” has been revised to “The total area of signal across all relevant sections, including both hemispheres, was calculated for each brain nucleus using Olyvia software (Olympus). Images were converted to a 256-level intensity scale, and pixels with intensities from 161 to 256 were considered signals. All sections used for comparison were processed in the same batch, without corrections between samples.”

      (6) More information should be provided in the methods regarding the image analysis of double ISH. In particular, what were the criteria to consider a cell as labeled are not clear. This is not clear either from the representative images.

      Line 596: To provide additional details on the single ISH method in the manuscript, we have added the following sentence: “Cells were identified as coexpressing the two genes when Alexa Fluor 555 and fluorescein signals were clearly observed in the cytoplasm surrounding DAPI-stained nuclei, with intensities markedly stronger than the background noise.”

      (7) There is no description of the in silico analyses run on ESR2a in the methods.

      The method for identifying estrogen-responsive element-like sequences in the esr2a locus is described in line 549: “Each nucleotide sequence of the 5′-flanking region of ara and arb was retrieved from the Ensembl medaka genome assembly and analyzed for potential canonical ERE-like sequences using Jaspar (version 5.0_alpha) and Match (public version 1.0) with default settings.”

      However, the method for domain identification in Esr2a was not described. Therefore, we have added the following text in line 469: “The DNA- and ligand-binding domains of medaka Esr2a were identified by sequence alignment with yellow perch (Perca flavescens) Esr2a, for which these domain locations have been reported (58).”

      The following reference (#58), cited in the newly added text above, have been included in the reference: S. G. Lynn, W. J. Birge, B. S. Shepherd, Molecular characterization and sex-specific tissue expression of estrogen receptor α (esr1), estrogen receptor βa (esr2a) and ovarian aromatase (cyp19a1a) in yellow perch (Perca flavescens). Comp. Biochem. Physiol. B Biochem. Mol. Biol. 149, 126–147 (2008).

      (8) Information about the validation steps of the EIA that were carried out as well as the specificity of the antibody the steroids and the extraction efficacy should be provided.

      We have not directly validated the applicability of the EIA kit, but its extensive use in medaka suggests that it us unlikely to pose any issues (e.g., Ussery et al., 2018, Aquat Toxicol, 205:58–65; Lee et al., 2019, Ecotoxicol Environ Saf, 173:174–181; Kayo et al., 2020, Gen Comp Endocrinol, 285:113272; Fischer et al., 2021, Aquat Toxicol, 236:105873; Royan et al., 2023, Endocrinology, 164:bqad030).

      The specificity (cross-reactivity) of the antibodies is detailed as follows.

      (1) Estradiol ELISA kits: estradiol, 100%; estrone, 1.38%; estriol, 1.0%; 5α-dihydrotestosterone, 0.04%; androstenediol, 0.03%; testosterone, 0.03%; aldosterone, <0.01%; cortisol, <0.01%; progesterone, <0.01%.

      (2) Testosterone ELISA kits: testosterone, 100%; 5α-dihydrotestosterone, 27.4%; androstenedione, 3.7%; 11-ketotestosterone, 2.2%; androstenediol, 0.51%; progesterone, 0.14%; androsterone, 0.05%; estradiol, <0.01%.

      (3) 11-Keto Testosterone ELISA kits: 11-ketotestosterone, 100%; adrenosterone, 2.9%; testosterone, <0.01%.

      As this information is publicly available on the manufacturer’s website, we deemed it unnecessary to include it in the manuscript.

      Unfortunately, we have not evaluated the extraction efficacy of the samples, but we recognize this a subject for future studies.

      (9) I wonder whether the evaluation of the impact of the mutation by comparing the behavior of a group of wild-type males to a group of mutated males is the most appropriate. Justifying this approach against testing the behavior of one mutated male facing one or several wild-type males would be appreciated.

      We agree that the resident-intruder test, in which a single focal resident is confronted with one or more stimulus intruders, is the most commonly used method for assessing aggression. However, medaka form shoals and lack strong territoriality, and even slight dominance differences between the resident and the intruder can increase variability in the results, compromising data consistency. Therefore, in this study, we adopted an alternative approach: placing four unfamiliar males together in a tank and quantifying aggressive interactions in total. This method allows for the assessment of aggression regardless of territorial tendencies, making it more appropriate for our investigation.

      (10) Lines 329-331: this sentence should be rephrased as it contributes to the confusion between sexual differentiation and activation of circuits. The restoration of sexual behavior by adult estrogen treatment pleads in favor of an activational role of neuro-estrogens on behavior rather than an organizational role. Therefore, referring to sexual differentiation is misleading, even more so that the study never compares sexes.

      As detailed in Response to reviewer #3’s comment 9 on weaknesses, we consider that all factors that cause sex differences, including the transient effects of adult steroids, need to be incorporated into a theory of sexual differentiation. In teleosts, since steroids during early development have little effect and sexual differentiation primarily relies on steroid action in adulthood, our discussion on brain sexual differentiation remains valid, including the statement in line 347: “This variation among species may represent the activation of neuroestrogen synthesis at life stages critical for sexual differentiation of behavior that are unique to each species.”

      (11) Lines 384-386: I may have missed something but I do not see data supporting the notion that neuroestrogens may function upstream of PGF2a signaling to mediate female receptivity.

      Line 403: We acknowledge that our explanation was insufficient and apologize for any confusion. To clarify this point, “Given that estrogen/Esr2b signaling feminizes the neural substrates that mediate mating behavior, while PGF2α signaling triggers female sexual receptivity,” has been added before the sentence “The present finding provides a likely explanation for this apparent contradiction, namely, that neuroestrogens, rather than or in addition to ovarian-derived circulating estrogens, may function upstream of PGF2α signaling to mediate female receptivity.”

      Additional alteration

      Reference list (line 682): a preprint article has now been published in a peer-reviewed journal, and the information has been updated accordingly as follows: “bioRxiv doi: 10.1101/2024.01.10.574747 (2024)” to “Proc. Natl. Acad. Sci. U.S.A. 121, e2316459121 (2024).”

    1. eLife Assessment

      This important study combines imaginative experiments to demonstrate the relevance of poroelasticity in the mechanical properties of cells across physiologically relevant time and length scales. Through innovative experiments and a finite element model, the authors present solid evidence that cytosolic flows and pressure gradients can persist in cells with permeable membranes, generating spatially segregated influx and outflux zones. These findings will be of interest to the cell biology and biophysics communities. Nevertheless, a more in depth discussion of why other possible explanations for the long time scales associated to mechanical propagation are less effective could further strengthen their message.

    2. Reviewer #1 (Public review):

      Summary:

      This work investigated whether cytoplasmic poroelastic properties play an important role in cellular mechanical response over length scales and time scales relevant to cell physiology. Overall, the manuscript concludes that intracellular cytosolic flows and pressure gradients are important for cell physiology and that they act of time- and length-scales relevant to mechanotransduction and cell migration.

      Strengths:

      Their approach integrates both computational and experimental methods. The AFM deformation experiments combined with measuring z-position of beads is a challenging yet compelling method to determine poroelastic contributions to mechanical realization.

      The work is quite interesting and will be of high value to the field of cell mechanics and mechanotransduction.

      Weaknesses:

      However, there are several issues related to the lack of description of theoretical equations, experimental details, and data transparency that should be addressed, including the following:

      (1) Some details are not described for experimental procedures. For example, what were the pharmacological drugs dissolved in, and what vehicle control was used in experiments? How long were pharmacological drugs added to cells?

      (2) Details are missing from the Methods section and Figure captions about the number of biological and technical replicates performed for experiments. Figure 1C states the data are from 12 beads on 7 cells. Are those same 12 beads used in Figure 2C? If so, that information is missing from the Figure 2C caption. Similarly, this information should be provided in every figure caption so the reader can assess the rigor of the experiments. Furthermore, how heterogenous would the bead displacements be across different cells? The low number of beads and cells assessed makes this information difficult to determine.

      (3) The full equation for displacement vs. time for a poroelastic material is not provided. Scaling laws are shown, but the full equation derived from the stress response of an elastic solid and viscous fluid is not shown or described.

    3. Reviewer #2 (Public review):

      Summary:

      Malboubi et al. present a novel experimental framework to investigate the rheological properties of the cell cytoplasm. Their findings support a model where the cytoplasm behaves as a poroelastic material governed by Darcy's law - a property overlooked in previous literature. They demonstrate that this poroelastic behavior delays the equilibration of hydrostatic pressure gradients within the cytoplasm over timescales of 1 to 10 seconds following a perturbation, likely due to fluid-solid friction within the cytoplasmic matrix. Furthermore, under sustained perturbations such as depressurization, they reveal that pressure gradients can persist for minutes, which they propose might potentially influence physiological processes like mechanotransduction or cell migration typically happening on these timescales.

      Strengths:

      This article holds significant value within the ongoing efforts of the cell biology and biophysics communities to quantitatively characterize the mechanical properties of cells. The experiments are innovative and thoughtfully contextualized with quantitative estimates and a finite element model that supports the authors' hypotheses.

      Comments & Questions:

      While the hypothesis of a poroelastic cytoplasm is insightful and supported by the results, certain parts of the paper (detailed below) rely on qualitative arguments. Given the experimental approaches and accompanying modeling, the study has the potential for more in-depth discussions and stronger quantitative evidence. Placing greater emphasis on quantifications and direct comparisons between the model and experimental data would enhance the work. Additionally, exploring the limitations of the proposed model would add valuable depth to the paper.

      The authors state, "Next, we sought to quantitatively understand how the global cellular response to local indentation might arise from cellular poroelasticity." However, the evidence presented in the following paragraph appears more qualitative than strictly quantitative. For instance, the length scale estimate of ~7 μm is only qualitatively consistent with the observed ~10 μm, and the timescale 𝜏𝑧 ≈ 500 ms is similarly described as "qualitatively consistent" with experimental observations. Strengthening this point would benefit from more direct evidence linking the short timescale to cell surface tension. Have you tried perturbing surface tension and examining its impact on this short-timescale relaxation by modulating acto-myosin contractility with Y-27632, depolymerizing actin with Latrunculin, or applying hypo/hyperosmotic shocks?

      The authors demonstrate that the second relaxation timescale increases (Figure 1, Panel D) following a hyperosmotic shock, consistent with cytoplasmic matrix shrinkage, increased friction, and consequently a longer relaxation timescale. While this result aligns with expectations, is a seven-fold increase in the relaxation timescale realistic based on quantitative estimates given the extent of volume loss?

      If the authors' hypothesis is correct, an essential physiological parameter for the cytoplasm could be the permeability k and how it is modulated by perturbations, such as volume loss or gain. Have you explored whether the data supports the expected square dependency of permeability on hydraulic pore size, as predicted by simple homogeneity assumptions? Additionally, do you think that the observed decrease in k in mitotic cells compared to interphase cells is significant? I would have expected the opposite naively as mitotic cells tend to swell by 10-20 percent due to the mitotic overshoot at mitotic entry (see Son Journal of Cell Biology 2015 or Zlotek Journal of Cell Biology 2015).

      Based on your results, can you estimate the pore size of the poroelastic cytoplasmic matrix? Is this estimate realistic? I wonder whether this pore size might define a threshold above which the diffusion of freely diffusing species is significantly reduced. Is your estimate consistent with nanobead diffusion experiments reported in the literature?

      Do you have any insights into the polymer structures that define this pore size? For example, have you investigated whether depolymerizing actin or other cytoskeletal components significantly alters the relaxation timescale?

      There are no quantifications in Figure 6, nor is there a direct comparison with the model. Based on your model, would you expect the velocity of bleb growth to vary depending on the distance of the bleb from the pipette due to the local depressurization? Specifically, do blebs closer to the pipette grow more slowly?

      I find it interesting that during depressurization of the interphase cells, there is no observed volume change, whereas in pressurization of metaphase cells, there is a volume increase. I assume this might be a matter of timescale, as the microinjection experiments occur on short timescales, not allowing sufficient time for water to escape the cell. Do you observe the radius of the metaphase cells decreasing later on? This relaxation could potentially be used to characterize the permeability of the cell surface.

      I am curious about the saturation of the time lag at 30 microns from the pipette in Figure 4, Panel E for the model's prediction. A saturation which is not clearly observed in the experimental data. Could you comment on the origin of this saturation and the observed discrepancy with the experiments (Figure E panel 2)? Naively, I would have expected the time lag to scale quadratically with the distance from the pipette, as predicted by a poroelastic model and the diffusion of displacement. It seems weird to me that the beads start to move together at some distance from the pipette or else I would expect that they just stop moving. What model parameters influence this saturation? Does membrane permeability contribute to this saturation?

    4. Reviewer #3 (Public review):

      Summary:

      In this delightful study, the authors use local indentation of the cell surface combined with out-of-focus microscopy to measure the rates of pressure spread in the cell and to argue that the results can be explained with the poroelastic model. Osmotic shock that decreases cytoskeletal mesh size supports this notion. Experiments with water injection and water suction further support it, and also, together with a mechanical model and elegant measurements of decreasing fluorescence in the cell 'flashed' by external flow, demonstrate that the membrane is permeable, and that steady flow and pressure gradient can exist in a cell with water source/sink in different locations. Use of blebs as indicators of the internal pressure further supports the notion of differential cytoplasmic pressure.

      Strengths:

      The study is very imaginative, interesting, novel and important.

      Weaknesses: I have two broad critical comments:

      (1) I sense that the authors are correct that the best explanation of their results is the passive poroelastic model. Yet, to be thorough, they have to try to explain the experiments with other models and show why their explanation is parsimonious. For example, one potential explanation could be some mechanosensitive mechanism that does not involve cytoplasmic flow; another could be viscoelastic cytoskeletal mesh, again not involving poroelasticity. I can imagine more possibilities. Basically, be more thorough in the critical evaluation of your results. Besides, discuss the potential effect of significant heterogeneity of the cell.

      (2) The study is rich in biophysics but a bit light on chemical/genetic perturbations. It could be good to use low levels of chemical inhibitors for, for example, Arp2/3, PI3K, myosin etc, and see the effect and try to interpret it. Another interesting question - how adhesive strength affects the results. A different interesting avenue - one can perturb aquaporins. Etc. At least one perturbation experiment would be good.

    1. eLife Assessment

      Alignment and sequencing errors are a major concern in molecular evolution, and this valuable study represents a welcome improvement for genome-wide scans of positive selection. This new method seems to perform well and is generally convincing, although the evidence could be made more direct and more complete through additional simulations to determine the extent to which alignment errors are being properly captured.

    2. Reviewer #1 (Public review):

      Summary:

      Selberg et al. present a small but apparently very relevant modification to the existing BUSTED model. The new model allows for a fraction of codons to be assigned to an error class characterized by a very high dN/dS value. This "omega_e" category is constrained to represent no more than 1% of the alignment. The analyses convincingly show that the method performs well and represents a real improvement for genome-wide scans of positive selection. Alignment and sequencing errors are a major concern in molecular evolution. This new method, which shows strong performance, is therefore a very welcome contribution.

      Strengths:

      By thoroughly reanalyzing four datasets, the manuscript convincingly demonstrates that omega_e effectively identifies genuine alignment errors. Next, the authors evaluate the reduction in power to detect true selection through simulations. This new model is simple, efficient, and computationally fast. It is already implemented and available in HYPHY software.

      As a side note, I found it particularly interesting how the authors tested the statistical support for the new method compared to the simpler version without the error class. In many cases, the simpler model could not be statistically rejected in favor of the more complex model, despite producing biologically incorrect results in terms of parameter inference. This highlights a broader issue in molecular evolution and phylogenomics, where model selection often relies too heavily on statistical tests, potentially at the expense of biological realism. The analyses also reveal a trade-off between statistical power and the false positive rate. As with other methods, BUSTED-E cannot distinguish between alignment/sequencing errors and episodes of very strong positive selection. The authors are transparent about this limitation in the discussion.

      Weaknesses:

      Regarding the structure of the manuscript, the text could be clearer and more precise. Clear, practical recommendations for users could also be provided in the Results section. Additionally, the simulation analyses could be further developed to include scenarios with both alignment errors and positive selection, in order to better assess the method's performance. Finally, the model is evaluated only in the context of site models, whereas the widely used branch-site model is mentioned as possible but not assessed.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, Selberg et al present an extension of their widely used BUSTED family of codon models for the detection of episodic ("site-branch") positive selection from coding gene sequences. The extension adds an "error component" to ω (dN/dS) to capture misaligned codons. This ω component is set to an arbitrarily high value to distinguish it from positive selection, which is characterised by ω > 1 but assumed not to be so high.

      The new method is tested on several datasets of comparative genomes, characterised by their size and the fact that the authors scanned for positive selection and/or provided filtering of alignment quality. It is also tested on simple simulations.

      Overall, the new method appears to capture relatively little of the ω variability in the alignments, although it is often significant. Given the complexity of codon evolution, adding a new parameter is more or less significant, and the question is whether it captures the signal that is intended, preferably in an unbiased manner.

      Strengths:

      This is an important issue, and I am enthusiastic to see it explicitly modeled within the codon modeling framework, rather than externalised to ad hoc filtering methods. The promise of quantifying the divergence signal from alignment error vs selection is exciting.

      The BUSTED family of models is widely used and very powerful for capturing many aspects of codon evolution, and it is thus an excellent choice for this extension.

      Weaknesses:

      (1) The definition of alignment error by a very large ω is not justified anywhere in the paper. There are known cases of bona fide positive selection with many non-synonymous and 0 synonymous substitutions over branches. How would they be classified here? E.g., lysosyme evolution, bacterial experimental evolution.

      Using the power of the model family that the authors develop, I would suggest characterising a more specific error model. E.g., radical amino-acid "changes" clustered close together in the sequence, proximity to gaps in the alignment, correlation of apparent ω with genome quality.

      Also concerning this high ω, how sensitive is its detection to computational convergence issues?

      (2) The authors should clarify the relation between the "primary filter for gross or large-scale errors" and the "secondary filter" (this method). Which sources of error are expected to be captured by the two scales of filters? What is their respective contribution to false positives of positive selection?

      Sources of error in the alignment of coding genes include:

      a) Errors in gene models, which may differ between species but also propagate among close species (i.e., when one species is used as a reference to annotate others).

      b) Inconsistent choice of alternative transcripts/isoforms.

      Both of these lead to asking an alignment algorithm to align non-homologous sequences, which violates the assumptions of the algorithms, yet both are common issues in phylogenomics.

      c) Sequencing errors, but I doubt they affect results much here.

      d) Low complexity regions of proteins.

      e) Aproximations by alignment heuristics, sometimes non-deterministic or dependent on input order.

      f) Failure to capture aspects of protein or gene evolution in the optimality criteria used.

      For example, Figure 1 seems to correspond to a wrong or inconsistent definition of the final exon of the gene in one species, which I would expect to be classified as "gross or large-scale error".

      (3) The benchmarking of the method could be improved both for real and simulated data.

      For real data, the authors only analysed sequences from land vertebrates with relatively low Ne and thus relatively low true positive selection. I suggest comparing results with e.g. Drosophila genomes, where it has been reported that 50% of all substitutions are fixed by positive selection, or with viral evolution.

      For simulations, the authors should present simulations with or without alignment errors (e.g., introduce non-homologous sequences, or just disturb the alignments) and with or without positive selection, to measure how much the new method correctly captures alignment errors and incorrect positive selection.

      I also recommend simulating under more complex models, such as multinucleotide mutations or strong GC bias, and investigating whether these other features are captured by the alignment error component.

      Finally, I suggest taking true alignments and perturbing them (e.g., add non-homologous segments or random gaps which shift the alignment locally), to verify how the method catches this. It would be interesting to apply such perturbations to genes which have been reported as strong examples of positive selection, as well as to genes with no such evidence.

      (4) It would be interesting to compare to results from the widely used filtering tool GUIDANCE, as well as to the Selectome database pipeline (https://doi.org/10.1093/nar/gkt1065). Moreover, the inconsistency between BUSTED-E and HMMCleaner, and BMGE is worrying and should be better explained.

      (5) For a new method such as this, I would like to see p-value distributions and q-q plots, to verify how unbiased the method is, and how well the chi-2 distribution captures the statistical value.

      (6) I disagree with the motivation expressed at the beginning of the Discussion: "The imprimatur of "positive selection" has lost its luster. Researchers must further refine prolific candidate lists of selected genes to confirm that the findings are robust and meaningful." Our goal should not be to find a few impressive results, but to measure accurately natural selection, whether it is frequent or rare.

    4. Author response:

      eLife Assessment

      Alignment and sequencing errors are a major concern in molecular evolution, and this valuable study represents a welcome improvement for genome-wide scans of positive selection. This new method seems to perform well and is generally convincing, although the evidence could be made more direct and more complete through additional simulations to determine the extent to which alignment errors are being properly captured.

      We thank the editors for their positive assessment and for highlighting the core strength and a key area for improvement. The main request (also echoed by both reviewers) is for us to conduct additional simulation studies where true alignment errors are known and assess the performance of BUSTED-E. We plan to conduct several simulations (on the order of 100,000 individual alignments in total) in response to that request, with the caveat that we are not aware of any tools that simulate realistic alignment errors, so these simulations are likely only a pale reflection of biological reality.

      (1) Ad hoc small local edits of alignments similar to what was implemented in the HMMCleaner paper. These local edits would include operations like replacement of codons or small stretches of sequences with random data, local transposition, inversion.

      (a) Using parametrically simulated alignments (under BUSTED models).

      (b) Using empirical alignments.

      (2) Simulations under model misspecification, specifically to address the point of reviewer 2. For example, we would simulate under models that allow for multi-nucleotide substitutions, and then apply error filtering under models which do not.

      We will also run several new large-scale screens of existing alignments, to directly and indirectly address the reviewers comments. These will include

      (a) A drosophila dataset (from https://academic.oup.com/mbe/article/42/4/msaf068/8092905)

      (b) Current Selectome data (https://selectome.org/), both filtered and unfiltered. Here the filtering procedure refers to what Selectome does to obtain what its authors think are high quality alignments.

      (c) Current OrthoMam data, both (https://orthomam.mbb.cnrs.fr/) filtered and unfiltered. Here the filtering procedure refers to what OrthoMam does to obtain what its authors think are high quality alignments.

      Reviewer #1:

      We are grateful to Reviewer #1 for their positive and encouraging review. We are pleased they found our analyses convincing and recognized BUSTED-E as a "simple, efficient, and computationally fast" improvement for evolutionary scans.

      Strengths:

      As a side note, I found it particularly interesting how the authors tested the statistical support for the new method compared to the simpler version without the error class. In many cases, the simpler model could not be statistically rejected in favor of the more complex model, despite producing biologically incorrect results in terms of parameter inference. This highlights a broader issue in molecular evolution and phylogenomics, where model selection often relies too heavily on statistical tests, potentially at the expense of biological realism.

      We agree that this observation touches upon a critical issue in phylogenomics. A statistically "good" fit does not always equate to a biologically accurate model. We believe our work serves as a useful case study in this regard. We will add discussion of the importance of considering biological realism alongside statistical adequacy in model selection.

      Weaknesses:

      Regarding the structure of the manuscript, the text could be clearer and more precise.

      We appreciate this feedback. We will perform a thorough revision of the entire manuscript to improve its clarity, flow, and precision. We will focus on streamlining the language and ensuring that our methodological descriptions and results are as unambiguous as possible.

      Clear, practical recommendations for users could also be provided in the Results section.

      To make our method more accessible and its application more straightforward, we will add a new section that provides clear, practical recommendations for users. This includes guidance on when to apply BUSTED-E, how to interpret its output, and best practices for distinguishing potential errors from strong selection.

      Additionally, the simulation analyses could be further developed to include scenarios with both alignment errors and positive selection, in order to better assess the method's performance.

      Additional simulations will be conducted (see above)

      Finally, the model is evaluated only in the context of site models, whereas the widely used branch-site model is mentioned as possible but not assessed.

      BUSTED class models support branch-site variation in dN/dS, so technically all of our analyses are already branch-site. However, we interpret the reviewer’s comment as describing use cases when a method is used to test for selection on a subset of tree branches (as opposed to the entire tree). BUSTED-E already supports this ability, and we will add a section in the manuscript describing how this type of testing can be done, including examples. However, we do not plan to conduct additional extensive data analyses or simulations, as this would probably bloat the manuscript too much.

      Reviewer #2:

      We thank Reviewer #2 for their detailed and thought-provoking comments, and for their enthusiasm for modeling alignment issues directly within the codon modeling framework. The criticisms raised are challenging and we will work on improving the justification, testing, and contextualization of our method.

      Weaknesses:

      The definition of alignment error by a very large ω is not justified anywhere in the paper... I would suggest characterising a more specific error model. E.g., radical amino-acid "changes" clustered close together in the sequence, proximity to gaps in the alignment, correlation of apparent ω with genome quality... Also concerning this high ω, how sensitive is its detection to computational convergence issues?

      This is a fundamental point that we are grateful to have the opportunity to clarify. Our intention with the high ω category is not to provide a mechanistic or biological definition of an alignment error. Rather, its purpose is to serve as a statistical "sink" for codons exhibiting patterns of divergence so extreme that they are unlikely to have resulted from a typical selective process. It is phenomenological and ad hoc. The reviewer makes sensible suggestions for other ad hoc/empirical approaches to alignment quality filtering, but most of those have already been implemented in existing (excellent) alignment filtering tools. BUSTED-E is never meant to replace them, but rather to catch what is left over. Importantly, error detection is not even the primary goal of BUSTED-E; errors are treated as a statistical nuisance. With all due respect, all of the reviewers suggestions are similarly ad hoc -- there is no rigorous quantitative justification for any of them, but they are all sensible and plausible, and usually work in practice.

      Computational convergence issues can never be fully dismissed, but we do not consider this to be a major issue. Our approach already pays careful attention to proper initialization, does convergence checks, considers multiple initial starting points. We also don’t need to estimate large ω with any degree of precision, it just needs to be “large”.

      The authors should clarify the relation between the "primary filter for gross or large-scale errors" and the "secondary filter" (this method). Which sources of error are expected to be captured by the two scales of filters?

      We will add discussion and examples to explicitly define the distinct and complementary roles of these filtering stages.

      The benchmarking of the method could be improved both for real and simulated data... I suggest comparing results with e.g. Drosophila genomes... For simulations, the authors should present simulations with or without alignment errors... and with or without positive selection... I also recommend simulating under more complex models, such as multinucleotide mutations or strong GC bias...

      We will add more simulations as suggested (see above). We will also analyze a drosophila gene alignment from previously published papers.

      It would be interesting to compare to results from the widely used filtering tool GUIDANCE, as well as to the Selectome database pipeline... Moreover, the inconsistency between BUSTED-E and HMMCleaner, and BMGE is worrying and should be better explained.

      Some of the alignments we have analyzed had already been filtered by GUIDANCE. We’ll also run the Selectome data through BUSTED-E: both filtered and unfiltered. We consider it beyond the scope of this manuscript to conduct detailed filtering pipeline instrumentation and side-by-side comparison.

      For a new method such as this, I would like to see p-value distributions and q-q plots, to verify how unbiased the method is, and how well the chi-2 distribution captures the statistical value.

      We will report these values for new null simulations.

      I disagree with the motivation expressed at the beginning of the Discussion... Our goal should not be to find a few impressive results, but to measure accurately natural selection, whether it is frequent or rare.

      That’s a philosophical point; at some level, given enough time, every single gene likely experiences some positive selection at some point in the evolutionary past. The practically important question is how to improve the sensitivity of the methods while controlling for ubiquitous noise. We do agree with the sentiment that the ultimate goal is to “measure accurately natural selection, whether it is frequent or rare”. However, we also must be pragmatic about what is possible with dN/dS methods on available genomic data.

    1. eLife Assessment

      In this valuable study, the authors provide a simple yet elegant approach to identifying therapeutic targets that synergize to prevent therapeutic resistance in ovarian cancer using cell lines, data-independent acquisition proteomics, and bioinformatic analysis. The authors convincingly identify several combinations of pharmaceuticals that were able to overcome or prevent therapeutic resistance in culture models of ovarian cancer, a disease with an unmet diagnostic and therapeutic need. However, the extent to which these findings may extend to more complex models of ovarian cancer remains unclear.

    2. Reviewer #1 (Public review):

      Summary:

      The authors provide a simple yet elegant approach to identifying therapeutic targets that synergize to prevent therapeutic resistance using cell lines, data-independent acquisition proteomics, and bioinformatic analysis. The authors identify several combinations of pharmaceuticals that were able to overcome or prevent therapeutic resistance in culture models of ovarian cancer, a disease with an unmet diagnostic and therapeutic need.

      Strengths:

      The manuscript utilizes state-of-the-art proteomic analysis, entailing data-independent acquisition methods, an approach that maximizes the robustness of identified proteins across cell lines. The authors focus their analysis on several drugs under development for the treatment of ovarian cancer and utilize straightforward thresholds for identifying proteomic adaptations across several drugs on the OVSAHO cell line. The authors utilized three independent and complementary approaches to predicting drug synergy (NetBox, GSEA, and Manual Curation). The drug combination with the most robust synergy across multiple cell lines was the inhibition of MEK and CDK4/6 using PD-0325901+Palbociclib, respectively. Additional combinations, including PARPi (rucaparib) and the fatty acid synthase inhibitor (TVB-2640). Collectively, this study provides important insight and exemplifies a solid approach to identifying drug synergy without large drug library screens.

      Weaknesses:

      The manuscript supports their findings by describing the biological function(s) of targets using referenced literature. While this is valuable, the number of downstream targets for each initial target is extensive, thus, the current work does not attempt to elucidate the mechanism of their drug synergy. Responses to drugs are quantified 72 hours after treatment and exclusively focused on cell viability and protein expression levels. The discovery phase of experimentation was solely performed on the OVSAHO cell line. An additional cell line(s) would increase the impact of how the authors went about identifying synergistic targets using bioinformatics. Ovarian cancer is elusive to treatment as primary cancer will form spheroids within ascites/peritoneal fluids in a state of pseudo-senescence to overcome environmental stress. The current manuscript is executed in 2D culture, which has been demonstrated to deviate from 3D, PDX, and primary tumours in terms of therapeutic resistance (DOI: 10.3390/cancers13164208). Collectively, the manuscript is insufficient in providing additional mechanistic insight beyond the literature, and its interpretation of data is limited to 2D culture until further validated.

    3. Reviewer #2 (Public review):

      Summary:

      Franz and colleagues combined proteomics analysis of OVSAHO cell lines treated with 6 individual drugs. The quantitative proteomics data were then used for computational analysis to identify candidates/modules that could be used to predict combination treatments for specific drugs.

      Strengths:

      The authors present solid proteomics data and computational analysis to effectively repeat at the proteomics level analysis that have previously been done predominantly with transcriptional profiling. Since most drugs either target proteins and/or proteins are the functional units of cells, this makes intuitive sense.

      Weaknesses:

      Considering the available resources of the involved teams, performing the initial analysis in a single HGSC cell is certainly a weakness/limitation.

      The data also shows how challenging it is to correctly predict drug combinations. In Table 2 (if I read it correctly), the majority of the drug combinations predicted for the initial cell line OVSAHO did not result in the predicted effect. It also shows how variable the response was in the different HGSC cell lines used for the combination treatment. The success rate will most likely continue to drop as more sophisticated models are being used (i.e., PDX). Human patients are even more challenging.

      It would most likely be useful to more directly mention/discuss these caveats in the manuscript.

    1. eLife Assessment

      This is a valuable study that suggests that HPV-human DNA junctions can be identified from cfDNA in women with cervical cancer and that detection of these junctions is indicative of recurrence. The evidence supporting the conclusions is incomplete, in part because the numbers of reads identifying breakpoints in tumor samples or in circulating cell-free serum samples are not provided. More quantitative analysis will be required to confirm that the breakpoints represented in cell-free DNA can be used as a surrogate to monitor the recurrence of cervical cancer cells, and additional patient studies would also be needed to strengthen the study. This work will be of interest to those who study and treat cervical cancer as well as other HPV-related malignancies.

    2. Reviewer #1 (Public review):

      Van Arsdale and colleagues evaluated whether human-HPV DNA junctions could be detected in serum, cell-free DNA from 16 patients with cervical cancer by hybrid capture and Illumina sequencing. Junctions were identified in seven patients, and these junctions were concordant with junctions identified in tumor DNA except for one patient, suggesting that, in most cases, the cfDNA is originating from a clone of the primary tumor. Junction detection at 6 months was found to be statistically significant prognostic for recurrence. The study further validates that type-specific E7 DNA, which is essential for tumorigenesis, was detectable by PCR for most patient sera, but had no association with recurrence. Furthermore, the study provides additional evidence that tumors harboring non-alpha-9 clade HPVs had shorter recurrence-free survival and overall worse outcome from the study's patients, as well as reanalysis of TCGA data. However, these findings need to be more extensively discussed in the context of previous publications. One identified limitation of this approach is the detection of non-tumor HPVs, but this was only seen in one patient. The major shortcoming of this study is the limited number of patients that were evaluated, but for a retrospective study, this is a reasonable number of patients evaluated, and the findings are appropriately not overstated. The design, execution, and detailed analysis of the sequencing data are a major strength. This study provides important foundational evidence for further evaluating the clinical utility of HPV DNA detection from cfDNA and specifically assessing for integration junctions.

    3. Reviewer #2 (Public review):

      Summary:

      The authors set out to identify cell-free HPV breakpoint junctions and assess their utility in identifying cervical cancer recurrence as a surrogate, tumor-specific assay. They added unrelated findings about a potential relationship between various viral types and cancer recurrence frequencies, concluding that clade alpha 9 types recurred at a lower rate than did non-alpha 9 viral types.

      Strengths:

      The authors analyzed 16 cervical cancer samples and matched serum samples collected initially or upon clinical treatments. An association between virus types and cancer recurrence frequencies is a novel finding that will likely induce further insights into HPV pathogenic mechanisms.

      Weaknesses:

      The main claims of this manuscript are only partially supported by the data as presented, because the sequencing data are not quantified and were not analyzed in a statistically adequate way. First, only one or at most two breakpoints are presented per tumor (Table 1). This finding is discrepant from many extensive, published genomics studies of HPV-positive cancers, in which many unique breakpoints are found frequently in individual cancers, ranging from 1 or 2 up to more than 100. Second, no information is provided about likely correlations between genomic DNA copy number at rearranged loci and breakpoint-identifying sequencing read counts. Third, no direct comparison is presented between supporting read counts from cancer samples and read counts from circulating cell-free DNA samples. Fourth, many of the initial cancer samples harbored no insertional breakpoints, so no correlation with breakpoints in the serum samples would be possible. Fifth, no mention was made about tumor heterogeneity, where a given breakpoint may not be present in every cell of the tumor. Previous literature about the general topic of using cell-free DNA breakpoints as a surrogate for cancer cells is not cited adequately. Findings about potential correlations between various viral types and variable recurrence rates are not well-supported by the authors' own data, because of the limited sample numbers studied. This section of the paper is relatively unrelated to the main thrust, which is about breakpoint detection.

    1. eLife Assessment

      This study presents important findings on increased ground beetle diversity in strip cropping compared with crop monocultures. Solid methods are used to analyze data from multiple sites with heterogeneous systems of mixed crops, allowing broad conclusions, albeit at the expense of lacking taxonomic specificity. The work will be of interest to all those applying plant diversity treatments to improve the diversity of associated animals in agricultural fields.

    2. Reviewer #3 (Public review):

      Summary: In this paper the authors examined the effects of strip cropping, a relatively new agricultural technique of alternating crops in small strips of several meters wide, on ground beetle diversity. The results show an increase in species diversity (i.e. abundance and species richness) of the ground beetle communities compared to monocultures.

      Strengths: The article is well written; it has an easily readable tone of voice without too much jargon or overly complicated sentence structure. Moreover, as far as reviewing the models in depth without raw data and R scripts allows, the statistical work done by the authors looks good. They have well thought out how to handle heterogenous, unbalanced and taxonomically unspecific yet spatially and temporarily correlated field data. The models applied and the model checks performed are appropriate for the data at hand. Combining RDA and PCA axes together is a nice touch. Moreover, after the first round of reviews, the authors have done a great job at rewriting the paper to make it less overstated, more relevant to the data at hand and more solid in the findings. Many of the weaknesses noted in the first review have been dealt with. The overall structure of the paper is good, with a clear introduction, hypotheses, results section and discussion.

      Weaknesses: The weaknesses that remain are mainly due to a difficult dataset and choices that could have stressed certain aspects more, like the relationship between strip cropping and intercropping. The mechanistic understanding of strip cropping is what is at stake here. Does strip cropping behave similar to intercropping, a technique which has been proven to be beneficial to biodiversity because of added effects due to increased resource efficiency and greater plant species richness.

      Unfortunately, the authors do not go into this in the introduction or otherwise and simply state that they consider strip cropping a form of intercropping.

      I also do not like the exclusive focus on percentages, as these are dimensionless. I think more could have been done to show underlying structure in the data, even after rarefaction.

      A further weakness is a limited embedding into the larger scientific discourses other than providing references. But this may be a matter of style and/or taste

    3. Author response:

      The following is the authors’ response to the original reviews.

      We thank all reviewers for the highly detailed review and the time and effort which has been invested in this review. It is clear from the reviews that we’ve had the privilege to have our work extensively and thoroughly checked by knowledgeable experts, for which we are very grateful. We have read their perspectives, questions and suggested improvements with great interest. We have reflected on the public review in detail and have included detailed responses below. First, we would like to respond to four main issues pointed out by the editor and reviewers:

      (1) Lack of yield data in the manuscript: Yield data has been collected in most of the sites and years of our study, and these have already been published and cited in our manuscript. In the appendix of our manuscript, we included a table with yield data for the sites and years in which the beetle diversity was studied. These data show that strip cropping does not cause a systematic yield reduction.

      (2) Sampling design clarification: Our paper combines data from trials conducted at different locations and years. On the one hand this allows an analysis of a comprehensive dataset, but on the other hand in some cases this resulted in variations in how data were collected or processed (e.g. taxonomic level of species identification). We have added more details to the sections on sampling design and data analysis to increase clarity and transparency.

      (3) Additional data analysis: In the revised manuscript we present an analysis on the responses of abundances of the 12 most common ground beetle genera to strip cropping. This gives better insight in the variation of responses among ground beetle taxa.

      (4) Restrict findings to our system: We nuanced our findings further and focused more on the implications of our data on ground beetle communities, rather than on agrobiodiversity in a broader sense.

      Below we also respond to the editor and reviewers in more detail.

      Reviewing Editor Comments:

      (1) You only have analyzed ground beetle diversity, it would be important to add data on crop yields, which certainly must be available (note that in normal intercropping these would likely be enhanced as well).

      Most yield data have been published in three previous papers, which we already cited or cite now (one was not yet published at the time of submission). Our argumentation is based on these studies. We had also already included a table in the appendix that showed the yield data that relates specifically to our locations and years of measurement. The finding that strip cropping does not majorly affect yield is based on these findings. We revised the title of our manuscript to remove the explicit focus on yield.

      (2) Considering the heterogeneous data involving different experiments it is particularly important to describe the sampling design in detail and explain how various hierarchical levels were accounted for in the analysis.

      We agree that some important details to our analysis were not described in sufficient detail. Especially reviewer 2 pointed out several relevant points that we did account for in our analyses, but which were not clear from the text in the methods section. We are convinced that our data analyses are robust and that our conclusions are supported by the data. We revised the methods section to make our approach clearer and more transparent.

      (3) In addition to relative changes in richness and density of ground beetles you should also present the data from which these have been derived. Furthermore, you could also analyze and interpret the response of the different individual taxa to strip cropping.

      With our heterogeneous dataset it was quite complicated to show overall patterns of absolute changes in ground beetle abundance and richness, especially for the field-level analyses. As the sampling design was not always the same and occasionally samples were missing, the number of year series that made up a datapoint were different among locations and years. However, we always made sure that for the comparison of a paired monoculture and strip cropping field, the number of year series was always made equal through rarefaction. That is, the number of ground beetle(s) (species) are always expressed as the number per 2 to 6 samples. Therefore, we prefer to stick to relative changes as we are convinced that this gives a fairer representation of our complex dataset.

      We agree with the second point that both the editor and several reviewers pointed out. The indicator species analyses that we used were biased by rare species, and we now omit this analysis. Instead, we included a GLM analysis on the responses of abundances of the 12 most common ground beetle genera to strip cropping. We chose for genera here (and not species) as we could then include all locations and years within the analyses, and in most cases a genus was dominated by a single species (but notable exceptions were Amara and Harpalus, which were often made up of several species). We illustrate these analyses still in a similar fashion as we did for the indicator species analysis.

      (4) Keep to your findings and don't overstate them but try to better connect them to basic ecological hypotheses potentially explaining them.

      After careful consideration of the important points that reviewers point out, we decided to nuance our reasoning about biodiversity conservation along two key lines: (1) the extent to which ground beetles can be indicators of wider biodiversity changes; and (2) our findings that are not as straightforward positive as our narrative suggests. We still believe that strip cropping contributes positively to carabid communities, and have carefully checked the text to avoid overstatements.

      Reviewer #1 (Public review):

      Summary:

      This study demonstrates that strip cropping enhances the taxonomic diversity of ground beetles across organically-managed crop systems in the Netherlands. In particular, strip cropping supported 15% more ground beetle species and 30% more individuals compared to monocultures.

      Strengths:

      A well-written study with well-analyzed data of a complex design. The data could have been analyzed differently e.g. by not pooling samples, but there are pros and cons for each type of analysis and I am convinced this will not affect the main findings. A strong point is that data were collected for 4 years. This is especially strong as most data on biodiversity in cropping systems are only collected for one or two seasons. Another strong point is that several crops were included.

      We thank reviewer 1 for their kind words and agree with this strength of the paper. The paper combines data from trials conducted at different locations and years. On the one hand this allows an analysis of a comprehensive dataset, but on the other hand in some cases there were slight variations in how data were collected or processed (e.g. taxonomic level of species identification).

      Weaknesses:

      This study focused on the biodiversity of ground beetles and did not examine crop productivity. Therefore, I disagree with the claim that this study demonstrates biodiversity enhancement without compromising yield. The authors should present results on yield or, at the very least, provide a stronger justification for this statement.

      We acknowledge that we indeed did not formally analyze yield in our study, but we have good reason for this. The claim that strip cropping does not compromise yield comes from several extensive studies (Juventia & van Apeldoorn, 2024; Ditzler et al., 2023; Carillo-Reche et al., 2023) that were conducted in nearly all the sites and years that we included in our study. We chose not to include formal analyses of productivity for two key reasons: (1) a yield analysis would duplicate already published analyses, and (2) we prefer to focus more on the ecology of ground beetles and the effect of strip cropping on biodiversity, rather than diverging our focus also towards crop productivity. Nevertheless, we have shown the results on yield in Table S6 and refer extensively to the studies that have previously analyzed this data (line 203-207, 217-221).

      Reviwer #1 (Recommendations for the authors):

      This is a well-written study on the effects of strip cropping on ground-beetle diversity. As stated above the study is well analyzed, presented, and written but you should not pretend that you analyzed yield e.g. lines 25-27 "We show that strip cropping...enhance ground beetle biodiversity without incurring major yield loss.

      We understand the confusion caused by this sentence, and it was never our intention to give the impression that we analyzed yield losses. These findings were based on previous research by ourselves and colleagues, and we have now changed the sentence to reflect this (line 25-27).

      I think you assume that yield does not differ between strip cropping and monoculture. I am not sure this is correct as one crop might attract pests or predators spilling over to the other crop. I am also not sure if the sowing and harvest of the crop will come with the same costs. So if you assume this, you should only do it in the main manuscript and not the abstract, to justify this better.

      With three peer-reviewed papers on the same fields as we studied, we can convincingly state that strip cropping in organic agriculture generally does not result in major yield loss, although exceptions exist, which we refer to in the discussion.

      In the introduction lines 28-43, you refer to insect biomass decline. I wonder if you would like to add the study of Loboda et al. 2017 in Ecography. It seems not fitting as it is from the Artic but also the other studies you cite are not only coming from agricultural landscapes and this study is from the same time as the Hallmann et al. 2017 study and shows a decline in flies of 80%

      We have removed the sentence that this comment refers to, to streamline the introduction more.

      Lines 50-51. You only have one citation for biodiversity strategies in agricultural systems. I suggest citing Mupepele et al. 2021 in TREE. This study refers to management but also the policies and societal pressures behind it.

      We have added this citation and a recent paper by Cozim-Melges et al. (2024) here (line 49-52).

      In the methods, I am missing a section on species identifications. This would help to understand why you used "taxonomic richness".

      Thanks for pointing this out. We have now included a new section on ground beetle identification (line 304-309 in methods).

      Figure 1 is great and I like that you separated the field and crop-level data, although there is no statistical power for the crop-specific data. I personally would move k to the supplements. It is very detailed and small and therefore hard to read

      We chose to keep figure 1k, as in our view it gives a good impression of the scale of the experiment, the number of crops included and the absolute numbers of caught species.

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to investigate the effects of organic strip cropping on carabid richness and density as well as on crop yields. They find on average higher carabid richness and density in strip cropping and organic farming, but not in all cases.

      We did not intend to investigate the effect of strip cropping on crop yields, but rather place our work in the framework of earlier studies that already studied yield. All the monocultures and strip cropping fields were organic farms. Our findings thus compare crop diversity effects within the context of organic farming.

      Strengths:

      Based on highly resolved species-level carabid data, the authors present estimates for many different crop types, some of them rarely studied, at the same time. The authors did a great job investigating different aspects of the assemblages (although some questions remain concerning the analyses) and they present their results in a visually pleasing and intuitive way.

      We appreciate the kind words of reviewer 2 and their acknowledgement of the extensiveness of our dataset. In our opinion, the inclusion of many different crops is indeed a strength, rarely seen in similar studies; and we are happy that the figures are appreciated.

      Weaknesses:

      The authors used data from four different strip cropping experiments and there is no real replication in space as all of these differed in many aspects (different crops, different areas between years, different combinations, design of the strip cropping (orientation and width), sampling effort and sample sizes of beetles (differing more than 35 fold between sites; L 100f); for more differences see L 237ff). The reader gets the impression that the authors stitched data from various places together that were not made to fit together. This may not be a problem per se but it surely limits the strength of the data as results for various crops may only be based on small samples from one or two sites (it is generally unclear how many samples were used for each crop/crop combination).

      The paper indeed combines data from trials conducted at different locations and years. On the one hand this allows an analysis of a comprehensive dataset, but on the other hand in some cases there were slight differences in the experimental design. At the time that we did our research, there were only a handful of farmers that were employing strip cropping within the Netherlands, which greatly reduced the number of fields for our study. Therefore, we worked in the sites that were available and studied as many crops on these sites. Since there was variation in the crops grown in the sites, for some crops we have limited replication. In the revision we have explained this more clearly (line 297-300).

      One of my major concerns is that it is completely unclear where carabids were collected. As some strips were 3m wide, some others were 6m and the monoculture plots large, it can be expected that carabids were collected at different distances from the plot edge. This alone, however, was conclusively shown to affect carabid assemblages dramatically and could easily outweigh the differences shown here if not accounted for in the models (see e.g. Boetzl et al. (2024) or Knapp et al. (2019) among many other studies on within field-distributions of carabids).

      Point well taken. Samples were always taken at least 10 meters into the field, and always in the middle of the strip. This would indeed mean that there is a small difference between the 3- and 6m wide strips regarding distance from another strip, but this was then only a difference of 1.5 to 3 meters from the edge. A difference that, based on our own extensive experience with ground beetle communities, will not have a large impact on the findings of ground beetles. The distance from field/plot edges was similar between monocultures and strip cropped fields. We present a more detailed description of the sampling design in the methods of the revised manuscript (line 294-297).

      The authors hint at a related but somewhat different problem in L 137ff - carabid assemblages sampled in strips were sampled in closer proximity to each other than assemblages in monoculture fields which is very likely a problem. The authors did not check whether their results are spatially autocorrelated and this shortcoming is hard to account for as it would have required a much bigger, spatially replicated design in which distances are maintained from the beginning. This limitation needs to be stated more clearly in the manuscript.

      To be clear, this limitation relates to the comparison that we did for the community compositions of ground beetles in two crops either in strip cropping or monocultures. In this case, it was impossible to avoid potential autocorrelation due to our field design. We also acknowledge this limitation in the results section (line 130-133). However, for our other analyses we corrected for spatial autocorrelation by including variables per location, year and crop. This grouped samples that were spatially autocorrelated. Therefore, we don’t see this as a discrepancy of our other analyses.

      Similarly, we know that carabid richness and density depend strongly on crop type (see e.g. Toivonen et al. (2022)) which could have biased results if the design is not balanced (this information is missing but it seems to be the case, see e.g. Celeriac in Almere in 2022).

      We agree and acknowledge that crop type can influence carabid richness and density, which is why we have included variables to account for differences caused by crops. However, we did not observe consistent differences between crops in how strip cropping affected ground beetle richness and density. Therefore, we don’t think that crop types would have influenced our conclusions on the overall effect of strip cropping.

      A more basic problem is that the reader neither learns where traps were located, how missing traps were treated for analyses how many samples there were per crop or crop combination (in a simple way, not through Table S7 - there has to have been a logic in each of these field trials) or why there are differences in the number of samples from the same location and year (see Table S7). This information needs to be added to the methods section.

      Point well taken. We have clarified this further in the revised manuscript (line 294-301, 318-322). As we combined data from several experimental designs that originally had slightly different research questions, this in part caused differences between numbers of rounds or samples per crop, location or year.

      As carabid assemblages undergo rapid phenological changes across the year, assemblages that are collected at different phenological points within and across years cannot easily be compared. The authors would need to standardize for this and make sure that the assemblages they analyze are comparable prior to analyses. Otherwise, I see the possibility that the reported differences might simply be biased by phenology.

      We agree and we dealt with this issue by using year series instead of using individual samples of different rounds. This approach allowed us to get a good impression of the entire ground beetle community across seasons. For our analyses we had the choice to only include data from sampling rounds that were conducted at the same time, or to include all available data. We chose to analyze all data, and made sure that the number of samples between strip cropping and monoculture fields per location, year and crop was always the same by pooling and rarefaction.

      Surrounding landscape structure is known to affect carabid richness and density and could thus also bias observed differences between treatments at the same locations (lower overall richness => lower differences between treatments). Landscape structure has not been taken into account in any way.

      We did not include landscape structure as there are only 4 sites, which does not allow a meaningful analysis of potential effects landscape structure. Studying how landscape interacts with strip cropping to influence insect biodiversity would require at least, say 15 to 20 sites, which was not feasible for this study. However, such an analysis may be possible in an ongoing project (CropMix) which includes many farms that work with strip cropping.

      In the statistical analyses, it is unclear whether the authors used estimated marginal means (as they should) - this needs to be clarified.

      In the revised manuscript we further clarified this point (line 365-366, 373-374).

      In addition, and as mentioned by Dr. Rasmann in the previous round (comment 1), the manuscript, in its current form, still suffers from simplified generalizations that 'oversell' the impact of the study and should be avoided. The authors restricted their analyses to ground beetles and based their conclusions on a design with many 'heterogeneities' - they should not draw conclusions for farmland biodiversity but stick to their system and report what they found. Although I understand the authors have previously stated that this is 'not practically feasible', the reason for this comment is simply to say that the authors should not oversell their findings.

      In the revised manuscript, we nuanced our findings by explaining that strip cropping is a potentially useful tool to support ground beetle biodiversity in agricultural fields (line 33-35).

      Reviewer #2 (Recommendations for the authors):

      In addition to the points stated under 'Weaknesses' above, I provide smaller comments and recommendations:

      Overall comments:

      (i) The carabid images used in the figures were created by Ortwin Bleich and are copyrighted. I could not find him accredited in the acknowledgements; the figure legends simply state that the images were taken from his webpage. Was his permission obtained? This should be stated.

      We have received written permission from Ortwin Bleich for using his pictures in our figures, and have accredited him for this in the acknowledgements (line 455-456).

      (ii) There is a great confusion in the field concerning terminology. The authors here use intercropping and strip cropping, a specific form of intercropping, interchangeably. I advise the authors to stick to strip cropping as it is more precise and avoids confusion with other forms of intercropping.

      We agree with the definitions given by reviewer 2 and had already used them as such in the text. We defined strip cropping in the first paragraph of the introduction and do not use the term “intercropping” after this definition to avoid confusion.

      Comments to specific lines:

      Line 19: While this is likely true, there is so far not enough compelling evidence for such a strong statement blaming agriculture. Please rephrase.

      Changed the sentence to indicate more clearly that it is one of the major drivers, but that the “blame” is not solely on agriculture (line 18-19).

      Line 22: Is this the case? I am aware of strip cropping being used in other countries, many of them in Europe. Why the focus on 'Dutch'?

      Indeed, strip cropping is now being pioneered by farmers throughout Europe. However in the Netherlands, some farmers have been pioneering strip cropping already since 2014. We have added this information to indicate that our setting is in the Netherlands, and as in our opinion it gives a bit more context to our manuscript.

      Line 24: I would argue that carabids are actually not good indicators for overall biodiversity in crop fields as they respond in a very specific way, contrasting with other taxa. It is commonly observed that carabids prefer more disturbed habitats and richness often increases with management intensity and in more agriculturally dominated landscapes - in stark contrast to other taxa like wild bees or butterflies.

      We have reworded this sentence to reflect that they are not necessarily indicators of wide agricultural biodiversity, but that they do hold keystone positions within food webs in agricultural systems (line 23-25).

      Line 31: This statement here is also too strong - carabids are not overall biodiversity and patterns found for carabids likely differ strongly from patterns that would be observed in other taxa. This study is on carabids and the conclusion should thus also refer to these in order to avoid such over-simplified generalizations.

      We agree and have nuanced this sentence to indicate that our findings are only on ground beetles (line 33-35). However, we would like to point out that the statement that “patterns found for carabids likely differ strongly from patterns that would be observed in other taxa” assumes a disassociation between carabids and other taxa.

      Line 41: I am sure the authors are aware of the various methodological shortcomings of the dataset used in Hallmann et al. (2017) which likely led to an overestimation of the actual decline. Analysing the same data, Müller et al. (2023) found that weather can explain fluctuations in biomass just as well as time. I thus advise not putting too much focus on these results here as they seem questionable.

      We have removed this sentence to streamline the introduction, thus no longer mentioning the percentages given by Hallmann et al. (2017).

      Line 46: Surely likely but to my knowledge this is actually remarkably hard to prove. Instead of using the IPBES report here that simply states this as a fact, it would be better to see some actual evidence referenced.

      We removed IPBES as a source and changed this for Dirzo et al. (2014), a review that shows the consequences of biodiversity decline on a range of different ecosystem services and ecological functions (line 45-47).

      Line 52ff: I am not sure whether this old land-sparing vs. land-sharing debate is necessary here. The authors could simply skip it and directly refer to the need of agricultural areas, the dominating land-use in many regions, to become more biodiversity-friendly. It can be linked directly to Line 61 in my opinion which would result in a more concise and arguably stronger introduction.

      After reconsidering, we agree with reviewer 2 that this section was redundant and we have removed the lines on land-sparing vs land-sharing.

      Line 59: Just a note here: this argument is not meaningful when talking about strip cropping in the Netherlands as there is virtually no land left that could be converted (if anything, agricultural land is lost to construction). The debate on land-use change towards agriculture is nowadays mostly focused on the tropics and the Global South.

      We argue that strip cropping could play an important role as a measure that does not necessarily follow the trade-off between biodiversity and agriculture for a context beyond the Netherlands (line 52-58).

      Line 69: Does this statement really need 8 references?

      Line 71: ... and this one 5 additional ones?

      We have removed excess references in these two lines (line 62-66).

      Line 74: But also likely provides the necessary crop continuity for many crop pests - the authors should keep in mind that when practitioners read agricultural biodiversity, they predominantly think of weeds and insect pests.

      We agree with reviewer 2 that agricultural biodiversity is still a controversial topic. However, as the focus in this manuscript is more on biodiversity conservation, rather than pest management, we prefer to keep this sentence as is. In other published papers and future work we focus more on the role of strip cropping for pest management.

      Line 83: Consider replacing 'moments' maybe - phenological stages or development stages?

      Although we understand the point of reviewer 2, we prefer to keep it at moments, as we did not focus on phenological stages and we only wanted to say that we set pitfall traps at several moments throughout the year. However, by placing the pitfall traps at several moments throughout the year, we did capture several phenological stages.

      Line 86: Not only farming practices - there are also massive fluctuations between years in the same crop with the same management due to effects of the weather in the previous reproductive season. Interpreting carabid assemblage changes is therefore not straightforward.

      We absolutely agree that interpreting carabid assemblage is not straightforward, but as we did not study year or crop legacy effects we chose to keep this sentence to maintain focus on our research goals.

      Line 88: 'ecolocal'?

      Typo, should have been ecological. Changed (line 81).

      Line 90: 'As such, they are often used as indicator group for wider insect diversity in agroecosystems' - this is the third repetition of this statement and the second one in this paragraph - please remove. Having worked on carabids extensively myself, I also think that this is not the true reason - they are simply easy to collect passively.

      We agree with the reviewer and have removed this sentence.

      Line 141: I have doubts about the value of the ISA looking at the results. Anchomenus dorsalis is a species extremely common in cereal monoculture fields in large parts of Europe, especially in warmer and drier conditions (H. griseus was likely only returned as it is generally rare and likely only occurred in few plots that, by chance, were strip-cropped). It can hardly be considered an indicator for diverse cropping systems but it was returned as one here (which I do not doubt). This often happens with ISA in my experience as they are very sensitive to the specific context of the data they are run on. The returned species are, however, often not really useable as indicators in other contexts. I thus believe they actually have very limited value. Apart from this, we see here that both monocultures and strip cropping have their indicators, as would likely all crop types. I wonder what message we would draw from this ...

      On close reconsideration, we agree with the reviewer that the ISAs might have been too sensitive to rare species that by chance occur in one of two crop configurations. To still get an idea on what happens with specific ground beetle groups, we chose to replace the ISAs with analyses on the 12 most common ground beetle genera. For this purpose we have added new sections to the methods (line 368-374) and results (line 135-143), replaced figure 2 and table S5, and updated the discussion (line 182-200).

      Line 165: Carabid activity is high when carabids are more active. Carabids can be more active either when (i) there are simply more carabid individuals or /and (ii) when they are starved and need to search more for prey. More carabid activity does thus not necessarily indicate more individuals, it can indicate that there is less prey. This aspect is missing here and should be discussed. It is also not true that crop diversification always increases prey biomass - especially strip cropping has previously been shown to decrease pest densities (Alarcón-Segura et al., 2022). Of course, this is a chicken-egg problem (less pests => less carabids or more carabids => less pests ?) ... this should at least be discussed.

      We have rewritten this paragraph to further discuss activity density in relation to food availability (line 175-185).

      Line 178: These species are not exclusively granivorous - this speculation may be too strong here.

      Line 185: true for all but C. melanocephalus - this species is usually more associated with hedgerows, forests etc.

      After removing the ISA’s, we also chose to remove this paragraph and replace it with a paragraph that is linked to the analyses on the 12 most common genera (line 182-200).

      Line 202: These statements are too strong for my taste - the authors should add an 'on average' here. The data show that they likely do not always enhance richness by 15 % and as the authors state, some monocultures still had higher richness and densities.

      “on average” added (line 211)

      Line 203: 'can lead' - the authors cannot tell based on their results if this is always true for all taxa.

      Changed to “can lead” (line 213)

      Line 205: What is 'diversification' here?

      This concerns measures like hedgerows or flower strips. We altered the sentence to make this clearer (line 215-216).

      Line 208: Does this statement need 5 references? (as in the introduction, the reader gets the impression the authors aimed to increase the citation count of other articles here).

      We have removed excess references (line 219-221).

      Line 222: How many are 'a few'? Maybe state a proportion.

      We only found two species, we’ve changed the sentence accordingly (line 232-233).

      Line 224: As stated above, I would not overstress the results of the ISAs - the authors stated themselves that the result for A. dorsalis is likely only based on one site ...

      We removed this sentence after removing the ISAs.

      Line 305: I think there is an additional nested random level missing - the transect or individual plot the traps were located in (or was there only one replicate for each crop/strip in each experiment)? Hard to tell as the authors provide no information on the actual sample sizes.

      Indeed, there was one field or plot per cropping system per crop per location per year from which all the samples were taken. Therefore the analysis does not miss a nested random level. We provided information on sample sizes in Table S7.

      Line 314ff: The authors describe that they basically followed a (slightly extended) Chao-Hill approach (species richness, Shannon entropy & inverse Simpson) without the sampling effort / sample completeness standardization implemented in this approach and as a reader I wonder why they did not simply just use the customary Chao-Hill approach.

      We were not aware of the Chao-Hill approach, and we see it as a compliment that we independently came up with an approach similar to a now accepted approach.

      Line 329: Unclear what was nested in what here - location / year / crop or year / location / crop ?

      For the crop-level analyses, the nested structure was location > year > crop. This nested structure was chosen as every location was sampled across different years and (for some locations) the crops differed among years. However, as we pooled the samples from the same field in the field-level analyses, using the same random structure would have resulted in each individual sampling unit being distinguished as a group. Therefore, the random structure here was only location > year. We explain this now more clearly in lines 329 and 355-357.

      Line 334: I can see why the authors used these distributions but it is presented here without any justification. As a side note: Gamma (with log link) would likely be better for the Shannon model as well (I guess it cannot be 0 or negative ...).

      We explain this now better in lines 360-364.

      Line 341: Why Hellinger and not simply proportions?

      We used Hellinger transformation to give more weight to rarer species. Our pitfall traps were often dominated by large numbers of a few very abundant / active species. If we had used proportions, these species would have dominated the community analyses. We clarified this in the text (line 379-381).

      Line 348: An RDA is constrained by the assumptions / model the authors proposed and "forces" the data into a spatial ordination that resembles this model best. As the authors previously used an unconstrained PERMANOVA, it would be better to also use an NMDS that goes along with the PERMANOVA.

      The initial goal of the RDA was not to directly visualize the results of the PERMANOVA, but to show whether an overall crop configuration effect occurred, both for the whole dataset and per location. We have now added NMDS figures to link them to the PERMANOVA and added these to the supplementary figures (fig S6-S8). We also mention this approach in the methods section (line 387-390).

      Line 355f: This is also a clear indication of the strong annual fluctuations in carabid assemblages as mentioned above.

      Indeed.

      Line 361: 'pairwise'.

      Typo, we changed this.

      Line 362: reference missing.

      Reference added (line 405)

      References

      Alarcón-Segura, V., Grass, I., Breustedt, G., Rohlfs, M., Tscharntke, T., 2022. Strip intercropping of wheat and oilseed rape enhances biodiversity and biological pest control in a conventionally managed farm scenario. J. Appl. Ecol. 59, 1513-1523.

      Boetzl, F.A., Sponsler, D., Albrecht, M., Batáry, P., Birkhofer, K., Knapp, M., Krauss, J., Maas, B., Martin, E.A., Sirami, C., Sutter, L., Bertrand, C., Baillod, A.B., Bota, G., Bretagnolle, V., Brotons, L., Frank, T., Fusser, M., Giralt, D., González, E., Hof, A.R., Luka, H., Marrec, R., Nash, M.A., Ng, K., Plantegenest, M., Poulin, B., Siriwardena, G.M., Tscharntke, T., Tschumi, M., Vialatte, A., Van Vooren, L., Zubair-Anjum, M., Entling, M.H., Steffan-Dewenter, I., Schirmel, J., 2024. Distance functions of carabids in crop fields depend on functional traits, crop type and adjacent habitat: a synthesis. Proceedings of the Royal Society B: Biological Sciences 291, 20232383.

      Hallmann, C.A., Sorg, M., Jongejans, E., Siepel, H., Hofland, N., Schwan, H., Stenmans, W., Müller, A., Sumser, H., Hörren, T., Goulson, D., de Kroon, H., 2017. More than 75 percent decline over 27 years in total flying insect biomass in protected areas. PLoS One 12, e0185809.

      Knapp, M., Seidl, M., Knappová, J., Macek, M., Saska, P., 2019. Temporal changes in the spatial distribution of carabid beetles around arable field-woodlot boundaries. Scientific Reports 9, 8967.

      Müller, J., Hothorn, T., Yuan, Y., Seibold, S., Mitesser, O., Rothacher, J., Freund, J., Wild, C., Wolz, M., Menzel, A., 2023. Weather explains the decline and rise of insect biomass over 34 years. Nature.

      Toivonen, M., Huusela, E., Hyvönen, T., Marjamäki, P., Järvinen, A., Kuussaari, M., 2022. Effects of crop type and production method on arable biodiversity in boreal farmland. Agriculture, Ecosystems & Environment 337, 108061.

      Reviewer #3 (Public review):

      Summary:

      In this paper, the authors made a sincere effort to show the effects of strip cropping, a technique of alternating crops in small strips of several meters wide, on ground beetle diversity. They state that strip cropping can be a useful tool for bending the curve of biodiversity loss in agricultural systems as strip cropping shows a relative increase in species diversity (i.e. abundance and species richness) of the ground beetle communities compared to monocultures. Moreover, strip cropping has the added advantage of not having to compromise on agricultural yields.

      Strengths:

      The article is well written; it has an easily readable tone of voice without too much jargon or overly complicated sentence structure. Moreover, as far as reviewing the models in depth without raw data and R scripts allows, the statistical work done by the authors looks good. They have well thought out how to handle heterogenous, yet spatially and temporarily correlated field data. The models applied and the model checks performed are appropriate for the data at hand. Combining RDA and PCA axes together is a nice touch.

      We thank reviewer 3 for their kind words and appreciation for the simple language and analysis that we used.

      Weaknesses:

      The evidence for strip cropping bringing added value for biodiversity is mixed at best. Yes, there is an increase in relative abundance and species richness at the field level, but it is not convincingly shown this difference is robust or can be linked to clear structural and hypothesised advantages of the strip cropping system. The same results could have been used to conclude that there are only very limited signs of real added value of strip cropping compared to monocultures.

      Point well taken. We agree that the effect of strip cropping on carabid beetle communities are subtle and we nuanced the text in the revised version to reflect this. See below for more details on how we revised the manuscript to reflect this point.

      There are a number of reasons for this:

      (1) Significant differences disappear at crop level, as the authors themselves clearly acknowledge, meaning that there are no differences between pairs of similar crops in the strip cropping fields and their respective monoculture. This would mean the strips effectively function as "mini-monocultures".

      This is indeed in line with our conclusions. Based on our data and results, the advantages of strip cropping seem mostly to occur because crops with different communities are now on the same field, rather than that within the strips you get mixtures of communities related to different crops. We discussed this in the first paragraph of the discussion in the original submission (line 161-164).

      The significant relative differences at the field level could be an artifact of aggregation instead of structural differences between strip cropping and monocultures; with enough data points things tend to get significant despite large variance. This should have been elaborated further upon by the authors with additional analyses, designed to find out where differences originate and what it tells about the functioning of the system. Or it should have provided ample reason for cautioning in drawing conclusions about the supposed effectiveness of strip cropping based on these findings.

      We believe that this is a misunderstanding of our approach. In the field-level analyses we pooled samples from the same field (i.e. pseudo-replicates were pooled), resulting in a relatively small sample size of 50 samples. We revised the methods section to better explain this (line 318-322). Therefore, the statement “with enough data points things tend to get significant” is not applicable here.

      (2) The authors report percentages calculated as relative change of species richness and abundance in strip cropping compared to monocultures after rarefaction. This is in itself correct, however, it can be rather tricky to interpret because the perspective on actual species richness and abundance in the fields and treatments is completely lost; the reported percentages are dimensionless. The authors could have provided the average cumulative number of species and abundance after rarefaction. Also, range and/or standard error would have been useful to provide information as to the scale of differences between treatments. This could provide a new perspective on the magnitude of differences between the two treatments which a dimensionless percentage cannot.

      We agree that this would be the preferred approach if we would have had a perfectly balanced dataset. However, this approach is not feasible with our unbalanced design and differences in sampling effort. While we acknowledge the limitation of the interpretation of percentages, it does allow reporting relative changes for each combination of location, year and crop. The number of samples on which the percentages were based were always kept equal (through rarefaction) between the cropping systems (for each combination of location, year and crop), but not among crops, years and location. This approach allowed us to make a better estimation whenever more samples were available, as we did not always have an equal number of samples available between both cropping systems. For example, sometimes we had 2 samples from a strip cropped field and 6 from the monoculture, here we would use rarefaction up to 2 samples (where we would just have a better estimation from the monoculture). In other cases, we had 4 samples in both strip cropped and monoculture fields, and we chose to use rarefaction to 4 samples to get a better estimation altogether. Adding a value for actual richness or abundance to the figures would have distorted these findings, as the variation would be huge (as it would represent the number of ground beetle(s) species per 2 to 6 pitfall samples). Furthermore, the dimension that reviewer 3 describes would thus be “The number of ground beetle species / individuals per 2 to 6 samples”, not a very informative unit either.

      (3) The authors appear to not have modelled the abundance of any of the dominant ground beetle species themselves. Therefore it becomes impossible to assess which important species are responsible (if any) for the differences found in activity density between strip cropping and monocultures and the possible life history traits related reasons for the differences, or lack thereof, that are found. A big advantage of using ground beetles is that many life history traits are well studied and these should be used whenever there is reason, as there clearly is in this case. Moreover, it is unclear which species are responsible for the difference in species richness found at the field level. Are these dominant species or singletons? Do the strip cropping fields contain species that are absent in the monoculture fields and are not the cause of random variation or sampling? Unfortunately, the authors do not report on any of these details of the communities that were found, which makes the results much less robust.

      Thank you for raising this point. We have reconsidered our indicator species analysis and found that it is rather sensitive for rare species and insensitive to changes in common species. Therefore, we have replaced the indicator species analyses with a GLM analysis for the 12 most common genera of ground beetles in the revised manuscript. This will allow us to go more in depth on specific traits of the genera which abundances change depending on the cropping system. In the revised manuscript, we will also discuss these common genera more in depth, rather than focusing on rarer species (line 135-143, 182-200 in discussion). Furthermore, we have added information on rarity and habitat preference to the table that shows species abundances per location (Table S2), and mention these aspects briefly in the results (line 145-153).

      (4) In the discussion they conclude that there is only a limited amount of interstrip movement by ground beetles. Otherwise, the results of the crop-level statistical tests would have shown significant deviation from corresponding monocultures. This is a clear indication that the strips function more like mini-monocultures instead of being more than the sum of its parts.

      This is in line with our point in the first paragraph of the discussion and an important message of our manuscript.

      (5) The RDA results show a modelled variable of differences in community composition between strip cropping and monoculture. Percentages of explained variation of the first RDA axis are extremely low, and even then, the effect of location and/or year appear to peak through (Figure S3), even though these are not part of the modelling. Moreover, there is no indication of clustering of strip cropping on the RDA axis, or in fact on the first principal component axis in the larger RDA models. This means the explanatory power of different treatments is also extremely low. The crop level RDA's show some clustering, but hardly any consistent pattern in either communities of crops or species correlations, indicating that differences between strip cropping and monocultures are very small.

      We agree and we make a similar point in the first paragraph of the discussion (line 160-162).

      Furthermore, there are a number of additional weaknesses in the paper that should be addressed:

      The introduction lacks focus on the issues at hand. Too much space is taken up by facts on insect decline and land sharing vs. land sparing and not enough attention is spent on the scientific discussion underlying the statements made about crop diversification as a restoration strategy. They are simply stated as facts or as hypotheses with many references that are not mentioned or linked to in the text. An explicit link to the results found in the large number of references should be provided.

      We revised the introduction by omitting the land sharing vs. land sparing topic and better linking references to our research findings.

      The mechanistic understanding of strip cropping is what is at stake here. Does strip cropping behave similarly to intercropping, a technique that has been proven to be beneficial to biodiversity because of added effects due to increased resource efficiency and greater plant species richness? This should be the main testing point and agenda of strip cropping. Do the biodiversity benefits that have been shown for intercropping also work in strip cropping fields? The ground beetles are one way to test this. Hypotheses should originate from this and should be stated clearly and mechanistically.

      We agree with the reviewer and clarified this research direction clearer in the introduction of the revised manuscript (line 66-72).

      One could question how useful indicator species analysis (ISA) is for a study in which predominantly highly eurytopic species are found. These are by definition uncritical of their habitat. Is there any mechanistic hypothesis underlying a suspected difference to be found in preferences for either strip cropping or monocultures of the species that were expected to be caught? In other words, did the authors have any a priori reasons to suspect differences, or has this been an exploratory exercise from which unexplained significant results should be used with great caution?

      Point well taken. We agree that the indicator species analysis has limitations and therefore now replaced this with GLM analysis for the 12 most common ground beetle genera.

      However, setting these objections aside there are in fact significant results with strong species associations both with monocultures and strip cropping. Unfortunately, the authors do not dig deeper into the patterns found a posteriori either. Why would some species associate so strongly with strip cropping? Do these species show a pattern of pitfall catches that deviate from other species, in that they are found in a wide range of strips with different crops in one strip cropping field and therefore may benefit from an increased abundance of food or shelter? Also, why would so many species associate with monocultures? Is this in any way logical? Could it be an artifact of the data instead of a meaningful pattern? Unfortunately, the authors do not progress along these lines in the methods and discussion at all.

      We thank reviewer 3 for these valuable perspectives. In the revised manuscript, we further explored the species/genera that respond to cropping systems and discuss these findings in more detail in the revised manuscript (line 182-200 in discussion).

      A second question raised in the introduction is whether the arable fields that form part of this study contain rare species. Unfortunately, the authors do not elaborate further on this. Do they expect rare species to be more prevalent in the strip cropping fields? Why? Has it been shown elsewhere that intercropping provides room for additional rare species?

      The answer is simply no, we did not find more rare species in strip cropping. In the revised manuscript, we added a column for rarity (according to waarneming.nl) in the table showing abundances of species per location (table S2). We only found two rare species, one of which we only found a single individual and one that was more related to the open habitat created by a failed wheat field. We discuss this more in depth in the revised results (line 145-153).

      Considering the implications the results of this research can have on the wider discussion of bending the curve and the effects of agroecological measures, bold claims should be made with extreme restraint and be based on extensive proof and robust findings. I am not convinced by the evidence provided in this article that the claim made by the authors that strip cropping is a useful tool for bending the curve of biodiversity loss is warranted.

      We believe that strip cropping can be a useful tool because farmers readily adopt it and it can result in modest biodiversity gains without yield loss. However, strip cropping is indeed not a silver bullet (which we also don’t claim). We nuanced the implications of our study in the revised manuscript (line 30-35, 232-237).

      Reviewer #3 (Recommendations for the authors):

      General comments:

      (1) I am missing the R script and data files in the manuscript. This is a serious drawback in assessing the quality of the work.

      Datasets and R scripts will be made available upon completion of the manuscript.

      (2) I have doubts about the clarity of the title. It more or less states that strip cropping is designed in order to maintain productivity. However, the main objective of strip cropping is to achieve ecological goals without losing productivity. I suggest a rethink of the title and what it is the authors want to convey.

      As the title lead to false expectations for multiple reviewers regarding analyses on yield, we chose to alter the title and removed any mention of yield in the title.

      (3) Line 22: I would add something along the lines of: "As an alternative to intercropping, strip cropping is pioneerd by Dutch farmers... " This makes the distinction and the connection between the two more clear.

      In our opinion, strip cropping is a form of intercropping. We have changed this sentence to reflect this point better. (line 21-22)

      (4) Line 24: "these" should read "they"

      After changing this sentence, this typo is no longer there (line 24).

      (5) Line 34-48. I think this introduction is too long. The paper is not directly about insect decline, so the authors could consider starting with line 43 and summarising 34-42 in one or two sentences.

      Removed a sentence on insect declines here to make the introduction more streamlined.

      (6) Line 51-59. I am not convinced the land sparing - land sharing idea adds anything to the paper. It is not used in the discussion and solicits much discussion in and of itself unnecessary in this paper. The point the authors want to make is not arable fields compared to natural biodiversity, but with increases in biodiversity in an already heavily degraded ecosystem; intensive agriculture. I think the introduction should focus on that narrative, instead of the land sparing-sharing dichotomy, especially because too little attention is spent on this narrative.

      We removed the section on land-sparing vs land-sharing as it was indeed off-topic.

      (7) Line 85. Dynamics is not correctly used here. It should read Ground beetle communities are sensitive.

      Changed accordingly (line 78-79).

      (8) Line 90-91. Here, it should be added that ground beetles are used as indicators for ground-dwelling insect diversity, not wider insect diversity in agricultural systems. In fact, Gerlach et al., the reference included, clearly warn against using indicator groups in a context that is too wide for a single indicator group to cover and Van Klink (2022) has recently shown in a meta-analysis that the correlation between trends in insect groups is often rather poor.

      We removed the sentence that claimed ground beetles to be indicators of general biodiversity, and have focused the text in general more on ground beetle biodiversity, rather than general biodiversity.

      (9) Line 178: was there a high weed abundance measured in the stripcropping fields? Or has there been reports on higher weed abundance in general? The references provided do not appear to support this claim.

      To our knowledge, there is only one paper on the effect of strip cropping on weeds (Ditzler et al., 2023). This paper shows strip cropping (and more diverse cropping systems) reduce weed cover, but increase weed richness and diversity. We mistakenly mentioned that crop diversification increases weed seed biomass, but have changed this accordingly to weed seed richness. The paper from Carbonne et al. (2022) indeed doesn’t show an effect of crop diversification on weeds. However, it does show a positive relation between weed seed richness and ground beetle activity density. We have moved this citation to the right place in the sentence (line 172-175).

      (10) Line 279-288. The description of sampling with pitfalls is inadequate. Please follow the guidelines for properly incorporating sufficient detail on pitfall sampling protocols as described in Brown & Matthews 2016,

      We were sadly not aware of this paper prior to the experiments, but have at least added information on all characteristics of the pitfall traps as mentioned in the paper (line 290-294).

      (11) Lines 307-310. What reasoning lies behind the choice to focus on the most beetle-rich monocultures? Do the authors have references for this way of comparing treatments? Is there much variation in the monocultures that solicits this approach? It would be preferable if the authors could elaborate on why this method is used, provide references that it is a generally accepted statistical technique and provide additional assesments of the variation in the data so it can be properly related to more familiar exploratory data analysis techniques.

      We ran two analyses for the field-level richness and abundance. First we used all combinations of monocultures and strip cropping. However, as strip cropping is made up of (at least) 2 crops, we had 2 constituent monocultures. As we would count a comparison with the same strip cropped field twice when we included both monocultures, we also chose to run the analyses again with only those monocultures that had the highest richness and abundance. This choice was done to get a conservative estimate of ground beetle richness increases through strip cropping. We explained this methodology further in the statistical analysis section (line 329-335).

      In Figure S6 the order of crop combinations is altered between 2021 on the left and 2022 on the right. This is not helpful to discover any possible patterns.

      We originally chose this order as it represented also the crop rotations, but it is indeed not helpful without that context. Therefore, we chose to change the order to have the same crop combinations within the rows.

    1. eLife Assessment

      This important study investigates how hummingbird hawkmoths integrate stimuli from across their visual field to guide flight behavior. Cue conflict experiments provide solid evidence for an integration hierarchy within the visual field: hawkmoths prioritize the avoidance of dorsal visual stimuli, potentially to avoid crashing into foliage, while they use ventrolateral optic flow to guide flight control. These findings will be of broad interest to enthusiasts of visual neuroscience and flight behavior.

    2. Reviewer #1 (Public review):

      Summary:

      Recent work has demonstrated that the hummingbird hawkmoth, Macroglossum stellatarum, like many other flying insects, use ventrolateral optic flow cues for flight control. However, unlike other flying insects, the same stimulus presented in the dorsal visual field, elicits a directional response. Bigge et al., use behavioral flight experiments to set these two pathways in conflict in order to understand whether these two pathways (ventrolateral and dorsal) work together to direct flight and if so, how. The authors characterize the visual environment (the amount of contrast and translational optic flow) of the hawkmoth and find that different regions of the visual field are matched to relevant visual cues in their natural environment and that the integration of the two pathways reflects a prioritization for generating behavior that supports hawkmoth safety rather than the prevalence for a particular visual cue that is more prevalent in the environment.

      Strengths:

      This study creatively utilizes previous findings that the hawkmoth partitions their visual field as a way to examine parallel processing. The behavioral assay is well-established and the authors take the extra steps to characterize the visual ecology of the hawkmoth habitat to draw exciting conclusions about the hierarchy of each pathway as it contributes to flight control.

    3. Reviewer #2 (Public review):

      Summary

      Bigge and colleagues use a sophisticated free-flight setup to study visuo-motor responses elicited in different parts of the visual field in the hummingbird hawkmoth. Hawkmoths have been previously shown to rely on translational optic flow information for flight control exclusively in the ventral and lateral parts of their visual field. Dorsally presented patterns, elicit a formerly completely unknown response - instead of using dorsal patterns to maintain straight flight paths, hawkmoths fly, more often, in a direction aligned with the main axis of the pattern presented (Bigge et al, 2021). Here, the authors go further and put ventral/lateral and dorsal visual cues into conflict. They found that the different visuomotor pathways act in parallel, and they identified a 'hierarchy': the avoidance of dorsal patterns had the strongest weight and optic flow-based speed regulation the lowest weight. The authors linked their behavioral results to visual scene statistics in the hawkmoths' natural environment. The partition of ventral and dorsal visuomotor pathways is well in line with differences in visual cue frequencies. The response hierarchy, however, seems to be dominated by dorsal features, that are less frequent, but presumably highly relevant for the animals' flight safety.

      Strengths

      The data are very interesting and unique. The manuscript provides a thorough analysis of free-flight behavior in a non-model organism that is extremely interesting for comparative reasons (and on its own). These data are both difficult to obtain and very valuable to the field.

      Weaknesses

      While the present manuscript clearly goes beyond Bigge et al, 2021, the advance could have perhaps been even stronger with a more fine-grained investigation of the visual responses in the dorsal visual field. Do hawkmoths, for example, show optomotor responses to rotational optic flow in the dorsal visual field?

      I find the majority of the data, which are also the data supporting the main claims of the paper, compelling. However, the measurements of flight height are less solid than the rest and I think these data should be interpreted more carefully.

    4. Reviewer #3 (Public review):

      The authors have significantly improved the paper in revising to make its contributions distinct from their prior paper. They have also responded to my concerns about quantification and parameter dependency of the integration conclusion. While I think there is still more that could be done in this capacity, especially in terms of the temporal statistics and quantification of the conflict responses, they have a made a case for the conclusions as stated. The paper still stands as an important paper with solid evidence a bit limited by these concerns.

      [Editors' note: Due to very minor revisions, the paper was not sent to reviewers for an additional round of review.]

    5. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Recent work has demonstrated that the hummingbird hawkmoth, Macroglossum stellatarum, like many other flying insects, use ventrolateral optic flow cues for flight control. However, unlike other flying insects, the same stimulus presented in the dorsal visual field, elicits a directional response. Bigge et al., use behavioral flight experiments to set these two pathways in conflict in order to understand whether these two pathways (ventrolateral and dorsal) work together to direct flight and if so, how. The authors characterize the visual environment (the amount of contrast and translational optic flow) of the hawkmoth and find that different regions of the visual field are matched to relevant visual cues in their natural environment and that the integration of the two pathways reflects a prioritization for generating behavior that supports hawkmoth safety rather than the prevalence for a particular visual cue that is more prevalent in the environment.

      Strengths:

      This study creatively utilizes previous findings that the hawkmoth partitions their visual field as a way to examine parallel processing. The behavioral assay is well-established and the authors take the extra steps to characterize the visual ecology of the hawkmoth habitat to draw exciting conclusions about the hierarchy of each pathway as it contributes to flight control.

      Reviewer #2 (Public review):

      Summary

      Bigge and colleagues use a sophisticated free-flight setup to study visuo-motor responses elicited in different parts of the visual field in the hummingbird hawkmoth. Hawkmoths have been previously shown to rely on translational optic flow information for flight control exclusively in the ventral and lateral parts of their visual field. Dorsally presented patterns, elicit a formerly completely unknown response - instead of using dorsal patterns to maintain straight flight paths, hawkmoths fly, more often, in a direction aligned with the main axis of the pattern presented (Bigge et al, 2021). Here, the authors go further and put ventral/lateral and dorsal visual cues into conflict. They found that the different visuomotor pathways act in parallel, and they identified a 'hierarchy': the avoidance of dorsal patterns had the strongest weight and optic flow-based speed regulation the lowest weight. The authors linked their behavioral results to visual scene statistics in the hawkmoths' natural environment. The partition of ventral and dorsal visuomotor pathways is well in line with differences in visual cue frequencies. The response hierarchy, however, seems to be dominated by dorsal features, that are less frequent, but presumably highly relevant for the animals' flight safety.

      Strengths

      The data are very interesting and unique. The manuscript provides a thorough analysis of free-flight behavior in a non-model organism that is extremely interesting for comparative reasons (and on its own). These data are both difficult to obtain and very valuable to the field.

      Weaknesses

      While the present manuscript clearly goes beyond Bigge et al, 2021, the advance could have perhaps been even stronger with a more fine-grained investigation of the visual responses in the dorsal visual field. Do hawkmoths, for example, show optomotor responses to rotational optic flow in the dorsal visual field?

      I find the majority of the data, which are also the data supporting the main claims of the paper, compelling. However, the measurements of flight height are less solid than the rest and I think these data should be interpreted more carefully.

      Reviewer #3 (Public review):

      The authors have significantly improved the paper in revising to make its contributions distinct from their prior paper. They have also responded to my concerns about quantification and parameter dependency of the integration conclusion. While I think there is still more that could be done in this capacity, especially in terms of the temporal statistics and quantification of the conflict responses, they have a made a case for the conclusions as stated. The paper still stands as an important paper with solid evidence a bit limited by these concerns.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The edits have significantly improved the clarity of the manuscript. A few small notes:

      Figure 2B legend - describe what the orange dashed line represents

      We added a description.

      Figure 2B legend - references Table 1 but I believe this should reference Table S1. There are other places in the manuscript where Table 1 is referenced and it should reference S1

      We changed this for all instances in the main paper and supplement, where the reference was wrong.

      Figure S1 legend - some figure panel letters are in parentheses while others are not

      We unified the notation to not use parentheses for any of the panel letters.

      Reviewer #2 (Recommendations for the authors):

      I couldn't find the l, r, d, v indications in Fig. 1a. This was just a suggestion, but since you wrote you added them, I was wondering if this is the old figure version.

      We added them to what is now Fig. 2, which was originally part of Fig. 1. After restructuring, we did indeed not add an additional set to Fig. 1, which we have now adjusted.

      Fig. 2: Adding 'optic flow' and 'edges' to the y-axis in panels E and F, would make it faster for me to parse the figure. Maybe also add the units for the magnitudes? Same for Figure 6B

      We added 'optic flow' and 'edges' to the panels E and F in Fig. 2 and Fig. 6.

      Fig. 2: Very minor - could you use the same pictograms in D and E&F (i.e. all circles for example, instead of switching to "tunnels" in EF)?

      We used the tunnel pictograms, because we associated those with the short notations for the different conditions summarised in Table S1. Because we wanted to keep this consistent across the paper, we used the “tunnel” pictograms here too.

      In the manuscript, you still draw lots of conclusions based on these area measurements (L132-142, L204-209 etc). This does not fully reflect what you wrote in your reply to the reviewers. If you think of these measurements as qualitative rather than quantitative, I would say so in the manuscript and not use quantitative statistics etc. My suggestion would be to be more specific about potential issues that can influence the measurement (you mentioned body size, image contrast, motion blur, pitch across conditions etc) and give that data not the same weight as the rest of the measurements.

      We do express explicit caution with this measure in the methods section (l. 657-659) and the results section (l. 135-137). Nevertheless, as the trends in the data are consistent with optic flow responses in the other planes, and with responses reported in the literature, we felt that it is valuable to report the data, as well as the statistics for all readers, who can – given out cautionary statement – assess the data accordingly.

      The area measurements suggest that moths fly lower with unilateral vertical gratings (Fig. S1, G1 and G2 versus the rest). If you leave the data in can you speculate why that would be? (Sorry if I missed that)

      We agree, this seems quite consistent, but we do not have a good explanation for this observation. It would certainly require some additional experiments and variable conditions to understand what causes this phenomenon.

      Fig.4 - is panel B somehow flipped? Shouldn't the flight paths start out further away from the grating and then be moved closer to midline (as in A). That plot shows the opposite.

      Absolutely right, thank you for spotting this, it was indeed an intermediate and not the final figure which was uploaded to the manuscript. It also had outdated letter-number identifiers, which we now updated.

      L198 - should be "they avoided"

      Corrected.