5,643 Matching Annotations
  1. Aug 2022
    1. Author Response

      Reviewer #2 (Public Review):

      The time-dependency of the model simulations was not analyzed, and the nature of the observed biphasic time-dependent APAP response remains elusive. It would be interesting to see how the model can explain the time course of the APAP stimulation experiment.

      The alternative model at its current state can only describe steady state conditions. However, we understand that the reviewer is interested in the dynamic behavior of the model. However, our approach provides a proof of principle that the alternative model can phenomenologically explain the changes of YAP localization as a response to APAP treatment. The question of how to model Hippo pathway in a time-dependent manner as a response to APAP treatment is very challenging and would require further investigations and, most notably, further development of the PDE simulation algorithms and the SME software. Hence, a technical update of the software algorithms would be required, which cannot be in the scope of this manuscript.

      Nevertheless, we decided to share our first and preliminary analyses on dynamic processes caused by APAP with the reviewer. For this, we simulated the steady state model in an arbitrary manner, where APAP initiates (early time-point) and slows down (late time-points) YAP phosphorylation in the nucleus (see Figure below).

      The simulated alternative model shows that increased YAP phosphorylation about 50% leads to the cytoplasmic localization of YAP (Rebuttal Figure R5A/B). However, this shuttling is not detectable in our protein fractionation and live-cell imaging experiments (see also Rebuttal Figure R7C/D). At late time points, decreasing YAP phosphorylation (about 60%) led to a clear nuclear enrichment and dephosphorylation of YAP was observed in our experiments. Thus, our mathematical model nicely describes cellular events of Hippo pathway dynamics observed at later stages after APAP treatment (nuclear enrichment). However, early events cannot be completely explained (suggested nuclear YAP exclusion is not detectable).

      We suggest two explanations for this observation. First, other molecular mechanisms (not yet identified and therefore not part of the model topology) oppose the exclusion YAP enrichment that is expected at early time points. Second, detection methods used in this study (Western Blotting and life cell imaging) cannot capture minimal changes and cellular heterogeneity in the chosen experimental setup. We clarify this aspect/limitation of our study in the discussion chapter of the manuscript. Page 12, lines 436-440

      Time-dependency of YAP (orange) localization based on the simulated APAP treatment. (A): Simulated control (ctrl) and APAP treatment for 2 and 48h. The treatment was simulated by changing the phosphorylation coefficient of YAP in the nucleus. (B): Simulated pYAP/YAP ratio during control and APAP treatment for 2 and 48 hours at the steady state of the model. (C): Simulated NCR of the total YAP during control and APAP treatment for 2 and 48 hours at the steady state.

    1. Author Response

      Reviewer #1 (Public Review):

      This study is a follow-up to the previous work by the authors in establishing a surprising role for the presynaptic adhesion molecules, neurexin (Nrxn) variants containing the SS4+ splice site, in differentially controlling postsynaptic NMDA and AMPA receptors by forming links through a shared system of extracellular cerebellins (Cbln) and postsynaptic GluD1. Here the authors show at CA1 to subiculum synapses, that the role for Clbn2 in mediating the effects of Nrxn1-SS4+ and Nrxn3-SS4+ in enhancing NMDAR and suppressing AMPAR, respectively, is redundant with that of Clbn1. Moreover, Clbns do not appear to play a role in synapse formation. Dai and colleagues extend their previous work also by highlighting the common function for Nrxn-Clbn signaling system across different synapses albeit with subtle differences and point to a lack of a role for Nrxn-Clbn signaling in morphological synapse development. Overall the data are solid, while the key findings are mostly incremental, and the basis for the selectivity in the observed differential regulation of AMPARs and NMDARs via the same trans-synaptic link through Clbns at various types of synapses remain to be clarified. Importantly, the authors make a definitive conclusion concerning the lack of a role for Nrxn-Cbln signaling complexes in synapse formation during development. Nevertheless, this is a contentious issue, and as such, the conclusions could be more compellingly supported with further experiments.

      We appreciate the reviewer’s positive assessment of our study.

      Reviewer #2 (Public Review):

      In this manuscript Dai et al. investigated the role of Nrxn-Cbln complexes in regulating AMPA- and NMDA- receptor function in different brain regions. Using a combination of genetic manipulations, together with electrophysiological and biochemical assays, the authors showed that, at CA1-subiculum synapses, Cbln2 regulates NMDA- and AMPA- receptors via Nrxn1SS4+ -Cbln2 and Nrxn3SS4+-Cbln2 signaling complexes, respectively. In the prefrontal cortex, only Nrxn1SS4+-Cbln2 signaling-dependent regulation of NMDA receptors occurs, while in the cerebellum, only Nrxn3SS4+-Cbln1 signaling-dependent regulation of AMPA receptor occurs. This systematic investigation of the function of different Neurexin-Cerebellin signaling complexes contributes to our understanding of how different members of the same family, in combination pairs, regulate synaptic transmission with circuit specificity. This work adds to the authors' systemic investigation of molecular mechanisms regulating synaptogenesis, synaptic transmission and synaptic plasticity.

      We thank the reviewer for the positive and astute comments.

      Some suggestions for clarifications:

      1) Regarding expression of Cbln1 in the subiculum, in lines 271-273, the authors stated that "in these and earlier experiments we only studied Cbln2, but quantifications show that Cbln1 is also expressed in the subiculum, albeit at lower levels Figure S3)." However, Figure S3 does not include any quantifications, and the example image does not show visible Cbln1 expression. Thus, the above-mentioned statement is inconsistent with the data presented. Please revise. If the authors would like to keep the statement about quantifications of Cbln1, then quantification should be provided for all panels of this Figure, in order to give the readers some ideas about relative expression levels.

      We agree, and have addressed this issue as described above (introductory point 4).

      2) Does Cbln4, which is also broadly expressed in the brain, play a role in regulating AMPA- and NMDA-receptors at the synapses investigated? Does Cbln3 contribute to regulation of synaptic transmission in the cerebellum? Please discussion.

      Cbln4 is not expressed in the subiculum, but is expressed in the PFC. Specifically, Cbln1, Cbln2, and Cbln4 are broadly expressed in brain, whereas Cbln3 is restricted to cerebellar granule cells and requires Cbln1 or Cbln2 for secretion (Bao et al., 2006; Miura et al., 2006). Remarkably, Cbln1, Cbln2, and Cbln4 are not uniformly expressed in all neurons, but synthesized in restricted subsets of neurons (Seigneur and Südhof, 2017). For example, cerebellar granule cells express high levels of Cbln1 but only modest levels of Cbln2, excitatory entorhinal cortex (EC) neurons express predominantly Cbln4, and neurons in the medial habenula (mHb) express Cbln2 or Cbln4 (Seigneur and Südhof, 2017).

      Cbln4 is poorly studied, and Cbln3 has not been functionally studied at all. To the best of our knowledge, there are only four studies on Cbln4 function, three of which are from our lab. The Seigneur & Sudhof (2018) paper showed that the deletion of Cbln4 in a large number of brain regions caused no change in excitatory or inhibitory synapse numbers. Subsequently, the Seigneur et al. (2018) paper demonstrated that genetic deletion of Cbln4 in the mHb had no major effect on synapse numbers, but because of the limits of this preparation (synaptic transmission is hard to monitor in the mHB), no detailed synaptic studies were done. The Fossati et al. (2019) paper in Neuron shows that Cbln4 regulates inhibitory synapse numbers in the cortex by binding to GluD1, but this study depended on RNAi, not genetic manipulations. Its results are puzzling because structural biology studies have shown that Cbln4 does not bind to GluD2, which is highly homologous to GluD1 and has the same function as GluD1. Instead of binding to GluD’s, Cbln4 binds to another class of receptors, Neogenin-1 and DCC, making the Fossati et al. (2019) paper difficult to interpret. The Liakath-Ali et al. (2022) paper, finally, demonstrated that deletion of Cbln4 in the EC or deletion of Neo1 in the dentate gyrus (DG) blocks long-term potentiation at EC→DG synapses but does not change basal synaptic transmission or synapse numbers, again consistent with the notion that Cbln4 regulates synapse properties similar to Cbln1 and Cbln2.

      We have now described these studies in the introduction to the paper. Many synaptic proteins are associated with contentious studies in the literature, and we completely concur that it is essential to evenly discuss the issues in detail, even if this expands the size of a paper.

      Reviewer #3 (Public Review):

      In this study, Dai and colleagues used genetic models combined to electrophysiological recordings and behavior as well as immunostaining and immunoblotting to investigate the role of trans-synaptic complexes involving presynaptic neurexins and cerebellins in shaping the function of central synapses. The study extends previous findings from the same authors as well as other groups showing an important role of these complexes in regulating the function of central synapses. Here, the authors sought to achieve two main objectives: (1) investigating whether their previous findings obtained at mature CA1-> subiculum synapses (Aoto et al., 2013; Dai et al., Neuron 2019; Dai et al., Nature 2021) extend to different synapse subtypes in the subiculum as well as to other central synapses including cortical and cerebellar synapses and (2) investigating whether Nrx-Cbln-GluD trans-synaptic complexes play a role in synapse formation as previously proposed by other groups.

      Overall, the study provides interesting and solid electrophysiological data showing that different Nrxns and Cblns assemble trans-synaptic complexes that differently regulate AMPAR and NMDAmediated synaptic transmission across distinct synaptic circuits (most likely through binding to postsynaptic GluD receptors).

      We appreciate the reviewer’s accurate and positive assessment of our study.

      However, the study has several important weaknesses:

      1) The novelty of the findings appears limited. Indeed, previous studies from the same authors with similar experimental paradigms and readouts already demonstrated the role of Nrxn-CblnGluD complexes in regulating AMPARs versus NMDARs in mature neurons (Aoto et al., Cell 2013; Dai et al., Neuron 2019; Dai et al., Nature 2021). Moreover, the absence of role of Cblns and GluD receptors in synapse formation was already suggested in previous studies from the same authors (Seigneur and Sudhof, J Neurosci 2018; Seigneur et al., PNAS 2018; Dai et al., Nature 2021).

      Not surprisingly, we disagree with this comment. We do concur that our data are consistent with previous studies, but believe that this reproducibility is a strength since so many data in the literature are irreproducible.

      We do not agree, however, that our findings lack novelty. The novelty is admittedly limited, after all we like to be consistent, but our paper is the first to demonstrate that the Nrxn1/Cbln/GluD and Nrxn3/Cbln/GluD complexes are differentially active in different synapses, with the subiculum synapses having both, the mPFC synapses only the former, and the cerebellum only the latter. This is a very important innovation that illustrates the power of the Nrxn/Cbln/GluD signaling complex in shaping synapses. In addition, our paper is the first to analyze a possible developmental function of Cbln2 in depth, to analyze its differential role at the two dominant types of pyramidal neurons in the subiculum, regular- and burst-spiking neurons, to analyze conditional deletions of Cbln1 in the adult cerebellum, and to directly measure the effect of Cbln2 deletions in the PFC. Especially in view of the recent Nature paper that concluded that Cbln2 regulates spine numbers in the PFC, these findings are highly relevant.

      2) The conclusion made by the authors that the Nrxn-Cbln-GluD trans-synaptic complexes do not play a role in synapse formation/development is not sufficiently supported by their data, while previous studies suggest the opposite. Actually, this conclusion is essentially based on the two following measurements taken as a 'proxy' for synapse density: (1) 'the average vGluT1 intensity calculated from the entire area of subiculum' and (2) the 'synaptic proteins levels' assessed by immunoblotting. None of these measurements (only performed in the subiculum) allow to precisely assess synapse density on the neurons of interest. While the average vGluT1 intensity over large fields of view does not directly reflect the density of synapses and does not take into account the postsynaptic compartment, the immunoblotting data only reflects the overall expression of synaptic proteins without discriminating between intracellular, surface and synaptic pools and between cell types. In the subiculum from Cbln1+2 KO mice, the authors performed mEPSCs recordings and found an increase in frequency. However, this increase may reflect the unsilencing and/or potentiation of AMPAR-EPSCs above the detection threshold, irrespectively of the actual synapse number. Finally, the decrease in NMDAR-EPSCs is not discussed by the authors while it could actually reflect a decrease in synapse number.

      We agree that additional data on synapse numbers are helpful for our paper. We have now performed these studies as described in detail in our response to introductory point 1 above. However, we would also like to refer to the already existing body of evidence on the role of neurexin-based complexes in synapse numbers. We have shown in papers published over the last two decades that deletions of individual neurexins or of multiple neurexins, as well as blocking cerebellin binding to neurexins by ablating splicing site #4 (SS4) in neurexins, have NO effect on synapse numbers. The most important of these papers are:

      1. Missler, M., Zhang, W., Rohlmann, A., Kattenstroth, G., Hammer, R.E., Gottmann, K., and Südhof, T.C. (2003) α-Neurexins Couple Ca2+-Channels to Synaptic Vesicle Exocytosis. Nature 423, 939948.
      2. Kattenstroth, G., Tantalaki, E., Südhof, T.C., Gottmann, K., and Missler, M. (2004) Postsynaptic Nmethyl-D-aspartate receptor function requires α-neurexins. Proc. Natl. Acad. Sci. U.S.A. 101, 2607-2612.
      3. Dudanova, I., Tabuchi, K., Rohlmann, A., Südhof, T.C., and Missler, M. (2007) Deletion of α-Neurexins Does Not Cause a Major Impairment of Axonal Pathfinding or Synapse Formation. J. Comp. Neurol. 502, 261-274.
      4. Etherton, M.R., Blaiss, C., Powell, C.M., and Südhof, T.C. (2009) Mouse neurexin-1α deletion causes correlated electrophysiological and behavioral changes consistent with cognitive impairments. Proc. Natl. Acad. Sci. U.S.A. 106, 17998-18003.
      5. Soler-Llavina, G.J., Fuccillo, M.V., Ko, J., Südhof, T.C., and Malenka, R.C. (2011) The neurexin ligands, neuroligins and LRRTMs, perform convergent and divergent synaptic functions in vivo. Proc. Natl. Acad. Sci. U.S.A. 108, 16502-16509.
      6. Aoto, J., Martinelli, D.C., Malenka, R.C., Tabuchi, K., and Südhof, T.C. (2013) Presynaptic Neurexin-3 Alternative Splicing Trans-Synaptically Controls Postsynaptic AMPA-Receptor Trafficking. Cell 154, 75-88. PMCID: PMC3756801.
      7. Aoto, J., Földy, C., Ilcus, S.M., Tabuchi, K., and Südhof, T.C. (2015) Distinct circuit-dependent functions of presynaptic neurexin-3 at GABAergic and glutamatergic synapses. Nat Neurosci. 18, 997-1007.
      8. Anderson, G.R., Aoto, J., Tabuchi, K., Földy, F., Covy, J., Yee, A.X., Wu, D., Lee, S.-J., Chen, L., Malenka, R.C., Südhof, T.C. (2015) α-Neurexins Control Neural Circuit Dynamics by Regulating Endocannabinoid Signaling at Excitatory Synapses. Cell 162, 593-606. PMCID: PMC4709013
      9. Chen, L.Y., Jiang, M., Zhang, B., Gokce, O., and Südhof, T.C. (2017) Conditional Deletion of All Neurexins Defines Diversity of Essential Synaptic Organizer Functions for Neurexins. Neuron 94, 611-625. PMCID: PMC5501922
      10. Dai, J., Aoto, J., and Südhof, T.C. (2019) Alternative Splicing of Presynaptic Neurexins Differentially Controls Postsynaptic NMDA- and AMPA-Receptor Responses. Neuron 102, 993-1008. PMCID: PMC6554035
      11. Luo, F., Sclip, A., Jiang, M., and Südhof, T.C. (2020) Neurexins Cluster Ca2+ Channels within presynaptic Active Zone. EMBO J. 39, e103208. PMCID: PMC7110102
      12. Khajal, A.J., Sterky, F.H., Sclip, A., Schwenk, J., Brunger, A.T., Fakler, B., and Südhof, T.C. (2020) Deorphanizing FAM19A Proteins as Pan-Neurexin Ligands with an Unusual Biosynthetic Binding Mechanism. J. Cell Biol. 219, e202004164
      13. Luo, F., Sclip, A., and Südhof, T.C. (2021) Universal role of neurexins in regulating presynaptic GABAB-receptors. Nature Comm. 12, 2380. PMCID: PMC8062527
      14. Wang, C.Y., Trotter, J.H., Liakath-Ali, K., Lee, S.J., Liu, X., and Südhof, T.C. (2021) Molecular SelfAvoidance in Synaptic Neurexin Complexes. Science Advances 7, eabk1924. PMCID: PMC8682996
      15. Dai, J., Patzke, C., Liakath-Ali, K., Seigneur, E., and Südhof, T.C. (2021) GluD1, A signal transduction machine disguised as an ionotropic receptor. Nature 595, 261-265. PMCID: PMC8776294

      Individual papers may not convince the reviewer, but the cumulative evidence seems to us to be hopefully persuasive. We have published less evidence on the lack of a role of cerebellins and GluD’s in synapse numbers than on neurexins, but the only in-depth studies of these molecules by others are in the cerebellum. Here, deletions of Cbln1 and GluD2 indeed cause a significant, albeit partial, loss of synapses. However, this loss may not be due a lack of synapse formation, but to an elimination of synapses that have been formed, as demonstrated by many beautiful papers from leading investigators. It is regrettable that reviews and textbooks continue to state that cerebellins mediate synapse formation as an established fact because as far as we can see, there is limited evidence for that conclusion, but it keeps coming back again and again.

      3) The authors do not provide sufficient data in order to interpret the increase in AMPAR-EPSCs and decrease in NMDAR-EPSCs amplitudes. Are the changes in AMPARs and NMDARs occurring at pre-existing synapses or do they result from alterations in the number of physical synapses and/or active synapses (see point#2)? In particular, the increase in AMPAR/NMDAR ratio accompanied by the increase in mEPSCs frequency might be well explained by the unsilencing of some synapses and/or by the fact that the available pool of AMPARs is distributed over a smaller number of synapses, resulting in higher quantal size. These effects could explain the blockade of LTP, i.e., through an occlusion mechanism.

      We addressed these points in previous studies (Aoto et al., 2013; Dai et al., 2019 and 2021), as discussed and cited in the present paper, and expanded on these points in the present paper.

      In a nutshell, we showed by surface AMPAR staining that presynaptic Nrxn3-SS4+ decreases postsynaptic AMPAR levels, and by direct application of AMPA that it decreases the functional surface levels of AMPARs, whereas presynaptic Nrxn1-SS4+ increases the functional surface levels of NMDARs. We also demonstrated the opposite effects for the GluD1 KO, and furthermore showed by minimal stimulation experiments that the Cbln2 deletion does not alter the number of silent synapses. In the present manuscript, we performed a detailed analysis of the miniature quantal size for AMPAR- and NMDAREPSCs.

      Finally, we have demonstrated in a large number of papers, including this one, that genetic manipulations of neurexins, cerebellins, and GluD’s do not alter synapse numbers with a few exceptions in which synapses are secondarily eliminated, like in the cerebellum. Together, these data show that the observed changes are mediated by a regulation of postsynaptic functional AMPARs and NMDARs, not by alterations in synapse numbers or by synapse silencing/unsilencing.

      4) The authors did not demonstrate (or did not cite relevant studies) that the deletion of Cbln1 and/or Cbln2 does not affect the expression of the remaining Cblns isoforms (Cbln2 and/or Cbln4) or Nrxns1/3 and GluD1/2. This verification is important to preclude the emergence of any compensatory effect.

      To address this point, we have now measured the mRNA expression levels of Nrxns, Cblns, and GluDs in both the subiculum and the prefrontal cortex in littermate P35-42 Cbln2 WT and KO mice. The result show that the constitutive Cbln2 deletion causes no compensatory expression effects (new suppl Fig. S5). Please note that compensatory expression effects are often raised as a possibility for explaining genetically induced changes (or the lack thereof), but such effects are virtually never found.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors try to shed light on how plant stem cells located in a ring‐like structure in the (the procambial cells or cambium) can generate two distinct differentiated tissues, one filling the interior of the ring (the xylem) and the other one surrounding the ring (the phloem). To achieve this goal, the authors propose different models increasing in complexity, and perform a series of comparisons between the model outcomes and experimental data in the Arabidopsis hypocotyl. This work seems to provide for the first time a computational framework to model the radial formation of the cambium, xylem and phloem in the hypocotyl. Some of the features of the wild type and mutants could be qualitatively recapitulated, such as the radial organization of the xylem, cambium and phloem in wild type, and a striking phenotype upon the overexpression of CLE41 transgene.

      We thank the reviewer for appreciating the novelty of this work.

      Although this work is very well written and understandable at the introduction, when paying careful attention to the presented results, there are different aspects that would require further work and investigation, on both experimental and modelling sides: The authors chose to study different models increasing in complexity, reaching a more complete model (Model 3, Figure 5A‐D) that the authors claim it is recapitulating the experimental data and the explored experimental perturbations (Figure 5E‐F). This model is substantially more complex than Model 1 and Model 2, and it is difficult to understand all the claims by the authors, and the radial pattern formation capabilities of it. Yet, a feature that is clear to the eye, both in the pictures and in the movies, is that this model seems more likely to present a front instability of the cambium front progression, disrupting the radial organization of the different tissues (see Figure 5B), which does not seem to happen in the wild type hypocotyl from Arabidopsis. This effect is even more extreme when looking at the pxy mutant (Figure 5F) and when the xylem cell wall thickness is explored through the simulations (Figure 6). The authors claim this model is able to recapitulate a basic feature of the pxy mutant, which is the fact that the distal cambium appears in patches. Although these patches appear in the simulations, this effect in the model might be produced by the instability of the cambium front progression itself, which might be fundamentally different from what happens in the experimental data. In the experimental data, the PXYpro:CFP cambium does not seem to present such front instability, but rather is the xylem that gets fragmented. To make a link between the Model 3 and the pxy mutant, a careful study of the different stages of this phenotype could be useful to do, both on the modelling and experimental side.

      Thanks for this valuable comment and for appreciating our writing style. Front stability was not part of our considerations but provides certainly a very interesting aspect to our study. The reviewer is correct when noticing that the front of domains observed in planta is very stable but that this is not the case for our computational simulations. We believe that instability in the computational models is due to local noise in the cellular pattern leading to differential diffusion of chemicals* with respect to its radial position and to a progressive deviation of domain from a perfect circle. Such a deviation seems to be corrected by an unknown mechanism in planta but such a corrective mechanism is, due to the absence of a good idea of its nature, not implemented in our models. In order to investigate this point and the contribution of front instability to phenotypes of perturbed lines, we performed a time course analysis of anatomies of wt, IRX3pro:CLE41 and pxy lines with the help of the PXYpro:CFP/SMXL5pro:YFP markers, now shown in Fig. S1, and compared their dynamics to the respective movies 4A, 5A, and 6A. For pxy mutants, we observed ‘gaps’ in the cambium domain already at early stages of development (Fig. S1I, J) arguing against the fact that the pxy anatomy is caused by increased front instability but rather by differential signaling within a circular domain leading to a breakdown of cambium patterning and cell fate determination. Although a corrective mechanism ensuring front stability in planta is difficult to predict, we believe that our model now allows to test respective ideas like directional movement of chemicals or stabilizing communication between cells within a particular circular domain. This aspect is now discussed in the discussion.

      The authors have a parameter search strategy based on matching the proportion of cell types in Model 3. I am wondering how effective is this strategy in a system where these features are evolving in time, especially in Model 3, which seems to present a front instability. Moreover, this strategy does not tell anything about the model robustness for recapitulating the different features of the pattern.

      We thank the reviewer for pointing out these aspects regarding the parameter search. We agree that there are some limitations to estimating dynamic parameters based on the proportion of cell types. As a consequence, we have focused our parameter search on those parameters that directly impact tissue formation: cell division thresholds, cell differentiation thresholds and maximal cell sizes. We have further expanded our parameter search until we obtained five distinct parameter sets that recapitulate central features of cambium activity. This increases the likelihood that the behavior we saw in the subsequent analyses was actually a feature of the system and not a characteristic of that particular parameter set. This strategy did not solve the front instability of model 3, which suggests that there are factors at play ‐ beyond the CLE41‐PXY module and cell wall stability – which are currently beyond the scope of our model.

      In the last model, the authors try to link the cell wall thickness with the radiality of the divisions. Although the idea of looking at the division trajectories seems interesting, more clarity is needed to see how helpful is the radiality measure, and perhaps a better measure is needed ‐ note that the proliferation trajectory in Figure 6C might have the same amount of ramifications than in Figure 6B, and therefore, effectively speaking, the amount of periclinal divisions might be the same in both cases. The authors claim that the increase of xylem thickness contributes in having a more radial growth, but this could be related to the cambium front instability, which seems to be more pronounced as well for higher xylem thickness.

      We agree with the reviewer that this is a critical point as a robust measurement of ‘radiality’ of cell lineages is central for accessing the degree of pericliniality of cell divisions with the computational model. After extensively considering different measurement methods, we indeed think that calculating R2 of cell connectors is the most appropriate and quantitative one in the context of our computational model. In fact, the amount of ramifications is not considered by this method but the geometry of ‘cell connectors’ which clearly shows a more ‘radial’ pattern of cell lineages when xylem cells are ‘stiffer’ (Fig. 6D). Ramifications would be a measurement of the amount of cell divisions, which we did not want to target in this case. We also did not claim that increased xylem thickness leads to more radial growth. In fact, Fig. S4 shows that this is rather the opposite. We expect that increased front instability when ‘xylem stiffness’ is increased, would rather decrease radiality of cell* lineages and mask respective positive effects. The fact that we still see increased ‘radiality’ argues against the assumption that front instability is causative.

      On the experimental side, the claims about the proximal and distal cambium, together with the cell proliferation data are not very well supported with the presented data in Figures 2, 3A and S1. Moreover, these different figures seem to show different behaviors ‐ are these sections at different stages of the hypocotyl? Also, seeing more of the H4 marker in a region of the tissue not necessarily indicates a higher proliferation rate (it could also simply be that cells are more synchronized in the S phase in that region of the cambium, and/or the cell cycle lasts for longer in that part of the tissue). A quantification and the proper repeats to support these claims is lacking. A quantitative and more extensive study of the pxy mutant would enable a better comparison with the simulated model. Is there PXYpro:CFP expression between the fragmented xylem?

      We agree with these concerns toward the H4 marker used in the initial submission. Because H4 expression is not specifically associated with cell division but with DNA synthesis in general and, thus, with endoreduplication, H4 expression does not report faithfully on cell division. As a response, we removed related figures and now reference our previous study characterizing cell division levels in different cambium domains based on cell linage analyses (Shi et al., 2019). Because this is a far more reliable analysis and convincingly supports our claims, we believe that we thereby addressed this concern. As mentioned above, we also added a more extensive analysis of the pxy mutant (Fig. S1) showing that there is no PXY expression between the fragmented xylem domains.

      This work might help progress in the field of understanding radial patterning in plants. The introduction and the first models could attract a more general plant audience, but once the models increase in complexity, the narrative and presented results are more relevant to those scientists more specialized in xylem and phloem formation.

      We thank the reviewer for appreciating the general relevance of our models for a larger audience.

      Reviewer #2 (Public Review):

      The paper uses computer modeling and simulations to show how a radially growing circular plant organ, such as a hypocotyl, can develop and maintain its organization into tissues including, in particular, cambium, xylem and phloem. The results are illustrated with useful movies representing the simulations. The paper is organized as a sequence of models, which has some rationale ‐ it presumably depicts the path of refinements through which the authors arrived at the final model ‐ but the intermediate steps are of limited interest. At the same time, mathematical details of the models are not presented to the full extent. Fortunately, the models can be downloaded over the Internet, and the supplementary materials include detailed instructions for executing them (using the VirtualLeaf framework). Consequently, the paper and its results can potentially serve as a stepping stone for further model‐assisted studies of radial tissue organization and growth.

      Again, we thank the reviewer for appreciating the usefulness of our model and its general implications. In the revised version of the manuscript we substantially expanded explanations of the mathematical details in the main text and the supplemental methods. We still would argue that intermediate steps are of common interests as they illustrate why certain assumptions being extensively discussed within the field were refused providing important justifications for the final model.

    1. Author response

      Reviewer #3 (Public Review):

      Sensory preconditioning (SPC) refers to a conceptually important, higher-order form of Pavlovian conditioning. It involves two training phases and a final test. In the first, pre-conditioning training phase two 'neutral' stimuli are presented together (S1, S2). In the second training phase, one of them is paired with for example a punishment (S1+). In the final test conditioned response to the respective other stimulus is assessed (S2).

      The conclusion that sensory preconditioning does indeed occur requires showing that i) conditioned responding is observed for S2 but not for other, not pre-conditioned stimuli (S3); ii) that conditioned responding to S2 depends on the jointness of presentation of S1 and S2; iii) that conditioned responding to S2 depends on S1 indeed being paired with punishment. It is a strength of the current paper that these requirements are met and that this is the case both at the behavioural level and for a plausible stand-in at the physiological level.

      A weakness is that key data belonging together are not shown and analysed together.

      We have rearranged the data.

    1. Author Response

      Reviewer #1 (Public Review):

      Mikelov et al. investigated IgH repertoires of memory B cells, plasmablasts, and plasma cells from peripheral blood collected at three time-points over the course of a year. In order to obtain deep and unbiased repertoire sequences, authors adopted uniquely developed IgH repertoire profiling technology. Based on collected peripheral blood data, authors claim that:

      1) A high degree of clonal persistence in individual memory B cell subsets with inter-individual convergence in memory and ASCs.

      2) ASC clonotypes are transient over time and related to memory B cells.

      3) Reactivation of persisting memory B cells with new rounds of affinity maturation during proliferation and differentiation into ASCs.

      4) Both positive and negative selection contribute to persisting and reactivated lineages preserving the functionality and specificity of BCRs.

      The present study provides useful technical application for the analysis of longitudinal B cell repertoires, and bioinformatics and statistical data analysis are impressive. Regarding point 1), clonal persistence of memory B cells is already well known. On the other hand, inter-individual convergence between memory B cells and plasma cells might not be shown in healthy individuals even though the biological significance of circulating plasma cells is questionable.

      We thank the reviewer for careful analysis of our manuscript and are grateful for the positive view and all the criticism of our study.

      To the best of our knowledge the clonal persistence of memory B cells was previously studied mostly in the contexts of active immune response after natural challenge or after immunization. Here we used the full set of modern experimental and analytical repertoire sequencing approaches to characterize the connection and dynamics of memory and the two antibody-secreting B cell subpopulations during a long period in healthy donors, i.e. in donors without severe inflammatory diseases or who were not experienced intensive response against a natural antigen close to the sample collection time points. In other words, we carefully dissected the repertoire of peripheral blood antigen-experienced B cells in normal state. Thus we believe that our study brings a number of essentially new details to the overall picture of B cell immunity.

      By assessing the intra- and inter-individual repertoire overlaps we found high reproducibility of B cell memory clones between timepoints, which was just a little bit lower compared to the overlap between replicates. About 5% of largest clonotypes were identical (Fig. 2B left), while the V usage distribution changed more substantially over the time (Fig. 2A left), assuming the impact of non-persistent memory IGH clonotypes. Compared to the intra-individual reproducibility, the number of shared clonotypes between unrelated donors was extremely low, but still detectable, showing impact of convergent clonotypes in antigen-experienced B cells repertoire overlap of unrelated donors. Together, our findings show a high level of individuality of IGH repertoire of antigen-experienced B cells, while common challenges converge it to some extent at the level of most expanded clones, which are extremely stable (persistent) over the time. On the way from naive to the antigen-experienced B cells the germ-line encoded sequence of CDR1 and CDR2 make an impact, which is similar between individuals with similar genetic and environmental context. The latter further supports the previously reported findings on the role of germ-line encoded parts of IGH in the response against specific antigens (Collins et al. DOI: 10.1016/j.coisb.2020.10.011).

      Regarding 2), temporal stability of plasma cell clonotypes has been demonstrated already in the bone marrow with serial biopsies over time (Wu et al. DOI: 10.1038/ncomms13838). The Association of clonotypes between memory and plasma cells in the blood of healthy donors might be new, however, again its biological significance is questionable.

      Long-term stability of plasma cells was previously shown by a number of studies demonstrating presence of antigen-specific clones or even cells during months and years in human bone marrow and other sites, as well as in mice and primates (Wu et al. DOI: 10.1038/ncomms13838; Landsverk et al. DOI: 10.1084/jem.20161590; Manz et al. DOI: 10.1038/40540; Hammarlund et al. DOI: 10.1038/s41467-017-01901-w; Xu et al. DOI: 10.7554/eLife.59850; Davis et al. DOI: 10.1126/science.aaz8432). We agree that BM samples would add the additional layer to our investigation by describing the interconnection of the B cell memory pool with BM PCs. We also agree that the nature of circulating plasma cells is not fully clear at the moment and the relation of such cells/clones to BM PCs remains to be detailed. However, we cannot agree with the reviewer’s remark about the low (or absent) biological significance of the circulating ASCs. According to modern view, raising from large number of different studies conducted for previous several decades on mice, human and other organisms, the differentiation events in GC after antigen-priming lead to formation of cells switched to antibody-secreting program, and some part of them further reaches the bone marrow as site of residence. The bone marrow niches provide necessary signals required for further differentiation of newly migrated ASC cells to long-living or short-living plasma cells and their further survival in BM. However, the ASCs migrating to BM can be sampled from blood during their migration. The presence of an apoptotic-resistant subset of PCs expressing high-affinity Abs in circulation early after booster immunization in humans was previously shown (Inés González-García et al. DOI: 10.1182/blood-2007-08-108118). Similar in vitro survival ability for transcriptomically different blood ASC subsets was demonstrated by other authors (Garmilla et al. DOI: 10.1172/jci.insight.126732). Recent study, using artificial system modeling the BM niche in vitro, show that peripheral blood ASCs are able to differentiate to LLPC (Joyner et al. DOI: 10.26508/lsa.202101285). Besides, in a number of other studies it was also previously shown the increase of plasmablasts and plasma cells in PB during intensive immune response after primary or secondary immunization/natural challenge (Blink et al. DOI: 10.1084/jem.20042060; Odendahl et al. DOI: 10.1182/blood-2004-07-2507; Lee et al. DOI: 10.4049/jimmunol.1002932) or in active autoimmune condition (Szabo et al. DOI: 10.1111/cei.12703; Jacobi et al. DOI: 10.1002/art.10949). So, we considered ASC subsets in our work as a source of ASCs enriched in recently differentiated antibody-producers different in expression of CD138, which is the marker of LLPC in BM plasma cells and seemingly marks differently differentiated ASCs in circulation. Thus, these ASC subsets complement antigen-primed peripheral blood B cells playing an important role in ongoing immune response and influence to the plasma cells population in the BM. The connection on clonal lineage level between persisting memory B cells and the ASC subsets shown in our study, and findings recently published by Antonio Lanzavecchia’s lab (Phad et al. DOI: 10.1038/s41590-022-01230-1), support the idea that the circulating CD19-/lowCD20-CD27+CD138+/- B cells in PB represent the antibody-producing progeny of reactivated memory.

      Regarding 3) and 4), it is hard to generalize observations from the presented data because the analysis was based on just four donor cases with different health conditions, i.e. a combination of healthy and allergic. The cell number of plasmablasts and plasma cells isolated from peripheral blood is extremely low compared to memory B cells, and in fact, the vast majority of ASCs reside in the tissues such as lymphoid organs, bone marrow, and mucosal tissues rather than in circulating blood (Mandric et al. DOI: 10.1038/s41467-020-16857-7). As the most critical problem, direct pieces of evidence to claim points, 3) and 4) are missing.

      We fully agree that our study has a set of limitations and added more detailed discussion of them to the revised version (lines 582-600). We agree that our cohort group is not large, nevertheless our observations demonstrate reproducibility among different donors and hold statistical significance for detected differences. To justify our generalization of this cohort group, combined from healthy and allergic donors, we added more detailed analysis as a Supplementary Note, showing that within our study design we observe no difference between healthy and allergic donors both on the level of the clonal repertoire and the level of clonal lineages.

      The number of sampled plasmablasts and plasma cells compared to memory B cells in our study reflects the ratio between the subpopulations in the peripheral blood of middle aged donors and corresponds to the previous estimations published by the others. According to the fact that about 15% of the most abundant clonotypes on average were reproducible between parallel samples (replicates), the sampled numbers of PBL and PL allowed us to reach a relatively high reproducibility of the clone sampling at the level of cells. This as well as the diversity estimations point out that we sequenced the representative number of ASCs in peripheral blood to characterize their clonal repertoire and their connection with the B cell memory pool. Indeed the vast majority of plasma cells reside in different tissues, mostly in the bone marrow, but we believe that the ASCs in circulation represent the pool of newly generated and/or migrating between sites ASCs at different stages of differentiation. However, the further studies showing clonal relationship between memory B cells and ASCs in circulation and tissue-resident ASCs are still required to provide a more detailed view to this aspect.

      We agree that we cannot provide much direct evidence to support points 3) and 4), however we revealed a bunch of indirect ones, which are very consistent between each other supporting the points on memory reactivation and clonal selection claimed:

      1. From the biological sense, rapid increase of frequency of LBmem lineages and its’ perfect reproducibility between replicates (Supplementary Figure S7E), indicate increase in the number of the sampled cells, i.e. lineage expansion, occurred due to proliferation after antigen challenge or migration between tissues of residence due to some other signals. Predominance of ASC phenotype indicates their involvement in ongoing immune response.

      2. Large G-MRCA distance in LBmem lineages together with low inter-lineage genetic divergence points out on that the observed clonotypes of LBmem lineages diverged recently, originate from some mature clonotype and represent only a single clade of full lineage phylogeny.

      3. Most of LBmem lineages (47 out of 52) includes Bmem clonotypes, showing interconnection of LBmem cluster to Bmem subset. For 38 out of 52 LBmem lineages we detected Bmem clonotype at the time point prior to lineage expansion.

      4. Significant difference in SHM patterns between HBmem and LBmem lineages reflects difference in selection forces, affecting their evolution. In evolutionary genomics, it is rarely possible to study evolution directly, and most often changes in genetic sequences are the only type of data available. Therefore, we are inclined to trust the conclusions drawn from the use of tools designed for this type of problem. If negative selection is expected in the evolution of any protein, positive selection is much more tricky to detect. Thus the presence of its signs suggests new rounds of affinity maturation or presence of some mechanism, leading to reactivation of the best-fitted representatives of the lineage.

      In addition to the indirect evidence, we found direct and clear example of memory reactivation inside the clonal lineage (Fig. 4F). We added alignment of the CDR3 region of this lineage as Supplementary Figure S7 to confirm that both its’ HBmem - like and LBmem - like parts originate from the same recombination event.

      These findings lead to the conclusion that most of the LBmem lineages in analysis originated from some pre-existing memory. However we can not say for sure that in all the cases the memory is similar in properties to the persistent memory of HBmem cluster. The one exemplary clonal lineage shows that at least some of LBmem lineages represent re-activation of persistent HBmem lineages. The most recent study in the field published by Phad et al. (DOI: 10.1038/s41590-022-01230-1) have also demonstrated clonal relatedness of peripheral blood plasmablasts to the persistent memory. It should also be noted that in the present study we focused on the most expanded clones and clonal lineages, while the mechanisms determining the power of expansion are well not defined and thus the behavior of not so large clones can be different. To conclude, we believe that our findings can be generalized while probably representing only a part of the whole complex picture describing the behavior of B cell memory in normal state.

      Reviewer #2 (Public Review):

      The findings in this manuscript have been properly hypothesized and adequately demonstrated, and have some levels of practical guidance. The authors performed a detailed longitudinal analysis of a subset of immune-experienced B cells from donors without severe pathology. They selected a comprehensive analytical framework for BCR clonal lineage from these data and suggested interconnected B-cell clone-level subsets, B-cell memory fusion in donor-independent, and long-term persistent peripheral blood memory-enriched clonal lineages. Lastly, their evolutionary results analyzing the B-cell clonal lineage plus annotation suggest that activating B-cell subsets of preexisting memory-B cells is accompanied by the maturation of new rounds of affinity.

      We thank the Reviewer for careful analysis and positive view on our study.

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, the science is sound and interesting, and the results are clearly presented. However, the paper falls in-between describing a novel method and studying biology. As a consequence, it is a bit difficult to grasp the general flow, central story and focus point. The study does uncover several interesting phenomena, but none are really studied in much detail and the novel biological insight is therefore a bit limited and lost in the abundance of observations. Several interesting novel interactions are uncovered, in particular for the SPS sensor and GAPDH paralogs, but these are not followed up on in much detail. The same can be said for the more general observations, eg the fact that different types of mutations (missense vs nonsense) in different types of genes (essential vs non-essential, housekeeping vs. stress-regulated...) cause different effects.

      This is not to say that the paper has no merit - far from it even. But, in its current form, it is a bit chaotic. Maybe there is simply too much in the paper? To me, it would already help if the authors would explicitly state that the paper is a "methods" paper that describes a novel technique for studying the effects of mutations on protein abundance, and then goes on to demonstrate the possibilities of the technology by giving a few examples of the phenomena that can be studied. The discussion section ends in this way, but it may be helpful if this was moved to the end of the introduction.

      We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      Schubert et al. describe a new pooled screening strategy that combines protein abundance measurements of 11 proteins determined via FACS with genome-wide mutagenesis of stop codons and missense mutations (achieved via a base editor) in yeast. The method allows to identify genetic perturbations that affect steady state protein levels (vs transcript abundance), and in this way define regulators of protein abundance. The authors find that perturbation of essential genes more often alters protein abundance than of nonessential genes and proteins with core cellular functions more often decrease in abundance in response to genetic perturbations than stress proteins. Genes whose knockouts affected the level of several of the 11 proteins were enriched in protein biosynthetic processes while genes whose knockouts affected specific proteins were enriched for functions in transcriptional regulation. The authors also leverage the dataset to confirm known and identify new regulatory relationships, such as a link between the SDS amino acid sensor and the stress response gene Yhb1 or between Ras/PKA signalling and GAPDH isoenzymes Tdh1, 2, and 3. In addition, the paper contains a section on benchmarking of the base editor in yeast, where it has not been used before.

      Strengths and weaknesses of the paper

      The authors establish the BE3 base editor as a screening tool in S. cerevisiae and very thoroughly benchmark its functionality for single edits and in different screening formats (fitness and FACS screening). This will be very beneficial for the yeast community.

      The strategy established here allows measuring the effect of genetic perturbations on protein abundances in highly complex libraries. This complements capabilities for measuring effects of genetic perturbations on transcript levels, which is important as for some proteins mRNA and protein levels do not correlate well. The ability to measure proteins directly therefore promises to close an important gap in determining all their regulatory inputs. The strategy is furthermore broadly applicable beyond the current study. All experimental procedures are very well described and plasmids and scripts are openly shared, maximizing utility for the community.

      There is a good balance between global analyses aimed at characterizing properties of the regulatory network and more detailed analyses of interesting new regulatory relationships. Some of the key conclusions are further supported by additional experimental evidence, which includes re-making specific mutations and confirming their effects on protein levels by mass spectrometry.

      The conclusions of the paper are mostly well supported, but I am missing some analyses on reproducibility and potential confounders and some of the data analysis steps should be clarified.

      The paper starts on the premise that measuring protein levels will identify regulators and regulatory principles that would not be found by measuring transcripts, but since the findings are not discussed in light of studies looking at mRNA levels it is unclear how the current study extends knowledge regarding the regulatory inputs of each protein.

      See response to Comment #10.

      Specific comments regarding data analysis, reproducibility, confounders

      1) The authors use the number of unique barcodes per guide RNA rather than barcode counts to determine fold-changes. For reliable fold changes the number of unique barcodes per gRNA should then ideally be in the 100s for each guide, is that the case? It would also be important to show the distribution of the number of barcodes per gRNA and their abundances determined from read counts. I could imagine that if the distribution of barcodes per gRNA or the abundance of these barcodes is highly skewed (particularly if there are many barcodes with only few reads) that could lead to spurious differences in unique barcode number between the high and low fluorescence pool. I imagine some skew is present as is normal in pooled library experiments. The fold-changes in the control pools could show whether spurious differences are a problem, but it is not clear to me if and how these controls are used in the protein screen.

      Because of the large number of screens performed in this study (11 proteins, with 8 replicates for each) we had to trade off sequencing depth and power against cell sorting time and sequencing cost, resulting in lower read and barcode numbers than what might be ideally aimed for. As described further in the response to Comment #5, we added a new figure to the manuscript that shows that the correlation of fold-changes between replicates is high (Figure 3–S1A). The second figure below shows that the correlation between the number of unique barcodes and the number of reads per gRNA is highly significant (p < 2.2e-16).

      2) I like the idea of using an additional barcode (plasmid barcode) to distinguish between different cells with the same gRNA - this would directly allow to assess variability and serve as a sort of replicate within replicate. However, this information is not leveraged in the analysis. It would be nice to see an analysis of how well the different plasmid barcodes tagging the same gRNA agree (for fitness and protein abundance), to show how reproducible and reliable the findings are.

      We agree with the reviewer that this would be nice to do in principle, but our sequencing depth for the sorted cell populations was not high enough to compare the same barcode across the low/unsorted/high samples. See also our response to Comment #5 for the replicate analyses.

      3) From Fig 1 and previous research on base editors it is clear that mutation outcomes are often heterogeneous for the same gRNA and comprise a substantial fraction of wild-type alleles, alleles where only part of the Cs in the target window or where Cs outside the target window are edited, and non C-to-T edits. How does this reflect on the variability of phenotypic measurements, given that any barcode represents a genetically heterogeneous population of cells rather than a specific genotype? This would be important information for anyone planning to use the base editor in future.

      We agree with the reviewer that the heterogeneity of editing outcomes is an important point to keep in mind when working with base editors. In genetic screens, like the ones described here, often the individual edit is less important, and the overall effects of the base editor are specific/localized enough to obtain insights into the effects of mutations in the area where the gRNA targets the genome. For example, in our test screens for Canavanine resistance and fitness effects, in which we used gRNAs predicted to introduce stop codons into the CAN1 gene and into essential genes, respectively, we see the expected loss-of-function effect for a majority of the gRNAs (canavanine screen: expected effect for 67% of all gRNAs introducing stop codons into CAN1; fitness screen: expected effect for 59% of all gRNAs introducing stop codons into essential genes) (Figure 2). In the canavanine screen, we also see that gRNAs predicted to introduce missense mutations at highly conserved residues are more likely to lead to a loss-of-function effect than gRNAs predicted to introduce missense mutations at less conserved residues, further highlighting the differentiated results that can be obtained with the base editor despite the heterogeneity in editing outcomes overall. We would certainly advise anyone to confirm by sequencing the base edits in individual mutants whenever a precise mutation is desired, as we did in this study when following up on selected findings with individual mutants.

      4) How common are additional mutations in the genome of these cells and could they confound the measured effects? I can think of several sources of additional mutations, such as off-target editing, edits outside the target window, or when 2 gRNA plasmids are present in the same cell (both target windows obtain edits). Could some of these events explain the discrepancy in phenotype for two gRNAs that should make the same mutation (Fig S4)? Even though BE3 has been described in mammalian cells, an off-target analysis would be desirable as there can be substantial differences in off-target behavior between cell types and organisms.

      Generally, we are not very concerned about random off-target activity of the base editor because we would not expect this to cause a consistent signal that would be picked up in our screen as a significant effect of a particular gRNA. Reproducible off-target editing with a specific gRNA at a site other than the intended target site would be problematic, though. We limited the chance of this happening by not using gRNAs that may target similar sequences to the intended target site in the genome. Specifically, we excluded gRNAs that have more than one target in the genome when the 12 nucleotides in the seed region (directly upstream of the PAM site) are considered (DiCarlo et al., Nucleic Acids Research, 2013).

      We do observe some off-target editing right outside the target window, but generally at much lower frequency than the on-target editing in the target window (Figure 1B and Figure 1–S2). Since for most of our analyses we grouped perturbations per gene, such off-target edits should not affect our findings. In addition, we validated key findings with independent experiments. For our study, we used the Base Editor v3 (Komor et al., Nature, 2016); more recently, additional base editors have been developed that show improved accuracy and efficiency, and we would recommend these base editors when starting a new study (see, e.g., Anzalone et al., Nature Biotechnology, 2020).

      We are not concerned about cases in which one cell gets two gRNAs, since the chance that the same two gRNAs end up in one cell repeatedly is low, and such events would therefore not result in a significant signal in our screens.

      We don’t think that off-target mutations can explain the discrepancy between pairs of gRNAs that should introduce the same mutation (Figure 3–S1. The effect of the two gRNAs is actually well-correlated, but, often, one of the two gRNAs doesn’t pass our significance cut-off or simply doesn’t edit efficiently (i.e., most discrepancies arise from false negatives rather than false positives). We may therefore miss the effects of some mutations, but we are unlikely to draw erroneous conclusions from significant signals.

      5) In the protein screen normalization uses the total unique barcode counts. Does this efficiently correct for differences from sequencing (rather than total read counts or other methods)? It would be nice to see some replicate plots for the analysis of the fitness as well as the protein screen to be able to judge that.

      We made a new figure that shows a replicate comparison for the protein screen (see below; in the manuscript it is Figure 3–S1A) and commented on it in the manuscript. For this analysis, the eight replicates for each protein were split into two groups of four replicates each and analyzed the same way as the eight replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16). The second figure shows that the total number of reads and the total number of unique barcodes are well correlated.

      For the fitness screen, we used read counts rather than barcode counts for the analysis since read counts better reflect the dropout of cells due to reduced fitness. The figure below shows a replicate comparison for the fitness screen. For this analysis, the four replicates were split into two groups of two replicates each and analyzed the same way as the four replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16).

      6) In the main text the authors mention very high agreement between gRNAs introducing the same mutation but this is only based on 20 or so gRNA pairs; for many more pairs that introduce the same mutation only one reaches significance, and the correlation in their effects is lower (Fig S4). It would be better to reflect this in the text directly rather than exclusively in the supplementary information.

      We clarified this in the manuscript main text: “For 78 of these gRNA pairs, at least one gRNA had a significant effect (FDR < 0.05) on at least one of the eleven proteins; their effects were highly correlated (Pearson’s R2 = 0.43, p < 2.2E-16) (Figure 3–S1B). For the 20 gRNA pairs for which both gRNAs had a significant effect, the correlation was even higher (Pearson’s R2 = 0.819, p = 8.8e-13) (Figure 3–S1C). These findings show that the significant gRNA effects that we identify have a low false positive rate, but they also suggest that many real gRNA effects are not detected in the screen due to limitations in statistical power.”

      7) When the different gRNAs for a targeted gene are combined, instead of using an averaged measure of their effects the authors use the largest fold-change. This seems not ideal to me as it is sensitive to outliers (experimental error or background mutations present in that strain).

      We agree that the method we used is more sensitive to outliers than averaging per gene. However, because many gRNAs have no effect either because they are not editing efficiently or because the edit doesn’t have a phenotypic consequence, an averaging method across all gRNAs targeting the same gene would be too conservative and not properly capture the effect of a perturbation of that gene.

      8) Phenotyping is performed directly after editing, when the base editor is still present in the cells and could still interact with target sites. I could imagine this could lead to reduced levels of the proteins targeted for mutagenesis as it could act like a CRISPRi transcriptional roadblock. Could this enhance some of the effects or alter them in case of some missense mutations?

      To reduce potential “CRISPRi-like” effects of the base editor on gene expression, we placed the base editor under a galactose-inducible promoter. For both the fitness and protein screens we grew the cultures in media without galactose for another 24 hours (fitness screen) or 8-9 hours (protein screens) before sampling. In the latter case, this recovery time corresponded to more than three cell divisions, after which we assume base editor levels to have strongly decreased, and therefore to no longer interfere with transcription. This is also supported by our ability to detect discordant effects of gRNAs targeting the same gene (e.g., the two mutations leading to loss-of-function and gain-of-function of RAS2), which would otherwise be overshadowed by a CRISPRi effect.

      9) I feel that the main text does not reflect the actual editing efficiency very well (the main numbers I noticed were 95% C to T conversion and 89% of these occurring in a specific window). More informative for interpreting the results would be to know what fraction of the alleles show an edit (vs wild-type) and how many show the 'complete' edit (as the authors assume 100% of the genotypes generated by a gRNA to be conversion of all Cs to Ts in the target window). It would be important to state in the main text how variable this is for different gRNAs and what the typical purity of editing outcomes is.

      We now show the editing efficiency and purity in a new figure (Figure 1B), and discuss it in the main text as follows: “We found that the target window and mutagenesis pattern are very similar to those described in human cells: 95% of edits are C-to-T transitions, and 89% of these occurred in a five-nucleotide window 13 to 17 base pairs upstream of the PAM sequence (Figure 1A; Figure 1–S2) (Komor et al., 2016). Editing efficiency was variable across the eight gRNAs and ranged from 4% to 64% if considering only cases where all Cs in the window are edited; percentages are higher if incomplete edits are considered, too (Figure 1B).”

      Comments regarding findings

      10) It would be nice to see a comparison of the results to the effects of ~1500 yeast gene knockouts on cellular transcriptomes (https://doi.org/10.1016/j.cell.2014.02.054). This would show where the current study extends established knowledge regarding the regulatory inputs of each protein and highlight the importance of directly measuring protein levels. This would be particularly interesting for proteins whose abundance cannot be predicted well from mRNA abundance.

      We agree with the reviewer that it would be very interesting to compare the effect of perturbations on mRNA vs protein levels. We have compared our protein-level data to mRNA-level data from Kemmeren and colleagues (Kemmeren et al., Cell 2014), and we find very good agreement between the effects of gene perturbations on mRNA and protein levels when considering only genes with q < 0.05 and Log2FC > 0.5 in both studies (Pearson’s R = 0.79, p < 5.3e-15).

      Gene perturbations with effects detected only on mRNA but not protein levels are enriched in genes with a role in “chromatin organization” (FDR = 0.01; as a background for the analysis, only the 1098 genes covered in both studies were considered). This suggests that perturbations of genes involved in chromatin organization tend to affect mRNA levels but are then buffered and do not lead to altered protein levels. There was no enrichment of functional annotations among gene perturbations with effects on protein levels but not mRNA levels.

      We did not include these results in the manuscript because there are some limitations to the conclusions that can be drawn from these comparisons, including that our study has a relatively high number of false negatives, and that the genes perturbed in the Kemmeren et al. study were selected to play a role in gene regulation, meaning that differences in mRNA-vs-protein effects of perturbations are limited to this function, and other gene functions cannot be assessed.

      11) The finding that genes that affect only one or two proteins are enriched for roles in transcriptional regulation could be a consequence of 'only' looking at 10 proteins rather than a globally valid conclusion. Particularly as the 10 proteins were selected for diverse functions that are subject to distinct regulatory cascades. ('only' because I appreciate this was a lot of work.)

      We agree with this, and we think it is clear in the abstract and the main text of the manuscript that here we studied 11 proteins. We made this point also more explicit in the discussion, so that it is clear for readers that the findings are based on the 11 proteins and may not extrapolate to the entire yeast proteome.

      Reviewer #3 (Public Review):

      This manuscript presents two main contributions. First, the authors modified a CRISPR base editing system for use in an important model organism: budding yeast. Second, they demonstrate the utility of this system by using it to conduct an extremely high throughput study the effects of mutation on protein abundance. This study confirms known protein regulatory relationships and detects several important new ones. It also reveals trends in the type of mutations that influence protein abundances. Overall, the findings are of high significance and the method appears to be extremely useful. I found the conclusions to be justified by the data.

      One potential weakness is that some of the methods are not described in main body of the paper, so the reader has to really dive into the methods section to understand particular aspects of the study, for example, how the fitness competition was conducted.

      We expanded the first section for better readability.

      Another potential weakness is the comparison of this study (of protein abundances) to previous studies (of transcript abundances) was a little cursory, and left some open questions. For example, is it remarkable that the mutations affecting protein abundance are predominantly in genes involved in translation rather than transcription, or is this an expected result of a study focusing on protein levels?

      We thank the reviewer for pointing out that this paragraph requires more explanation. We expanded it as follows: “Of these 29 genes, 21 (72%) have roles in protein translation—more specifically, in ribosome biogenesis and tRNA metabolism (FDR < 8.0e-4, Figure 5C). In contrast, perturbations that affect the abundance of only one or two of the eleven proteins mostly occur in genes with roles in transcription (e.g., GO:0006351, FDR < 1.3e-5). Protein biosynthesis entails both transcription and translation, and these results suggest that perturbations of translational machinery alter protein abundance broadly, while perturbations of transcriptional machinery can tune the abundance of individual proteins. Thus, genes with post-transcriptional functions are more likely to appear as hubs in protein regulatory networks, whereas genes with transcriptional functions are likely to show fewer connections.”

      Overall, the strengths of this study far outweigh these weaknesses. This manuscript represents a very large amount of work and demonstrates important new insights into protein regulatory networks.

    1. Author Response

      Reviewer #2 (Public Review):

      In this paper, the authors identify topological metrics in gene-regulatory networks that have the potential to predict the sub-types of phenotypic steady states that the network can lead to. The results hold great value for the field of Theoretical Systems Biology.

      The paper becomes too technical too quickly and assumes a lot of knowledge from the reader. Equations and theoretical concepts are not always well defined. In general, I would recommend connecting the results from the simulations/topology metrics to EMP biology earlier in the paper. Alternatively, rather than investigating 5 networks related to EMP, the generalization of the statements could become stronger if the authors explore the trends of the theoretical analysis in networks modeling other biological processes (such as SCLC).

      One of the main findings of the paper is that the distance between the matrix of correlation values between nodes in all steady states obtained from simulation and influence matrix indicates that the mean group strength is a good quantity to identify teams of nodes in the network. However, it remains unclear how to identify groups/teams in the networks based on influence: is it unsupervised (hierarchical?) clustering? How do the authors identify the number of teams of nodes in randomized?

      The authors also explore whether team structure correlates with the stability of relevant biological phenotypes. To characterize stability, they define static (e.g., frustration and stead state frequency) and dynamic network metrics (e.g., coherence and higher-order perturbations), and correlate them to the mean group strength in both WT and randomized networks. Results are promising: team structure and group mean strength show interesting correlative trends with both the static and dynamic metrics. However, everything relies on the mean group strength, which as mentioned before is not convincingly defined in randomized networks.

      Taken together, the conclusions of this paper would be better supported if a better explanation of team identification in gene-regulatory networks would be provided, and if networks related to other biological processes would be investigated.

      We thank the referee for their encouraging remarks and valuable suggestions about improving the manuscript. We are excited that the referee finds our results promising and of great value to the field of theoretical systems biology. Following the suggestions given here, we have now included further clarification on various aspects, included results for regulatory networks of melanoma and small cell lung cancer (SCLC, Fig 9, S11), and described in detail the algorithm used to identify teams in a given network (Methods)

    1. Author Response

      Reviewer #3 (Public Review):

      The manuscript by Barr et al., investigates the molecular phenotype, regulation by type 2 immunity, and function, of ectopic tuft cells that appear in the lungs of mice recovering from infection by the mouse-adapted PR8 strain of influenza A virus. They use reporter mice and either bulk or single cell RNA sequencing to reveal the molecular heterogeneity among tuft cells present in lungs of mice 43 days after PR8 infection. Lineage tracing using a Krt5-CreER driver line was used to demonstrate the basal cell origin of ectopic tuft cells and mice harboring homozygous null alleles for either Pou2f3, Trpm5, IL4Ra or IL25, were evaluated to determine roles for tuft cells and type 2 immunity in regulation of dysplastic epithelial remodeling. Their data confirm that ectopic tuft cells are derived from dysplastic Krt5-expressing cells that appear following PR8 infection, that pre-existing tuft cells play no role in basal cell dysplasia, and that ectopic tuft cells derived from dysplastic basal cells play no role in lung remodeling. Furthermore, they show that neither type 2 cytokines nor IL25, an upstream regulator of type 2 immune responses, play roles in regulating the pulmonary response to PR8 infection. Finally, they show that tuft cells are also induced in the lungs of bleomycin-injured mice and that the presence of tuft cells in alveolar regions of PR8-infected mice does not influence the inability of dysplastic basal cells to assume alveolar epithelial cell fates. The manuscript is well written and experiments were performed with rigorous experimental design and data of high quality. However, even though findings have potential importance and could be of interest, results seem preliminary and lack a strong rationale.

      Major concerns:

      1) Studies of tuft cells in the gut and their response to type 2 immunity, which were the basis for this line of investigation into ectopic tuft cells in the PR8-infected lung, have shown that tuft cells are part of a feed-forward loop leading to tuft cell expansion and enhanced type 2 immune responses including increased abundance of goblet cells. Since ectopic pulmonary tuft cells are derived from dysplastic basal cells after PR8 infection, rather than the reverse, this is clearly not the case in lungs of PR8 infected mice. Furthermore, since tuft cells are derived from hyperplastic basal cells in lungs of PR8-infected mice, it would seem unlikely that they impact the extent of basal cell hyperplasia.

      Ultimately the reviewer is correct in that the mechanisms at play in the post-flu lung promoting ectopic tuft cell expansion are clearly distinct from those in the small intestine. However, this was not a foregone conclusion, especially given that similar Type 2-dependent mechanisms clearly have a role in brush cell (now also termed tuft cell) expansion in the trachea. Regarding tuft cell influence on basal cell hyperplasia, we originally hypothesized that tuft cells differentiating from the migrating, proliferating basal cells may act in a feed-forward fashion to promote continued proliferation of the basal cells, akin to what happens upon tuft cell activation in the intestine. Nevertheless the Reviewer is correct in that our results show that basal cell hyperplasia is independent of tuft cell differentiation, and we feel this is valuable information for the field.

      2) Tuft cell expansion following parasitic infection of the gut and associated type 2 inflammation, and basal cell differentiation into tuft cells leading to their increased abundance following lung injury, are distinct processes and likely to be regulated through distinct mechanisms. As such, the rationale for investigating the roles of type 2 cytokines in the regulation of tuft cell appearance is rather weak. In the absence of data demonstrating how basal to tuft cell differentiation is regulated, this component of the study seems preliminary.

      Amplification of tuft cells in the small intestine (Gerbe et al., 2016; Howitt et al., 2016; von Moltke et al., 2016) and upper airways (Ualiyeva et al., 2021, Bankova et al., 2018) are either totally dependent on or highly influenced by Type 2 cytokines, respectively. Accordingly, it was critical to examine whether a similar mechanism was at play in the lung after influenza injury, i.e. promoting tuft cell amplification downstream of Type 2 cytokines. While our findings demonstrate that post-flu tuft cells arise largely independent of Th2 signals, new findings in other tissues published after submission of the current manuscript do indeed demonstrate Th2 / ILC2-indepdent functions of tuft cells (O’Leary et al., DOI: 10.1126/sciimmunol.abj1080). Our findings support the existence of novel mechanisms regulating tuft cell differentiation, and as the Reviewer suggests, we hope to uncover these mechanisms in future work.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors here follow-up on roles for signaling pathways like ERK in epithelial patterning that have been studied in an emerging literature in both, broadly, the cell competition field and, more specifically, in mouse intestinal organoids. They employ timelapse microscopy to study behavior of human colonic organoids in monolayers as the organoids initially self-organize. They then follow maintenance of organization into densely clustered nodes that have increased cells in cell cycle and the remaining more sparsely populated regions with fewer cycling cells. Nodes also show markers of in vivo colonic stem cells (Lgr5 and myc). They follow propagation of ERK waves using a genetic tool (ERK-JTR) and show that they can emerge from single apoptotic cells in between nodes.

      Strengths of the study include novelty of showing self-organization and behavior of human organoids over time, with good resolution, using microscopy, as well as sophisticated analysis techniques to interpret and present cumulative data over many experiments. Additionally, the paper adds important pieces of the puzzle with respect to how cells may compete and respond across an entire monolayer, and the tools and approaches lend themselves to studying many genes and signaling pathways besides simply Wnt vs ERK.

      Weaknesses in the current version of the manuscript:

      1) The manuscript is focused nearly exclusively on ERK and Wnt but not in terms of the broader context of interpretation of the response of a monolayer to apoptosis of single cells. Some of the original work in the field showed that apoptotic cells enacted Rho- and MLCK-dependent actomyosin contractility, which was proposed to signal neighboring cells by initially pulling them inwards via the contraction (PMIDS: 9456322, 10459006, 11283606, 21721944). But a more intestine-specific literature has long-been extant following up on the critical role of ROCK and MLCK in maintaining barrier after specifically intestinal-cell apoptosis (15825080, 21237166).

      -- A suggestion would be 1) to cite the relevant literature and 2) to interpret some of the experiments within the cytoskeletal mechanistic context already known. In addition to comments about PMA and ERK activation (see next point), the authors could test whether the ERK waves cause myosin II activation and/or are ROCK/MLCK-dependent. Given ROCK inhibition is frequently used in organoid culture, this would seem an obvious avenue to explore. Does the ERK wave propagate the cytoskeletal changes to close the gap and increase centrifugal motility and/or conversely does the actomyosin tugging of the apoptotic cell trigger ERK activation? (Admittedly, the latter question may be hard to address). In short, there is a lot known about monolayer behavior in terms of dynamic cytoskeletal changes that can be addressed here to integrate with the Wnt/ERK roles.

      We completely agree that contractility and the cytoskeleton play vital roles in this process. We have added a section on this in the discussion and cited the relevant literature you suggested. We have conducted an unbiased screen for Erk wave dynamics and have several novel hits related to the mechanical aspect of this process. We are currently validating these hits and feel it would be too preliminary to include here. We are preparing a separate study that will focus on the role of mechanical signaling during Erk wave propagation.

      2) The authors use only PMA as an ERK activator. PMA is a broadly acting drug, principally known as a PI3K inducer. Obviously, Akt and other downstream action of PI3K means many other pathways are stimulated besides ERK. Indeed, ROCK and Src and other cytoskeleton-modifying pathways are modulated by PMA that may not correlate with the ERK effects. Additionally, the movies showing the effects of PMA treatment show a striking increase in apoptotic cells throughout the field, which would obviously confound the interpretation of what happens after relatively rare, internodal apoptotic cells die

      -- A strong suggestion would be to increase the routes to ERK activation the authors use. This could be via receptor tyrosine kinase stimulation (again, like ROCK, EGF is a key organoid medium component), though obviously that would not be much more specific than PMA, but the authors use EGFr inhibition to block ERK, so wouldn’t stimulation be an apt converse approach? Genetic constitutively active KRAS might be introduced. Alternatively, there are pharmacological ways to increase pERK dramatically by inhibiting the dual action phosphatase (see eg PMID: 30475204 in a previous eLife paper). At the least, it would seem the authors should not use an approach that increases apoptosis dramatically.

      This is a great suggestion. We have added an additional figure describing a set of experiments that activate Erk through the expression of an oncogenic KRAS allele (G12V) under control of doxycycline. This resulted in increased uncoordinated Erk activity and loss of nodes. Further, we show that the Wnt inhibitor Pyrvinium also increased Erk activity in organoid monolayers and led to node loss. Consequently, we have tested three independent activators of Erk, all of which led to loss of the proliferative/stem cell niche.

      3) The movies clearly show many dividing cells that are between nodes, and they show apoptotic cells within nodes (eg movie 3a towards the end). While it's clear that apoptotic cells in internodal regions can elicit the wave behavior, it would seem that apoptosis does not universally do this, given the counter-examples.

      -- It would help if the authors could speak to this. Namely, in what cases are there no waves after apoptosis and what are the factors that might contribute (nearness to a node? nearness in time and space to another apoptotic cell?). Presumably, the events are relatively stochastic so there would be occasions for non-stereotypical behavior like wave front interference or augmentation in the case of closely located apoptotic cells.

      We agree. As shown in the movie 3a, there are occasional cell death events in the proliferative region of organoid monolayers. We observed that these cell death events did induce waves but were less frequent compared to non-proliferative regions as quantified in figure 3H. Cells within the proliferative compartment also contain elevated Wnt signaling as shown by Top-GFP signal in figure S6 and LRG5 staining in figure 2B. The margin of the proliferative compartment is also the region where Erk waves tend to die off. Our hypothesis is that Wnt largely suppresses apoptosis and Erk waves.

      Reviewer #2 (Public Review):

      The work by Pond, et al., uses patient derived organoid monolayers to interrogate MAPK signaling in real-time using an ERK reporter. This technology was developed previously to use a target domain of ERK that responds to phosphorylation by altering nuclear-cytoplasmic localization. The active ERK kinase can be inferred by cytoplasmic localization of the reporter. The premise of the paper is that this reporter can be used in human organoid cultures to understand ERK signaling dynamics. Figures 1 and 2 demonstrate the monolayer culture properties and how stem-like and differentiated domains for within the cultures, validated using RNA FISH for MYC, LGR5, and KRT20. Figure 3 describes how an ERK wave radiates out from an apoptotic cell in the cultures, and that the living cells migrate towards to dying cell, presumably to sustain a barrier. In figure 4, data is presented showing that PMA-mediated activation of ERK disrupts the patterning of the monolayers, dispersing the nodes of cells associated with stem/proliferative identity. Finally, in figure 5, the authors show that treating cultures with Wnt3a suppresses ERK activity, while inhibiting ERK may expand WNT/stem cells in the cultures.

      The study is interesting and the model system has a lot of potential.

      However, there are some concerns about the novelty. The reasons for this are:

      1 - the monolayer system has been demonstrated before, very nicely in a 2018 Dev. Cell paper from the Altschuler lab and one of the current manuscript authors.

      2 - ERK-KTR reporters have been used to demonstrate apoptosis induced signaling waves in the epithelium (Gagliardi, 2021, Dev. Cell.)

      3 - ERK activity suppressing stem cell fate has been documented previously (Riemer, 2015; Leach, 2021; Reischmann, 2020; Tong, 2017)

      So while there are exciting aspects of the work, including use of human tissues and live imaging of pathway dynamics, I feel that the novel discoveries using these technologies are somewhat limited.

      Point 1: We agree that the Thorne 2018 paper showed the feasibility of 2D enteroid monolayers using mouse small intestine, yet it was not obvious that this approach would translate to human organoid models. We have demonstrated that this approach can be used for patient derived organoids from human colon, which contributes greatly to the translational potential. Additionally, a major challenge with organoids is tracking cells in space and time in 3D culture condition. We have shown that these primary cultures can be combined with lentiviral live kinase reporters and are amenable to long term culture for the study of single cell dynamics of heterogenous organoid cultures without laborious 3D image analysis.

      Point 2 and 3: We agree KTRs are a well-known and useful tool for studying single cell kinase dynamics. In mammalian cell lines (Gagliardi 2021) and drosophila epithelium (Valon 2021), Erk waves driven by apoptosis were reported to prevent apoptosis in nearby cells and instruct movement to prevent barrier disruption. Here, we showed that Erk waves effect the patterning of the differentiated and stem cell compartments. Our work 1) establishes that Erk waves are found in human colonic epithelium, 2) that this effects the patterning of the differentiated and stem cell compartments and 3) Erk wave signaling is a fundamental part of human colonic epithelial homeostasis. The novelty of this report is connecting apoptosis-driven Erk dynamics to spatial partitioning of cell fates.

    1. Author Response

      Reviewer #1: “(Public Review):

      The main result of the paper is a statistical dependence between the evolved size control strategy and the structure of the cell cycle, in that size control that manifests early (later) in the cell cycle tends to give adder- (weakly sizer-) like strategies. Notably, even when the final evolved network shows weak adder or weak sizer-like behaviour, they find strong sizer-like control in the evolutionary transient. Finally, they constrain the evolutionary algorithm to sense cell size only through stochastic fluctuations of protein concentrations and uncover a strategy that exhibits hallmarks of self-organised criticality.

      The questions studied by the authors are both interesting and timely, and their results are intriguing and well documented. On the whole, the conclusions are convincingly argued, and the authors do an excellent job of extracting qualitative features from their evolved networks. However, the manuscript is a little difficult to read, with the figures being crowded and difficult to parse. In addition, while there is a lot of detail in some places (as in the description of one particular feedback control strategy), other results are less fleshed out (such as statistical summaries of the different simulations). The manuscript would benefit from a sharper presentation of the results.’

      We have done our best to tighten the writing and better focus on the main results of the paper. We have done this in response to the specific criticisms of the reviewers, however, most of the comments indicated that our manuscript was rather dense and so important points had been lost. Therefore, in the revision, we have mostly focused on increasing the clarity rather than condensing our prose further.

      A particularly interesting question addressed in the paper is why adders are more commonly found when sizers are believed to be better at controlling cell size. Here, the authors' simulations give two answers: first, that sizers tend to appear when cell size control is exerted later in the cycle (as in S. pombe). Second, that even when adders eventually evolve, the evolutionary transient passes through a strong sizer strategy. As the adder-vs-sizer question is repeatedly raised, it would strengthen the paper to have a longer and sharper discussion on (a) why early cell size control favours adders, and (b) why sizers appear as transients when fluctuations in cell size are large?’

      We now clarify these key points and extend our discussion. The question as to why sizers appear as transients when fluctuations in cell size are large is more complex. We see repeatedly that sloppy sizers evolve first. But, these sizers are not necessarily that good at giving a low CV. Then, as the system continues to evolve, adders appear that are better at reducing CV than the noisy sizers. This emphasizes that the contribution to reducing the CV comes from two parts, first the slope contribution defining the relationship between the amount of growth in the cell cycle and the cell size at birth, and second, the amount of noise in this process, i.e., how variable the result will be for two cells born the same size. The system proceeds from a noisy sizer to a less noisy adder while reducing the CV as selected for. Thus, we speculate that in the later stages of evolution, where the system has already significantly reduced cell size variability, the ability to more accurately perform size control with less noise reduces the selection pressure on the slope so that adders tend to emerge. To address the comment, we have extended our discussion as to why early cell size control favors adders. We have broken the penultimate paragraph in the discussion into two parts where we now write:

      “Our evolution simulations gave insight into factors that bias evolution towards sizer or adder type control mechanisms (Fig. 4). First, it is worth noting that our evolution simulations were not deterministic. There was no one-to-one correspondence between a given evolutionary pressure and any one specific cell size control mechanism. Rather, our claims represent an average behavior observed over the course of many simulations. It is first worth noting that size control, as measured by the CV at a particular point in the cell cycle, has contribution both from the slope of the correlation between cell size and the amount of cell growth and from the amount of noise characterizing the differences between cells that are initially the same size (Di Talia et al., 2007). It is therefore possible that a low noise adder can produce a lower CV than a higher noise sizer. This is reflected in the evolutionary paths of some of our simulations, which traverse from a noisy sizer to a less noisy adder (Fig. 5). However, we anticipate even noisy sizers will be better than adders at controlling cell size in response to large deviations away from the steady state distribution. This is because sizers will always return the cell size to be within the steady state distribution within a cell cycle.

      In the selection of a size controlling G1 network followed by a timer in S/G2/M, we observed a prevalence of adders that is consistent with the prevalence of adders reported in the literature. While fewer in number, sizers have also been observed. That the most accurate sizers have been observed in the fission yeast S. pombe (Fantes, 1977; Sveiczer et al., 1996; Wood & Nurse, 2015), and that this organism performs cell size control at G2/M rather than at G1/S led us to explore the effect of cell cycle structure on the evolution of cell size control. We found that controlling cell size later in the cycle in S/G2/M biases evolution away from adders and towards sizers. In retrospect, this result can be rationalized since any size deviations incurred earlier during the timer period can be compensated for by the end of the cycle with the sizer. However, when the order is inverted, any size deviations escaping a G1 control mechanism would only be amplified by exponential volume growth during the S/G2/M timer period. A second recent case exhibiting sizer control was found in mouse epidermal stem cells, which exhibit a greatly elongated G1 phase and a relatively short S/G2/M phase (Mesa et al., 2018; Xie & Skotheim, 2020). We found that if we increased the relative duration of G1 in our simulations by shortening the S/G2/M timer, we also see a bias towards sizer control. In essence, by extending G1 to a larger and larger fraction of the cell cycle the control system is gradually approaching a size control taking place at the end of the cell cycle, i.e., an S/G2/M size control. Taken together, these simulations suggest the principle that having size-dependent transitions later in the cell cycle selects for sizers, while having such transitions earlier selects for adders.”

      The final part of the paper, which describes a strategy based on sensing size through concentration fluctuations, is very interesting but brief, which is understandable given the quantity of results presented earlier in the paper. Nonetheless, it provides an excellent example of the power of the authors' approach.

      Overall, the results in this paper are a compelling addition to the recent interest in cell size control.’

      We thank the reviewer for their careful reading of our manuscript and their support.

      Reviewer #2 (Public Review):

      The use of evolutionary models to understand the emergence of cell size control is novel and interesting. One strength of the approach is that simulations do not impose any mechanistic model for cell size control, rather the feedback motif for size control emerges from optimisation of chosen fitness functions. This allows the authors to come up with various size control motifs for given evolutionary pressures and model rules. Interestingly, the authors find that there is no one-to-one correspondence between specific size control mechanisms and evolutionary pressures, rather size control mechanisms are dependent on cell cycle structures. The authors also evolve a size control model based on the sensing of protein concentration fluctuations. This model exhibits interesting features such as self-organized criticality and the existence of very large cells that achieve size homeostasis by undergoing rapid cell divisions. The authors' model, however, comes with many arbitrary choices and assumptions that need further justifications and theoretical results should be compared with experimental data to establish the applicability of the model.

      We thank the reviewer for their careful reading of our manuscript and have worked to address its previous shortcomings as described below.

      Major Comments:

      1) Fitness function choices: Two fitness functions are used for the majority of this paper, number of cell divisions and CV_birth. What motivates the choice of these fitness functions and how do they relate to single-cell fitness?

      We added some text describing the choice of fitness function in the Supplement in the S3A - Fitness subsection. Using the number of cell divisions as a fitness makes sense since the higher the number of divisions in a given window of time, the bigger the population, which corresponds to the classical Darwinian fitness. Adding CV as an extra fitness specifically pushes the system towards better size control, which is the problem we aim to study, and also helps the optimization process. This is an effective way to include in our simulated evolution all observed detrimental effects observed when cell size is not controlled well. In the methods section we write:

      “We impose two evolutionary selection pressures in the form of two fitness functions. The first fitness function is simply the number of cell divisions during a long period, which we call NDiv . This is consistent with the classical definition of fitness as optimizing the number of offspring and is to be maximized by the algorithm. The second fitness function is the coefficient of variation of the volume distribution at birth for those NDiv generations, which we call CVBirth and is to be minimized by the algorithm. This penalizes broad distributions of volume at birth, which are detrimental to cell size homeostasis, which is what we aim to examine here.”

      Since the selection for tight size distribution is enforced via minimization of CV_birth, the model is unlikely to explain the timer control that is observed in some parts of the cell cycle. The authors discuss how a single fitness function results in all-or-nothing selection in the evolutionary algorithm, however, a third simultaneous fitness function is not considered. Are the results of this paper robust with respect to the addition of other selection pressure (for instance, optimization of growth rate)? This is a crucial question that is not addressed in the text.

      While we could always add more fitness functions, we have to start somewhere. The two fitness functions we use make most sense for the problem we are interested in, and allows us to obtain some clear results from the examination of an already complex starting point. Adding more than two fitness functions greatly increases the complexity of the problem. In fact, we are not aware of any work in the field of computational evolution using more than two fitness functions. One reason is that simulated evolution under control of two fitness functions is already not well understood in general (as we discussed previously in Francois & Siggia, Physical Biology 2008; Henry et al Plos Comp Bio 2018). We hope our simulations will inspire other work in this direction.

      2) Cell-cycle structure not considered to be changeable in evolution: Based on the presented details of the evolutionary algorithm, the network topology parameters are varied but not the temporal structure of the cell cycle, i.e. timer in G1/S and sizer S/G2/M or sizer in G1/S and timer in S/G2/M, etc. How do you justify evolution in one part of the cell cycle but not in the other? Do your results hold when the temporal structure is permitted to evolve?

      We are very interested in how the network structure affects the results. To address this point, we did invert size-dependence of the cell cycle phases as suggested by the reviewer i.e., we considered a fission yeast-like network with a timer in G1 and a sizer in S/G2/M (see Fig. 4,5, and S10). The possibilities of performing different types of evolution experiments is almost endless. We therefore restricted our examination to cases inspired by naturally occurring networks in well studied model organisms such as budding and fission yeasts. While it is in principle possible that size control could take place in multiple cell cycle phases, we do not yet know of a naturally occurring example and so chose not to explore this possibility at the present time. Nevertheless, the reviewer is raising a very interesting question as to why evolution selecting for cell size control tends to pick one or another cell cycle phase, but possibly not both, in a particular organism. We do not know the answer to this question at present and refrain from attempting to address it since our manuscript is already quite dense. Future work can explore this interesting direction.

      3) Noise sources: The authors consider noise protein quantity or concentration while neglecting noise in growth rate or division. Can the assumption that growth noise is negligible compared to protein production noise be supported by experimental data? This is a crucial assumption that is not supported by a discussion of physical values or citations. In addition, it is assumed later in the supplement (S132-133) that there is no division noise without presenting justification for why that noise is negligible on the scale of protein production noise.

      As for many other points raised by the referee, there is a necessary balance to achieve between biochemical realism and simplifying assumptions to theoretically study such problems. Of course we fully agree with the reviewer that there are multiple sources of noise in the system. In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding sources of noise if needed. We chose to first focus on noise in the cell cycle phases themselves whose CV can be as high as 50% (cf Fig. 1 in Di Talia et al 2007 Nature). For this reason, we first introduced noise in the precise timing of the G1/S transition as well as in the timing of the S/G2/M phase duration. Next, we introduced protein production noise because it is larger than the noise associated with cell division and cell growth rate in several cases where it has been measured. For example, the CV of cell growth rate in a diploid budding yeast is ~14% (Di Talia et al 2007 Nature; cf Table S12). The noise in partitioning at cell division is easier to measure in symmetrically dividing cells. For human cells grown in culture, division noise is ~10% (cf Fig. 3G in Zatulovskiy et al 2020 Science). In contrast, noise in protein concentrations is typically higher. This can be seen in the examination of molecular noise across all GFP labeled proteins in budding yeast (Newman et al, Nature 2006, PMID: 16699522). The CV in concentration of regulatory proteins in similarly sized cells is ~20-30% which is larger than noise in division by partitioning or noise in cell growth rate. We therefore next focused our analysis on the effects of protein production noise.

      In revising our manuscript, we now also consider noise in cell growth rate and noise in partitioning of mass at division as suggested by the reviewer. This results in slightly lower control, and more noise in alignment with our intuition. However, broadly speaking, our results are unchanged (see new supporting figures Fig. S6-S7 shown below). We now describe the logic of our series of simulations of increasing complexity in the methods section, which has two new paragraphs that reads as follows: “In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding additional sources of noise in subsequent analyses. All simulations presented include noise (stochastic control of G1/S transition and timing of S/G2/M, see below) in the cell cycle phases, whose CV has been found to be as high as 50% (Di Talia et al., 2007). Then, we introduced protein production noise via Langevin noise because the CV of regulatory protein concentrations is typically 20-30% (Newman et al., 2006). Importantly, the cell volume also contributes to stochastic effects, which are larger in smaller cells with fewer molecules. Thus, for stochastic simulations, we include a multiplicative 1/√V contribution to the added Gaussian noise term (see more complete description in the Supplement).

      We also checked that our results are largely invariant when adding other sources of noise (see Figs. S5-S7). In these simulations, we also included noise in cell growth rate (CV ~15%; e.g. (Di Talia et al., 2007), and in mass partitioning at cytokinesis (CV ~10%; e.g. (Zatulovskiy et al., 2020).”

      4) Types of biochemical interactions considered: It is assumed that inhibitor protein production rate scales with cell volume. Is this assumption supported by data? The assumption is contrary to the production rate of the inhibitor protein Whi5 in budding yeast, which does not scale with cell volume.

      In general, most proteins are at relatively constant concentration as cells grow. This means that their production rate (measured in number of proteins per time) has to scale in proportion to cell volume. As noted by the reviewer, Whi5 in the budding yeast is an exception to the general rule where the production rate does not scale with cell volume. This Is why Whi5 is diluted by growth, leading to a sizer in G1. However, allowing the network to generate size control with a diluted inhibitor starting point is basically too simple because it would start with a size sensor and does not need to evolve any feedback mechanism. Here, we are focused on exploring how cell size control can be done by a network with multiple feedbacks rather than just the concentration of a single protein. We made those points more explicit in the text, which now included the following sentences in the methods section: “We note that we are not allowing the cell to employ proteins such as Whi5 in budding yeast whose production is independent of cell size so that its concentration is a direct readout of cell size (Schmoller et al 2015; Swaffer et al 2021). We chose to do this because we want to explore how cell size control can be done by a network with multiple feedbacks rather than just the concentration of a single protein with a special dedicated synthesis mechanism.”

      5) Comparisons to data: Currently no attempt has been made to compare the model predictions quantitatively with experimental data that are easily available. For instance, how does the CV of cell birth size predicted by the model compare with cell size distribution in budding yeast or in the fission yeast? The same goes for the scaling of added volume with initial cell volume in different phases of the cell cycle. Furthermore, the noise parameters should also be calibrated to reproduce the cell size variability seen in experiments.

      To facilitate the comparison of our evolution simulations with model organisms we have included Table S1 in the supporting material, where we show the published results for budding yeast, fission yeast, and mammalian cells grown in culture and mouse epidermal stem cells growing in the animal. In fact, it turns out that distribution and CV that we obtained in our simulations are relatively similar in some cases to what is observed experimentally, but can also be much lower and exhibit a tighter control when optimized. However, the comparison is not perfectly fair since the model organisms were grown in laboratory conditions rather than their natural environment for which they are likely more optimized.

      Reviewer #3 (Public Review):

      In this paper, Proulx-Giraldeau et al. develop evolutionary simulations to study how size control can emerge. In the first part of the paper, the authors initiate cell cycle simulations with a simple network that does not allow cell size sensing and ask what molecular networks can lead to size control after evolution. Results show that a wide range of network types allows size control, some of which are comparable to experimentally identified networks such as the dilution inhibitor model in budding yeast. In the second part of the paper, the authors use their framework to ask how the structure of the cell cycle, including the duration of G1 vs. S/G2/M and the form of size control in each of these phases (i.e. 'sizer' or 'adder'), affects the overall size control. While this is a very important question and the authors bring comprehensive and interesting answers, it is less clear that framing the findings in the context of evolution is meaningful. Indeed, the solutions for how the combination of strength of size control, noise levels, and respective duration of the phases can be found analytically/with simulations that are not 'evolving' the cell cycle structure. Additionally, the finding that a sizer in G1 can lead to an overall adder if it is followed by a timer in S/G2/M is only true if a significant amount of noise is added during the timer phase. At present, this finding is discussed as a result of 'evolution' which is confusing and the dependency of this conclusion on the level of noise during S/G2 does not appear very clearly.

      With more cautiously formulated conclusions and a better discussion of already established theoretical and experimental work, this paper will become more accessible to experimentalists and will be a very valuable contribution to the field of cell size control.

      We thank the reviewer for their careful reading of the manuscript and their thoughtful comments.

      Major suggestions:

      1) Fig 4-5. While the use of the evolution simulation seems interesting to identify which underlying network(s) can result in size control, the use of the same framework to compare the result of sizer+timer vs. timer+sizer is less easy to interpret. Previous analytical/simulation approaches have explored how noise & duration of the timer phase can alter the 'sizer' or 'adder' signature (see doi.org/10.1016/j.celrep.2020.107992, doi.org/10.3389/fcell.2017.00092, for example) and what evolutionary simulations add to this question is unclear.’

      We thank the reviewer for pointing out this highly relevant work, which we now cite where appropriate at various places in the manuscript. We agree that several of our results could have been derived from non-evolutionary analysis as performed in this work (such as the conclusion that a sizer followed by a timer can yield an adder). However, many of our other results cannot. For example, we are interested in how a network based on constant concentrations of proteins can measure cell size. Our evolution simulations yield highly non-trivial networks which we then proceed to analyze. We now clarify the distinction between our approach using evolution simulations to the more traditional analytical approach in the discussion. We added the following text: “We note that these generic results of how sizers and adders can govern cell size homeostasis can be derived from more traditional analytical methods (Barber et al., 2017; Willis et al., 2020). However, our evolution simulations are particularly useful because the molecular networks that evolved give non-trivial insights into how the observed size homeostasis dynamics can be regulated.”

      – What is the authors' interpretation of why the optimization of Pareto vs. number of divisions yield different size control results (Fig. 4A)? Is it possible that these different fitness parameters allow for the evolution of different levels of noise/duration of the timer phase?

      This relates to what we discuss in section “A two-step evolutionary pathway for cell size control”. We think the effect is intuitive : if there is no selection on CV, there is no reason for the system to evolve good noise control in general. Then in the absence of secondary effects such as size dependent growth rates, etc…, networks such as the one presented in Fig 5 A are essentially optimum for the number of divisions, and are pure sizers. This is not related to the timer phase as far as we can see. We added a few words at the end of that section to make this more explicit.

      – In the conclusion: 'G1 control is more conducive to the evolution of adders, while G2 control is more conducive to sizers', do the authors really believe that this is an evolutionary acquired trait, or are their observations instead the natural consequence of having a noise-adding phase (timer + multiplicative noise) after a phase with size control?

      We believe what the reviewer says, ie, adder is a consequence of noise-adding phase after the size control. We do not think this is necessarily an evolutionary acquired trait. As discussed above, and now in our discussion, this result could have been found using traditional analytical approaches. That the result is similar in a computational evolution simulation is interesting because the flexibility of the PhiEvo algorithm might have allowed for different phenomenological results to emerge. That they did not do so further strengthens the intuition built up from the analytical approach.

      – A perfect sizer in G1, followed by a timer (with exponential growth) in S/G2/M would simply give an overall 'noisy sizer' (i.e. the slope of final volume vs. initial volume would still be 0 but with some variability around the slope). Only beyond a certain level of noise added in S/G2/M, would the sizer signature be lost. Would it be possible for the authors to perform simulations with different levels of noise (on the timer in S/G2) to help understand this conclusion better? This conclusion could be one of the most valuable to experimentalists studying different organisms.

      This is an excellent suggestion by the reviewer and we have performed these evolution experiments examining the effect of modulating the noise in the S/G2/M timer. We consider a CV in the timer of 0, 5, and 8% corresponding to no, medium, and high noise respectively. The average duration of the timer is half the time it takes to double the cell’s volume. Having specified the S/G2/M timer parameters, we then evolved and selected networks as previously, and compared ensembles of 60 networks for each noise level. The results are in line with our and the reviewer’s intuition. Increasing the noise, progressively leads to a loss of the sizer signature and increases the CV of cell size at birth. These results are described in a new paragraph in the results section modulating cell cycle structural constraints selects for sizers and adders, which reads as: “We next considered the effect of changing the amount of noise in the timer phase of the cell cycle. To do this, we examined the evolution of networks performing size control in G1 and where the S/G2/M phase with an increasing amount of noise. Increasing the noise in the timer progressively reduced the amount of size control done by the network (Fig. S5). This is likely because the fixed duration of S/G2/M allows the system to accurately reset protein concentrations for the subsequent cell cycle to promote accurate G1 control (Willis et al., 2020). We also examined the effects of adding noise to the cellular growth rate and to volume partitioning at division and found similar results (Fig. S6-S7).”

      The results are shown in the new supporting figure 5.

      2) Some aspects of the mathematical formalism were unclear: - Working with the hypothesis that growth is exponential and at a constant rate is reasonable. However, the description of the scenario where growth modulation contributes to size homeostasis is incorrect. E.g. the statement 'cells further from the optimum size grow slower' is not accurate. If size control occurs via growth regulation, what is expected is a negative correlation between size and growth rate (big cells grow slow, small cells grow fast).

      To clarify this point, we have modified the sentence to read as: “In the first class, it is crucial that the growth rate per unit mass of a cell depends on cell size so that cells that are significantly larger than the optimum cell size grow slower.”

      – The quantity I is produced with a rate proportional to volume, degraded at a constant rate, diluted by cell growth': why is I diluted? Concentration should be constant if I increases at the same rate as volume. 'the quantity of I does not initially depend in any way on the volume'. Does the quantity of I not increase with volume (since concentration is constant)?

      The equation for the amount of I does not have a dilution term, but the equation for the concentration of I does. This is easy to see if you consider stopping synthesis of I but continuing cell growth. In the case where I is stable, the concentration of I would decrease in proportion to the growth rate of the cell, which is the dilution term. In the case of constant synthesis of I, the concentration is indeed constant at equilibrium and reflects a balance between protein synthesis and dilution and degradation (e.g., see Eq. S4).

      Fig. 3, The rescaling of the variables to tau and Veq was difficult to understand. Fig. 3A: If T_S/G2/M is at ~0.5 of the doubling time tau, how relevant is it to look at the behaviour of T_(Vc) for values of T_(Vc)/tau above 0.5 (and beyond 1)? Fig 3B: for which value of T(Vc) is the prediction made?

      Time is rescaled to the amount of time it takes to double the biomass. Volume was rescaled to the average volume at the G1/S transition for a population of cells at the size distribution's steady state. We realize now that this nomenclature is unclear, and have replaced Veq with <VG1/S>, which we believe is more clear.

      Because of the timer constraint, T_(Vc)/tau has to be at least 0.5, which corresponds to a G1 phase with 0 duration. But, in principle, T_(Vc)/tau could have any value larger than 0.5. The range of T_(Vc)/tau is set by the size control mechanism after we specify the range of Vc that we wish to examine. To clarify this, we now denote what parts of the plot correspond to cells increasing or decreasing in size.

      The prediction is the solid line and is made for a bit more than the range of cell sizes that we see in the steady state simulation. We think there is confusion about our nomenclature for a single point indicated on each line as ‘Added Veq’. This point represents the average amount of volume added at steady state. To clarify this we now label this as <∆V>.

      4) Discussion:

      – Including a discussion of previous theoretical work that explored the consequences of varying the relative duration of the timer and sizer phases would be valuable.’

      As discussed above, we have now cited the previous theoretical work in the introduction, results, and discussion. We thank the reviewer for pointing out this omission.

      – A reason commonly evoked to explain why cells might show sizer vs. adder behaviour is the role of the growth mode: S. pombe is a sizer but is thought to grow linearly, E. coli behaves like a sizer when it grows slower than usual (see Walden et al. 2015). It would be helpful to mention this when discussing S. pombe and remind the reader that the findings of this paper are limited to exponential growth mode.

      As suggested, we clarify that our analysis is restricted to exponential growth rates and that S. pombe growth rates have been reported to deviate from exponential.

      – The paper seems to be focusing on the noise of the size control mechanism (i.e. probability of transitioning through G1/S based on levels if I) but does not address the question of other sources of noise (i.e. asymmetry at division). What do the authors think about the role of such sources of noise as selective pressure on size control mechanisms evolution?

      This point was also raised by referee 2. There is a necessary balance to achieve between biochemical realism and simplifying assumptions to theoretically study such problems. Of course we fully agree with the reviewer that there are multiple sources of noise in the system. In this study, we chose a hierarchical way of introducing noise in the system that starts with the biggest contributing factor and incrementally adding sources of noise if needed.

      In revising our manuscript, we now also consider noise in cell growth rate and noise in partitioning of mass at division as suggested by the reviewer. This results in slightly lower control, and more noise in alignment with our intuition. However, broadly speaking, our results are unchanged (see new supporting figures Figs. S6-S7). We now describe the logic of our series of simulations of increasing complexity in the methods section, which has a new paragraph that reads as follows: “In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding additional sources of noise in subsequent analyses. All simulations presented include noise (stochastic control of G1/S transition and timing of S/G2/M, see below) in the cell cycle phases, whose CV has been found to be as high as 50% (Di Talia et al., 2007). Then, we introduced protein production noise via Langevin noise because the CV of regulatory protein concentrations is typically 20-30% (Newman et al., 2006). Importantly, the cell volume also contributes to stochastic effects, which are larger in smaller cells with fewer molecules. Thus, for stochastic simulations, we include a multiplicative 1/√V contribution to the added Gaussian noise term (see more complete description in the Supplement).

      We also checked that our results are largely invariant when adding other sources of noise (see Figs. S5-S7). In these simulations, we also included noise in cell growth rate (CV ~15%; e.g. (Di Talia et al., 2007), and in mass partitioning at cytokinesis (CV ~10%; e.g. (Zatulovskiy et al., 2020).”

    1. Author Response

      Reviewer #1 (Public Review):

      “The synthesis and metabolism of sphingolipid (SL) are involved in wide range of biological processes. In the present study, the authors investigate the role of SPTLC1, one of the essential subunits of serine palmitoyl transferase complex, in both physiological and pathophysiological angiogenesis, via using inducible endothelial-specific SPTLC1 knockout mice. They found SPTLC1 deficiency in ECs inhibited retinal angiogenesis along with reducing several SL metabolites in plasma, red blood cells, and peripheral organs. In addition, the authors found SPTLC1 EC-KO mice are resistant to APAP-induced liver injury. Overall, the in vivo findings in the present study are of potential interest and the authors have given clear evidence that endothelial SPTLC1 is critical to retinal angiogenesis. However, the underlying mechanisms are completely lacking in the present study. Most of the evidence provided is circumstantial, associative, and indirect.”

      We appreciate the positive comments of the reviewer. We have addressed the reviewer’s concern regarding underlying mechanisms as detailed below.

      “To be specific,

      1. The authors found endothelial SPTLC1 is important to both angiogenesis and the plasma lipid profile. However, the authors did not present the data to demonstrate the relationship between them. The in vivo findings about the phenotype and the plasma lipid profile might be true and unrelated. It would be important to know whether supplementing the reduced lipid induced by SPTLC1 KO could rescue the angiogenesis related phenotype in mice, or, whether the alternative way to inhibit the SL synthesis could mimic the phenotype of KO mice.”

      In the manuscript, we discussed the possibility whether S1P is involved, since it is one of the most down-regulated SL in the plasma and a major regulator of angiogenesis. We think it is unlikely that reduced plasma S1P is responsible for the phenotype. First, the retinal angiogenesis defect in Sptlc1 ECKO mice is the opposite of S1pr1 ECKO as we have published previously (PMID: 22975328, PMID: 32059774). Moreover, deletion of sphingosine kinase, the enzyme produces S1P, in the endothelium does not influence retinal angiogenesis at P6 (Figure 3 Supplement 2 A and B). Loss of S1P chaperone ApoM- i.e., Apom KO, which exhibits 50% reduction of plasma S1P, does not show change in retinal vascular development (Figure 3 Supplement 2 C and D). Taken together, our results strongly suggest that reduction in plasma S1P is not the cause of vascular defect in Sptlc1 ECKO retinas.

      Based on our results in the manuscript, loss of SPT enzyme activity in endothelial cells reduced SL species in the endothelial cells and the plasma. Our in vitro and VEGF intraocular injection experiments (new data) suggests that the angiogenic defects seen in Sptlc1 ECKO mice is due to cell intrinsic defects in VEGF signaling and not due to changes in plasma SL levels. We have edited the discussion section to address this issue.

      “2. A major issue is that the present study did not reveal is a real downstream target. It is possible that VEGF signaling might be impaired by SPTLC1 knockout as discussed by the authors. However, the authors did not demonstrate this point with data. Including both in vivo and in vitro data to evaluate the effects of SPTLC1 deficiency on VEGF signaling might further strengthen the hypothesis. Besides, with in vitro experiments, the authors might further find the critical metabolite(s) involved in VEGF signaling and angiogenesis.”

      As discussed above, we agree with the review’s critique and have addressed this essential point with new experiments (both in vitro and in vivo) in Figure 5. Our new data shows that SPT pathway supplies the glycosphingolipid GM1, which is needed for efficient VEGF-induced ERK phosphorylation and tip cell formation.

      Reviewer #2 (Public Review):

      “Andrew Kuo et al. investigated the role of endothelial de novo sphingolipids (SL) synthesis using endothelial cell specific SPTLC1 knockout (ECKO) mice. They showed that these mice exhibited low concentration of various SL species in not only ECs but also RBC, circulation, and other non-EC tissues. They also showed that ECKO mice exhibited impaired angiogenesis in normal and oxygen-induced retinopathy models, consistent with the decrease of endothelial proliferation and tip cell formation. They finally revealed that these mice were resistant to acetaminophen-induced acute liver injury in early phase. The experiments were well-designed, and the results were clear and convincing. The authors concluded that endothelial cells were the major source of SL in circulation and various organs (liver and lung) other than retina (and probably brain). The weakness of the current version of the manuscript is that the authors did not elucidate the mechanisms underlying the observed phenomena.

      1) The authors showed impaired angiogenesis in ECKO mice using neonatal retina model. Based on the fact that this phenotype was similar to that in endothelial VEGFR2 deficient mice, they suggested that VEGF responsiveness is altered in ECKO mice. Although this hypothesis is plausible, the authors would need to prove it by evaluating VEGFR signaling (VEGFR phosphorylation, Akt activation etc.) in ECKO mice.”

      We thank the reviewer for positive comments. As for the weakness identified, we have addressed this point by conducting new in vitro and in vivo experiments (detailed above). The new Figure 5 addresses this issue directly.

      “2) The acetaminophen-induced liver injury was reduced in ECKO mice in early phase. However, it is still unclear whether SL production itself affects liver injury. The authors discussed the possibility that gene deficiency increases unconsumed serine resulting in GSH increase, but it is essentially independent to SL. If possible, it would be good if the authors could investigate the effect of SL administration on the liver injury progression.”

      We appreciate the reviewer’s concern about liver injury model in the Sptlc1 ECKO mice. Our data suggests that SL species supplied from EC impacts hepatocyte response to stress. Since the acetaminophen induced liver injury is highly dependent on reactive oxygen species, our finding that increased glutathione levels in the Sptlc1 ECKO mice may be involved in the phenotype. However, we are simply considering them as biochemical markers of liver injury. This has been addressed in the discussion.

      “3) This paper showed the impaired cell proliferation in Sptlc1 KO EC mice, and discussed it. Authors described that this phenotype was similar to that of Nos3 KO mice, but its inconsistency with Sptlc2 ECKO adult mice was only justified by a word "isoform-selective function". Authors could quantify eNOS expressions in Sptlc1 KO mice, compared results and then discuss this matter. “

      In figure 1C, we used eNOS as an EC marker to show purity during our EC isolation process. In fact, we did not observe change of eNOS expression in Sptlc1 ECKO. We also did not detect elevated phospho-eNOS in Sptl1c ECKO in contrast to Sptlc2 ECKO adult mice (Figure1 supplement 4). Additionally, our work in the retina was performed in postnatal-genedeletion pups from P6-P17 which is different from the published Sptlc2 ECKO study. The differences in gene deletion strategy (early postnatal vs. adult) could result in differences in eNOS expression . We have added discussion about this issue.

    1. Author Response

      Joint Public Review

      1) The structures of the PDZ domains of PSD95 have been determined and they are well-folded and stable. In addition, the PSG module has been shown to adopt a stable structure after expression and purification. The authors should cite papers, their own and those by Zeng et al. (e.g. J. Mol. Bio, 2018), to reassure readers that the protein is not destabilized by the cysteine mutations. The authors need to state how many purifications of the mutants have been done and how many replicates have been made for the FRET measurements. Did the FRET data change over time?

      We appreciate the importance of selecting labeling sites that do not disrupt protein structure and activity. There are two protein constructs in this work: full-length PSD-95 and the PSG truncation of this same protein, which have been expressed hundreds of times over more than a decade in my lab. The cysteine mutations used in this work have all been validated as non-disruptive to the protein and the dyes in several ways. 1) We selected labeling sites using the available x-ray and NMR structures to ensure surface accessible residues within alpha helices or short loops to minimize tertiary structural disruption; 2) we ensured that the two point mutations don’t affect the expression and purification protocols. Misfolding or changes in conformation would be visible on elution profiles from chromatography as well as proteolytic cleavage patterns, which are sensitive to protein folding; 3) in our previous work, we measured both donor anisotropy and acceptor quantum yield for all of the variants in use here but one, which relied on existing sites in a new combination. Dyes involved in interactions with proteins or changes in dye environment would become apparent through changes in quantum yield and anisotropy. Any problematic labeling sites have been purged from the current work, which uses a small subset of the mutants from our earlier work. The repeatability of the expression and purification of all these constructs has been demonstrated in our published work and is not affected by the specific labeling mutants in use. The stability of these constructs is supported by the numerous other NMR and x-ray crystallography studies published on these robustly expressing proteins. To highlight this important issue, we have added additional discussion of the origin and validation of these mutants in the text on page 4 and in the methods section. We also included references to the tables of photophysical measurements for the library of PSD-95 cysteine mutants adapted for this study.

      We did not explicitly track the number of purifications used in this work, which spanned more than five years. We were not aware of any expectation to provide such records but will be more aware going forward. The measurements for this paper come from one or in some cases two protein expression runs, each of which generates 2 or more cell pellets. Each of these pellets generates a single affinity and ion exchange purified sample. This is then aliquoted and frozen, which may produce more than a dozen samples for fluorescent labeling. Individual labeled samples are given additional rounds of desalting and size exclusion chromatography immediately before measurements to ensure than the full length proteins are used and that there has been no aggregation or degradation. In terms of repeatability, the data shown in this manuscript involves repeat measurements of the same constructs using different FRET dye pairs, collected on different instruments at different times and still shows excellent agreement. All of the measurements involve as few as one protein expression run and a minimum of two separate labeling and purifications for two independent sets of measurements. Some variants exceeded this standard but this was not tracked during this long study.

      Regarding the agreement of experimental observables across different protein preparations, one of the variants within the existing dataset (P2-S3) was measured on two experimental setups, two years apart, using two different expression runs each with separate protein purifications and labeling reactions. Comparison of these measurements revealed that the mean FRET efficiency values measured at Clemson were 0.70 while that measured at HHU was 0.71 w mean DDA lifetimes were 2.29 and 2.4, respectively.

      2) The authors have not explained how the approach taken in this paper compares to their previous simulated annealing approach of mapping PDZ3 using FRET data in McCann et al., 2012. That study resulted in a model in which PDZ3 binds to a completely different interface, which is not mentioned in this manuscript.

      We apologize for this oversight and thank the reviewers for this reminder. The omission was an error of trimming the manuscript for brevity and we appreciate the opportunity to highlight how much our approach has improved over the intervening time. We have included commentary on our previous modeling in the revised discussion.

      3) The biochemical disulfide (DS) mapping experiments provide a useful check of predictions of the FRET and DMD conclusions. However, in order to interpret these correctly, the authors need to show data from negative controls testing cysteine pairs that are predicted NOT to interact.

      We agree that negative controls are a critical part of the disulfide mapping experiments and thank the reviewers for this suggestion. As a negative control, we selected a cysteine pair that showed low FRET in our 2012 PNAS paper (Q374C-K591C), which was not included in this work nor was the cysteine pair involved in contact interfaces identified from simulations or modeling. This cysteine pair showed no evidence of intramolecular disulfide formation. In the manuscript, we have provide an additional supplemental figure panel to document that this negative control sample does not form disulfides.

      4) The SH3-GUK domain of PSD95 can undergo domain swap dimerization and the dimerization is promoted by binding of the synGAP PDZ-ligand to PDZ3. The authors should mention the existence of domain-swap dimerization (citing McGee [2001] and Zeng et al. [2018]) and indicate whether they tested that the FRET-labeled proteins are monodisperse. This is particularly important in light of the high variation in diffusion time for individual variants - 0.91-10.19 ms (see also #10 below). In particular, the P3-G4 FRET variant has a long diffusion time of 10.19; could it be undergoing domain swap dimerization?

      We are very interested in the prospect of domain swapping as has been suggested previously. However, we have not seen evidence for this at the concentrations used here. As reported in our 2012 PNAS paper, both full-length PSD-95 and the PSG fragment are monodisperse as judged by size exclusion chromatography, which suggests that lack of stably populated oligomeric states under these conditions at 10-5 molar concentrations. The PSG fragment runs very true to its calculated formula weight while the full-length protein does migrate faster than expected based on formula weight but not high enough to be a dimer.

      The DS mapping experiments did reveal some higher molecular weight species. However, these higher order species never accounted for more than 5% of the total input. Thus, any intramolecular interaction is transient and not well occupied under the buffer conditions and concentrations used in these studies. Our size exclusion and disulfide mapping experiments are carried out at protein concentrations that are orders of magnitude higher than used for single molecule imaging. Thus, dimerization is unlikely at the single-molecule concentrations used for the present FRET experiments. If dimerization were to occur, we would expect the appearance of additional static subpopulations in the MFD histograms. If dimerization were significant, we would also expect the appearance of an additional diffusion term in fluorescence correlation curves, which was not the case in these experiments.

      5) On page 4, line 5 the authors state: "the number and occupancy of conformational states were set as global fitting parameters". This assumes that the protein is unbiased by the labeling and that the protein behaviour is independent of the purification batch. Have the authors verified this?

      The reviewers are correct in stressing the importance of quality control in the selection of labeling sites and reproducibility in sample preparation. The PSD-95 purification has been carried out hundreds of times in the Bowen lab using different variants. The cysteine mutations used in this work have all been validated as non-disruptive to the protein and the dyes in several ways. 1) We selected labeling sites using the available x-ray and NMR structures to ensure surface accessible residues within alpha helices or short loops to minimize tertiary structural disruption; 2) we ensured that the two point mutations don’t affect the expression and purification protocols. Misfolding or changes in conformation would be visible on elution profiles from chromatography as well as proteolytic cleavage patterns, which are sensitive to protein folding; 3) in our previous work, we measured both donor anisotropy and acceptor quantum yield all of the variants in use here but one, which relied on existing sites in a new combination. We have insured that sites with poor properties are never included in our published work. Indeed, the reproducibility of sample preparation, using chromatography before and after labeling, gives confidence that the attachment of fluorescent dyes is not altering macromolecular properties. For the dyes to change the protein structure, they would have to interact competitively with the protein interfaces or disrupt local structure. These would be expected to change the dye quantum yield or the anisotropy, which were each measured in our previous work. In addition, the multiparameter fluorescence detection includes anisotropy measurements of the current samples. None of these measurements reveal aberrant fluorophore behavior (Supplemental File 3C).

      This alone does not rule out that the dyes affect the conformational ensemble. One can take additional confidence that our protein handling workflow does not affect the results from the cross-methods agreement that we demonstrate in the current work. First, between measurements of both full-length PSD-95 and its PSG truncation, using confocal and TIRF experiments boosts confidence. The labeled samples for each experiment were prepared from the same purified proteins but labeled independently with different dye pairs. The different dyes attached to the samples used for confocal and TIRF did not impact the time averaged distances between these residue pairs save for one slight outlier. Additionally, our cross-validation using disulfide mapping, which is entirely label free, provides additional confidence that the interdomain contact interfaces, observed in the data collected using the labeled proteins, are preserved when the labels are not present. Finally, independent DMD simulations of label-free PSG were in excellent agreement with regards to the predominant states identified from rigid body docking based on experimental FRET distance and the disulfide mapping.

      6) On line 6 the authors state: "Based on fitting statistics, we demonstrate that a two-state model with a small donor-only (or no FRET) population (Supplementary file 1C &D) is sufficient to fit all data.” From the average Χ2 this can be concluded, but for individual datasets sometimes a 1 state model or 3 state model seems more appropriate. The authors should explain why measuring more cys mutants justified using 'one unifying model'? How can the data contain donor-only contributions if pulsed-interleaved excitation (PIE) is used to select only molecules with active donor and acceptor fluorophores?

      We apologize for the lack of clarity as to how we arrived at the determination that two states were present in the conformational ensemble. The fitting statistics show that there is an improvement in global fitting upon increasing the number of states in the model from one-state to multiple states. The statistics in the former Supplementary file 1C show significant improvement upon fitting with two states relative to one while adding a 3rd state marginally improves 3 variants while the remaining 9 remain unchanged or show a slightly worse fit. The former Supplementary File 1D (now 3C) provides a list of the values for each of the constants that arise from fitting the 2-state model to all datasets simultaneously and the individual fit statistic for fitting this model to the specific variant dataset. This table assigns the global population fractions and their associated donor lifetimes but was not used to assign the number of states. That there are two states is based solely on the improvement in fitting statistics with two states shown in the former Supplementary File 1C. Thus, the statistics do not justify us including an additional state. Because this is such a critical point, we have moved the former Supplementary File 1C to the main text as Table 2 and add additional discussion to the manuscript to highlight how we arrived at a 2 state model.

      The reviewer is correct that a global fit of the dataset could result in suboptimal fits for an individual FRET pairs to satisfy the global minimum. In this case, most variants were best fit by a two state model. The reason for using one unifying model is our underlying assumption that the same conformational distribution for PSD-95 is sensed differently by each labeling combination. A primary conclusion from this assumption is that all variants share a population distribution. A secondary assumption is that protein handling is not biasing this conformational ensemble, which we verify as described above. Each measurement provides part of the same story so we were only interested in models which simultaneously explained all observed FRET data, and as such enforced the single global model. A global fit also proved the best way to uniquely assign each distance to its corresponding state. Furthermore, the FRET Network Robustness analysis explicitly examined how much our model depends on any one labeling variant and found no systematic deviations. This revealed an ensemble of structures that satisfy the data without enforcing a global model for all samples simultaneously.

      We also thank the reviewer for correctly observing that we misapplied the term donor-only (DO) in the manuscript. The population we referred to is more appropriately termed a “No-FRET” or “low-FRET” population. The reviewer is correct that active, FRET-labeled molecules were selected using PIE parameters. We have corrected this in the manuscript.

      7) All variants are shown to be dynamic, but they are positioned differently on the dynamic FRET line (Fig. 1D and S3). Does the same kinetic model underly each variant? If the same state occupancies are implied, then why not the same kinetic constants, especially for distances probing the same two domains?

      While the global population fraction is shared between variants the transitions rates for Individual variants are not constrained. As such the variants do not share a single equilibrium rate constant. While the FRET data is fit to two global states, our DMD simulations showed that there is substantial fuzziness within these global states. Thus, the full kinetic network is more complex than the global 2-state transition. As our screening of DMD snapshots showed, each FRET variant is uniquely sensitive to the underlying conformational transitions. Hence, the system is underdetermined and we are not able to adequately determine forward and backward kinetic rates for each variant individually.

      It is important to recall that the data shown in multiparameter FRET histograms has been binned with millisecond time resolution, which is slower than the local conformational dynamics arising from fuzzy domain rearrangements. The position of the peak will depend on the underlying rate constants. Our Photon Distribution Analysis reveals the kinetic processes that dominate the broadening of the FRET efficiency distributions. This analysis also measures the fractions of the effectively “static” population. Fast transitions, which do not significantly contribute to changes in FRET efficiency (or broadening) on the binning timescale, will appear as static populations. Thus, the simple PDA model captures the broadening that is also present in MFD histograms, but does not adequately describe dynamics at the fastest timescales.

      8) Could the data also be explained by "fuzziness" within domains, without interdomain dynamics? The authors should discuss this given the possibility of domain swap dimerization of the SH3-GuK domain.

      In this work, we use the term fuzziness to refer to alternate residue interactions and domain orientations within a global contact basin. Using this definition, we do not expect significant structural rearrangements within the PDZ, SH3 or GuK domains. These domains are well folded and have been studied individually and in combination using x-ray crystallography and NMR, which did not reveal local distortions of the domain fold (e.g. SH3-GuK interactions). This is not to say that there are not conformational dynamics within loop regions or other small scale subdomain motions. Our rubric for selection of labeling sites is to avoid large loops to minimize the local dynamics as this conformational variability compromises the resolving power of the FRET restraint. Our DMD screening provides details as to how each FRET pair senses changes in local and global conformation. In comparison to the global changes extracted from the fluorescence lifetime decays, the intradomain dynamics are occurring rapidly on small length scale and are not expected to affect our global positioning of PDZ3. We do not observe a significant population of dimers or other multimers under the concentrations used for these experiments as discussed above.

      9) Regarding supplemental File 2: The authors should justify that PDA is an appropriate method to quantify relaxation time of Fluorophores. Dynamics being so fast, how do the authors explain that when binned in 2 ms time bins, discrete subpopulations in the PDA histograms are still clearly observed (e.g. Figure 2B, Fig. 2 supp 3)? Why would the protein move through certain very discrete states and not others? Doesn't this imply that the model is oversimplifying the actual mechanism (even though the Chi^2 is alright)? It is strange that for some mutants (fig 2 supp 3B P1G3) PDA displayed discrete states, while for others (e.g. fig 2 supp 3A P2G6) PDA histograms were smooth, implying it cannot be a low-histogram-count artifact. Or can it?

      We apologize for this confusion but the photon distribution analysis was not used to “quantify relaxation times” of the fluorophores, which comes from fitting of the lifetime decays. Rather, PDA was used to estimate the rates of exchange between limiting states (i.e. the inter-fluorophore distances derived from fitting the fluorescence decays). Obtaining the rates is accomplished by fitting time-binned FRET efficiency histograms with a model that accounts for broadening due to exchange between limiting states.

      We agree with the reviewers that the two-state model, which is sufficient to fit the lifetime decays, is too simplified to fully describe the dynamic exchange between limiting states. To address this, we performed the FNR analysis to describe the limiting state basins within which fast dynamics occur. This extends the model beyond two discrete limiting states. Further, DMD screening shows that different FRET variants do report differently on the underlying conformational landscape. Some exhibit a degree of degeneracy showing similar FRET efficiency for different conformations making each variant insensitive to specific subsets of possible transitions.

      Using fluctuation correlation analysis to probe FRET-induced changes in intensities, we observed dynamics on the 10-5 second timescale, which is much too fast to give rise to broadening in the fluorescence observable histograms. However, these dynamic transitions did not correspond to exchange between states with large differences in FRET efficiency because, if such fast dynamics involved a large change in FRET, this would be associated with a narrow distribution about the mean in MFD histograms. We explain the appearance of distinct peaks for some variants as an increase in the relative contribution of fast dynamics within limiting ensembles compared to the slower processes of exchange between limiting ensembles. This can occur without a relative shift in forward/backward exchange rates and with only a slight shift in the overall relaxation rates on the timescales to which PDA is sensitive (~.01-1 ms).

      10) Regarding supp file 3A and Table S9: The spread on tdiff, (the average diffusion time through the confocal volume) for individual variants is very broad - 0.91-10.19 ms. Considering that the authors use global fits for many different parameters, it's surprising that they didn't use it for this parameter which should unbiasedly be the same for all the protein mutants, at least if all are well-behaved (i.e. non-aggregating). The high variation in tdiff may be a warning that the model is not accurately accounting for all dynamics. For example, might the P3-G4 variant be undergoing domain swap dimerization?

      We thank the reviewers for their observation and apologize for the confusion as to why there are differences in the diffusion time through the confocal volume for the different variants. We expect that there would be three distinct diffusion times because the samples were measured on two experimental setups using different confocal volumes and pinhole sizes. There are also two distinct protein constructs (full-length and PSG), which differ in molecular weight. The longest timescale processes included in the fFCS fits are attributed to long-timescale photophysical effects, such as blinking. As discussed above, we do not expect a significant population of dimers or other multimers at the pM concentrations used for these single molecule experiments.

      We agree with reviewers that the diffusion time for a given construct on a given instrumental setup should be a constant. In this light, we reanalyzed the filtered fFCS curves with enforced consistency for the diffusion times in measurements involving the same construct measured on the same setup. While this refitting slightly changed the values of fit parameters, none of these differences significantly affected the parameters used for modeling and therefore the conclusions of the paper have not been impacted. We have updated the manuscript to indicate the change in the fit models.

      11) In the results section, the authors state: "Summarizing the dynamics observed for the PDZ3-GuK variants, fFCS depicts three relaxation times." This is an overstatement because the authors imposed these three broad relaxation times. The authors should describe how they made these assignments. Is this common practice? Regarding Supplemental File 2 versus Supplemental File 3A: In principle, the relaxation time implied from fFCS and that from PDA should align. However, the 'Average' of fFCS and the T_R of PDA do not align. Is it possible that the dynamics analysis from PDA should have been constrained in some way by the results from fFCS? It would be useful to add error estimations for PDA here.

      We agree with the reviewer that it is an overstatement to say that the number of relaxation terms arises from the correlation analysis. We have removed this statement and instead focus on the differences in dynamics. The assignment of three relaxation terms was made to probe the extent of dynamics across decades in time as each time regime is typically associated with distinct forms of protein dynamics. We enforced these consistent timescales in order to directly compare amplitudes across different FRET variants. However, we do not enforce any assignment that dynamics arising from a particular type of exchange process occur at the same timescale.

      We also agree that obtaining agreement between PDA and fFCS is desirable. In our experience, such agreement is only obtainable for simple kinetic schemes when dynamics probed by fFCS and PDA all occur within the same relative timescales. Here, the contributions to dynamics occur across several decades in time including those obtainable only through fFCS analysis but too fast to be quantified by PDA. Using the methods we employed, we recover only the effective relaxation times rather than the absolute kinetic rate constants because the system is underdetermined. Differences for individual variants arise because the variants differ in sensitivity to specific transitions (Figure 8-Figure Supplement 1) while fFCS and PDA differentially report on the underlying kinetic scheme.

      12) Regarding the DS bond formation data, the authors state, "The α-basin variant showed slightly more DS formation than the beta-basin variant in full-length PSD-95 but the rates of DS formation were similar". It isn't clear what this means physically. It seems to suggest that there is static heterogeneity in the population, i.e. some proteins can and some proteins cannot form DS bonds. The presence of this effect may contradict the assumption that every state at some point interconverts to any other state, which underlies the FRET PDA analysis. The authors should discuss this possible inconsistency.

      We agree with the reviewer that this statement was not clear. It was never our intention that the DS formation kinetics be directly related to FRET data in this way. The goal of DS mapping experiments was to provide qualitative confirmation that supertertiary structures suggested by DMD and FRET experiments occur in solution. We meant to focus on the DS formation kinetics, which are in indication of structural proximity. The extent of DS formation comes from the fitting as a matter of course. The reactions progress to near completion (Figure 7-Figure Supplement 1). The differences in extent of disulfide formation, while real, are very small and we did not intend to highlight them. We have removed any discussion of the extent of DS formation in the manuscript.

      13) In the discussion of the DS experiments, the authors state, "We also observe significant kinetic differences when PSD-95 is truncated in agreement with FRET studies." This sentence is vague. The authors need to state more completely what they mean here. Exactly what is in agreement with the FRET studies?

      We agree with the reviewers that the claim was vague. We intended to communicate that the DS mapping is generally consistent with FRET experiments in that they confirm the proposed limiting conformational states. The formation of disulfides at these points confirms the accessibility and proximity of these sites with respect to one another within the supertertiary structure. Also, both DS mapping and fFCS observed changes when PSD-95 was truncated to the PSG fragment. However, the rates of DS formation are not directly comparable to the rates of conformational dynamics. We have removed this statement from the paper to avoid directly linking these two unrelated kinetic measurements.

      14) The text in the section on "Structural Modeling with Experimental FRET Restraints" is often unclear. The authors appear to have equated States A and B, formerly used only in the seTCSPC analysis to the alpha and beta basins extracted from the DMD snapshots. The authors should discuss whether there might be other conformations in the DMD results that would be consistent with the FRET-derived distances from seTCSPC? It seems possible that there could be, given that in Fig 6 sup 1, large discrepancies exist between simulated distances and FRET-measured distances for some of the FRET pairs. The authors should discuss explanations for the discrepancies that do not compromise the actual model.

      We apologize for the lack of clarity in our description of structural modeling with FRET restraints. We thank the reviewer for the suggestions as to how we can improve this discussion. In the course of this study, we do reach the conclusion that states A and B, obtained from modeling solely based on FRET data, are equivalent to conformations within the alpha and beta basins from DMD, respectively. Because the representative structures were obtained independently via distinct techniques, we felt that it would be premature to use the same terminology when we are introducing the FRET results.

      We agree that more than a single snapshot from DMD per basin appropriately satisfies the FRET restraints and that no one model satisfies all restraints equally. Our goal with the later FNR analysis, which explicitly incorporates FRET-derived restraints, was to identify ensembles of structural snapshots from DMD that are compatible with experimental data. Instead of finding the single best model for the full set of FRET-derived distances, each snapshot in the ensembles from FNR satisfies all distance thresholds independently. Thus, the ensembles from FNR do refer to both experiment and DMD.

      Further, the vertical lines shown in Figure 8 Figure Supplement 1 represent the distances from the initial global fit of all samples simultaneously. For some variants, this likely includes biases in certain distances due to the enforcement of this global model, which FNR seeks to alleviate. For SH3-GK FRET pairs, these deviations are most likely the result of the restraints placed on the motions of the GK domain in the DMD simulations.

      15) A weakness of the modeling approaches in this manuscript is that they are difficult to validate. Could the authors include a test of the modeling in which they show how small changes of the input FRET data would influence the final FRET-restrained model? Could they quantify their confidence in the final model, given all the limitations of the FRET data?

      We agree with reviewers as to the importance of validating structural models regardless of the data modality used in their determination. We respectfully disagree that this study is lacking in model validation. In this work, we generated models based on confocal FRET data and validated the FRET models using independent DMD simulations and disulfide mapping. We also employed smTIRF measurements using a different dye pair to independently validate the time-averaged FRET from confocal measurements. While this may fall short of complete validation of the associated dynamic information, we feel that this represents the state of the art in model validation regardless of the experimental approach. While it is difficult to validate novel methods for deriving structural models, we feel that have done so through cross-validation against other established techniques.

      As suggested, we did test the dependency of the experimental models on small changes in the input FRET data. To accomplish this, we used the same analysis framework described for FRET Network Robustness Analysis. Instead of removing datasets as in FNR, we introduced artificial error into the FRET distances for each variant and repeated the classification of DMD structures using the altered distances. For each trial, we introduced a random, artificial error on each of the FRET distances and repeated the classification of structures from DMD into the two basin ensembles. To check the dependence on the magnitude of the error, we used introduced a random error to each variant between 5 and -5% or between 15 and -15% of the original distance. Each condition was repeated 3 times with different random errors. To compare conditions, we measured the change in the center of mass of the surface distribution composed from the individual PDZ3 centers of mass identified by that screen (Figure 8-Figure supplement 2). We found that increasing the distance error did not significantly impact the classification of structures into the two ensembles. The variance in the mean ensemble positions over three repeats increased with increasing error along with small shifts in the mean positions. Notably, +/-15% is greater than the uncertainties in distances obtained via global fitting of fluorescence decays, suggesting that the intrinsic uncertainty in the FRET-derived distances from a single fit (Supplemental file 3D) does not significantly impact the ensemble assignment or their fuzziness.

    1. Author Response

      We thank the reviewers for their thoughtful and constructive comments which have helped us improve our manuscript. In our revised manuscript, we will respond to three main weaknesses:

      1. We will address the inconsistency in the experimental design across the behavior and the transcription experiments by repeating the behavior with an experimental timeline that more exactly matches that of the animals used in transcriptional studies;

      2. We will further validate and justify our use of TRAP and our focus on the NAc as the sole brain region of investigation;

      3. We will revise the language throughout the manuscript, especially in the discussion, to reduce anthropomorphizing of our results and interpretations. Below we have provided responses to specific concerns articulated by each reviewer.

      Reviewer #1 (Public Review):

      The monogamous vole provides unique opportunities to study the neural basis of pair bonding and this study exploits that opportunity in a novel way. Focusing on the nucleus accumbens, the authors conduct RNA-Seq to characterize the transcriptome in same-sex and opposite-sex pairs when bonded, when separated for a short time and when separated for a long time at which point the literature has in the past demonstrated the willingness to form a new bond. They determine that the transcriptome of pair bonding includes a preponderance of glial-associated gene changes and that it degrades with long-term separation. To the latter point, they then conduct a neuron enriching trap schema to find those genes subject to change specifically in neurons.

      The strength of the report is the clever experimental design, the unusual animal model, and the comparisons of same-sex and opposite-sex pairs and long-term and short-term separations.

      The weakness is that the behavioral changes observed are not what was expected based on prior work and are relatively modest, providing a disconnect between the outcome and the more dramatic transcriptional changes. A second weakness is the focus on the nucleus accumbens which is a brain region most closely associated with reward. While pair bonding may be rewarding, that component may be independent of the memory of a partner or the willingness to partner anew. Lastly, there is no clear connection between the identified transcriptome and either the formation or degradation of the pair bond.

      We thank the reviewer for noting the unique strengths of using prairie voles to investigate this specific question and for praising our experimental design, which compares opposite-sex and same-sex paired males at each time point to disentangle the effects of pair bonding from general social affiliation and isolation.

      Reviewers #1 and #3 noted the mismatch between the behavioral and transcriptional responses. Specifically, we found little evidence of bond dissolution following long term separation despite substantial erosion of the pair bond transcriptional signature. They further note that the experimental design employed to assess behavior and transcription differed, which may have contributed to the apparent mismatch. Importantly, our initial behavioral assessment as presented in Figure 1 of the manuscript had two strengths. It measured intra-animal changes in behavior over time and minimized the number of animals required. However, we agree with the reviewers, and we are currently repeating the behavior experiments to match the transcription experiments. Specifically, separated partners will be kept in separate colony rooms to ensure no possible access to partner-associated sensory cues (visual, auditory, olfactory), and we will use separate cohorts of animals for short- and long-term separation. This design avoids partner re-introduction during the short-term partner preference test. The results of this work will be informative regardless of outcome. If we observe a dissolution of pair bond behaviors, it indicates that re-exposure to a partner after a short, 48-hour separation has a powerful effect on bond duration following separation. If we do not observe any change in pair bond behaviors following separation, it would confirm that pair bond behaviors are more resistant to erosion than are transcriptional signatures of pair bonding.

      We have focused on the NAc because it is a critical hub that is engaged upon attachment formation and is implicated in loss processing. Specifically, studies have shown that blockade of neuromodulatory signaling (i.e. oxytocin and dopamine) in this region impairs bond formation and can lead to bond dissolution. Our group and others have demonstrated that plasticity within this region - in patterns of neuronal activity and in synaptic response to oxytocin - are associated with bond formation and maturation (1, 2). And literature on drugs of abuse has demonstrated an important role for the NAc in encoding of reward associations (3), which ultimately underlies partner preference. Additionally, in human neuroimaging studies, Prolonged Grief Disorder is associated with an enhanced signal in the NAc when viewing images of the lost loved one, suggesting that normal resolution of grief corresponds with a decrease in NAc activity elicited by reminders of the lost loved one (4). Thus, our focus on this region is well supported. Nonetheless, we recognize that the NAc does not act in a vacuum, and the efferent and afferent connectivity of different NAc cell types is well delineated, paving the way for future work (5, 6).

      Additionally, we agree with the reviewer that pair bonding behavior is multifaceted and comprised of several discrete behaviors that are not dissociable in the partner preference test. Partner-associated reward and partner memory may be independently encoded, and disruption of either process would manifest as a decrease or lack of partner preference. In our complete response to reviewers and revision of the manuscript, we will address this point more thoroughly. Finally, we interpret the reviewer’s last comment to be a request for functional manipulations to validate that the predicted transcriptional changes have a behavioral effect. This is beyond the scope of this manuscript but an active area of future research.

      Reviewer #2 (Public Review):

      The goal of this study is to understand the molecular mechanisms by which pair bonded animals recover following the loss of a partner.

      Strengths of this work include: (1) The organism - a novel model for studying pair bonding and loss; (2) The integrative nature of the study; it integrates behavior and brain gene expression RNASeq data and vTRAP; (3) The important and understudied question about how pair bonded animals recover from loss; (4) The thorough and careful analysis of highly multidimensional and complex datasets

      Weaknesses include: (1) the major comparison is between same vs opposite sex housed pairs. This design controls for social effects somewhat, but the two treatment groups differ not just with respect to whether or not they are pair bonded, but also in whether or not they had associated with a male or female. Differences between the treatments could reflect pair bonding, or perhaps something about the sex of the partner. It would be useful to have an additional control group, or data on the behavior of individuals within both types of pairs while they are co-housed. Were transcriptomic effects more detectable in pairs that were more bonded together behaviorally? That would suggest that the gene expression signatures really reflect something about the bond rather than other confounds, for example; (2) The vTRAP method is fancy but what is it really adding? (3) The authors interpret the transcriptomic differences as promoting the ability to form a new bond but there are probably other processes that are contributing to the differences in gene expression. Some of the differentially expressed genes could be involved in promoting a new pair bond, but there could also be a signature of the memory of the identity of the partner, the signature of the bond itself, etc. (4) Some of the interpretations go a little too far, especially in terms of anthropomorphism. The impact of the work includes further development of voles as an important model for studying social behavior and insights into the molecular processes important for recovering from the loss of a partner.

      We thank the reviewer for recognizing the strength of our study organism and experimental techniques as well as rigorous analyses to answer an important question about adapting to partner loss.

      Regarding the noted weaknesses:

      (1) We chose to compare opposite sex pair bonds to same sex affiliative relationships as this is the standard within the field, and we note that reviewers 1 and 3 found this to be a strength of our study design (7–11). Peer relationships in prairie voles are difficult to distinguish behaviorally from those of opposite-sex pairs (Fig 1) because both same and opposite-sex paired voles show selective preference for their pairmate and selective agression towards other voles (7). As such, the critical feature that makes pair bonding different is mating, which requires an opposite sex partner in voles, and our experiments are optimally designed to identify the longitudinal transcriptional changes that result from mating and cohabitating with an opposite-sex partner. In order to best match our two groups, only animals with a preference score >50% were included in the transcriptional experiment, ensuring that we were comparing animals with an affiliative preference for their partner - whether that individual was the same or opposite sex.

      We interpret the reviewers comment to be that they want us to compare opposite-sex-paired animals with and without bonds. This can be achieved two ways. First, we can compare to a promiscuous species, such as meadow voles, which will mate and cohabitate without forming bonds, but this is confounded by species differences in transcription that may exist independent of bonding. Second, we can compare bonded voles to the small subset that do not form bonds. While intriguing, this is experimentally challenging as only ~10-20% of males fail to form a bond when paired with a sexually receptive female (in the current study, 16% had a preference < 50% after two weeks of pairing, which is consistent with prior reports - (9–11)). Put simply, we would need to pair hundreds of voles to opportunistically collect a sufficient number of non-bonders for transcriptional assessment across our experimental conditions. While we hope to eventually be able to do such an experiment, litter sizes, consideration of animal welfare, and other constraints make this largely untenable at present.

      Data on the behavior of individuals within both types of pairs while they are co-housed is already provided via results of a partner preference test performed after 2 weeks of co-housing and prior to re-housing or separation (Fig 2B and 3B). We find the reviewer’s suggestion of finding a relationship between the transcriptional signature and the pair bonding strength an interesting question, and we undertook a preliminary analysis examining whether animals with different pair bond strength aggregate on a PCA analysis of gene expression. There was no apparent relationship, although we are performing additional analyses such as exploratory factor analysis. The fact that we have not found a relationship between the baseline partner preference and the transcription in these initial analyses is perhaps unsurprising. First, bonding may require some threshold change in gene expression, with bond strength reflected in non-genomic information, such as synapse formation or strengthening, or axonal ensheathment. Second, we only performed transcriptional analyses on animals with a baseline partner preference >50%; we would not necessarily expect a dissociation given the uniformly strong bonds across these animals.

      (2) We feel that inclusion of TRAP adds substantially to this manuscript and to our understanding of the neuromolecular underpinnings of bonding and loss in the NAc. The value of this experiment is twofold. As noted by Reviewer 3, “the TRAP approach in prairie voles is novel and will provide a great resource to the research community.” The prairie vole community has just developed its first transgenic Cre lines, which could be paired with vTRAP to query bond-associated gene expression changes exclusively in Cre-expressing neurons (15). Second, we noticed a puzzle in our tissue-level data. The majority of cells in the NAc are neurons (16, 17), and the vast majority of pair bonding studies of this region have focused on neuronal phenotypes, but our transcriptional signatures were linked to changes in glial populations. Ultimately, changes in glia are likely to act via their interactions with neurons, and vTRAP enables us to query the neuronal transcriptional changes within our data. Supporting that this provides novel insights into our datasets, when we cluster transcripts based on their expression profiles following short and long-term separation, we predict different GO terms from the tissue level and neuronally-enriched gene sets. For instance, the GO terms resulting from cluster 2 for neuronal genes (Fig 4) includes “response to amphetamine” within the top 10 results, but the same cluster of genes from tissue level sequencing predicts this GO term as the 34th result.

      (3) We agree with the reviewer that adapting to partner loss is a multifaceted process that likely engages numerous biological and emotional systems in voles. The explanation we offer for the transcriptional changes during loss is based on previous work in the field and is one possible interpretation. We will expand on this point during revision of the manuscript.

      (4) We thank the reviewer for encouraging us to be objective with our interpretations. We will address this comment during revision of the manuscript.

      Finally, we thank the reviewer for recognizing the value of our study for not only the field of voles but the bereavement field more broadly.

      Reviewer #3 (Public Review):

      In this manuscript, the authors investigate the behavioral and brain transcriptional alterations associated with short- and long-term partner separation in the monogamous male prairie vole. Male prairie voles continue to show affiliative behavior after short- (2 days) and long-term (4-weeks) partner separation, with similar effects for same and opposite-sex pairs. However, the transcriptional signature in the nucleus accumbens exhibits marked alterations after long-term separation.

      Strengths:

      1) A key strength of this manuscript is its use of the monogamous prairie vole to investigate transcriptional alterations associated with pair bonding and subsequent pair separation. This sort of behavior cannot be investigated in typical rodent model systems (e.g., mice, rats), and the choice of using prairie voles allows for dissection of potential mechanisms of social bonding with relevance to partner loss in humans.

      2) Investigation of behavioral measures and transcriptional alterations at both short- and long-term time points after pairing and separation is a strength of the manuscript. These time points were selected based on previous studies in laboratory and wild prairie voles related to the time it takes to form a pair bond and for the male prairie vole to leave the nest after the loss of the female pair. The datasets generated will be of great use to the scientific community.

      3) The authors investigate the behavior and transcriptional profiles after same-sex as well as opposite-sex pairing. This is considered a thoughtful decision on the authors' part which allows them to tease apart the effects of same vs. opposite sex.

      4) The use of numerous behavioral measures to assess both affiliative and aggressive behaviors is a strength of the approach.

      5) The authors use many biostatistical approaches (e.g., RRHO, WGCNA, Enrichr) to probe the transcriptomics data. These approaches allow the authors to move beyond simply assessing transcriptional profiles separately, but to look for patterns that are similar or different across datasets.

      6) The authors use rigorous statistical methods to assess behavioral measures.

      7) The TRAP approach in prairie voles is novel and will provide a great resource to the research community.

      Weaknesses:

      1) The methods state that prairie voles were treated differently in the behavioral and transcriptomics studies. Specifically, for the separation in the behavioral studies, prairie voles were separated by sight, but not necessarily by the smell from partners (i.e., partners were kept ~1 foot apart). However, prairie voles in the transcriptomics studies were separated by both sight and smell (i.e., partners were sacrificed after separation). Thus, it is possible that the lack of degradation of pair bond-related behavior after long-term separation might be due to these prairie voles being able to smell their partners after separation. This is considered a moderate flaw in the design of the studies which limits the integration of results between behavior and transcriptomics. This might be why the authors do not see a strong behavioral degradation of pair bond-related behavior after long-term separation but do see a strong transcriptional signature.

      2) While RRHO is helpful to assess overall patterns of transcriptional signatures across datasets, its utility for determining the exact transcripts is limited. This is because of how RRHO determines the overlapping transcripts for its Venn diagram feature (by taking the point where the p-value is most significant and taking the list to the outside corner of that quadrant).

      3) TRAP expression was verified in only one animal. Thus, the approach has not been appropriately confirmed.

      We thank the reviewer for their thoughtful comments on the innovative strengths and advantages of our manuscript.

      Regarding the noted weaknesses:

      (1) Please see our response to Reviewer #1, who shares your concerns.

      (2) We agree that RRHO is particularly useful for assessment of overall patterns. We interpret the Reviewer’s comment to mean that when extracting the overlapping gene lists from an RRHO quadrant for downstream analyses, we should filter that list for genes whose differential expression passes a nominal p-value cutoff to reduce the amount of biologically insignificant conclusions we are drawing from the RRHO data. Our initial analyses used just such a threshold-based approach by identifying GO terms via differentially expressed genes of the combined pair bond (Figure 2) using both p-value and log2Fold cutoffs. This analysis revealed a number of terms associated with glial cell proliferation, differentiation, and function (Fig 2H). Such processes occur over a time frame of days to weeks, with different phases of differentiation characterized by different gene expression profiles. To explore this further, we used the genes in the UU and DD RRHO quadrants without implementing a p-value cutoff to see if additional genes associated with these GO-identified pathways may be showing subtle but consistent directional changes (Fig 3). Importantly, we only use the overlapping RRHO gene lists to determine how previously defined biological processes via DEG-predicted GO terms change across conditions; we are not using the RRHO gene lists to generate new GO terms. This allowed us to look for patterns within the identified pathways that may give insight into how transcription might be affecting gliogenesis. This analysis was similarly suggested to us from other experienced users of RRHO plots (see Acknowledgements). There are also several published studies that use RRHO UU and DD quadrant overlap (18–22).

      (3) Most labs rarely confirm Cre-dependence of vectors in more than one or two animals as the results, including those shown in Fig S9A, are typically definitive (i.e. no expression in the absence of Cre, expression in the presence of Cre). In addition to the images shown in figure S9A, we used fluorescent guided dissection to harvest tissue/mRNA, serving as an additional visual confirmation of RPL10-GFP expression in the animals used to generate Figure 4. Since submission, we have also confirmed that this vector also expresses in rats when Cre-recombinase is present. However, prior to resubmission, we will perform additional surgeries to confirm that TRAP is only expressed in the presence of Cre-recombinase.

      References

      1. J. L. Scribner, E. A. Vance, D. S. W. Protter, W. M. Sheeran, E. Saslow, R. T. Cameron, E. M. Klein, J. C. Jimenez, M. A. Kheirbek, Z. R. Donaldson, A neuronal signature for monogamous reunion. Proceedings of the National Academy of Sciences. 117, 11076–11084 (2020).
      2. A. M. Borie, S. Agezo, P. Lunsford, A. J. Boender, J.-D. Guo, H. Zhu, G. J. Berman, L. J. Young, R. C. Liu, Social experience alters oxytocinergic modulation in the nucleus accumbens of female prairie voles. Current Biology. 32, 1026-1037.e4 (2022).
      3. E. S. Calipari, R. C. Bagot, I. Purushothaman, T. J. Davidson, J. T. Yorgason, C. J. Peña, D. M. Walker, S. T. Pirpinias, K. G. Guise, C. Ramakrishnan, K. Deisseroth, E. J. Nestler, In vivo imaging identifies temporal signature of D1 and D2 medium spiny neurons in cocaine reward. Proc. Natl. Acad. Sci. U.S.A. 113, 2726–2731 (2016).
      4. M.-F. O’Connor, D. K. Wellisch, A. L. Stanton, N. I. Eisenberger, M. R. Irwin, M. D. Lieberman, Craving love? Enduring grief activates brain’s reward center. NeuroImage. 42, 969–972 (2008).
      5. T. Hikida, S. Yao, T. Macpherson, A. Fukakusa, M. Morita, H. Kimura, K. Hirai, T. Ando, H. Toyoshiba, A. Sawa, Nucleus accumbens pathways control cell-specific gene expression in the medial prefrontal cortex. Sci Rep. 10, 1838 (2020).
      6. C. Baimel, L. M. McGarry, A. G. Carter, The Projection Targets of Medium Spiny Neurons Govern Cocaine-Evoked Synaptic Plasticity in the Nucleus Accumbens. Cell Reports. 28, 2256-2263.e3 (2019).
      7. N. S. Lee, N. L. Goodwin, K. E. Freitas, A. K. Beery, Affiliation, aggression, and selectivity of peer relationships in meadow and prairie voles. Frontiers in Behavioral Neuroscience. 13 (2019), doi:10.3389/fnbeh.2019.00052.
      8. O. J. Bosch, H. P. Nair, T. H. Ahern, I. D. Neumann, L. J. Young, The CRF System Mediates Increased Passive Stress-Coping Behavior Following the Loss of a Bonded Partner in a Monogamous Rodent. Neuropsychopharmacology. 34, 1406–1415 (2009).
      9. O. J. Bosch, J. Dabrowska, M. E. Modi, Z. V. Johnson, A. C. Keebaugh, C. E. Barrett, T. H. Ahern, J. Guo, V. Grinevich, D. G. Rainnie, I. D. Neumann, L. J. Young, Oxytocin in the nucleus accumbens shell reverses CRFR2-evoked passive stress-coping after partner loss in monogamous male prairie voles. Psychoneuroendocrinology. 64, 66–78 (2016).
      10. A. J. Grippo, B. S. Cushing, C. S. Carter, Depression-like behavior and stressor-induced neuroendocrine activation in female prairie voles exposed to chronic social isolation. Psychosomatic Medicine. 69, 149–157 (2007).
      11. A. J. Grippo, D. Gerena, J. Huang, N. Kumar, M. Shah, R. Ughreja, C. Sue Carter, Social isolation induces behavioral and neuroendocrine disturbances relevant to depression in female and male prairie voles. Psychoneuroendocrinology (2007), doi:10.1016/j.psyneuen.2007.07.004.
      12. J. R. WILLIAMS, C. S. CARTER, T. INSEL, Partner Preference Development in Female Prairie Voles Is Facilitated by Mating or the Central Infusion of Oxytocin. Annals of the New York Academy of Sciences. 652, 487–489 (1992).
      13. C. Sue Carter, A. Courtney Devries, L. L. Getz, Physiological substrates of mammalian monogamy: The prairie vole model. Neuroscience and Biobehavioral Reviews. 19, 303–314 (1995).
      14. L. L. Getz, C. S. Carter, L. Gavish, The mating system of the prairie vole, Microtus ochrogaster: Field and laboratory evidence for pair-bonding. Behavioral Ecology and Sociobiology. 8, 189–194 (1981).
      15. K. Horie, K. Inoue, S. Suzuki, S. Adachi, S. Yada, T. Hirayama, S. Hidema, L. J. Young, K. Nishimori, Oxytocin receptor knockout prairie voles generated by CRISPR/Cas9 editing show reduced preference for social novelty and exaggerated repetitive behaviors. Horm Behav. 111, 60–69 (2019).
      16. K. E. Savell, J. J. Tuscher, M. E. Zipperly, C. G. Duke, R. A. Phillips, A. J. Bauman, S. Thukral, F. A. Sultan, N. A. Goska, L. Ianov, J. J. Day, A dopamine-induced gene expression signature regulates neuronal function and cocaine response. Sci Adv. 6, eaba4221 (2020).
      17. D. Avey, S. Sankararaman, A. K. Y. Yim, R. Barve, J. Milbrandt, R. D. Mitra, Single-Cell RNA-Seq Uncovers a Robust Transcriptional Response to Morphine by Glia. Cell Reports. 24, 3619-3629.e4 (2018).
      18. S. L. Fulton, S. Mitra, A. E. Lepack, J. A. Martin, A. F. Stewart, J. Converse, M. Hochstetler, D. M. Dietz, I. Maze, Histone H3 dopaminylation in ventral tegmental area underlies heroin-induced transcriptional and behavioral plasticity in male rats. Neuropsychopharmacology. 47, 1776 (2022).
      19. S. G. Caradonna, T.-Y. Zhang, N. O’Toole, M.-J. Shen, H. Khalil, N. R. Einhorn, X. Wen, C. Parent, F. S. Lee, H. Akil, M. J. Meaney, B. S. McEwen, J. Marrocco, Genomic modules and intramodular network concordance in susceptible and resilient male mice across models of stress. Neuropsychopharmacol. 47, 987–999 (2022).
      20. J. S. Wang, T. Kamath, C. M. Mazur, F. Mirzamohammadi, D. Rotter, H. Hojo, C. D. Castro, N. Tokavanich, R. Patel, N. Govea, T. Enishi, Y. Wu, J. da Silva Martins, M. Bruce, D. J. Brooks, M. L. Bouxsein, D. Tokarz, C. P. Lin, A. Abdul, E. Z. Macosko, M. Fiscaletti, C. F. Munns, P. Ryder, M. Kost-Alimova, P. Byrne, B. Cimini, M. Fujiwara, H. M. Kronenberg, M. N. Wein, Control of osteocyte dendrite formation by Sp7 and its target gene osteocrin. Nat Commun. 12, 6271 (2021).
      21. D. A. Gallegos, M. Minto, F. Liu, M. F. Hazlett, S. Aryana Yousefzadeh, L. C. Bartelt, A. E. West, Cell-type specific transcriptional adaptations of nucleus accumbens interneurons to amphetamine. Mol Psychiatry, 1–15 (2022).
      22. B. J. Hilton, A. Husch, B. Schaffran, T. Lin, E. R. Burnside, S. Dupraz, M. Schelski, J. Kim, J. A. Müller, S. Schoch, C. Imig, N. Brose, F. Bradke, An active vesicle priming machinery suppresses axon regeneration upon adult CNS injury. Neuron. 110, 51-69.e7 (2022).
    1. Author Response

      Reviewer #1 (Public Review):

      In this paper the authors present variations in carbon oxidation state and hydration state in proteomes available in RefSeq. Then they use this information to predict community level proteomes, and their corresponding carbon oxidation states and hydration states, based on available 16S rRNA gene sequences from selected previously published datasets. When combining this with information about the environmental setting of the individual samples analyzed, the authors are able to demonstrate connections between redox conditions and proteomic carbon oxidation state and hydration state. Furthermore, they explore how individual taxonomic groups at different taxonomic levels contribute to forming these connections.

      A weakness with the study is that the described environmental proteomes are inferred from 16S rRNA gene sequence data and not observed directly. However, there is good reason to believe that the conclusions drawn in the paper are valid.

      The study sheds light on microbial adaptations on the genome level that so far have received relatively little attention. The paper is also interesting from an ecological perspective regarding the general question of how microbial communities are shaped by environmental settings.

      To attempt to bring more attention to environmental constraints, a plot (Figure 4E in the published paper) was redrawn to more clearly show how carbon oxidation state of estimated community proteomes not only is lower in more reducing conditions for a variety of environments but also shows the largest differences for hydrothermal systems and shale-gas wells. This finding is discussed in terms of geological sources of reductants and provides new evidence that the chemical makeup of microbial communities reflects their geological context.

      Reviewer #2 (Public Review):

      This manuscript mainly investigated the carbon oxidation and stoichiometric hydration states of the inferred community proteomes according to 16S rRNA gene compositions from the published datasets and explored their potential associations with environmental parameters such as redox gradients, oxygen concentrations and salinity.

      Predictions of the carbon oxidation and stoichiometric hydration states on the basis of microbial proteomes can provide some meaningful information for disentangling microbial response to environmental changes. As we know, some genes in microbial genomes are not expressed and transformed to proteins. Therefore, such gene redundancy in genomes may lead to bias in predicting the carbon oxidation and stoichiometric hydration states.

      Our study uses available data sources to identify informative differences of elemental compositions of proteomes predicted from genomes. There are numerous examples in the literature of using protein sequences predicted from genomes to make comparisons of amino acid composition (for example, in eLife: https://doi.org/10.7554/eLife.57347), so it would appear to be acceptable with some level of uncertainty to use genomic data to make comparisons between (amino acid or elemental) compositions of predicted proteomes.

      Furthermore, this study compiled many 16S rRNA gene datasets from previous studies. Different primer sets were applied in those studies, and such difference will result in distinct 16S rRNA gene compositions. Accordingly, it is essential to deal with the influence of different primer sets on the 16S rRNA gene compositions among samples. Unfortunately, such information is missing in the method section.

      Primer sets used in the source studies have been added to Table 1 in the published paper. The Discussion was modified to acknowledge limitations in making comparisons *between* datasets obtained using different primers. However, the main results of this study are based on differences of carbon oxidation state (Zc) *within* individual datasets (for instance, along the vertical redox gradients shown in Figure 3).

      The intra-dataset differences of Zc themselves are compared across datasets in Figure 4E. However, it can be expected that the effects of technical variability – including not only primer pairs but also DNA extraction methods, etc. – would tend to be reduced in these inter-dataset comparisons of intra-dataset differences, in contrast to direct inter-dataset comparisons. The index plot at the center of Figure 2 does make a direct inter-dataset comparison, but the outcome is consistent with trends identified in previous analyses of shotgun metagenomic datasets, 16S primers and other technical differences between studies notwithstanding.

      Additionally, the community proteomes in this study were inferred from 16S rRNA genes. The marker gene of 16S rRNA cannot well predict their corresponding genomes, possibly leading to prediction of biased proteomes. Therefore, it should avoid to use 16S rRNA genes for predicting microbial genomes and proteomes.

      Despite the various sources of uncertainty in making estimates of elemental composition of communities from 16S rRNA genes and reference proteomes, comparisons with shotgun metagenomic data support the reliable identification of trends within datasets (Figure 5 in the published paper).

      It seems that the relationships between carbon oxidation states/stoichiometric hydration state and redox/salinity gradients have been reported in previous studies (e.g., Dick et al 2019, 2020, 2021). The finding of this study is not new in comparison with the previously reported.

      The explorations in previous studies of chemical links between communities and environments were based on analysis of shotgun metagenomic data. The ability to reproduce those findings by analyzing 16S rRNA gene sequence data is a new advance in this study.

      Other new results in the published paper are the different magnitudes of Zc differences in various environments (which were not previously documented from shotgun metagenomes; Figure 4E) and the comparison of shotgun metagenome and 16S-based estimates of Zc for the time series of injected fluids in the Marcellus Shale (Figure 5B). The latter results are particularly interesting; the close correspondence for Days 0, 7, and 13 supports the basic reliability of the 16S-based estimates, while the increasing divergence at Days 82 and 328 suggests the onset of some interfering mechanisms (the speculation is made that this could be related to viral lysis and heterotrophic degradation of the released DNA). Also, the published paper presents the first analysis of carbon oxidation state of proteins – from either shotgun metagenome sequences or 16S rRNA-based estimates – for microbial communities in various body sites using data from the Human Microbiome Project (Figure 5D).

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript by de la Vega and colleagues describes Neuroscout, a powerful and easy-to-use online software platform for analyzing data from naturalistic fMRI studies using forward models of stimulus features. Overall, the paper is interesting, clearly written, and describes a tool that will no doubt be of great use to the neuroimaging community. I have just a few suggestions that, if addressed, I believe would strengthen the paper.

      Major comments

      1) How does Neuroscout handle collinearity among predictors for a given stimulus? Does it check for this and/or throw any warnings? In media stimuli that have been adopted for neuroimaging experiments, low-level audiovisual features are not infrequently correlated with mid-level features such as the presence of faces on screen (see Grall & Finn, 2022 for an example involving the Human Connectome Project video clips). How to disentangle correlated features is a frequent concern among researchers working with naturalistic data.

      We agree with the reviewer that collinearity between predictors is one of the biggest challenges for naturalistic data analysis. However, absent consensus on how to best model these data, we find that it is out of scope of the present report to make strong recommendations. Instead, our goal was to design an agnostic platform that would enable users to thoughtfully design statistical models for their particular goal. Papers such as Grall & Finn (2022) will be critical in advancing the debate on how to best analyze and interpret such data.

      We explicitly address this challenge in a new paragraph in the discussion under “Challenges and future directions:

      “A major challenge in the analysis of naturalistic stimuli is the high degree of collinearity between features, as the interpretation of individual features is dependent on co-occurring features. In many cases, controlling for confounding variables is critical for the interpretation of the primary feature— as is evident in our investigation of the relationship between FFA and face perception. However, it can also be argued that in dynamic narrative driven media (i.e. films and movies), the so-called confounds themselves encode information of interest that cannot or should not be cleanly regressed out (Grall & Finn, 2022).[…] Absent a consensus on how to model naturalistic data, we designed Neuroscout to be agnostic to the goals of the user and empower them to construct sensibly designed models through comprehensive model reports. An ongoing goal of the platform—especially as the number of features continues to increase—will be to expand the visualizations and quality control reports to enable users to better understand the predictors and their relationship. For instance, we are developing an interactive visualization of the covariance between all features in Neuroscout that may help users discover relationships between a predictor of interest and potential confounds.” (pg. 11)

      Note we shortened the second paragraph of the discussion by two sentences as it had touched on this subject, and was better addressed separately.

      In addition, we ensured to highlight the covariance structure visualization in the Results section:

      “At this point, users can inspect the model through quality-control reports and interactive visualizations of the design matrix and predictor covariance matrix, iteratively refining models if necessary.” (pg. 3)

      2) On a related note, do the authors and/or software have opinions about whether it is moreappropriate to run several regressions each with a single predictor of interest or to combine all predictors of interest into a single regression? (Or potentially a third, more sophisticated solution involving variance partitioning or another technique to [attempt to] isolate variance attributable to each unique predictor?) Does the answer to this depend on the degree of collinearity among the predictors? Some discussion of this would be helpful, as it is a frequent issue encountered when analyzing naturalistic data.

      This is a very sensitive methodological point, but one for which it is hard to find a univocal answer in the literature. While on the one hand it can be deceptive to model a single feature in isolation (as illustrated by our face perception analyses), more complex models pose different challenges in terms of robust parameter estimation and variance attribution. Resolving these challenges goes beyond the scope of our work, and it is ultimately our goal to provide a flexible tool which will enable these types of investigations, and enable users to take responsibility and provide motivations for methodological choices made using the platform. We touch on Neuroscout’s agnostic philosophy on this issue under “Challenges and future directions” (pg. 11; quoted above).

      However, we also agree that in part the solution to this problem will be methodological. This is particularly true for modeling deep learning based embeddings, which can have hundreds of features in a single model. We are currently working on expanding beyond traditional GLM models in Neuroscout, opening the door to more sophisticated variance partitioning techniques, and more robust parameter estimation in complex models. We highlight current and future efforts to expand Neuroscout’s statistical models in the following paragraph:

      “However, as the number of features continues to grow, a critical future direction for Neuroscout will be to implement statistical models which are optimized to estimate a large number of covarying targets. Of note are regularized encoding models, such as the banded-ridge regression as implemented by the Himalaya package. These models have the additional advantage of implementing feature-space selection and variance partitioning methods, which can deal with the difficult problem of model selection in highly complex feature spaces such as naturalistic stimuli. Such models are particularly useful for modeling high-dimensional embeddings, such as those produced by deep learning models. Many such extractors are already implemented in pliers and we have begun to extract and analyze these data in a prototype workflow that will soon be made widely available. “ (pg. 11)

      3) What the authors refer to as "high-level features" - i.e., visual categories such as buildings,faces, and tools - I would argue are better described as "mid-level features", reserving the term "high-level" for features that are present only in continuous, engaging, narrative or narrative-like stimuli. Examples: emotional tone or valence, suspense, schema for real-world situations, other operationalizations of a narrative arc, etc. After all, as the authors point out, one doesn't need naturalistic paradigms to study brain responses to visual categories or single-word properties. Much of the work that has been done so far with forward models of naturalistic stimuli has been largely confirmatory (e.g., places/scenes still activate PPA even during a rich film as opposed to a serial visual presentation paradigm). This is a good first step, but the promise of naturalistic paradigms is ultimately to go beyond these isolated features toward more holistic models of cognitive and affective processes in context. One challenge is that extracting true high-level features is not easily automated, although the ability to crowdsource human ratings using online data collection has made it feasible to create manual annotations. However, there are still technical challenges associated with collecting continuous-response measurement (CRM) data during a relatively long stimulus from a large number of individuals online. Does Neuroscout have any plans to develop support for collecting CRM data, perhaps through integration with Amazon MTurk and/or Prolific? Just a thought and I am sure there are a number of features under consideration for future development, but it would be fabulous if users could quickly and easily collect CRM data for high-level features on a stimulus that has been uploaded to Neuroscout (and share these data with other end users).

      The reviewer makes a very good point regarding the fact that many so-called “high-level” features are best called “mid-level”. As such, we have changed our use of “high-level” to “mid-level perceptual features” throughout the manuscript.

      “Currently available features include hundreds of predictors coding for both low-level (e.g., brightness, loudness) and mid-level (e.g., object recognition indicators) properties of audiovisual stimuli…” (pg. 3)

      That said, we do believe that as machine learning (and in particular deep learning) models evolve, it will become more feasible to extract higher level features automatically. This has already been shown with transformer language models, which are able to extract higher-level semantic information from natural text. To this end, we have ensured to design our underlying feature extraction platform, pliers, to be easily extensible, to ensure the continued growth of the platform as algorithms evolve. We ensure to highlight this in the Results section ‘Automated annotation of stimuli’:

      “The set of available predictors can be easily expanded through community-driven implementation of new pliers extractors, as well as public repositories of deep learning models, such as HuggingFace and TensorFlowHub. We expect that as machine learning models continue to evolve, it will be possible to automatically extract higher-level features from naturalistic stimuli.” (pg. 3)

      We also ensured to highlight the extensibility of pliers to increasingly power deep learning models in the Discussion by revising this sentence

      “As a result, we have designed Neuroscout and its underlying feature extraction framework pliers to facilitate community-led expansion to novel extractors— made possible by the rapid increase in public repositories of pre-trained deep learning models such as HuggingFace and TensorFlow Hub” (pg. 10)

      As to the point of a potential extension to Neuroscout for easily collecting crowd source stimuli annotations, we are in full agreement that this would be very useful. In fact, this feature was part of the original plan for Neuroscout, but fell out of scope as other features took priority. Although we are unsure if this extension is a short term priority for the Neuroscout team (as it likely would take substantial effort to develop a general purpose extension), the ability to submit user-generated features to the Neuroscout API should make it possible to design a modular extension to Neuroscout to collect such features.

      We mention this possibility briefly in the future directions section:

      “Other important expansions include facilitating analysis execution by directly integrating with cloud-based neuroscience analysis platforms (such as Brainlife.io) and facilitating the collection of higher-level stimulus features by integrating with crowdsourcing platforms such as MechanicalTurk or Prolific.” (pg. 11)

      4) Can the authors talk a bit more about the choice to demean and rescale certain predictors, namely the word-level features for speech analysis? This makes sense as a default step, but I wonder if there are situations in which the authors would not recommend normalizing features prior to computing the GLM (e.g., if sign is meaningful, if the distribution of values is highly skewed if the units reflect absolute real-world measurements, etc). Does Neuroscout do any normalization automatically under the hood for features computed using the software itself and/or features that have been calculated offline and uploaded by the user?

      In keeping with Neuroscout’s philosophy to be a general purpose platform, we have not performed any standardization of features. Instead, users can choose to modify raw predictor values by applying transformations on a model-by-model basis. Currently available transformations through the web interface include: scale, orthogonalize and threshold. Note that there is a wider range of transformations available in the BIDS Stats Model, but we are hesitant to advertise these yet, as they are more difficult to use.

      We revised our description of transformations in the Result section to clarify these transformations are model specific:

      “Raw predictor values can be modified by applying model-specific transformations such as scaling, thresholding, orthogonalization, and hemodynamic convolution.” (pg. 3)

      We also clarify that variables are ingested without any in-place modifications in the Methods section. The only exception is that we down-sample highly dense variables (such as those from auditory files, which can result in thousands of value per second), to save disk space:

      “Feature values are ingested directly with no in place modifications, with the exception of down sampling of temporally dense variables to 3hz to reduce storage on the server.” (pg. 13)

      With respect to the word frequency analysis, the primary reason we scaled variables was to facilitate imputing missing values for words not found in the look-up dictionary. By scaling the variable, we were able to replace missing values with zero, effectively assigning them the average word frequency value. We clarified this strategy in the Methods section:

      “In all analyses, this variable was demeaned and rescaled prior to HRF convolution. For a small percentage of words not found in the dictionary, a value of zero was applied after rescaling, effectively imputing the value as the mean word frequency.” (pg. 17)

      On a more general note, when interpreting a single variable with a dummy coded contrast (i.e. 1 for the predictor of interest, and 0 for all other variables), it’s not necessary to normalize features prior to modeling, as fMRI t-stat maps are scale-invariant (although the parameter estimates will be affected).

      We added a note with our recommendations in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs//web/builder/transformations.html#scale

      Reviewer #2 (Public Review):

      The authors present a new platform for constructing and sharing fMRI analyses, specifically geared toward analyzing publicly-available naturalistic datasets using automatically-extracted features. Using a web interface, users can design their analysis and produce an executable package, which they can then execute on their local hardware. After execution, the results are automatically uploaded to NeuroVault. The paper also describes several examples of analyses that can be run using this system, showing how some classical feature-sensitive ROIs can be derived from a meta-analysis of naturalistic datasets.

      The Neuroscout system is impressive in a number of ways. It provides easy access to a number of publicly-available datasets (though I would like to see the current set of 13 datasets increase in the future), has a wide variety of machine-learning features precomputed on the video and audio features of these stimuli, and builds on top of established software for creating and sandboxing analysis workflows. Performing meta-analyses across multiple datasets are challenging both practically and statistically, but this kind of multi-dataset analysis is easy to specify using Neuroscout. It also allows researchers to easily share a reproducible version of their pipeline simply by pointing to the publicly-available analysis package hosted on Neuroscout. The platform also provides a way for researchers to upload their own custom models/predictors to extend those available by default.

      The case studies described in the paper are also quite interesting, showing that traditional functional ROIs such as PPA and VWFA can be defined without using controlled stimuli. They also show that, running a contrast for faces does not produce FFA until speech (and optionally adaptation) is properly controlled for, and that VWFA shows relationships to lexical processing even for speech stimuli.

      I have some questions about the intended workflow for this tool: is Neuroscout meant to be used for analysis development in addition to sharing a final pipeline? The fact that the whole analysis is packaged into a single command is excellent for reproducibility but seems challenging to use when iterating on a project. For example, if we wanted to add another contrast to a model, it appears that this would require cloning the analysis and re-starting the process from scratch.

      An important principle of Neuroscout from the onset of the project was to minimize undocumented researcher degrees of freedom, and maximize transparency in order to reduce the file drawer effect which can contribute to biased results in the published literature. As such, we require analyses to be registered and locked as the modal usage of our application. In the case of adding a contrast, it is true that this would require a user to clone the analysis. Although all of the information from the previous model would be encoded in the new model, this would require re-estimating the design matrix which could be time consuming. However, in our experience, users almost always add new variables to the design-matrix when a study is cloned, which would in any case require re-estimating the design matrix for all runs and subjects. We believe this trade-off is worthwhile to ensure maximal reproducibility, but also point out that since Neuroscout’s data is freely available via our API, power users could directly access the data if they need to use it in a less constrained manner.

      We believe that these important distinctions are best addressed in the newly developed Neuroscout documentation which we now reference throughout the text (https://neuroscout.org/docs/web/browse/clone.html).

      I'm also unsure about how versioning of the input datasets and the predictors is planned to be handled by the platform; if datasets have been processed with multiple versions of fmriprep, will all of those options be available to choose from? If the software used to compute features is updated, will there be multiple versions of the features to choose from?

      The reviewer makes an astute observation regarding the versions of input data (predictors & datasets). Currently we have only pre-processed the imaging data once per data, and as such this has not been an issue. However, in the long run we certainly agree this would be important to give users the ability to choose which pre-processed version of the raw dataset they want to use, as certainly there could be differing but equally valid versions. We have opened an issue in Neuroscout’s repository to track this issue, and plan to incorporate this ability in a future version (https://github.com/neuroscout/neuroscout/issues/1076).

      With respect to feature versions, every time a feature is re-extracted, a new predictor_id is generated, and the accompanying meta-data such as time of extraction is tracked for that specific version. As such, if a feature is updated and re-extracted, this will not change existing analyses. By default, we have chosen to obscure this from the user to make the user experience simpler. However, there is an open issue to expand the frontend’s ability to explicitly display different versions, and allow users to update older analyses with newer versions of features. Advanced users already have access to this functionality by using the Python API (PyNS) to directly access all features, and create analyses with more precision.

      We have made a note regarding this behavior in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs/web/builder/predictors.html

      I also had some difficulty attempting to test out the platform, so additional user testing may be necessary to ensure that novice users are able to successfully run analyses.

      We thank the reviewer for this bug report, which allowed us to fix a previously unnoticed issue with a subset of Neurosout datasets. We have been incontact with the reviewer to ensure that this issue was successfully addressed.

    1. Author Response

      Reviewer #1 (Public Review):

      1) While the authors identify the suppressors in known genetic interactors (GIs) of the yeast SEC53, it is worth testing if the compensatory mutations are rewiring the GIs, thereby explaining the lack of comparable compensations observed in reconstituted strains. If altered GIs explain the suppression, then while yeast serves as an excellent tool to perform these assays, the human context of the disease may require a different set of genetic suppressors and, therefore, a different target than the yeast PGM1 ortholog.

      Our data show that pgm1 mutations alone greatly improve growth of sec53-V238M strains. Our data also indicate other pathways of compensation. Whether each of these compensatory mechanisms translate to humans is unknown. However, the observed enrichment of compensatory mutations in genes whose human homologs are associated with Type 1 CDG, suggests that many of these genetic interactions are likely to be conserved.

      Also, are Sec53 and Pgm1 proteins directly interacting in yeast and whether these mutations are on the interaction interface?

      As we mention above, there is no support for a direct physical interaction between Sec53 and Pgm1.

      2) Based on the data obtained between pACT1 and pSEC53-driven expression of the SEC53 mutant alleles, the pattern of suppressors appears to be different. Authors report that the variants expressed from strong pACT1 promoters show more suppressors than those driven by native promoters. Is this a general trend in experimental evolution that slower-growing strains tend to show lesser suppressors? For example, on Page 6, line 154, "compensating for Sec53-F126L dimerization defects are rare or not easily accessible". The statement suggests that the authors did obtain suppressors that compensate for the dimerization defect. At the same time, while rare (also, are authors suggesting suppression of dimerization defect as in better dimerization?), the rate of obtaining suppressors seems to be linked to the severity of the fitness defects of the strains. The lack of suppressors may be a limitation of the evolution experiments. Indeed later in the manuscript, the authors noticed that while PGM1 suppressors obtained in V238M can also suppress F126L alleles, the suppression was not as efficient. Could it be that evolution experiments in slower-growing strains predominantly enrich suppressors in other pathways (i.e., not in the CDG orthologs) that restore the growth better and compete out the relatively weaker suppressors in PGM1? In fact, the authors report similar effects on Page 7, lines 204-210. These two paragraphs are contradictory and should be explained further.

      All of our sequencing was performed on strains with sec53 under the control of the pACT1 promoter. While we did not identify unique sec53-F126L suppressors, we cannot exclude that sec53-F126L suppressors exist, so we describe them as “rare or not easily accessible”. While it is possible that the slower growth rate of the sec53-F126L allele could impact the likelihood of observing suppressors, we think it is more likely due to the nature of the variant (dimerization defect versus stability defect) rather than growth rate. In other laboratory evolution experiments the same beneficial mutation typically has a greater effect in slower-growing backgrounds (for example: doi.org/10.1126/science.1250939).

      3) Authors report that the LOF of PGM1 compensates for the SEC53 mutations. However, the evolution experiments did not capture any LOFs in PGM1. The fitness comparisons in evolution experiments are different as many different genotypes compete in a mix. Therefore, the fitness assays in a clonal population may not represent these differences well. To test this argument, authors can try to mimic the evolution experiments by mixing two genotypes to check competitive fitness, like the co-culture of pgm1 suppressor obtained via evolution experiments with pgm1Δ.

      Though we did not perform a direct head-to-head competition between a pgm1 suppressor and a pgm1Δ, our data suggest that the pgm1 delete would outcompete some of the lower-fitness suppressors. In the Discussion we speculate as to why we do not see deletion mutations: “Given that most of the evolved clones containing pgm1 mutations are more fit than the reconstructed strains, it is possible that other evolved mutations interact epistatically only with non-loss-of-function pgm1 mutations.”. Though it is beyond the scope of the present manuscript, it would be possible to rerun the evolution experiment in sec53-V238M strains carrying either a pgm1 missense suppressor or a pgm1Δ. Under the hypothesis of additional interacting loci, only the pgm1 missense suppressors would be more likely to acquire additional compensatory mutations.

      Reviewer #3 (Public Review):

      Vignogna et al. used yeast genetics, experimental evolution and biochemistry to tackle human congenital disorders of glycosylation (CDG), a disease mostly caused by mutations in PMM2. They took advantage of the observation that the budding yeast gene SEC53 is almost identical to human PMM2, and used experimental evolution to find interactors of SEC53/PMM2. They found an overrepresentation of mutations in genes corresponding to other human CDG genes, including PGM1. Genetic and biochemical characterizations of the pgm1 mutations were carried out. This work is solid, although authors did not reveal why reduction of pgm1 activity could compensate for defects of a particular mutant allele of sec53.

      Out of curiosity, if the authors were to simply focus on the preexisting mutations, would they have gotten the materials for most of the experiments in this article? In other words, how important is the experimental evolution?

      The evolution experiment was crucial as the specific pgm1 mutations we identified here have not been reported elsewhere, nor have the orthologous mutations been identified in human PGM1.

      A strain table with full genotypes is needed.

      We added a strain genotype table (Supplemental Dataset 2).

    1. Author Response

      Reviewer #2 (Public Review):

      In this MEG work employing two types of bistable perception test and unique regression analyses, the authors identified different neural frequencies to different components of visual perception: its content and stability.

      Strengths:

      This study has a nice set of three different experiments to clarify neural differences between content, memory and stability of visual perception.

      The state space analysis appears to be powerful to identify such different neural signatures for different cognitive components as well.

      Weaknesses:

      Despite such strengths, this work may have the somewhat critical weakness specified in the recommendations for the authors.

      First, in the analysis to identify content-specific neural frequency, the authors concluded that the SCP is more relevant to the visual perceptual content compared to the neural activity in the alpha and beta-band frequencies. In my impression, to claim this, it would be necessary to show statistically significant differences in the prediction accuracy between the SCP and the other frequencies. Given the not-so-high prediction accuracy seen in the SCP-based analysis, such statistical supports appear essential.

      We have now directly compared decoding accuracy for SCP and alpha/beta oscillations, which showed statistically significant differences in both the ambiguous and unambiguous conditions for both ambiguous images. We have added these results as a supplementary figure (new Figure 2—figure supplement 1).

      Second, two behavioural metrics in the neural state space analysis-i.e., Switch and Direction-may be too arbitrary. As suggested by the power-law distribution of the percept duration, the neural dynamics during seemingly stable percept may not be able to be described in linear functions. Instead, the brain may go back and forth between several neural states even when we are thinking we're experiencing stable visual consciousness. If so, the current definition of the Switch metric and Direction index, which seems to be based on the behaviour of the Switch index, may be arbitrary. In other words, I feel the authors may have to elaborate the rationale for the definitions of such metrics.

      First, we note it is generally accepted in the field that the distribution of percept durations follows a gamma distribution instead of a power-law distribution (e.g., Sterzer et al., TiCS 2009; Blake & Logothetis Nature Rev. Neurosci 2002; Kleinschmidt et al., 1998; Leopold et al., TiCS 1999), and microswitches have not been reported either using the more classic task as that employed here or the more recently developed ‘no-report’ task of using eye-tracking statistics to deduce perceptual switches without overt report (e.g., Frassle et al., J Neurosci 2014).

      Second, while brain activity may fluctuate during these time periods, it never crosses the threshold of evoking a conscious report, and thus we would expect that such fluctuations, if they do occur, would be of a lower magnitude than those that do produce a conscious report.

      Most importantly, our goal here is to define behavioral metrics in order to identify components of neural dynamics underpinning the relevant aspect of behavior. As such, our definition of the behavioral metric should not be directly informed by observed spontaneous dynamics of brain activity (especially those that may be observed in the data but are of unclear relevance to perceptual switching); otherwise the analysis would be prone to circularity and spurious correlations (i.e., using observed brain dynamics to inform construction of behavioral metrics might pick up aspect of brain dynamics not really relevant to behavior in the analysis results).

      Finally, the timing characteristics of ‘Switch’ and ‘Direction’ behavioral metrics are not arbitrary; instead they are the simplest behavioral functions that allow a comparison of pre- and post-switching periods (or when the percepts might be in the ‘stabilizing’ phase vs. the ‘destabilizing’ phase). Nevertheless, the regression analysis can pick up on other temporal patterns of changes not exactly the same as our defined behavioral metric. This can be seen for SCP and beta activity projected onto the Direction axis, where it has the lowest value at ~20th percentile of the trial (not 50th percentile as assumed by the behavioral metric). To confirm that the analysis is not highly dependent on the precise timing definition of the behavioral metrics, we ran a control analysis, where the switching point was set at 30%tile (rather than 50%tile as in the original analysis). This control analysis resulted in a similar pattern of neural results (Figure R1).

      Figure R1: Changing temporal behavior definition (switching point moved from 50th percentile to 30th percentile of percept duration) does not significantly alter the neural results. Compare to Figure 4—figure supplement 1, ‘Switch’ and “Direction’ Columns.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper shows that a principled, interpretable model of auditory stimulus classification can not only capture behavioural data on which the model was trained but somewhat accurately predict behaviour for manipulated stimuli. This is a real achievement and gives an opportunity to use the model to probe potential underlying mechanisms. There are two main weaknesses. Firstly, the task is very simple: distinguishing between just two classes of stimuli. Both model and animals may be using shortcuts to solve the task, for example (this is suggested somewhat by Figure 8 which shows the guinea pig and model can both handle time-reversed stimuli).

      The task structure is indeed simple. In the context of categorization tasks that are typically used in animal experiments, however, we would argue that we are the higher end of stimulus complexity. Auditory categories used in most animal experiments typically employ a category boundary along a single stimulus parameter (for example, tone frequency or modulation frequency of AM noise). Only a few recent studies (for example, Yin et al., 2020; Town et al., 2018) have explored animal behavior with “non-compact” stimulus categories. Thus, we consider our task a significant step towards more naturalistic tasks.

      We were also faced with the practical factor of the trainability of guinea pigs (GPs). Prior to this study, guinea pigs have been trained using classical conditioning and aversive reinforcement on detecting tone frequency (e.g., Heffner et al., 1971; Edeline et al., 1993). More recently, competitive training paradigms have been developed for appetitive conditioning, using a single “footstep” sound as a target stimulus and manipulated sounds as non-target stimuli (Ojima and Horikawa, 2016). But as GPs had never been trained on more complex tasks before our study, we started with a conservative one vs. one categorization task. We mention this in the Discussion section of the revised manuscript (page 27, line 665).

      To determine whether these results hold for more complex tasks as well, after receiving the reviews of the original manuscript, we trained two GPs (that were originally trained and tested on the wheeks vs. whines task) further on a wheeks vs. many (whines, purrs, chuts) task. As earlier, we tested these GPs with new exemplars and verified that they generalized. In the figure below, the average performance of the two GPs on the regular (training) stimuli and novel (generalization) stimuli are shown in gray bars, and individual animal performances are shown as colored discs. The GPs achieved high performance for the novel stimuli, demonstrating generalization. We also implemented a 4-way WTA stage for a wheek vs. many model and verified that the model generalized to new stimuli as well.

      For frequency-shifted calls, these two GPs performed better for wheeks vs. many compared to the average for wheeks vs. whines shown in the main manuscript. The 4-way WTA model closely tracked GP behavioral trends.

      The psychometric curves for wheeks vs. many categorization in noise (different SNRs) did not differ substantially from the wheeks vs. whines task.

      We focused our one vs. many training on the two conditions that showed the greatest modulation in the one vs. one tasks. However, these preliminary results suggest that the one vs. one results presented in the manuscript are likely to extend to more complex classification tasks as well. We chose not to include these new data in the revised manuscript because we performed these experiments on only 2 animals, which were previously trained on a wheeks vs. whines task. In future studies, we plan to directly train animals on one vs. many tasks.

      Secondly, the predictions of the model do not appear to be quite as strong as the abstract and text suggest.

      We now replace subjective descriptors with actual effect size numbers to avoid overstatingresults. We also include additional modeling (classification based on the long-term spectrum) and discuss alternative possibilities to provide readers with points of comparison. Thus, readers can form their own opinions of the strengths of the observed effects.

      The model uses "maximally informative features" found by randomly initialising 1500 possible features and selecting the 20 most informative (in an information-theoretic sense). This is a really interesting approach to take compared to directly optimising some function to maximise performance at a task, or training a deep neural network. It is suggestive of a plausible biological approach and may serve to avoid overfitting the data. In a machine learning sense, it may be acting as a sort of regulariser to avoid overfitting and improve generalisation. The 'features' used are basically spectro-temporal patterns that are matched by sliding a crosscorrelator over the signal and thresholding, which is straightforward and interpretable.

      This intuition is indeed accurate – the greedy search algorithm (described in the original visionpaper by Ullman et al., 2002) sequentially adds features that add the most hits and the least false alarms compared to existing members of the MIF set to the final MIF set. The latter criterion (least false alarms) essentially guards against over-fitting for hits alone. A second factor is the intermediate size and complexity of MIFs. When MIFs are too large, there is certainly overfitting to the training exemplars, and the model does not generalize well (Liu et al., 2019).

      It is surprising and impressive that the model is able to classify the manipulated stimuli at all. However, I would slightly take issue with the statement that they match behaviour "to a remarkable degree". R^2 values between model and behaviour are 0.444, 0.674, 0.028, 0.011, 0.723, 0.468. For example, in figure 5 the lower R^2 value comes out because the model is not able to use as short segments as the guinea pigs (which the authors comment on in the results and discussion). In figure 6A (speeding up and slowing down the stimuli), the model does worse than the guinea pigs for faster stimuli and better for slower stimuli, which doesn't qualitatively match (not commented on by the authors). The authors state that the poor match is "likely because of random fluctuations in behavior (e..g motivation) across conditions that are unrelated to stimulus parameters" but it's not clear why that would be the case for this experiment and not for others, and there is no evidence shown for it.

      Thank you for this feedback. There are two levels at which we addressed these comments inthe revised manuscript.

      First, regarding the language – we have now replaced subjective descriptors with the statement that the model captures ~50% of the overall variance in behavioral data. The ~50% number is the average overall R2 between the model and data (0.6 and 0.37 for the chuts vs. purrs and wheeks vs. whine tasks respectively). We leave it to readers to interpret this number.

      Second, our original manuscript lacked clarity on exactly what aspects of the categorization behavior we were attempting to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022). Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      In figure 11, the authors compare the results of training their model with all classes, versus training only with the classes used in the task, and show that with the latter performance is worse and matches the experiment less well. This is a very interesting point, but it could just be the case that there is insufficient training data.

      This could indeed be the case, and we acknowledge this as a potential explanation in therevised manuscript (page 22, line 537; page 27, line 653). Our original thinking was that if GPs were also learning discriminative features only using our training exemplars, they would face a similar training data constraint as well. But despite this constraint, the model’s performance is above d’=1 for natural calls – both training and novel calls; it is only the similarity with behavior on the manipulated stimuli that is lower than the one vs. many model. This phenomenon warrants further investigation.

      Reviewer #2 (Public Review):

      Kar et al aim to further elucidate the main features representing call type categorization in guinea pigs. This paper presents a behavioral paradigm in which 8 guinea pigs (GPs) were trained in a call categorization task between pairs of call types (chuts vs purrs; wheek vs whines). The GPs successfully learned the task and are able to generalize to new exemplars. GPs were tested across pitch-shifted stimuli and stimuli with various temporal manipulations. Complementing this data is multivariate classifier data from a model trained to perform the same task. The classifier model is trained on auditory nerve outputs (not behavioral data) and reaches an accuracy metric comparable to that of the GPs. The authors argue that the model performance is similar to that of the GPs in the manipulated stimuli, therefore, suggesting that the 'mid-level features' that the model uses may be similar to those exploited by the GPs. The behavioral data is impressive: to my knowledge, there is scant previous behavioral data from GPs performing an auditory task beyond audiograms measured using aversive conditioning by Heffner et al., in. 1970. [One exception that is notably omitted from the manuscript is Ojima and Horikawa 2016 (Frontiers)]. Given the popularity of GPs as a model of auditory neurophysiology these data open new avenues for investigation. This paper would be useful for neuroscientists using classifier models to simulate behavioral choice data in similar Go/No-Go experiments, especially in guinea pigs. The significance of the findings rests on the similarity (or not) of the model and GP performance as a validation of the 'intermediary features' approach for categorization. At the moment the study is underpowered for the statistical analysis the authors attempt to employ which frequently relies on non-significant p values for its conclusions; using a more sophisticated approach (a mixed effects model utilizing single trial responses) would provide a more rigorous test of the manipulations on behavior and allow a more complete assessment of the authors' conclusions.

      We thank the reviewer for their feedback and the suggestion for a more robust statistical approach. We have now replaced the repeated measures ANOVA based statistics for the behavior and model where more than 2 test conditions were presented (SNR, segment length, tempo shift, and frequency shift) with generalized linear models with a logit link function (logistic activation function). In these models, we predict the trial-by-trial behavioral or model outcome from predictors including stimulus type (Go or Nogo), parameter value (e.g., SNR value), parameter sign (e.g., positive or negative freq. shift), and animal ID as a random effect. To evaluate whether parameter value and sign had a significant contribution to the model, we compare this ‘full’ model against a null model that only has stimulus type as a predictor and animal ID as a random effect. These analyses are described in detail in the Materials and Methods section of the revised manuscript (page 36, line 930).

      These analyses reveal significant effects of segment length changes, and weak effects of tempo changes on behavior (as expected by the reviewer). Both the behavior and model showed similar statistical significance (except tempo shift for wheeks vs. whines) for whether performance was significantly affected by a given parameter.

      The behavioral data presented here are descriptive. The central conceptual conclusions of the manuscript are derived from the comparison between the model and behavioral data. For these comparisons, the p-value of statistical tests is not used. We realized that a description of how we compared model and behavioral data was not clear in the original manuscript. To compare behavioral data with the model, we fit a line to the d’ values obtained from the model plotted against the d’ values obtained from behavior, and computed the R2 value. We used the mean absolute error (MAE) to quantify the absolute deviation between model and behavior d’ values. Thus, high R2 values would signify a close correspondence between the model and behavior regardless of statistical significance of individual data points. We now clarify this in page 12, line 289. We derive R2 values for individual stimulus manipulations, as well as an overall R2 by pooling across all manipulations (presented in Fig. 11). This is now clarified in page 21, line 494.

      Reviewer #3 (Public Review):

      The authors designed a behavioral experiment based on a Go/ No-Go paradigm, to train guinea pigs on call categorization. They used two different pairs of call categories: chuts vs. purrs and wheeks vs. whines. During the training of the animals, it turned out that they change their behavioral strategies. Initially, they do not associate the auditory stimuli with rewards, and hence they overweight the No-Go behavior (low hit and false alarm rate). Subsequently, they learned the association between auditory stimuli and reward, leading to overweighting the Go behavior (high hit and false alarm rates). Finally, they learn to discriminate between the two call categories and show the corresponding behaviors, i.e. suppress the Go behavior for No-go stimuli (improved discrimination performance due to stable hit rates but lower false alarm rates).

      In order to derive a mechanistic explanation of the observed behaviors, the authors implemented a computational feature-based model, with which they mirrored all animal experiments, and subsequently compared the resulting performances.

      Strengths:

      In order to construct their model, the authors identified several different sets of so-called MIFs (most informative features) for each call category, that were best suited to accomplish the categorization task. Overall, model performance was in general agreement with behavioral performance for both the chuts vs. purrs and wheeks vs. whines tasks, in a wide range of different scenarios.

      Different instances of their model, i.e. models using different of those sets of MIFs, performed equally well. In addition, the authors could show that guinea pigs and models can generalize to categorize new call exemplars very rapidly.

      The authors also tested the categorization performance of guinea pigs and models in a more realistic scenario, i.e. communication in noisy environments. They find that both, guinea pigs and the model exhibit similar categorization-in-noise thresholds.

      Additionally, the authors also investigated the effect of temporal stretching/compression of calls on categorization performance. Remarkably, this had virtually no negative effect on both, models and animals. And both performed equally well, even for time reversal. Finally, the authors tested the effect of pitch change on categorization performance, and found very similar effects in guinea pigs and models: discrimination performance crucially depends on pitch change, i.e. systematically decreases with the percentage of change.

      Weaknesses:

      While their computational model can explain certain aspects of call categorization after training, it cannot explain the time course of different behavioral strategies shown by the guinea pigs during learning/training.

      Thank you for bringing this up – in hindsight the original manuscript lacked clarity on exactlywhat aspects of the behavior we were trying to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022) . Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal, or behavioral strategies. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      Furthermore, the model cannot account for the fact that short-duration segments of calls (50ms) already carry sufficient information for call categorization in the guinea pig experiment. Model performance, however, only plateaued after a 200 ms duration, which might be due to the fact that the MIFs were on average about 110 ms long.

      The segment-length data indeed demonstrates a deviation between the data and the model.As we had acknowledged in the original manuscript, this observation suggests further constraints (perhaps on feature length and/or bandwidth) that need to be imposed on the model to better match GP behavior. We originally did not perform this analysis because we wanted to demonstrate that a model with minimal assumptions and parameter tuning could capture aspects of GP behavior.

      We have now repeated the modeling by constraining the features to a duration of 75 ms (thelowest duration for which GPs show above-threshold performance). We found that the constrained MIF model better matched GP behavior on the segment-length task (R2 of 0.62 and 0.58 for the chuts vs. purrs and wheeks vs. whines tasks; with the model crossing d’=1 for 75 ms segments for most tested cases). The constrained MIF model maintained similarity to behavior for the other manipulations as well, and yielded higher overall R2 values (0.66 for chuts vs. purrs, 0.51 for wheeks vs. whines), thereby explaining an additional 10% of variance in GP behavior.

      In the revised manuscript, we included these results (page 28, line 699), and present results from the new analyses as Figure 11 – Figure Supplement 2.

      In the temporal stretching/compressing experiment, it remains unclear, if the corresponding MIF kernels used by the models were just stretched/compressed in a temporal direction to compensate for the changed auditory input. If so, the modelling results are trivial. Furthermore, in this case, the model provides no mechanistic explanation of the underlying neural processes. Similarly, in the pitch change experiment, if MIF kernels have been stretched/compressed in the pitch direction, the same drawback applies.

      We did not alter the MIFs in any way for the tests – the MIFs were purely derived by trainingthe animal on natural calls. In learning to generalize over the variability in natural calls, the model also achieved the ability to generalize over some manipulated stimuli. The fact that the model tracks GP behavior is a key observation supporting our argument that GPs also learn MIF-like features to accomplish call categorization.

      We had mentioned at a few places that the model was only trained on natural calls. To addclarity, we have now included sentences in the time-compression and frequency-shifting results affirming that we did not manipulate the MIFs to match test stimuli. We also include a couple of sentences in the Discussion section’s first paragraph stating the above argument (page 26, line 615).

    1. Author Response

      Reviewer #1 (Public Review):

      The actual description of the methods does not allow the reader to evaluate the precision of two important processing steps. First, rCBF measures are supposed to be restricted to the cortex, but given the pCASL image spatial resolution, partial volume effects with white matter probably exist, especially in younger infants. Furthermore, segmenting tissues on the basis of anatomical images (especially T1-weighted) is complicated in the first postnatal year. As rCBF measurements are very different between grey and white matter, the performed procedure might impact the measures at each age, or even lead to a systematic bias on age-dependent changes. Second, the methodology and accuracy of the brain registration across infants are little detailed whereas it is a challenging aspect given the intense brain growth and folding, the changing contrast in T1w images at these ages, and the importance of this step to perform reliable voxelwise comparison across ages.

      We thank the reviewer for this comment. We have added more descriptions in the methods to address this comment. Briefly, individual rCBF map was generated in the individual space and calibrated by phase contrast MRI to minimize the individual variations of processing parameters such as T1 of arterial blood (Aslan et al., 2010). Cortical segmentation was also conducted in individual space. Then different types of images including rCBF map and gray matter segmentation probability map in the individual space were normalized into the template space. An averaged gray matter probability map was generated after inter-subject normalization. After carefully testing multiple thresholds in the averaged gray matter probability maps, 40% probability minimizing the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex was used to generate the binary gray matter mask shown on the left panel of Figure R1 below. Despite poor contrasts and poor cortical segmentation of T1-weighted images of younger infants rightfully pointed out by this reviewer, the poor cortical segmentation of younger infants was compensated by the averaged cortical mask and measurement of rCBF in the template space. As demonstrated in the right three panels in Figure R1, the rCBF measure in the cortical mask in the template space is consistent across ages for accurate and reliable voxelwise comparison across age.

      Figure R1. The gray matter mask and segmented cortical mask overlaid on rCBF map of three representative infants aged 3, 6, and 20 months in the template space. The gray matter mask on the left panel was created to minimize the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex. The contour of the gray matter mask was highlighted with bule line.

      The authors achieved their aim in showing that the rCBF increase differs across brain regions (the DMN showing intense changes compared to the visual and sensorimotor networks). Nevertheless, an analysis of covariance (instead of an ANOVA) including the infants' age as covariate (in addition to the brain region) would have allowed them to evaluate the interaction between age and region (i.e. different slopes of age-related changes across regions) in a more rigorous manner. Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures, the results only partly support the authors' conclusion. Actually, both measures strongly depend on the infants' age, as the authors highlight in the first parts of the study. Thus, considering this common age dependency would be required to show that the physiological and connectivity measurements are specifically related and that there is indeed a coupling.

      We thank the reviewer for this comment. Following the reviewer’s suggestion, we conducted an analysis of covariance (ANCOVA) and found significant interaction between regions and age (F(6, 322) = 2.45, p < 0.05) with age as a covariate. This ANCOVA result is consistent with Figure 3c showing differential rCBF increase rates across brain regions. The ANCOVA result was added in the last paragraph in the Results section “Faster rCBF increases in the DMN hub regions during infant brain development”.

      Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures (FC), the Figure 5, Figure 5–figure supplement 1 and 2 were generated exactly to test that the FC-rCBF coupling specifically localized in the DMN is not due to mutual age dependency. Briefly, Figure 5B demonstrated significant correlation only clustered in the DMN regions using the correlation method demonstrated in Figure 5-figure supplement 1. Furthermore, nonparametric permutation tests with 10,000 permutations were conducted. Such permutation tests are sensitive and effective with Figure 5c revealing significant coupling only in the DMN regions. If coupling is related to mutual age dependency, Figure 5c would demonstrate significant coupling in Vis and SM network regions too.

    1. Author Response

      Reviewer #1 (Public Review):

      In this work, Maxime R. and co-authors intended to investigate the consequence of dystrophin absence/alteration in myoblasts, the effector cells of muscle growth and regeneration, and the early role of such cells in the pathogenesis of the disease. They carried out a transcriptomic analysis, comparing transcripts expressed by dystrophic myoblasts isolated from two murine models of DMD (Dmdmdx and Dmdmdx-βgeo) and control healthy mice. The expression of a large number of genes, comprising key regulator of myogenic differentiation (Myod1, Myog, Pax3 etc.) resulted affected in comparison to control in both mouse lines.

      We believe that the novelty and importance of these result lie in demonstrating for the first time that the loss of full-length dystrophin expression is both necessary and sufficient to trigger molecular and functional abnormalities in myoblasts. The fundamental point is that, contrary to the prevailing belief, the dystrophin function may not be just to provide sarcolemma stability in myofibers but rather that there is a disease continuum: DMD defects in satellite cells (Dumont et al., 2015, Ref 45), cause myoblast dysfunctions diminishing muscle regeneration (this work), and also impairing myofiber differentiation (Shoji et al., Ref 4), with the resulting fibre being unstable and therefore degenerating. These data can better explain all the symptoms of dystrophic muscle pathology, where abnormalities in satellite cells, myoblasts and myofibers form the pathological vicious cycle. Moreover, we identify the key trigger behind these abnormalities in dystrophic myoblasts, which is MyoD downregulation. Furthermore, we demonstrate that the additional loss of short dystrophin isoforms, although these are expressed in myoblasts, do not exacerbate the phenotype. This latter finding is very important given the near complete lack of understanding of the pathology in dystrophin-null patients.

      Authors highlighted similar gene expression modifications also in a myoblast cell line previously established from the mdx mouse.

      Analogous alterations found in both primary myoblasts and in the established myoblast cell line demonstrate that this change is cell-autonomous and not evoked by the external factors in the dystrophic niche, e.g. inflammatory mediators. This also shows that the dystrophic phenotype resists the transcriptomic drift as it is maintained through numerous passages. This approach was praised later on in the review.

      To assess the outcomes from the gene ontology analysis, which pointed on the alteration of muscle system and regulation of muscle system processes, authors evaluated the proliferative, chemotactic and differentiative capacities of dystrophic myoblasts. Myoblasts presented increased proliferation, reduced chemotaxis and quite surprisingly, improved differentiating capacity, if considering the transcriptomic data.

      The key pathways (proliferation, migration and differentiation), that are essential for myoblast to evoke muscle regeneration, were confirmed to be altered in functional analyses, thus proving these transcriptomic alterations to be functional and biologically relevant. Our data showing accelerated differentiation in mdx myoblasts fully agree with findings by others, both in primary cultures and in isolated myofibers (Yablonka-Reuveni &Anderson, 2005, Ref 22).

      Finally, Maxime R. and co-authors carried out a transcriptomic analysis in myoblasts from DMD human subjects. Even though the profile of altered gene expression resulted similar and the GO studies seemed to focus on the same pathway categories, a significative divergence was observed particularly at the level of gene expression.

      Given that myoblasts from individual DMD patients present heterogeneous phenotypes (Choi et al., 2016), such divergence at the level of individual gene expression between mouse and human is to be expected. Nevertheless, these changes become convergent in altered GO categories and pathways. In the revised manuscript we have included additional genome-scale metabolic analysis in human DMD myoblasts. This revealed significant alteration in specific metabolic pathways. These are consistent with the metabolic alterations found previously in dystrophic muscle and brain, thus confirming the commonality of dystrophic defects found here in myoblasts and described before in dystrophic tissues. Moreover, this analysis is an additional proof that DMD myoblasts are significantly altered when compared to healthy cells.

      Authors link transcriptomic abnormalities and functional changes in proliferation, chemotaxis and differentiation of the dystrophic myoblasts with the alterations (probably epigenetic changes) occurring in satellite cells of dystrophic mice, consequent to the absence of the dystrophin protein. Such modifications in gene expression are supposed to be inherited by pathological myoblasts due to the division of the SC that is no longer asymmetric as occurring in healthy tissue.

      Strengths

      Transcriptomic data from samples of different sources are solid and rigorous statistical analyses have been carried out.

      Transcriptomic and functional data from primary proliferating myoblasts of the two mouse models and from the myoblast cell line are similar. This is a convincing evidence that the transcriptomic alterations observed in primary myoblasts are not influenced by the exposure to the niche environment present in the dystrophic muscle, but that are cell autonomous.

      Authors adopted a 3D culture for the functional analysis concerning myoblasts differentiations, in this way better mimicking the process occurring in vivo.

      Weaknesses

      The mdx mouse phenotype is mild in comparison to the severe symptoms and the rapid disease progression experimented by most of the human DMD subjects. Mdx mice is characterized by cycle of degeneration/regeneration initiating around the age of 6 weeks and continuing for several weeks. It was expected that authors discussed this point in detail, also considering that the animals used in this study were 8 weeks old.

      The mdx mouse has a mutation resulting in the loss of full-length dystrophin expression, which reflects the molecular defect affecting the majority of DMD patients. Therefore, mdx is the most commonly used pre-clinical model in DMD studies. The intensity of myonecrosis during this active degeneration and regeneration period (starting at 12 days and not at 6 weeks) is as aggressive as in patients. In fact, it has been suggested that the intensity of myonecrosis seen in mdx mice would be lethal to DMD patients (Duddy et al., 2015). The difference between human and mdx mouse pathology is that, starting at 10 weeks of age, the fibre replacement in mdx leg muscles reduces gradually, due to an unknown mechanism. Therefore, we isolated myoblasts at 8 weeks, when mdx replicates the human pathology. To emphasise the relevance of our findings for the human pathology, we discuss this point in detail in the revised manuscript.

      Furthermore, transcriptomic analysis of the human DMD myoblasts highlighted many differences as well as similarities when compared to mouse samples. Why do not focus more on this aspect? According to the authors, dystrophic abnormalities in myoblasts manifest irrespective of differences in genetic backgrounds and across species. The last one is a strong statement that should have been supported at least by functional data regarding chemotaxis proliferation and differentiation of human DMD myoblasts.

      What we meant by: “dystrophic abnormalities in myoblasts manifest irrespective of differences in genetic backgrounds and across species” is that the lack of full-length dystrophin expressions results in identical molecular defects in mouse and human primary myoblasts and also in the dystrophic cell line, despite numerous gene expression alterations triggered by the long-term culture in the latter We agree that linking the functional alterations in human dystrophic myoblasts to the transcriptomic alteration that we identified is important. And indeed, altered proliferation, migration and differentiation of human DMD myoblasts have been described before (Witkowski and Dubovitz., 1985; Nesmith et al., 2016; Sun et al., 2020). In fact, these previous findings that were never fully investigated, prompted us to undertake this study. Thus, our data provide a molecular underpinning for these abnormalities. In the revised manuscript we have elaborated on the existing functional data supporting alterations in human myoblasts.

      Further functional analyses will be needed to understand their consequences. It would require investigation of numerous parameters, including significant alterations in metabolic pathways, which we identified and described in the revised version of this manuscript. Given the aforementioned individual variability in patients’ population demonstrated by heterogeneous phenotypes in myoblasts, such functional analyses would need to involve a significant number of probands.

      Therefore, a detailed study in a sufficiently large cohort of DMD myoblasts is a logical next step from the identification of specific pathway alterations described here. But it is an extensive new project beyond our immediate capability.

      In the discussion, the authors suggest two possible mechanisms as responsible for alterations in the behavior of the SC that ultimately affect the functionality of myoblasts, an RNA-mediated pathological process or an alteration in epigenetic regulation. They consider the latter mechanism more likely. This is based in particular on transcriptomic data showing the downregulation of important genes involved in histone modifications, normally linked to transcriptional activation. They also reported from the literature that HDAC inhibitors upregulate MyoD, a gene that is effectively downregulated in this study. Since the authors postulate that the epigenetic dysregulation of Myod1 expression is responsible for the pathological cascade of gene downregulation, ultimately leading to the pathological phenotype, it would have been interesting to evaluate the impact of HDACi on this pathways or the overexpression of enzymes responsible for H3K4 methylation as Smid1 (downregulated in this study).

      We have presented several hypotheses regarding the mechanism in which loss of full-length dystrophin expression could affect myoblasts, including restricted spatio-temporal requirement for small amounts of full-length dystrophin and an RNA-based mechanism. The notion that epigenetic dysregulation of Myod1 expression causes a pathological cascade of transcription downregulation of genes controlled by MyoD was based on our finding that transcripts downregulated in dystrophic myoblasts exhibit overrepresentation of MyoD binding sites. We discussed this as a likely mechanism, supported by a body of literature on the known alterations of epigenetic regulation found in DMD (fifteen papers in total). We also offered a hypothesis that since treatment of mdx mice with histone deacetylase inhibitors (HDACi) promoted myogenesis (Saccone et al., 2014) and HDACi upregulate Myod1 (Mal et al., 2001), HDACi could increase myogenesis by counteracting the changes we found in dystrophic myoblast. However, while evaluation of the impact of HDACi or of the overexpression of enzymes responsible for H3K4 methylation would prove or disprove this one of the working hypotheses we made in the Discussion, it would, in no way, alter the key discovery of this study, which is that loss of full-length dystrophin expression results in major cell-autonomous abnormalities in proliferating myoblasts. Thus, if preferred, this Discussion paragraph could be shortened not to detract the reader from the main findings of this manuscript.

      Reviewer #2 (Public Review):

      This study is one of many that explore various abnormalities in the mononuclear myogenic cell compartments in DMD. Although the aim has been extensively investigated in the last several decades, it is still relevant.

      It is correct that abnormalities of proliferation, migration and differentiation in dystrophic myogenic cells have been reported over decades, but these were not followed up and often disregarded. Certainly, their causative link to DMD mutations and their consequences for the pathology were never investigated. Our study is the first to provide the comprehensive molecular underpinning for these abnormalities, demonstrating that the loss of full-length dystrophin expression directly and significantly affects myoblasts.

      The biggest limitation of this study is that it relies on the RNAseq analyses of extensively cultured myoblasts. While the computation analyses are profound, the study lacks any mechanistic explanation for the relevance of the transcriptional differences seen in the DMD myoblasts.

      We are not sure where this opinion had originated from. In fact, we used freshy isolated primary myoblasts in RNAseq experiments and then confirmed the key alterations functionally in primary myoblasts freshy isolated from two strains of DMD mice. Furthermore, we performed the mechanistic analyses, where we linked process alterations to functional defects, in which we focussed on proliferation, migration and differentiation, as processes known to impact the DMD pathology.

      In an approach considered as one of the strengths of our work by the other Reviewer, these findings in primary myoblasts were then reproduced in myoblast cell line, to demonstrate that alterations observed are not evoked by the exposure to the niche environment present in the dystrophic muscle, but that are cell-autonomous. Importantly, DMD mutant cells show these alterations despite being extensively cultured in vitro, demonstrating expressivity of this mutation. Finally, alterations were confirmed in human primary myoblasts.

      Cell purity, the myogenic status of the cells, passage number, and the period that cells were in culture are not well described. This study's cell isolation method allows contamination with non-myogenic cells that can significantly influence the RNAseq analyses. Immunostaining for myogenic markers, for example, MyoD, would indicate the purity of the cell culture. Extensive culturing of the primary myoblasts promotes clonal selection and introduces numerous molecular alterations; thus, the passage number and duration of the culture are significant factors. It looks that some assays were conducted with cells in the high passage. For example, in myogenic differentiation assay where they needed one million cells for each pellet. Maybe that is the reason for the low differentiation rate presented in Sup. Fig 2.

      Cell homogeneity across genotypes was fully confirmed by sample-based hierarchical clustering, clearly segregating transcripts into groups corresponding to genotypes. Furthermore, the same alterations were found in corresponding myoblast cell lines, which purity and myogenic potential was demonstrated previously (Onopiuk et al., 2015). Therefore, varying contamination with non-myogenic cells could not significantly influence these results. However, for completeness, in the revised manuscript (Supplementary Figure 8) we described cell characterisation using MyoD as a marker, proving that the well-established myoblast isolation procedure used by us produces pure myoblast cultures.

      As for the differentiation assay, isolated myoblasts were never passaged extensively (one passage only) but sufficient numbers were obtained through the efficient isolation. Moreover, cells from every genotype were maintained and treated identically. Therefore, under these given conditions, any differences were the result of the DMD gene mutation and not culturing.

      It is hard to explain how DMD myoblasts differentiate better than the WT controls if they have a suppressed myogenic program in the proliferation stage. Even at day 0 of differentiation, DMD myoblasts differentiated better according to the RT-qPCR presented in Figure 5c. Additionally, it is unusual that the marker of differentiation Myog and Myh1 reached the peak at day 2 of differentiation for the WT myoblasts.

      In fact, our data fully agree with findings by others, that mdx cells display accelerated differentiation both in primary cultures and in isolated myofibers (Yablonka-Reuveni &Anderson, 2005). Our team recently demonstrated that DMD mutations evoke marked transcriptome and miRNome dysregulations early in human muscle cell development (Mournetas et al, 2021). Expression of key coordinators of muscle differentiation was dysregulated in proliferating dystrophic myoblasts, the differentiation of which was subsequently found to be altered, in line with the mouse cells studied here. Clearly, further studies into the mechanisms of this and numerous other alterations described by us here are urgently needed, as these may uncover new therapeutic targets.

      As to whether it is unusual for these differentiation markers to peak at that time, we cannot comment, as no reference for this statement was given and the expressions can vary depending on the experimental conditions used – in our case the 3D culture could make the difference. Yet, again, cells from every genotype were maintained and treated identically and so any differences reflect the impact of the DMD mutation.

    1. Author Response

      Reviewer #1 (Public Review):

      The current manuscript examined patients with inborn errors of immunity (IEI) using whole exome sequencing (WES) and identified de novo variants (DNVs) associated with the disease. They found 14 genes associated with DNVs, including four novel genes - PSMB10, DDX1, KMT2C, and FBXW11, and conducted a systematic assessment of affected genes.

      Given the level of heterogeneity underlying IEI, the sample size is limited. Although the authors clearly stated this, the analysis of the current manuscript does not add much value to describing genes affected by DNVs. The sample size is small to perform exome-wide evaluation (authors described they did "exome-wide evaluation" in Abstract - line 10 but there is no statistical evaluation to prioritize effect genes). They could go with systems biology approaches, explaining the biological pathway of affected genes or underlying cell types from immune single-cell datasets. As the authors stated that IEI constitutes a large group of heterogeneous disorders, there should be some analysis to explain the functional convergence of affected genes in disease development.

      We believe the term ‘exome-wide evaluation’ might have led to misinterpretation. We used it in the context of reviewing each DNV found in a single patient’s exome outside the diagnostic IEI gene panel (i.e. ‘exome-wide’), instead of reviewing DNVs across all exomes. We have rephrased the sentences containing this term. The main purpose of this manuscript was to identify ‘all’ coding DNVs in each case, and explore whether they include any pathogenic or novel candidate DNVs. Our main purpose was to urge the IEI field to apply trio-based WES more systematically, and share candidate DNVs with the field for further validation.

      As the reviewer points out, our sample size would be too limited to perform systems biology approaches for variant prioritization. The signal-to-noise ratio would be very high, because many genes causing inborn errors of immunity remain to be discovered and the studied group of patients with inborn errors of immunity is very heterogeneous. This means that we would not have the power to investigate potential enrichment or burden of DNVs in specific genes nor the functional convergence of affected genes or pathways in specific phenotypes. In this study, we aimed to show the additional value of the systematic DNV analysis as a method to identify and prioritize candidate variants in individual cases, but ideally we would like to answer other important research questions using computational/statistical approaches in a larger cohort in the future, as has been performed in other rare disease fields. The suggestion of the reviewer is helpful, and this approach has been shown to implicate novel pathways enriched in disease for various forms of neurodevelopmental diseases for which ten-thousands of trio-based WES have been performed [9, 10].

      For DNV identification, the authors filtered out variants with ExAC & gnomAD AF > 0.1% or GoNL AF > 0.5%. I think this is too lenient a cutoff for filtering for DNV. For example, gnomAD AF 0.1% is approximately ~200 individuals in population. Given the filtering parameters (<5 variation reads, <20% variant allele frequency, or low coverage DNVs), they did not use specific filtering metrics to find DNV and there might be false-positive variants in the final DNV set. As far as I can find in the manuscript, they used the GATK pipeline from the previous study (REF 29). The GATK unified genotype generates a range of filtering metrics to increase specificity in variant filtering. It is very surprising that the authors seem to use three parameters (variation reads → FORMAT:AD[1]; variant allele frequency → FORMAT:AB? and low coverage → FORMAT:DP? but the authors did not state the cutoff) to filter de novo variants, which are fragile to false-positive variant calling.

      The chosen population database fraction cut-offs align with DNV filtering strategies in literature. We have not chosen a stricter cut-off to avoid missing true positives, since patients with IEI can exhibit late-onset disease, variable penetrance and have postzygotic mutations, while limiting the chance of false-positive findings. For instance, we have reduced local false-positives by filtering on allele frequencies in our in-house database and Dutch population database. Moreover, automated DNV calling required >2% alternate reads in either parent and variants were prioritized based on prediction scores and annotated immune function. Additionally, and in accordance with this expert reviewer, we have now put a stricter cut-off in place for variation reads (from 5 to 10) to further minimize false-positive findings. Lastly, we visually inspected the final 14 candidate DNVs in IGV and/or Alamut, which supports the validity of the findings. The DNVs reported in our final DNV list (Table 2B) are therefore unlikely to contain falsepositive findings.

      Reviewer #2 (Public Review):

      The manuscript by Hebert et al., reports on the utility of TRIO-based whole-exome sequencing (WES) in patients who presented as sporadic cases and are suspected of having inborn errors of immunity (IEIs). The authors developed an in-house pipeline for data analysis and used a set of known algorithms to prioritize the impact of genetic variants located mostly in the coding region of proteins. The data analysis was done in two steps; the first step involved the routine WES diagnostic analysis that led to the identification of pathogenic (P) and likely pathogenic variants (LP) in genes already associated with IEIs. The authors claim that this analysis resulted in a likely molecular diagnosis in 19 (~15%) of patients, while an additional 14% of cases were carriers for VUSs or other risk factors in the disease causal genes. As many of these variants are either inherited from one parent or are present as heterozygous (monoallelic) variants in genes associated with recessive diseases, their clinical significance is unclear.

      In the second step, the authors focused on the identification of de novo variants (DNVs), including SNVs, CNVs, and small indel, since these variants are more likely to be deleterious on protein function. The authors identified 136 non-synonymous DNVs, which were then filtered down to 14 best candidate variants using various in silico tools and database searches. These 14 variants included DNVs in genes previously associated with autoinflammatory diseases, such as CAPS and RELA haploinsufficiency. Three patients are found to carry de novo copy number variants (CNVs) of unknown clinical significance. Finally, several de novo loss-of-function (LoF) variants have been identified in genes that are not yet associated with any IEIs but are good functional candidates. Their potential pathogenicity is further supported by the observation that they are found in genes intolerant to loss of function. Functional validation has been performed only for the patient carrier of the novel FBXW11 splice variant. The authors state that the maximum solve rate (i.e., probable molecular diagnosis) in this cohort might be as high as 23%, which is comparable to similar reports of patients with IEIs, however, the reported results do not yet support this conclusion.

      The main conclusion of this study is that TRIO-based WES analysis for DNVs could improve the diagnostic rate and can result in the identification of novel disease-causing genes. TRIO-based sequencing is also preferable when analyzing patients from populations underrepresented in gnomAD and ExAC. As the cost of WES has come down, WES has been increasingly used in the clinical diagnosis of many human disorders. Despite the major progress in the development of novel sequencing technologies and new in silico tools, the diagnostic rate is still below 50%. In summary, this study suggests that despite the identification of over 400 genes associated with IEIs, there are many more genes to be identified and that the heritability of these diseases is very complex.

      We thank the reviewer for the elaborate summary of our study and the suggestions that have helped to further improve the manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      This is a study that is aimed at understanding the binding mechanism of D-serine to the two different binding lobes of the NMDA receptor. D-serine is a known agonist and binder of the GluN1 ligand-binding domain, but its interaction with the GluN2A is unknown. Using long time-scale conventional molecular dynamics simulations, the researchers observe that D-serine interacts and associates readily with both binding domains, often via protein surface pathways referred to as a guided-diffusion mechanism. As observed previously, free-energy calculations show that D-serine stabilizes the closure of both binding domains. Finally, analysis of the effect of glycans shows that these modifications play a role in further stabilizing the closed state of the ligand-binding domains.

      Amongst this broad and careful analysis, the major finding from this work is that D-serine surprisingly associates with GluN2A, which has been known to bind glutamate to enable activation of the channel. Since the binding of D-serine to GluN2A had not been observed previously, they proposed that D-serine acts as an inhibitor for glutamate at high concentrations. This hypothesis was investigated and supported by electrophysiological experiments, yielding a novel result that presents new interpretations for the field. However, the guided-diffusion mechanism still remains hypothetical and is unclear as to whether this is in fact a driving force, or requirement, for the binding. Specifically, the following questions warrant further investigation:

      1) Specific or non-specific association? It is possible that non-specific association events of ligands to the protein could be an intrinsic artifact of the MD simulations. To investigate this, it would be informative to compare the current results with a negative control simulation where the ligand was replaced with a similar amino acid or molecule that has been verified as a non-binder for NMDAR.

      To address this, we quantified the non-specific association signal by comparing the number of successful binding events to random association (see response to Essential Revisions #4). In theory, any appropriately small amino acid could associate with the conserved arginine of each LBD through its C-terminus (as evidenced by our PMF of glycine bound to GluN2A). However, an amino acid’s ability to remain bound long enough to induce LBD closure is largely dependent on the presence of interactions with the LBD bottom lobe.

      2) Dissociation events? Further clarification is required to understand whether any dissociation events are observed in these simulations to the non-specific sites or the final binding site. If dissociation is not observed, how does this impact the interpretation of the binding mechanisms that characterize only the association events?

      Association and dissociation are both observed and documented in Datasets S2-S4. We added clarification to the text on page 5 about the nature of both processes and how pathways are defined by residues that allow the agonist to enter and leave the binding site. As illustrated in the clustering dendrograms, association (even-numbered events) and dissociation (odd-numbered events) pathways are present in all clusters.

      3) Testing the hypothesis of guided diffusion. It is proposed that guided diffusion drives serine binding to its site. This would imply that the residues on this path are important, and if mutated, would decrease the association rate and the ability to compete with glutamate. Additional electrophysiological experiments or direct binding experiments would be useful in understanding the relevance of guided diffusion in the ligand-binding mechanism of NMDARs.

      To address this point, we performed additional TEVC experiments generating D-serine dose-response curves for GluN1a Arg694Ala and Arg695Ala, and GluN2A Arg692Ala and Arg695Ala. The curves for both GluN2A mutants support our guided diffusion mechanism, as they lowered the D-serine inhibition potency (These mutants also likely also alter glutamate binding, but since D-serine and glutamate bind through the same residues, it is not possible to separate out individual contributions.) The GluN1a mutants did not show altered behavior, supporting the increased diffusiveness of D-serine binding to GluN1 compared to GluN2A. These additional findings are included in the main text on page 12 and in Fig. 4D.

      Reviewer #2 (Public Review):

      In this manuscript, Yovanno et. al. did a comprehensive mechanistic study of D-serine binding to NMDAR ligand-binding domains (LBDs). The framework of the current investigation is built upon this research group's previous studies of NMDAR agonists glutamate and glycine binding. Using an aggregated 51 microseconds of all-atom MD simulations of spontaneous binding, the authors applied rigorous pathway similarity analysis to cluster the paths through which D-serine enters the LBDs from the bulk solution. The most interesting and unexpected result from this study is the spontaneous binding of D-serine to the GluN2A LBD, which was previously known to be the glutamate binding site.

      By computing the overlap coefficient for all binding pathways, the authors concluded that D-serine binding to GluN2A LBD through "guided" diffusion, while to GluN1 through random diffusion (the clustered pathways comprise random contacts rather than specific, conserved residue contacts). A "guided" binding pathway further suggests that the agonist binding could be sensitive to the conformational change within and around the binding pocket, and vice versa.

      To investigate whether D-serine binding events are able to modulate the GluN2A LBD conformation, the authors then computed a series of LBD conformational free energy landscapes (2D-PMF) using 2D-umbrella sampling simulations. The 2D-PMF profiles confirmed that D-serine stabilizes the closed LBD conformation, just like glutamate. Because the D-serine 2D-PMF shows a metastable state that was absent in glutamate 2D-PMF, the authors argue that D-serine may not stabilize the closed conformation to the same extent as glutamate. Likewise, based on the 2D-PMF of GluN1 LBD, the authors suggest that D-serine has a higher potency than glycine, in part due to its ability to more strongly stabilize a closed LBD conformation.

      The simulations above generated the hypothesis that D-serine could function as a competitive antagonist of glutamate at high concentrations. This computationally derived hypothesis is beautifully tested by the authors' dose-response curves and the Schild plot.

      One question that would merit further clarification is whether the binding affinity of D-serine to the two LBDs is stronger or weaker in comparison with glutamate and glycine. The difference in agonist potency could be due to the difference in binding affinity and/or efficacy. Stabilizing the closed LBD conformation may indicate the efficacy of the agonist, but affinity (Kd) will still play a role in the final potency.

      Indeed, as Reviewer 2 pointed out, affinity should play a role since the D-serine inhibition here is attributed to the competitive binding of D-serine against glutamate as we showed with our Schild plot. The bona fide binding site for D-serine is GluN1 LBD where D-serine binds more strongly than glycine (Furukawa/Gouaux 2003). In the GluN1 LBD, D-serine is a full agonist. The D-serine binding to the GluN2A LBD (the finding here) is substantially weaker (mM) than glutamate (~1 uM).

      While a glycosylated GluN1/GluN2A dimer was used for the majority of MD simulations, the authors also checked the "reality" by mapping the pathway residues onto the NMDAR heterotetramer structure. The role of glycans in D-serine binding pathways was further investigated by conducting an additional 30 microseconds simulations of the non-glycosylated dimer. It was found that glycans introduced small kinetic "traps" that slow down the binding process. Glycan was also found to stabilize LBD closure from 1D-PMF profiles.

      The detailed mechanistic insight and D-serine's inhibitory effect on NMDAR, unraveled by this study, may play an important role in therapeutic strategies, and thus is likely to have a broad impact in the field.

    1. Author Response

      Reviewer #2 (Public Review):

      Dr Muktupavela et al. present a novel likelihood-based method for inferring the strength of natural selection and basic demographic parameters, such as mobility rates, from time-stamped ancient DNA data in a spatially explicit framework. This is an elegant method that is, in many ways, a natural extension of previous work in the field that has focussed mainly on inferring natural selection from temporal data to a spatial setting. In addition to the simplest scenarios of isotropic dispersal the authors also consider models with different dispersal rate in longitudinal and latitudinal directions, as well as biased dispersal. Selection strength, dispersal rates and bias are assumed to be constant across space and piecewise constant in time (but it would be very straightforward to relax these assumptions). The bias component of the model is an interesting addition that, in principle, allows to broadly account for the effect of long-range dispersals such as the spread of agriculture across Europe from the fertile crescent and Bronze age migrations from the Asian steppes on the spatiotemporal pattern of allele frequencies.

      Although the main idea is clearly communicated, there is room for improvement of the manuscript regarding investigating the properties of the model and presenting the results. Notably, the authors assume that the age of mutation is known and correct in their assessment of the performance of the model on simulated data (which may inflate the reported accuracy of the reconstructions) and use estimates from the literature when the method is applied to empirical data. Although it is necessary to specify the age of the allele, and this could easily have been treated as a free parameter in the framework. I would like to see a discussion of why the method may not be suitable for this, and a more systematic test for the sensitivity of the method to misspecification of the age (which could be very substantial, especially if the population history has been complex). In the cases where the model is run for different allele age estimates in the manuscript, such as for the lactase persistence scenario, the authors should present the (approximate maximum) likelihoods for the different scenarios in the text.

      An explanation as to why we do not infer the age of the allele (see text below) has been added to the main text under section “Parameter search” (lines 531-533). Briefly, we chose to construct our method in a way that uses the age of the allele as an input parameter rather than estimating it since there are multiple equally possible solutions with various combinations of allele age and selection coefficient values. This is demonstrated Appendix A3.

      We also added a description of log-likelihood values when we vary the allele ages under section “Robustness of parameters to the assumed age of the allele” in lines 324-329, the results of which are presented in supplementary Figure 6–Figure Supplement 9 and Figure 8–Figure Supplement 6.

      Briefly, we assessed the likelihood of the best fitted models by varying the ages of the rs4988235(T) and rs1042602(A) alleles. We can see that in the case of rs4988235(T) allele the allele age used in this study (7,441 years) gives the most likely solution among the explored ages. In the case of the rs1042602(A) allele, we found that there are multiple nearly equally likely ages when looking at ages at least as old as 15,000 years.

      A further weakness of the method is that it uses the Fisher information matrix to estimate uncertainty. While this works well if the posterior distribution is narrow, it can severely underestimate the uncertainty if this is not case, in particular if the distribution is non-gaussian in the tails. It would be better, but perhaps computationally prohibitively expensive, to report Bayesian posterior distributions for the parameters as well as Bayes factors that could be used to formally compare the fit of different models to the data.

      We agree with the reviewer that implementing Bayesian parameter fitting would likely provide a more robust understanding of the uncertainty of the estimates as well as an opportunity to formally compare different models using Bayes factors (although at the cost of an increase of computational intensity). Changing the inference engine of our method in this manner (while keeping it computationally feasible) is something we are currently investigating and hope to release as part of a future Bayesian version of our method. In the meantime, we have added a discussion of this caveat in our manuscript (sixth paragraph).

      Finally, although the rationale behind the model is clearly described, the detailed descriptions of the model and the numerical implementation have some shortcomings. First, there are typos in the appendix where the continuous model is derived from a discrete approximation (the right-hand side of Eq. (8) should not contain the term p(x,y,t) for it to be consistent with Eqs. (9) and (10)). Second, any differential equation model is incomplete without specifying the boundary conditions. This is especially important here as the assumption of uniform diffusion and advection on the grid is violated by the constraints imposed by the land mask, where the population is assumed to vanish on water areas (suggesting an absorbing boundary condition). Further down in the methods, details are also missing on how Eq. (10) was solved numerically, merely that it was discretized at a certain resolution.

      Looking more closely at the Eq (8), we do believe that the term p(x,y,t) should be there since it is moved to the left-hand side of the Eq (9) by simple algebraic rearrangements of the terms of the equation.

    1. Author Response

      Evaluation Summary:

      1) The paper is well written, and its style/formatting are optimal. The baseline signature moderately predicted outcome, and the data after one cycle further improved the algorithm, though this decreases its utility as a pure predictive tool

      We thank the editor and the reviewers for their positive feedback regarding the style and formatting of the manuscript. We concur that longitudinal sampling of blood, before and after one cycle of treatment, renders the predictive signature marginally more laborious to generate. In an ideal setting, we would be able to solely generate a predictive signature based on baseline characteristics - unfortunately such a test does not yet exist.

      In this study, we propose adding an easily obtainable blood sample after the first cycle of treatment to significantly improve our ability to predict response. Due to the ease of sampling them, we believe that blood biopsies will be key as the search for predictive biomarkers expands. Since the inception of our study, there have been numerous impactful pieces of published literature assessing PBMCs, mainly in response to immune checkpoint blockade 1-6. Given that our risk signature is now validated in an immunotherapy trial (EACH trial NCT03494322), we are even more confident with our unique approach to longitudinal sampling to developing a predictive model to systemic therapy. The trial design of the validation study is now included as supplementary (Figure 2A) in the manuscript.

      2) Signatures were not prospectively validated on an independent cohort; the algorithm was developed around a first-line therapy that is no longer considered to be the standard of care for HNSCC; and, while most of the conclusions are supported by the data, some of the caveats (such as the lack of a validation cohort, key in predictive biomarker development), are not addressed.

      Thank you. We will address this comment in two parts – (a) with regards to the validation cohort part and (b) for the status of the EXTREME treatment regimen in the original cohort. In this revised version, we have validated our risk signature in an independent cohort of patients who received cetuximab and avelumab (anti-PD-L1) in a single-arm, phase 2 clinical trial setting. Beyond serving purely as a validation cohort, it also demonstrates the applicability of our model in predicting response to immune checkpoint blockade-based therapy in keeping with contemporary advances in systemic treatment for HNSCC. The risk signature strongly predicted response in the new independent cohort giving us more confidence in our model’s ability to predict outcome for systemic therapy regimens beyond cytotoxic chemotherapy and cetuximab. Figure 5B shows the strong correlation between the risk signature and disease outcome in the validation cohort (Kendall rank correlation, t=0.725 p=0.0181).

      Secondly, the EXTREME regimen (platinum/5-FU/cetuximab) remains a first-line standard of care treatment in the UK and European countries for HNSCC patients with negative PD-L1 status (CPS score <1) which account for around 15% of all HNSCC patients 7. While the US Food and Drug Administration (FDA) approved pembrolizumab in combination with chemotherapy as first-line treatment regardless of PD-L1 expression and pembrolizumab alone for patients with PD-L1-expressing tumours (CPS ≥1), the European Medicines Agency (EMA) approved pembrolizumab with or without chemotherapy only for patients with a CPS ≥1, and this has been highlighted in the European Society for Medical Oncology (ESMO) and the UK National Institute for Health and Care Excellence (NICE) guidelines 8 and (https://www.nice.org.uk/guidance/ta661/chapter/1-Recommendations).

      Furthermore, chemotherapy with EXTREME regimen is standard of care for patients with contraindications to immune checkpoint inhibitors such as autoimmune disease 8. It can also be considered as second-line treatment in patients who only received pembrolizumab monotherapy in the first line setting.

      3) However the overall impact in the field of this work seems limited by a number of factors, including that the authors focused on immune cell subpopulations and exosomes, which narrows the scope (no cytokines or other biomarkers were included).

      Thank you. We selected a finite number of covariates based on a few factors – (a) published literature, (b) previous data generated by the group and (c) the applicability of the findings to the clinic. Instead of an exploratory article in which we could generate an infinite number of covariates by a technique similar to RNA sequencing, we opted for a select set of covariates. This hypothesis-driven approach generated a strong signature that is now validated across two trials. The focus on immune population is driven by our hypothesis that systemic changes in the PBMCs are indicative and reflective of the status of the intra-tumoral immune response. In the revised manuscript we used a custom immune focused imaging mass cytometry antibody panel to probe tissue sections from 9 patients. We now show that the key populations driving the predictive model in the periphery are not only reflected at the tumoral level, but these disparate immune cell subpopulations also interact. See Figure 6 in which we use a machine learning approach to segment cells and assign them to distinct immunological subpopulations. We found that the peripheral monocyte population strongly correlated with a tumoral macrophage population having a similar marker expression pattern. We found that the peripheral central memory CD8 T cells inversely correlated with tissue resident memory T cells. The tissue presence of both these cells correlated positively with outcome. Most strikingly, these two populations were most likely to co-localize with each other at the tissue level at a frequency of almost double the second highest co-localization. Data on the nature of the interplay between peripheral systemic immunity and intra-tumoral immunity is novel and rarely exists in the literature outside the scope of in-vivo animal models. Here we describe these interactions using human patient samples treated with a clinically relevant therapy.

      Given the limited amount of patient sera collected in the trial we opted to perform exosome analysis on markers known to impact the response to the anti-EGFR/HER3 treatment/immune responses. This was in line with our labs work to use exosome FRET-FLIM as a surrogate for tissue FRET-FLIM which we originally used to discover a potential dimer dependent mechanism for anti-EGFR treatment resistance in neoadjuvant breast cancer patients9; and more recently published on a colorectal patient sample cohort from the COIN study 10. While exosome EGFR-HER3 heterodimer failed to reach significance in our risk signature, it was close as depicted in the Kaplan-Meier curve from Figure 3C. We of course acknowledge the potential added benefit of having serum cytokine array analysis. While that was not feasible for this study our group now aims at ensuring that extra patient serum samples are bio-banked for such analysis from ongoing and future trials.

      Reviewer 1 (Public Review):

      1) For this study to be significant, one would want to see a marked improvement over current biomarkers, in a robust and generalizable population. Unfortunately, this study falls short in these respects. First, the authors do not adequately discuss the prior literature. Even a fairly crude and old-fashioned blood-based biomarker such as neutrophil:lymphocyte ratio has quite good predictive and prognostic capability in R/M HNSCC

      Thank you for your suggestion. We have expanded the discussion to include an overview of current biomarkers. We also compared the predictive power of neutrophil:lymphocyte ratio (NLR) from two published meta-analysis to our risk signature 11,12. We used the median risk score to divide our original patient cohort into a high and low risk group. We then calculated the HRs and CI for both signatures at pre-treatment alone (HR = 4.1397 [95% CI: 1.975 - 8.676]) and for the combined signature (HR = 2.574 [95% CI: 1.336 - 4.96]). Both were higher than the published literature whilst only using the median as the cutoff. Mascarella, Mannard et al. published “NLR greater than the cutoff value was associated with poorer OS and DSS (HR 1.69; 95% CI 1.47-1.93; P < .001 and HR 1.88; 95% CI 1.20-2.95”, and Takenaka, Oya et al published : “The combined hazard ratio for OS in patients with an elevated NLR (range 2.04-5) was 1.78 (confidence interval [CI] 1.53-2.07”. We realize that we are stratifying patients based on PFS and not overall survival, which is an inherent limitation of the study, but the added preditive value of the signature relative to existing literature we humbly believe is too large to not be impacful.

      2) It is not clear to me that there is a compelling need to do better -- given that existing predictive biomarkers based on clinical nomograms or NLR are actually used in practice.

      We agree that clinical nomograms (based on clinicopathological factors) have been shown to be predictors of outcomes in HNSCC 13. However, whilst these models have been validated as prognostic biomarkers for overall survival and/or disease specific survival, they are not currently recommended in the cancer treatment guidelines nor universally used in the clinic. With the further validation performed on a cohort treated with an immune-checkpoint inhibitor, our multimodal signature describes new data to help understand the range of treatment responses and predict outcomes and could be used to guide treatment intensification, continuation and/or early termination in clinical practice or incorporated into future clinical trials. Moreover, in the resubmission we extend our work from predictive biomarker research to developing a better understanding of the interplay between the peripheral immune response to intra-tumoral immunity which we discuss in this letter as part of our response to the public evaluation summary part 3. Given the recent surge in literature focused on tumor immunity with the increased use of immune checkpoint blockers, we believe our work offers a strong contribution to the few papers in circulation that have attempted to link tumor immunity from the systemic level to the tumor tissue level.

      3) A large number (31 of 87) patients were not included due to lack of biomaterials. No analyses have been performed to examine the characteristics of these patients. It is unlikely that the collection of biomaterials has no correlation with disease characteristics, prognostic features, outcomes, or the analytes in this study. This exclusion -- akin to unequal censoring in clinical trials -- is likely to significant impact results. Given that the population enrolled in a phase II trial, and that sub-population of patients who survive long enough and are feeling well enough to submit to large volume blood draws on trial, would not necessarily represent the real world population of R/M HNSCC patients, a broader population is needed to justify conclusions about this assay having robust predictive value.

      We appreciate the reviewer’s concern on potential skewness of the data based on patient selection criteria. The median PFS of our 56-patient cohort used in the generation of the risk signature was 5.48 months as shown in supplementary table 1 in the original submission. This is in line with real-world treatment outcomes to the EXTREME Regimen (cetuximab with platinum-based therapy) as first line therapy for Recurrent/Metastatic Squamous Cell Carcinoma of the Head and Neck which was reported as 5 month by Sano et al in 2019 14. It is also very similar to the median PFS observed in the DIRECT study 15

      4) It is unclear why OS as a hard endpoint was not analyzed here. No explanation is provided, other than OS was not available, a statement that is difficult to understand, given that PFS was available, and overall survival is a component of PFS.

      Thank you. We admit that the absence of overall survival is an inherent limitation of the study. In the process of submitting this revision, we have once again requested this dataset from the sponsoring pharmaceutical company but were informed that they are unable to provide it. This is because reorganization of funding priorities within the company precludes them opening datasets from an already-published clinical trial. We are equally disappointed to not be able to obtain this data, but firmly believe that the ability of the signature to predict PFS (the primary endpoint of the trial, untainted by subsequent lines of treatment), as well as cross-validation against the contemporary EACH trial, is a testament to the signature’s strength.

      There is no validation set for the biomarker. The biomarker was trained and cross-validated using Bayesian techniques to reduce overfitting. This is a valid approach for training and cross-validation, but for the biomarker to be testable and interpretable, it requires assessment in an independent dataset. There is no statistical technique that I am aware of that generates informative biomarkers without an independent validation dataset

      We completely agree with the reviewer regarding the need to obtain a validation set. Obtaining patient samples from a similar cohort was difficult but we managed to validate the signature on a set of patients treated with an anti-PD-L1 monoclonal antibody in combination with cetuximab. Furthermore, the validation was performed using a limited numbers of covariates that were identified in the risk signature by the Bayesian model. These immune populations can be obtained by running a limited set of markers on flow cytometry. We were very happy to see that these limited immune based covariates strongly correlated with a worst disease response in an independent cohort using a different treatment modality. This furthers our hypothesis that changes in the immune populations are key to understanding response to systemic therapy. Fueled with the data from the validation cohort we furthered our analysis of the tissue from a total of 9 patients from the test cohort. Using imaging mass cytometry, we were able to identify how immune populations are mirrored at the tumoral level opening the horizon for new research. The data for the validation set are copied into this letter in response to point 2 of the public evaluation summary.

    1. Author Response

      Reviewer #1 (Public Review):

      Tarasov and colleagues provide data that extensively phenotypes TGAC8 mice, which exhibit a cAMP-mediated increase in cardiac workload prior to developing heart failure. The authors confirm data from prior studies, showing increased cardiac output mediated by changes in heart rate with similar ejection fraction. 

      The above is slightly incorrect as stated. Our results section stated that HR and EF were increased in TGAC8, but that stroke volume did not differ by genotype. Thus 30% increase in cardiac output in TGAC8 was attributable to the increased HR.

      The study is overall well-planned and the amount of data presented by the authors is impressive. The work nicely incorporates animal-level physiology (echocardiography data), tests for known canonical markers of hypertrophy, and then delves into an unbiased analysis of the transcriptome and proteome of LV tissue in bulk. The techniques and analyses in the study are adequately executed and within the realm of expertise of the Lakatta laboratory. This study is a necessary and crucial first step to extensively phenotype this mouse line and generate hypotheses for further work. 

      Reviewer #2 (Public Review): 

      Tarasov et al. present an impressive amount of work in their in-depth assessment of a murine model of chronic stress in a transgenic line with constitutively active AC/cAMP/PKA/Ca2+ signaling that spans cardiac structure, function, cellular architecture, gene and protein expression, mitochondrial function, energetics and more. Exploration of multiple cellular pathways throughout the manuscript and as summarized in Figure 16 help characterize this murine model and serves as a first step in using this model to understanding the effect of chronic stress on the heart. The conclusions of the manuscript are well-supported by the data, and I have the following comments: 

      Strengths: 

      1. The authors present echocardiographic, histologic, electrocardiographic, neurohormonal quantification, protein synthesis/degradation, mitochondrial, gene and protein expression profiling, and metabolism data in their assessment of this model. 

      2. The verification of increased transcripts of AC and PKA activation in this transgenic line provided validation for the model. 

      3. The pathway analyses for both gene and protein expression profiling help supports the authors' claim of the importance of differences noted in the various pathways between the transgenic line and controls. 

      4. The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. 

      As written, this statement does not exactly reflect what we had intended to communicate in the paper. We did not posit, that LV wall stress was reduced in TGAC8, but that it must be reduced compared to WT on the basis of Laplace’s Law because of a substantial reduction of LV cavity volume. We also did not posit that energy production is due to a shift in metabolism, but rather, that adaptations in energy metabolism resulted in adequate energy production to meet, what appeared to us to be a marked increase in energy demand in TGAC8 vs WT, based on our observation that transcriptome and proteome gene ontology (GO) terms that differed in TGAC8 vs WT, covered nearly all biological processes and molecular functions within nearly all compartments of the LV myocardium.

      These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. 

      Although the chronic increase in cardiac output in TGAC8 heart simulates that of an athlete’s heart during exercise, LV cavity volume at rest is larger in the endurance trained heart and this is associated with bradycardia. In these aspects, the TGAC8 heart differs from the endurance trained heart (perhaps because it does not have sufficient rest periods between bouts of exercise, as does the endurance trained heart). In the discussion section of the manuscript, we noted several features that differed between the TGAC8 vs the endurance trained heart. 

      However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      Heart failure in the long run, had first been described in the TGAC8 mouse by Mougenot et. al. (ref 10 in our manuscript) who performed numerous biochemical and biophysical measurements in TGAC8 and WT attributed the heart failure to be a manifestation of accelerated heart aging. We are in the midst of conducting a longitudinal study of cardiac structure and function in the TGAC8 vs WT as these mice age, along with additional non-biased multi-omics analyses in order to get an overview about which of adaptive pathways that are activated in TGAC8 heart at 3 months of age become faltered with advancing age and how changes in these pathways relate to the altered cardiac structure and function of the TGAC8 as age advances. Following that, we will focus on each of these pathways employing detailed mechanistic analyses. Our provisional hypothesis is that while AC8 activity will continue to be increased as age advances, its downstream signaling will begin to fail due to age-associated changes in proteostasis and in the expression of proteins, including those involved in energy metabolism.

      Weaknesses: 

      1.  As acknowledged by the investigators, this is a hypothesis-generating rather than hypothesistesting study. 

      Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      2.  The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      We have addressed these comments above in our response to your comment #4 under strengths.

      3.  Figure 5B is referenced to support the claim regarding beta adrenergic receptor desensitization, but the data show catecholamine levels in tissue. I would have expected receptor expression analysis to suggest up/downregulation of receptors at the membrane to support this claim. 

      Beta adrenergic receptor desensitization can occur due to changes in molecules that inhibit signaling that are at the receptor or at the signaling downstream of the receptor in the absence of changes in receptor number. Here is how we summed this up in our manuscript:  “Numerous molecules that inhibit βAR signaling, (e.g. Grk5 by 2.6 fold in RNASEQ and 30% in proteome; Dab2 by 1.14 fold in RNASEQ and 18% in proteome; and β-arrestin by 1.2 fold in RNASEQ and 14% in proteome) were upregulated in the TGAC8 vs WT LV (Table S.3, S.5 and S.9), suggesting that βAR signaling is downregulated in TGAC8 vs WT, and prior studies indicate that βAR stimulation-induced contractile and HR responses are blunted in TGAC8 vs WT.8,11… A blunted response to βAR stimulation in a prior report was linked to a smaller increase in L-type Ca2+ channel current in response to βAR stimulation in the context of increased PDE activity.13, 14 WB analyses showed that PDE3A and PDE4A expression increased by 94% and 36%, respectively in TGAC8 vs WT, whereas PDE4B and PDE4D did not differ statistically by genotype (Figure 16-supplement 1 A). In addition to mechanisms that limit cAMP signaling, the expression of endogenous PKI-inhibitor protein (PKIA), which limits signaling of downstream of PKA was increased by 93% (p<0.001) in TGAC8 vs WT (Table S.3). Protein phosphatase 1 (PP1) was increased by 50% (Figure 16-supplement 1 A). The DopamineDARPP-32 feedback on cAMP signaling pathway was enriched and also activated in TGAC8 vs WT (Figure 15), the LV and plasma levels of dopamine were increased, and DARPP-32 protein was increased in WB by 269% (Figure 16-supplement 1 A).

      Thus, mechanisms that limit signaling downstream of AC-PKA signaling (βAR desensitization, increased PDEs, PKI inhibitor protein, and phosphoprotein phosphatases, and increased DARPP32, cAMP (dopamine- and cAMP-regulated phosphoprotein)) are crucial components of the cardio-protection circuit that emerge in response to chronic and marked increases in AC and PKA activities (Figure 4 C, F).” 

      4. Changes in ion channel (e.g. KCNQ1 and KCNJ2) gene and protein expression were described but not validated by assessment of change in function. 

      Reviewer #3 (Public Review): 

      Tarasov et al have undertaken a very extensive series of studies in a transgenic mouse model (cardiomyocyte-specific overexpression of adenylyl cyclase type 8) that apparently resists the chronic stress of excessive cAMP signaling for around a year or so without overt heart failure. Based on the extensive analyses, including RNAseq and proteomic screening, the authors have hunted for potential "adaptive" or "protective" pathways. There is a wealth of information in this study and the experiments appear to have been carefully performed from a technical viewpoint. Many interesting pathways are identified and there is plenty of information where additional experiments could be designed. 

      General comments 

      1. Ultimately, this is a descriptive and hypothesis-generating study rather than providing directly proven mechanistic insights.

      As noted in response to Reviewer #2: “Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      -Given several prior studies reporting a detrimental effect of chronically increased cAMP signaling, what is it that is different in this model? Is it something specific about AC8? Is it to do with when (in life) the stress commences? 

      We believe it is, at least in part, due to something specific about the effects of the marked increased activity of AC8 perse, because adenylyl cyclase singling impacts nearly all aspects of our current knowledge of cell biology. Thus, due to the marked increase of AC and PKA activation in the TGAC8 heart, the transcriptome and proteome gene ontology (GO) terms that differ in TGAC8 vs. WT covered nearly all biological processes and molecular functions within nearly all compartments of the TGAC8 LV myocardium.

      - Is the information herein relevant to stress adaptation in general or is it just something interesting in this specific mouse model?

      In our opinion, AC8 mouse model is very relevant to stress adaptation in general, but this broad view has hardly ever been realized previously in the literature, because of the reductionist nature (by necessity) of mainstream biomedical research. For example, reports on cardiac specific overexpression of AC5 and AC6 never provided broader view on these mice and were focused only on a limited number of traits i.e., arrhythmogenesis, chronic pressure overload, contraction (Am J Physiol Heart Circ Physiol. 2015 Feb 1;308(3):H240-9; Am J Physiol Heart Circ Physiol. 2010 Sep;299(3):H707-12; Clin Transl Sci. 2008 Dec;1(3):221-7; Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9986-90; Am J Physiol Heart Circ Physiol. 2013 Jul 1;305(1):H1-8). 

      None of the pathways that are apparently activated were directly perturbed so their mechanistic role requires further study.

      We agree and have entitled a section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results” to address this plainly.

      Specific 

      1. The strain of the mice and their sex needs to be stated as well as the exact age at which the various assays were performed.

      All assays were performed on 3-month-old males. This information was inadvertently not directly stated in the original submission.  

      2. The hearts of the Tg mice have more cardiomyocytes but which are smaller. Since there is no observed increase in proliferation of cardiomyocytes, how (or when) did this increase in cell number occur?   

      It is likely that an increase in number of cardiomyocytes may have occurred during the embryonic stage of development (8.5 dpc), when AC8 expression begins. Since submitting our manuscript we have found that the expression level of human AC8 (the type of AC8 employed in this transgenic model) increases markedly during the embryonic period when compared to endogenous AC8 and remains elevated in both the fetal and perinatal periods. 

      3. While the mice do not show an increased mortality up to 12 months of age, HR/CO/EF are poor indices of contractile function. Data on end-systolic elastance or perhaps echo-based LV strain indices which will be relatively load-independent would be useful.

      Numerous comprehensive hemodynamic measurements have been performed previously on this mouse. For example, Mougenot et. al (Ref 10 in our manuscript), based on invasive hemodynamics analysis concluded that contractile function in the TGAC8 heart was increased at both 2 and 12 months of age. But Doppler imaging of the heart in conscience mice, unmasked, myocardial dysfunction, informed by a reduction in systolic strain rate in both old TGAC8 and WT littermates. This is why they attributed the heart failure in TGAC8 at 12 months of age to be a manifestation of accelerated aging.

      We agree with your comment that end-systolic elastance ought to be measured in the TGAC8 but also end-diastolic elastance, and effective arterial elastance should be measured in order to quantify diastolic function and heart energetic coupling in the TGAC8.  

      4.  Quite a lot of conclusions are made relating to metabolism. However, this is entirely based on gene expression or protein levels. Given the substantial role of allosteric regulation in metabolic control, as well as the interconnectedness of metabolic pathways, ultimately any robust conclusions need to be based on an assessment of activity of key pathways. 

      We concur and have described some of the types of metabolic assessments in the last section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results”: “… precisely defining shifts in metabolism within the cell types that comprise the TGAC8 LV myocardium via metabolomic analyses, including fluxomics.97 It will be also important that future metabolomics studies elucidate post-translational modifications (e.g. phosphorylation, acetylation, ubiquitination and 14-3-3 binding) of specific metabolic enzymes of the TGAC8 LV, and how these modifications affect their enzymatic activity”.

    1. Author Response

      Reviewer #1 (Public Review):

      In their manuscript, these authors present a novel geostatistical framework for modelling the complex animal-environment-human interaction underlying Leptospira infections in a marginalised urban setting in Salvador, Brazil.

      In their work, the authors combine human infection data and the rattiness framework of Eyre et al. (Journal of the Royal Society Interface, 2020) . They use seroconversion defined as an MAT titer increase from negative to over 1:50 or a four-fold increase in titer for either serovar between paired samples from cohort subjects. Whereas this is a commonly used measure of infection; the work would benefit from answering the question about how robust results are related to this definition of seroconversion.

      Thank you for your comment. We have acknowledged this on line 534 in the discussion by adding the following text: “A possible limitation of this study is the titre rise cut-off values used for classifying seroconversion and reinfection in the cohort that determine the sensitivity and specificity of the infection criteria. However, these criteria were used because they are the standard definitions for serological determination of infection that are commonly applied for leptospirosis and a wide range of other infections, and they enable the comparison of results with other previous leptospirosis studies.”

      The model framework relies on the concept of 'rattiness' previously defined by Eyre et al. (JRSI, 2020) and assumes conditional independence within its built up (equation (1)). Whereas this is a reasonable assumption, it would be good to discuss situations in which this assumption is questionable and what the implications are for applying the modelling framework to other settings.

      We have added the following text immediately after “is shown schematically in Figure 2” following equation (1) on line 225: “The conditional independence assumption in (1) is reasonable for a vector-borne disease or one that is transmitted indirectly, in which context the observed rat indices are to be considered as noisy indicators of the unobservable spatial variation in the extent to which the environment is contaminated with rat-derived pathogen. It would be more questionable for applications in which the disease of interest is spread by direct transmission from rat to human.”

      The authors provide an extensive model building exercise and investigate, in different ways, whether the model captures the necessary complexity (GAM smoothers - testing linearity, spatial correlation, etc). I believe the work would benefit from (1) a formal diagnostic investigation, if feasible; (2) providing guidelines on how model building should be performed.

      We have added a new Appendix 7 with diagnostic plots of randomized quantile residuals to check the rattiness-infection model fit with the human infection data and included the following text in Section 2.4 of the main text: “A formal diagnostic investigation of randomized quantile residuals is included in Appendix 7. We found no evidence in the diagnostic plots to suggest that there were issues with our modelling approach.”

      To supplement the R code that is publicly available for repeating all of the steps in this analysis, we have now also included a detailed step-by-step explanation of the model building process in Appendix 8 that outlines the key steps for building the rat and infection components of the model (variable selection and evaluation of residual spatial autocorrelation) and fitting and examining the joint rattiness-infection model. We have added the following text in Section 2.6 of the main text: “We also include a step-by-step explanation of the model building process to guide future users of the rattiness-infection framework in Appendix 8.”

      The authors are to be acknowledged for providing an extensive and thorough discussion of the different aspects of their work. Whereas the discussion is complete, I wonder whether the authors can give a brief example about how this model can be applied in a different setting.

      Thank you. We have added the following text on line 551 in the discussion: “The framework may have important applications beyond the study of zoonotic spillover, with the rattiness component replaced by other exposure measures e.g. mosquito density or ecological indices (such as pollution, where there are multiple, related measures of air or groundwater quality) to model associations with human or animal health outcomes.”

      Reviewer #2 (Public Review):

      Eyre et al. developed and applied a novel geostatistical framework for joint spatial modeling of multiple indices of pathogen (Leptospira) reservoir (rats) abundance and human infection risk. This framework enabled evaluation of infection risk at a fine spatial scale and accounted for uncertainty in the pathogen reservoir abundance estimates. The authors used data collected in two different field projects: (1) a rat ecology study in which three different approaches were used to detect rat presence "rattiness", and (2) a prospective community cohort study in which individuals were sampled during two different time periods to detect recent infections via seroconversion or a four-fold increase in anti-Leptospira antibody MAT titer. Univariable and then multivariable analyses were performed on these data to identify (1) the environmental variables that best predicted "rattiness", and (2) the demographic/social, environmental (household), occupational, and behavioral variables that best predicted human risk of infection. Once identified, the best predictors from (1) and (2) were included in a final, joint model to identify the significant predictors of both 'rattiness' and human infection risk. As a result of this study, the authors were able to detect spatial heterogeneity in leptospiral transmission to humans. They found that infection risk associated with increases in reservoir abundance differed by elevation, and that increases in reservoir abundance at high elevation were associated with a much higher odds ratio for infection than at low elevation. The authors suggest that this has to do with differences in how the infectious leptospires (shed by the rat reservoir) are dispersed in the environment. At high elevations, flooding is less frequent and thus rat shed leptospires are likely to stay where the rat deposited them. Whereas at lower elevations, flooding may play a large role in spreading leptospires more evenly across the landscape, reducing the importance of rat presence at smaller spatial scales. The final best model was then used to generate prediction maps of 'rattiness' as well as human infection risk at all locations within the study area (i.e. including those that lacked rat detection data and human infection data. This work represents an important advance in infection risk modeling as it explicitly incorporates estimates of reservoir abundance and the uncertainty surrounding these estimates into the infection risk assessment, and allows for modeling of infection risk at fine spatial scales. Findings from this study have important management implications at the authors' study site as it suggests that interventions directed at high elevations should be different from those designed to address infection risk at lower elevations. However these are broader implications, as this novel approach may be applied to other systems to enable identification of differences in infection risk for other pathogens at a fine spatial scale, predict infection risk more broadly, and facilitate intervention strategies targeted for the specific epidemiological and ecological conditions experienced by a population.

      This was a well-designed study. The field sampling approach was well balanced, well described and appropriate. Broadly the modeling framework is appropriate for the questions being asked and for the data being used. The variable and model selection approaches were clearly described and appropriate. Evaluation of the more detailed mathematical approach is outside of my area of expertise, so I am unable to comment on the validity of the approach.

      For the most part, the explanatory variables assessed in the different models were well described and justified, however there were some cases for which further explanation would have been helpful. For example, how did the authors determine which occupations to evaluate? Specifically, why traveling salesperson? What is the difference between open sewer within 10 m and unprotected from sewer?

      We have added the following additional text to Section 2.3.2 on line 297 to clarify the definition and reason for inclusion for these variables: “In the household environment domain, two variables were used to capture risk due to sewer flooding close to the household: i) the presence of an open sewer within 10 metres of the household location and ii) a binary `unprotected from open sewer' variable which identified those households within 10 metres of an open sewer that did not have any physical barriers erected to prevent water overflow. Three high-risk occupations were included in the occupational exposures domain as binary variables. Construction workers and refuse collectors have direct contact with potentially contaminated soil, building materials and refuse in areas that provide harbourage and food for rats. Travelling salespeople have regular and high levels of exposure to the environment (particularly during flooding events) as they move from house to house by foot. Two other binary occupational exposure variables were included that measured whether a participant worked in an occupation that involves contact with floodwater or sewer water.”

      I also had some concerns regarding the time-period of the rat ecology study used to determine abundance, potential fluctuations in rat abundance through time, and how this might align with sampling to detect infection in humans. Depending on the time scale of population fluctuation in rats as well as fluctuations in infection prevalence in rats, the abundances calculated from data from the ecology study may not be accurately reflecting true abundance (and therefore shedding and transmission risk) during the time period that a human may have been exposed. However, the authors do a nice job of addressing some of these issues in the discussion. They mention that infection prevalence in rats is consistently around 80% and that there don't appear to be seasonal fluctuations in human exposure risk in the study area.

      Thank you.

      Reviewer #3 (Public Review):

      The goal of the authors was to test how important local rat abundance is as a driver of Leptospira infection in humans.

      The authors approached this using a strong combination of datasets on human infection risk and rat abundance, across a spatial scale that is large enough to allow simultaneous assessment of multiple potentially important drivers of infection risk. This further enables the authors to develop infection prediction maps based on the fitted models.

      This study design is a major advance towards understanding link between rat abundance and human infection risk.

      Based on the top models tested in the study, the authors conclude that local rat abundance is indeed correlated with infection risk, and that this correlation is strongest at higher elevation.

      This is an impactful finding, but in my opinion it is not yet clear how robust and important this is, because of two reasons:

      (1) The infection risk data: while the actual infection risk data are not shown, the map shown in Figure 5B suggests that there is an infection hotspot that happens to be at high elevation. This raises the question of how strongly this single hotspot is driving the observed correlation between rat abundance and infection risk (which the authors find to be much stronger at high elevation than at lower elevations).

      We have added a new figure (Figure 4) earlier on in the article (we decided to add this here rather than to Figure 6 - formerly Figure 5 - to ensure that the map is large enough that points in Figure 4A are easily visible – please note that it is included as a larger and easier to view image in the main eLife template version) with the raw infection data overlaid on contour lines for the three elevation levels to provide the reader with a better overview of the raw data. This new Figure 4 shows that out of a total of 403 participants in the high elevation region there were 16 infections, of which only 5 (31%) were located in the large hotspot in Valley 3 (valleys are numbered 1 to 3 from west to east, see Figure 1A). In addition to the largest hotspot in the north of Valley 3, there are several other areas in the high elevation region with raised predicted infection risk values relative to their surroundings where there were also rattiness hotspots and infected participants in the raw data: fives cases (red and yellow infection risk areas in Figure 5B) on the western side of Valley 2; the two cases on the eastern edge of Valley 2; the two cases on the western edge of Valley 3; and the single case in the southwest of Valley 3. Other variables are also important drivers of infection risk and at several of these locations the contribution of rattiness increases infection risk significantly relative to the low-risk surrounding area (e.g. to 10% in areas where risk is closer to 1% or 2%) without reaching the more obviously visible high infection risk values closer to 20%. We believe that our statistical model provides a better test of whether there is a statistical association between rattiness and infection at high elevations than a visual examination, but that this is supported by the large number of observations in the high elevation area (403) and the distribution of infected and uninfected households, which demonstrates that the observed association is not only driven by the hotspot in Valley 2.

      (2) The statistical models: if I understand correctly, all tested models of infection risk include the variable rat abundance, and while the individual effect estimates for rat abundance are statistically significant (Table 3), the more important question of how the fit of a model without the rat abundance variables compares with those of the other tested models (shown in Supplementary Table S2) has not been addressed.

      These models were considered but were ranked outside of the top five models and for this reason were not reported in Table S2. We agree that showing the AIC of a model without rattiness in this table can more clearly demonstrate the improved fit of the model with rattiness. To do this we have added the highest ranked model without rattiness (M) to Table S2 and added a note to the table explaining the reason for its inclusion (“Model M was ranked outside of the top 5 models but is included here for reference to demonstrate the improvement in model fit when rattiness is included”). The AIC of M* was 532.13. This is substantially higher than the top five models (M1 = 523.14 and M5 = 525.04), justifying its inclusion in this model and in the joint rattiness-infection framework.

      Regardless of whether rat abundance is an important driver of human infection risk, this study is a major step in our understanding of the role of rats in the spread of leptospirosis, due to the strong combination of a unique combination of datasets and a spatial statistical modeling approach.

      Thank you.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript discusses evolutionary patterns of manipulation of others' allocation of investment in individual reproduction relative to group productivity. Three traits are considered: this investment, manipulation of others' investment, and resistance to this investment. The main result of the manuscript is that the joint evolution of these traits can lead to the maintenance of diversity through, as documented here, cyclic (or noisier) dynamics. Although there are some analytical results, this main conclusion is instead supported by individual-based simulations, which seem correctly performed (but for clonal populations, as emphasized below).

      There could be material for a good paper here but the organization of the manuscript makes it difficult to fully evaluate. The narrative is highly condensed, with the drawbacks that this often entails in terms of accurately conveying the results of a study, as illustrated here by the following issue.

      The population is apparently assumed to be clonal (more than just "haploid"), meaning that there is no recombination between the loci controlling the three traits. In the one case where this assumption is relaxed (quite artificially), the cyclic dynamics disappear (section 4.4 of the appendix). This is crucial information that cannot be appreciated in the main text.

      The paragraph at line 368 offers a simple explanation for the joint dynamics of traits. However, this explanation would hold identically for a sexual population and a clonal population, whereas these two cases seem to have completely different dynamics. Thus, there is something essential to explain these differences, that is missing from the given explanation.

      Yes, our model was asexual with no recombination. To address this comment, we carried additional simulations where recombination was allowed (Appendix 1— 4.8). We found that recombination does not change our results (predictions), and describe this on line 469-475. By assuming additive effects of traits and each traits having the same dispersal property, our haploid asexual model is also equivalent to a diploid sexual model (Taylor 1996; Day & Taylor 1998).

      This is especially important because the finding that the joint evolution of several traits can lead to some form of diversity maintenance is not surprising. As the discussion acknowledges (but the introduction seems to downplay), it is also well understood that manipulation and counter-adaptations to it can occur in many contexts and lead to the maintenance of diversity. For this reason, similar results in the present case are not surprising, and the main outcome of the study should be to provide a deeper understanding of the forces leading to the different outcomes in the current models.

      I do not see clearly what distinguishes "manipulative cheating" from other forms of manipulations that have been previously discussed in the literature (e.g, as cited lines 461). Couldn't this be clarified by some kind of mathematical criterion?

      Thanks for pointing out that there is room to improve the distinction between our model and previous models! We have added more description to explain the conceptual difference on line 187-193, and a new subsection in appendix to show these differences through mathematically examine the fitness formulations in previous models (Appendix 1—1.3).

    1. Author Response

      Reviewer #1 (Public Review):

      This paper addresses an important question: whether the conduction velocity in white matter tracts is related to individual differences in memory performance. The authors use novel MRI techniques to estimate the "g-ratio" in vivo in humans - the ratio of the inner axon relative to the inner axon plus its outer myelin sheath. They find that autobiographical recall is positively related to the g-ratio in a specific white matter tract (the parahippocampal cingulum bundle) in a population of 217 healthy adults. This main finding is extended by showing that better memory is associated with larger inner axon diameters and lower neurite dispersion, which suggests more coherently organised neurites. The authors also argue that their results show that the magnetic resonance (MR) g-ratio can reveal novel insights into individual differences in cognition and how the human brain processes information.

      The study is exploratory in nature and the analyses were not pre-registered. The technique has not been used before to associate cognitive performance with MR estimates of conduction velocity in candidate white matter tracts. It is therefore unknown how strong any associations are likely to be and what sort of sample size might be needed to observe them. Nevertheless, if the technique proves to be reliable, then it certainly offers a valuable new tool to understand individual differences in cognitive abilities. However, brain structure to behavior associations are notoriously variable across studies and have been argued to require very large sample sizes to obtain reproducible results.

      We respectfully disagree that the study was exploratory. We had distinct aims and hypotheses from the outset. Our prime interest is in autobiographical memory, the hippocampus and its connectivity. This motivated our focus on three specific white matter tracts. We also planned from the time of study design to examine the MR g-ratio, and even contributed to refining the pre-processing pipeline for this approach, as reported in a previous paper (Clark et al., 2021, Frontiers in Neuroscience). Moreover, in the current manuscript we outlined well thought through possible outcomes and declared specific predictions.

      Regarding pre-registration, due to the scope of this work, the experiment was planned eight years ago, and data collection commenced seven years ago. At that time, formal pre-registration was not common practice. However, it has been a long-standing feature of our Centre that proposed studies and their analysis plans undergo rigorous internal peer review, including presentation to the whole Centre, before data acquisition can commence. The proposal for the research under consideration here was presented on 26th September, 2014.

      As noted in our response to the Editors’ Public Evaluation Summary above, someone has to be the first to report a novel result, and we believe that the depth and transparency of our approach permits confidence in the findings. Not least, and to reprise, because we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Our primary analyses were performed using the behavioural outcome measure from this test, the results of which were directly compared to those from a closely-matched control measure to test whether significantly larger effects were observed for our variable of interest. The potential for false positives was further reduced by extracting microstructure data from hypothesised tracts of interest (instead of performing whole brain voxel-wise analyses), with statistical correction performed on all structure-behaviour analyses. Moreover, we performed partial correlations with age, gender, scanner and number of voxels in a region of interest (ROI) as covariates. Complementary investigations were also conducted using other commonly-reported measures, providing supporting evidence. We report all analyses (and provide all the source data), including those finding no relationships. The consistent results throughout were associations between autobiographical memory recall ability and the microstructure of the parahippocampal cingulum bundle only. Moreover, thanks to the excellent suggestions of the Reviewers, the revised version reports additional analyses that allow us to further corroborate and interpret our findings.

      Our sample of 217 participants allowed for sufficient power to identify medium effect sizes when conducting correlation analyses at alpha levels of 0.01 and when comparing correlations at alpha levels of 0.05 (Cohen, 1992, Psychological Bulletin). While it has recently been suggested that thousands of participants are required in order to investigate brain structure-behaviour associations (Marek et al., 2022, Nature), other, more sophisticated, analyses suggest that samples of ~200 participants can be sufficient, in line with our estimates (Cecchetti and Handjaras, https://psyarxiv.com/c8xwe; DeYoung et al., https://psyarxiv.com/sfnmk). Given that our study was principled, well-controlled, analysed appropriately and produced very specific and consistent findings, we are confident that the findings are robust.

      The authors decided to analyse performance on a single task - the Autobiographical Memory Interview - and identified three candidate white matter tracts that connect the hippocampal region with other brain regions. While it is clear why these three tracts were chosen, it is less obvious why the authors chose to investigate associations with the Autobiographical Memory Interview and not other memory tests that were part of the battery of tests administered to the participants. It is reasonable to assume that something as general as the conduction velocity of a white matter tract would have an effect on memory ability across a range of tasks, so to single out one seems an unnecessarily narrow focus.

      Our main interest over many years, and hence the focus of this study, is autobiographical memory recall because it directly relates to how people function in real life. As noted above, autobiographical experiences occur in dynamic, multisensory, multidimensional, non-linear, ever-changing contexts; they involve actively engaging with the environment and other people; they are embodied; they span milliseconds to decades. Many of these features cannot be captured by laboratory-based episodic memory tests. This issue is increasingly being discussed (for example, see recent reviews by Nastase et al., 2020, NeuroImage; Mobbs et al., 2021, Neuron; Miller et al., 2022, Current Biology). It is further laid bare in McDermott et al.’s (2009, Neuropsychologia) meta-analysis of functional MRI studies which showed that laboratory-based and autobiographical memory retrieval tasks differ substantially in terms of their neural substrates. Consequently, we were not surprised to find that when we analysed laboratory-based memory test performance, there were no correlations with the MR g-ratio. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer for making the excellent suggestion to include these additional data, analyses and discussion points.

      The results of the study are interesting and highlight a key role of the parahippocampal cingulum bundle in autobiographical memory recall. The results are corrected for multiple comparisons across the three fiber tracts of interest and the recall of "external details" provides a nice control compared to the "internal details" which are the measure of interest. The main findings are extended to show that it is likely to be an increase in axon diameter and an increase in neurite coherency that characterize those individuals with better autobiographical recall. Despite these positives, it remains unclear whether memory recall, in general, is better in people with higher g-ratios in this tract (as implied in the Abstract), or if this effect is specific to scores on the Autobiographical Memory Interview.

      Our interest is in autobiographical memory, and so we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Not only does this test include a control measure, external details (as noted by the Reviewer), but we had independent raters score the autobiographical memory descriptions, and found that the inter-class correlation coefficients were very high (see Materials and Methods). Despite using this current, gold standard approach, at the request of the Reviewer we have now analysed data from eight additional laboratory-based memory tests. These are standard memory tests that are often used in neuropsychological studies: testing recall - the immediate and delayed recall of the Logical Memory subtest of the Wechsler Memory Scale IV, the immediate and delayed recall of the Rey Auditory Verbal Learning Test, the delayed recall of the Rey–Osterrieth Complex Figure; testing recognition memory - the Warrington Recognition Memory Tests for Words and Faces; testing semantic memory - the “Dead or Alive Test”. While these tests can assess some aspects of memory recall, they cannot be regarded simply as proxies for autobiographical memory recall, for the reasons we outlined in our response to the previous point. They do not capture key aspects of autobiographical memories. It is therefore all the more interesting that we found no associations between these laboratory-based memory tasks and the MR g-ratio of the parahippocampal cingulum bundle, in contrast to the relationship identified with autobiographical memory recall ability. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer once again for making the excellent suggestion to include these additional data, analyses and discussion points.

      Reviewer #2 (Public Review):

      In this study, Clark and colleagues tackle a very intriguing question: how differences in autobiographical recall abilities reflect in the human brain structure and function? To answer this question, they interviewed a large cohort of subjects and proceeded to acquire MRI data, specifically diffusion-weighted imaging and magnetization transfer data, to estimate the g-ratio, a measure of myelination deeply linked to conduction velocity. Looking at three specific white matter pathways of interest, all interconnecting the hippocampus with other brain structures, they studied the relationship between the g-ratio and the autobiographical recall abilities, together with many more measures from MRI. They found a significant positive association between the g-ratio of the parahippocampal cingulum bundle and the number of inner details from the interviews. These results can provide new potential directions to further study the underlying neural features beyond memory.

      I think that this is a very interesting article, it is well written, the methods are extensively explained, and the appendix provides further details for more expert readers. The authors put an effort into providing a comprehensive context in the introduction and in the discussion, and as a result, the paper seems overall quite suitable for both general and specialistic readerships.

      Thank you.

      The main issue I can currently see in the paper is that the mentioned relationship between g-ratio and recall abilities is then used to infer that better recall abilities are associated with higher conduction velocity and larger axons. The authors' line of reasoning is that given the hypothesized association, the increase in the g-ratio implies increases in myelin and axonal diameter. Despite this scenario being indeed possible given the current result, an increased g-ratio may also not indicate higher conduction velocity. In fact, the first potential inference would be that, without having any information on the axon size, the quantify of myelin can indeed be lower and as result, the conduction velocity would decrease. I understand that the authors expected higher conduction velocity associated with better autobiographical memory recall, but it is hard to see any experimental outcome that could have disproved this hypothesis: from the possible scenarios depicted in the introduction, any change in the g-ratio (and even not any change at all) could indicate higher conduction velocity. What would be then needed to corroborate one of these scenarios is some independent or complementary measure, which unfortunately is missing.

      The mentioned issue does not mean that the paper loses relevance - I think that it should focus on the very practical result, a change in myelination and microstructure, and discuss what are the potential implications, including the one that currently dominates the discussion section.

      Thank you for these comments and the opportunity to provide further clarification.

      First, we have now provided additional background information regarding the relationship between the MR g-ratio and conduction velocity. We explicitly note that while finding a significant relationship between the MR g-ratio and autobiographical memory recall suggests the existence of an association between autobiographical memory recall and parahippocampal cingulum bundle conduction velocity, it cannot speak to the direction of this association.

      Second, we have further noted that interpretation of the parahippocampal cingulum bundle MR g-ratio in relation to the underlying microstructure requires knowledge, or an assumption, about whether the associated change in conduction velocity is faster or slower. Given that faster conduction velocity is thought to promote better cognition (e.g. Brancucci, 2012; Dicke and Roth, 2016; Miller, 1994; Reed and Jensen, 1992), we interpreted our MR g-ratio findings under the assumption of faster conduction velocity, and now explicitly note in several places in the revised manuscript that this is an assumption.

      Third, we thank the Reviewer for the excellent suggestion that a complementary measure could help to further inform the findings. Consequently, we now also include additional analyses examining the relationship between the extent of myelination and autobiographical memory recall ability. This is possible using the magnetisation transfer saturation maps, which are optimised to assess myelination. Given our assumption of faster conduction velocity when interpreting our positive MR g-ratio correlations, then no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall would be expected. On the other hand, if conduction velocity is actually decreasing, then a negative correlation between magnetisation transfer saturation values and autobiographical memory recall ability would be observed. In fact, we found no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall. This suggests that myelin was not associated with autobiographical memory recall ability, supporting our assumption that relationships with the MR g-ratio were indicative of faster rather than slower, conduction velocity.

      We have now added these new data, analyses and discussion points to the revised manuscript.

      It would also be helpful to include some paragraphs on both interpretation and methodological issues when it comes to MRI-based microstructural imaging, which at the moment is lacking. This would provide a better picture of the results for a more general readership.

      We agree, and additional consideration of interpretational and methodological limitations have now been included in the manuscript.

      As one of the first works using an MRI-based microstructural measure of myelin, the g-ratio, to study cognition in a large cohort of subjects, I think this work will be a needed and significant step towards merging the neuroscience and MRI physics community - the methodology presented here is robust and could be used in many other applications.

      Thank you.

      Reviewer #3 (Public Review):

      The manuscript adds useful information about how structural properties of the brain are related to individual differences in autobiographical memory. A novel metric is used to assess features of white matter in tracts that are important for information exchange between the hippocampus and other brain regions. In one of these, the parahippocampal bundle, a relationship between the MR g-ratio and autobiographical memory recall is identified. This represents new and interesting information. The authors interpret the results in line with the theory that speed of signal transmission is important for cognitive function.

      Thank you for this positive summary.

    1. Author Response

      Reviewer #1 (Public Review):

      Rasicci et al. have developed a FRET biosensor that is designed to light up when cardiac myosin folds. This structure is extremely important to understand, and its link to the super-relaxed (SRX) state has not been fully shown. Their study provides a comprehensive review of the literature and provides compelling data that the 15 heptad+leucine zipper+GFP construct does function well and that the DCM mutant E525K has a similar IVM velocity despite a reduced ATPase compared with HMM. They rely on the ionic strength-dependent changes in the rate of MantATP release to argue that the E525K mutation stabilizes the 'interacting heads motif' (IHM) state, which makes logical sense.

      Strengths:

      Well written and comprehensive.

      Utilizes the appropriate fluorescence-based sensor for measuring the folding of the myosin structure. Provides a detailed range of techniques to support the premise of the study

      Weaknesses:

      Over-interpretation of the outcomes from this study means that the IHM and SRX are the same. Similar studies, e.g. Anderson 2018 and Chu 2021 support the opposite view that IHM and SRX are not necessarily the same, Anderson (and Rohde 2018) point out that S1 has some element of a reduced ATPase, this clearly cannot be due to folding of the molecule. Also, mavacamten was used in these studies to show that even S1 is inhibited suggesting that SRX and IHM are not connected. This is not to say that with enough supporting evidence that these observations cannot be over-ridden, it is just not clear that there is enough in this study to support this conclusion.

      We have revised our discussion to emphasize that our results support a model in which the SRX state is enhanced by formation of the IHM, but given the S1 and 2HP data the IHM may not be required for populating the SRX biochemical state (see page 8).

      I felt that the authors passed over the recent Chu 2021 paper too quickly, the Thomas group used a FRET sensor as well and provides a direct comparison as a technique, but with opposite conclusions. They also have supporting data in Rohde 2018 that their constructs were less ionic strength sensitive. It would be useful to understand what the authors think about this.

      We have discussed the Rohde and Chu papers in more detail in the discussion (see page 8). In the Rhode paper they used proteolytically prepared HMM and S1. Rohde found 20% SRX at all KCl concentrations in S1, while HMM shifted from 50% to 20% SRX in low and high salt conditions, respectively. Our results are different in terms of the absolute fraction of the SRX state but the trend is similar in terms of S1 being salt-insensitive and HMM being salt-sensitive. The difference could be proteolytic HMM, which is a longer construct, and proteolytic S1, which is prone to internal cleavage that can impact ATPase activity. Another difference could be the mixed isoform of mantATP used in previous studies and the single isoform of mantATP used on our study (see page 5)

      Reviewer #2 (Public Review):

      The paper by Rasicci et al. examines the impact of the DCM mutation E525K in beta-cardiac myosin on its function and regulation by autoinhibition. The role of the auto-inhibited state of beta-cardiac myosin in fine-tuning cardiac contractility is an active and exciting area of current research related to muscle biology and cardiomyopathies. Several studies in the past have linked the destabilization of the autoinhibited, super-relaxed (SRX) state of myosin to the pathogenesis of hypertrophic cardiomyopathy. This timely study provides one of the first few examples where the hypocontractile phenotype of a DCM mutation has been linked to the stabilization of the SRX state.

      One of the strengths here is the utilization of a wide variety of both pre-existing and novel biochemical and biophysical assays for the study. The authors have characterized a new two-headed long-tailed myosin construct containing 15-heptad repeats of the proximal S2 (15HPZ), which they show allows myosin to form the SRX state in vitro using single ATP turnover assays. The authors go on to compare the E525K and WT proteins using the 15HPZ myosin construct in terms of their steady-state actin-activated ATPase activity, in-vitro actin-sliding velocity and single ATP turnover measurements. These assays reveal that the predominant effect of this mutation is the stabilization of the SRX state which is maintained even at 150 mM salt concentration where the WT SRX is largely disrupted. This is an important observation because DCM mutations so far have been believed to only affect the force-generating capacity of myosin.

      One of the biggest strengths of this study is the attempt to develop a FRET-based approach to directly ask if the biochemical SRX state here correlates well with the structural IHM state, which is an important unresolved question in the field. The authors have designed a FRET pair (C-terminal GFP and Cy3ATP bound to the active site) that is sensitive to the relative position of the heads and the tail, allowing them to distinguish between the low-FRET closed IHM conformation and the no-FRET open conformation. Remarkably, the authors show that the salt dependence of the FRET efficiency values closely follows their results from the salt dependence of the percent SRX for both WT and E525K proteins. The authors then attempt to substantiate their FRET results by a direct visual analysis of the conformational states populated by both WT and E525K proteins at low salt using negative staining EM analysis. The authors have optimized conditions to allow the deposition of the IHM state on grids without adding the small molecule mavacamten, which was found to be necessary in an earlier study to visualize the closed state using EM. The authors conclude that the SRX state correlates well with the IHM state and that the E525K mutation indeed stabilizes the folded-back conformation of myosin.

      This study significantly strengthens the previously illustrated correlation between the SRX and IHM states and provides methodological advances (especially visualization of the IHM state by negative EM in the absence of cross-linking agents) that will be very useful to the field going forward. The observation that a DCM mutation can lead to stabilization of the folded back state is a novel insight that should spark interest in the field to test how broadly this applies to other DCM mutations. The conclusions of the paper are mostly supported by the data; however, some clarifications and qualifications are needed.

      Weaknesses:

      The extremely low enzymatic activity of the M2β 15HPZ myosins as compared to the WT S1 control (which is a historical control not assayed in parallel with the 15HPZ proteins), is concerning for the low protein quality of the 15HPZ myosins. The authors attribute the low kcat to the high proportion of SRX population in their ensembles. However, the DRX rates reported for the WT and E525K 15HPZ proteins in the single ATP turnover assay are ~3-4 fold lower than those of their S1 counterparts. These rates reflect basal turnover of ATP in the open state and thus should not be affected by the presence of the S2 tail, which leads to concerns about the 15HPZ protein activity. In addition, the very high percentage of stuck filaments in the in vitro motility assay for the 15HPZ constructs (despite the use of dark actin) is also concerning for significant amounts of enzymatically inactive protein.

      We thank the reviewer for pointing out the differences in the S1 and HMM DRX rates. We performed additional single turnover measurements with S1, adding two sets of measurements from one additional preparation (N=3), and we demonstrate that there is a significant increase in the DRX rates of WT S1 compared to WT HMM (see pages 4-5, Table 3, Figure 3- figure supplement 3). A faster rate in S1 was also reported in Rohde et al. 2018. Indeed, the DRX rates of E525K S1 are significantly higher than WT in S1, which we also now report in the results (see page 5, Figure 3 – figure supplement 3). We addressed the concerns about 15HPZ activity by performing NH4+ ATPase assays to demonstrate that the number of active heads was similar in S1 and 15HPZ HMM (see page 4). It is possible that the higher percentage of stuck filaments in the HMM motility is due to myosin heads in the IHM state on the motility surface, which generate a drag force by non-specifically interacting with actin, but further study is necessary to examine this question.

      The authors assert that the E525K mutation represents a new mechanism by which DCM-causing mutations lead to decreased contractility - by stabilizing the sequestered state rather than affecting motor function. However, there is no evaluation of the motor function (actin-activated ATPase activity or in vitro motility) of the E525K S1, which would reveal the effects of the mutation without confounding effects due to the sequestering of heads. Interestingly, in the single ATP turnover assay, the DRX rate of the E525K S1 is >2-fold higher than the WT control, suggesting that the mutation may have effects beyond stabilization of the SRX state. The conclusion that the E525K mutation's effect on myosin function is mediated via stabilization of the SRX state would be strengthened if the effects of the mutation on the motor domain alone were also known.

      We thank the reviewer for this suggestion. We performed actin-activated ATPase assays with WT and E525K S1 and found that E525K increases kcat and lowers KATPase, demonstrating enhanced intrinsic motor activity in the mutant S1 construct (see page 4, Figure 2B). This adds an interesting dimension to the manuscript because we report a mutant that enhances the intrinsic motor activity but stabilizes the SRX/IHM (see Discussion page 10). We did not perform in vitro motility, because this assay depends on the surface attachment strategy, and we would like to compare all constructs with the same attachment strategy using a C-terminal GFP tag (mutant and WT S1 and 15HPZ HMM). Therefore, we are making the S1 construct with a C-terminal GFP tag for this purpose, to be examined in a future study.

      While the authors show strong qualitative correlations between the SRX and IHM states using single ATP turnover, FRET, and EM experiments, attempts to quantitatively compare the fraction of heads in the IHM state using the various experimental approaches is problematic. For example, the R0 value of the FRET pair used here doesn't allow precise measurement of the distances being probed here to be made, but the distances are reported and compared to predicted distances. The authors report that the R0 for their FRET pair is 63 Å. Surprisingly the authors go on to use the steady-state FRET efficiency values to determine the average D-A distance (Fig 5B) which is 100 Å when all heads are in the IHM configuration and becomes larger than that when heads open. R0 of 63 Å allows a precise distance measurement to be made in the 31.5-94.5 Å range which corresponds to 0.5-1.5 R0. It is therefore technically incorrect to use the steady-state FRET efficiency values to determine the D-A distance here. Besides, there are several unknown factors here like orientation factor (κ2) which further complicate these calculations. Similarly, the quantification of IHM state molecules from the negative stain EM experiments is significantly hampered by the disruptive effect of the grid surface on the structure of the IHM state. The authors find that limiting the contact time with the grid to ~ 5s is necessary to preserve the IHM state.

      Despite that, only ~15% WT molecules were seen in the IHM state at low salt (Fig. 6B). In contrast, ~56% E525K molecules were in the IHM state. Both these proteins have similar SRX proportions (Fig. 3C) and similar FRET efficiency values (Fig. 5A) at this salt concentration. This mismatch highlights the problem arising due to not having a measure of the populations from the FRET data. It is not clear if the hugely different proportions of the IHM state in EM experiments are indicative of the relative stability of this state in the two proteins or a random difference in the electrostatic interactions of WT vs mutant with the grid. These experiments do not provide a correct idea of the %IHM in the two proteins. In the absence of any IHM population measurement, it is important to proceed with caution when quantitatively correlating the SRX and IHM.

      We thank the reviewer for pointing out that measuring precise distances by FRET can be difficult. We agree that the low FRET efficiency makes precise distance determination even more challenging. However, FRET is quite good at measuring a change in distance given a specific donor-acceptor pair. We feel our FRET biosensor clearly demonstrates FRET efficiencies that are salt-insensitive in E525K but a clear decrease in FRET at higher salt concentrations in WT. In order to compare the trend in the predicted FRET, based on the single turnover measurements, and the actual FRET we thought it was important to plot the two together on the same graph. We understand that this could have been misleading that we were reporting actual distances. We have now plotted the FRET efficiency instead of distance as a function of KCl concentration (Figure 5B), to prevent any confusion with reporting distances. In addition, we have emphasized that the data are plotted to allow for a comparison of the trend in the single turnover and FRET data (see page 6, 10, Figure 5B).

      We agree that it is important to proceed with caution when comparing the EM to the FRET and single turnover data. The EM does not give a quantitative estimate of the fraction of IHM molecules, due to the disruptive effect of the grid surface on protein conformation. However, it does provide direct (though qualitative) evidence that the conformation underlying SRX and enhanced FRET is the IHM, and it is consistent with our interpretation that the E525K mutation enhances FRET and SRX by stabilizing the IHM. To consolidate this result, we have performed EM experiments now with a total of 3 preparations of WT and mutant (see page 6-7 and Figure 6D). We find that while there is variability from experiment to experiment, likely because the grid surface is slightly different each time the experiment is performed, in all cases there was a ~4-fold higher fraction of folded molecules in the mutant. Since each WT/mutant experimental pair was studied in parallel, using identically prepared grids, the results provide further evidence that the mutant stabilizes the IHM. However, we agree that a quantitative, direct visual correlation of the SRX and IHM is not possible based on the current EM data.

      Finally, the utility of the methods described in the paper to the field would be greatly enhanced if they were described in more detail. As currently written, it would be difficult for others to replicate these experiments.

      Thank you for the comment. We have made significant changes in the methods to clarify the details of the experiments (see pages 11-14). In addition, we have added details to the results and figure legends.

    1. Author Response

      Reviewer #1 (Public Review):

      “This study investigates the dynamics of brain network connectivity during sustained experimental pain in healthy human participants. To this end, capsaicin was applied to the tongues of two cohorts of participants (discovery cohort, N=48; replication cohort, N=74). This procedure resulted in pain for several minutes. During sustained pain, pain avoidance/intensity ratings and fMRI scans were obtained. The analyses (i) compare the pain state with a resting state, (ii) assess the dynamics of brain networks during sustained pain, and (iii) aim to predict pain based on the dynamics of brain networks. To this end, the analyses focus on community structures of time-evolving networks. The results show that sustained pain is associated with the emergence of a brain network including somatomotor, frontoparietal, basal ganglia and thalamic brain areas. The somatomotor area of the tongue is particularly involved in that network while this area is decoupled from other parts of the somatomotor cortex. Moreover, the network configuration changes over time with the frontoparietal network decoupling from the somatomotor network. Frontoparietal-cerebellar connections were predictive of decreases of pain. Together, the findings provide novel and convincing insights into the dynamics of brain network during sustained pain.

      Strengths

      • The brain mechanisms of sustained pain is a timely and relevant topic with potential clinical implications.

      • Assessing the dynamics of sustained pain and relating it to the dynamics of brain networks is a timely and promising approach to further the understanding of the brain mechanisms of pain.

      • The study includes discovery and replication cohorts and pursues a cutting-edge analysis strategy.

      • The manuscript is very well-written and the results are visualized in an exemplary manner including a graphical outline and summary of the findings.”

      We thank the reviewer for the thoughtful summarization and evaluation of our study.

      “Weaknesses

      • It remains unclear whether the changes of brain networks over time simply reflect the duration of sustained pain or whether they essentially reflect different levels of pain intensity/avoidance.”

      We appreciate the editor and reviewer’s comment on this issue. With the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. That is, the pain duration is correlated with the pain intensity in our task.

      However, when we examined the time-course of the ratings at each individual level (as shown in Figure S2), the time duration explained 53.7% of the rating variance, R2 = 0.537 ± 0.315 (mean ± standard deviation). In addition, if we constrain the beta coefficient of the time duration to be negative (i.e., ratings should decrease over time), the explained variance decreases to 48.2%, R2 = 0.482 ± 0.457, leaving us enough variance (i.e., greater than 50%) for examining the distinct effects of time duration and ratings on the patterns of functional brain reorganization.

      Indeed, the two main analyses included in the manuscript—consensus community detection and predictive modeling—were designed to examine those two aspects of the task, i.e., time duration and pain avoidance ratings, respectively. First, through the consensus community detection analysis, we examined the community structure that changes over time, i.e., across the early, middle, and late periods (as shown in Figure 3). We then developed predictive models of pain avoidance ratings in the second main analysis (as shown in Figure 5).

      Though it is still a caveat that we cannot fully dissociate the effects of time duration versus pain ratings, we could interpret the first set of results to be more about time duration, while the second set of results is more about pain ratings.

      We now added a description of the implication of predictive modeling for isolating the effects of pain ratings. In addition, a discussion on the caveat of the current experimental design and relevant future direction.

      Revisions to the main manuscript:

      p. 25: Moreover, developing models to directly predict the pain ratings is helpful to complement the group-level analysis, because the changes in consensus community structure over the early, middle, and late periods only indirectly reflect the different levels of pain.

      p. 27: This study also had some limitations. First, with the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. Though we aimed to model the effects of pain duration and pain avoidance ratings with our two primary analyses, i.e., consensus community detection and predictive modeling, we cannot fully dissociate the impact of time duration versus pain ratings.

      “• Although the manuscript is very well-written it might benefit from an even clearer and simpler explanation of what the consensus community structure and the underlying module allegiance measure assesses.”

      We thank you for the suggestion. Now we added additional (but simple) descriptions of module allegiance and consensus community detection methods.

      Revisions to the main manuscript:

      pp. 8-9: Here, the consensus community means the group-level representative structures of the distinct community partitions of individuals. To determine the consensus community across different individuals and times, we first obtained the module allegiance (Bassett et al., 2011) from the community assignment of each individual. Module allegiance assesses how much a pair of nodes is likely to be affiliated with the same community label, and is defined as a matrix T whose element Tij is 1 when nodes i and j are assigned to the same community and 0 when assigned to different communities. This conversion of the categorical community assignments to the continuous module allegiance values allows group-level summarization of different community structures of individuals.

      p. 14: Here, high module allegiance indicates the voxels of two regions are likely to be in the same community affiliation, and vice versa.

      “• The added value of the assessment of the dynamics of brain networks remains unclear. Specifically, it is unclear whether the current analysis of brain networks dynamics allows for a clearer distinction between and prediction of pain and no-pain states than other measures of static or dynamic brain activity or static measures of brain connectivity.”

      The main goal (and thus, the added value) of the current study was to provide a “mechanistic” understanding of the brain processes of sustained pain, rather than the “prediction.” Even though we included the results from the predictive modeling, as in Figures 4-6, our focus was more on the interpretation of the model to quantitatively examine the functional changes in the brain, not on the maximization of the prediction performance.

      Indeed, maximizing the prediction performance was the main goal of our previous study (Lee et al., 2021), in which we developed a predictive model of sustained pain based on the patterns of dynamic functional connectivity. The model showed better prediction performances compared to the current study, but it was challenging to interpret the model because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      In this sense, the multi-layer community detection method has several advantages to achieving our goal. First, the community detection analysis allows us to summarize the complex, high-dimensional whole-brain connectivity patterns into neurobiologically interpretable subsystems. Second, the multi-layer community detection method allows us to study the temporal changes in community structure by connecting the same nodes across different time points.

      Now we added a description of the rationale behind the choice of the multi-layer community detection analysis over the conventional functional connectivity methods, and the added value of our study.

      Revisions to the main manuscript:

      p. 3: In this study, we examined the reconfiguration of whole-brain functional networks underlying the natural fluctuation in sustained pain to provide a mechanistic understanding of the brain responses to sustained pain.

      p. 7: In this study, we used this approach to examine the temporal changes of brain network structures during sustained pain, which cannot be done with conventional functional connectivity-based analyses (Lee et al., 2021).

      p. 27: However, the previous model provides a limited level of mechanistic understanding because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      Reviewer #2 (public Review):

      “The Authors J-J Lee et al., investigated cortical and subcortical brain networks and their organization in communities over time during evoked tonic pain. The paper is well-written, and the findings are interesting and relevant for the field. Interestingly, other than confirming well known phenomena (e.g., segregation within the primary somatomotor cortex) the Authors identified an emerging "pain supersystem" during the initial increase of pain, in which subcortical and frontoparietal regions, usually more segregated, showed more interactions with the primary somatomotor cortex. Decrease of pain was instead associated to a reconfiguration of the networks that sees subcortical and frontoparietal regions connected with areas of the cerebellum. The main novelty of the proposed analysis, lies in the resulting high performances of the classifier, that shows how this interesting link between frontoparietal network and subcortical regions with the cerebellum, is predictive of pain decrease. In summary, the main strengths of the present manuscript are: • Inclusion of subcortical regions: most of the recent papers using the Shaefer parcellation in ~200 brain areas1, do not consider subcortical areas, ignoring possible relevant responses and behaviors of those regions. Not only the Authors smartly addressed this issue, but most of their results showed how subcortical regions played a key role in the networks reconfiguration over time during evoked sustained pain.

      • Robust classification results: high accuracy obtained on training dataset (internal validation), using a leave-one-out approach, and on the available independent test dataset (external validation) of relatively large sample size (N=74).

      • Clarity in the description of aim and sub-aims and exhaustive presentation of the obtained results helped by appropriate illustrations and figures (I suggest less wording in some of them).

      • Availability of continuous behavioral outcome (track ball).”

      We appreciate the reviewer’s summary and positive evaluations.

      “Even though the results are mostly cohesive with previous literature, some of the results need to be discussed in relationship to recently published papers on the same topic as well as justifying some of the non-standard methodological procedures adding appropriate citations (or more detailed comments). The Authors do not touch upon the concept of temporal summation of pain, historically associated with tonic pain, especially when the study is finalized to better understanding brain mechanisms in chronic pain populations (chronic pain patients often exhibit increased temporal summation of pain2). I would suggest starting from the paper recently published by Cheng et al. that also shares most of the methodological pipeline3 to highlight similarities and novelties and deepen the comparison with the associated literature.”

      We thank the reviewer and editor for the comment on this important topic. Temporal summation of pain indicates progressively increased sensation of pain during prolonged noxious stimulation (Price, Hu, Dubner, & Gracely, 1977), and has been suggested as a hallmark of chronic pain disorders including fibromyalgia (Cheng et al., 2022; Price et al., 2002). In a recent study by Cheng et al. (2022), the authors induced tonic pain using constantly high cuff pressure and examined whether the participants experienced increased pain in the late period compared to the early period of pain. On the contrary, in our experimental paradigm, the capsaicin liquid initially delivered into the oral cavity is being cleaned out by saliva, and thus overall pain intensity was decreasing over time, not increasing (Figure 1B). Therefore, the temporal summation of pain may occur in a limited period (e.g., the early period of the run), but it is difficult to examine its effect systematically in our study.

      However, it is notable that Cheng et al.’s results overlap with our findings. For example, Cheng et al. reported the intra-network segregation within the somatomotor network and the inter-network integration between the somatomotor and other networks during the temporal summation of pressure pain in patients with fibromyalgia, which were similar to the findings we reported in Figure S9 and Figure 4. Although it is unclear whether these results reflect the temporal summation of pain, these network-level features shared across the two studies are likely to be an essential component of the sustained pain processes in the brain.

      Now we added a comment on the temporal summation of pain in the main manuscript.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      We thank the reviewer and editor for the information about this recent publication. Cheng et al. (2022) was not published at the time we wrote the manuscript, and we were surprised that Cheng et al. shares many aspects with our study, e.g., both used multilayer community detection and also reported similar findings, as described above.

      However, there were some differences between the two studies as well.

      First, the focus of our study was on the brain dynamics during the natural time-course of sustained pain from its initiation to remission in healthy participants, whereas the focus of Cheng et al. was on the temporal summation phenomenon of pain (TSP) and the enhanced TSP in patients with fibromyalgia patients. Because of this difference in the research focuses, our study and Cheng et al. are providing many nonoverlapping results and insights. For example, our study paid particular attention to the coping mechanisms of the brain (e.g., the network-level changes in the subcortical and frontoparietal network regions) and the brain systems that are correlated with the natural decrease of pain (e.g., the cerebellum in Figure 5). In contrast, Cheng et al. (2022) identified the brain connectivity and network features important for the increased TSP in fibromyalgia patients.

      Second, our great interest was in identifying and visualizing the fine-grained spatiotemporal patterns of functional brain network changes over the period of sustained pain. To utilize fine-grained brain activity information, we conducted our main analyses at a voxel-level resolution and on the native brain space, such as in Figures 2-3 and Figures S5, S7, and S8. With this fine-grained spatiotemporal mapping, we were able to identify small, but important voxel-level dynamics.

      We now cited Cheng et al. (2022) in multiple places and revised the manuscript accordingly.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Here the main significant weaknesses of the study:

      • The data analysis is entirely conducted on young healthy subjects. This is not a limitation per se, but the conclusion about offering new insights into understanding mechanisms at the basis of chronic pain is too far from the results. Centralization of pain is very different from summation and habituation, especially if all the subjects in the study consistently rated increased and decreased pain in the same way (it never happens in chronic pain patients). A similar pipeline has been actually applied to chronic pain patients (fibromyalgia and chronic back pain)3,4. Discussing the results of the present paper in relationship to those, could offer a more robust way to connect the Authors' results to networks behavior in pathological brains.”

      We are grateful for the opportunity to discuss the clinical implication of our study. First of all, we agree with the reviewer and editor that we cannot make a definitive claim about chronic pain with the current study, and thus, we revised the last sentence of the abstract to tone down our claim.

      Revisions to the main manuscript (p. 2, in the abstract):

      This study provides new insights into how multiple brain systems dynamically interact to construct and modulate pain experience, advancing our mechanistic understanding of sustained pain.

      However, as we noted above in E-4, some of our findings were consistent with the findings from a previous clinical study (Cheng et al., 2022), suggesting the potential to generalize our study to clinical pain conditions. In addition, we previously reported that a predictive model of sustained pain derived from healthy participants performed better at predicting the pain severity of chronic pain patients than the model derived directly from chronic pain patients (Lee et al., 2021), highlighting the advantage of the “component process approach.”

      The component process approach aims to develop brain-based biomarkers for basic component processes first, which can then serve as intermediate features for the modeling of multiple clinical conditions (Woo, Chang, Lindquist, & Wager, 2017). This has been one of the core ideas of the Research Domain Criteria (RDoC) (Insel et al., 2010) and the Hierarchical Taxonomy of Psychopathology (HiTOP) (Kotov et al., 2017). If the clinical pain of a patient group is modeled as a whole, it becomes unclear what is being modeled because of the multidimensional and heterogeneous nature of clinical pain (Melzack, 1999) as well as other co-occurring health conditions (e.g., mental health issues, medication use, etc.). The component process approach, in contrast, can specify which components are being modeled and are relatively free from heterogeneity and comorbidity issues by experimentally manipulating the specific component of interest in healthy participants.

      The current study was conducted on healthy young adults based on the component process approach. We used oral capsaicin to experimentally induce sustained pain, which unfolds over protracted time periods and has been suggested to reflect some of the essential features of clinical pain (Rainville, Feine, Bushnell, & Duncan, 1992; Stohler & Kowalski, 1999). Therefore, the detailed characterization of the brain processes of sustained pain will be able to serve as an intermediate feature of multiple clinical conditions in future studies.

      Now we added the discussion on the clinical generalizability issue in the discussion section.

      Revisions to the main manuscript:

      pp. 25-26: An interesting future direction would be to examine whether the current results can be generalized to clinical pain. Experimental tonic pain has been known to share similar characteristics with clinical pain (Rainville et al., 1992; Stohler & Kowalski, 1999). In addition, in a recent study, we showed that an fMRI connectivity-based signature for capsaicin-induced orofacial tonic pain can be generalized to chronic back pain (Lee et al., 2021). Therefore, a detailed characterization of the brain responses to sustained pain has the potential to provide useful information about clinical pain.

      p. 26: Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Vice versa, the behavioral measure used to assess evoked pain perception (avoidance ratings), has been developed for chronic pain patients and never validated on healthy controls5. It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects6,7.”

      We acknowledge that pain avoidance measures are not fully validated in the healthy population. Nevertheless, we used this measure in this study for the following two main reasons that outweigh the limitations.

      First, a pain avoidance rating provides an integrative measure that can reflect the multi-dimensional aspects of sustained pain. One of the essential functions of pain is to avoid harmful situations and promote survival, and the avoidance motivation induced by pain is composed of not only sensory-discriminative, but also cognitive components including learning, valuation, and contexts (Melzack, 1999). According to the fear-avoidance model (Vlaeyen & Linton, 2012), if the pain-induced avoidance motivation is not resolved for a long time and is maladaptively associated with innocuous environments, chronic pain is likely to develop, suggesting the importance and clinical relevance of pain avoidance measures. In addition, our experimental design is particularly suitable for the use of avoidance rating because the oral capsaicin stimulation is accompanied by the urge to avoid the painful sensation, but it cannot immediately be resolved similar to chronic pain. Moreover, capsaicin is sometimes experienced as intense but less aversive (or even appetitive) in some cases, e.g., spicy food craver (Stevenson & Yeomans, 1993). In this case, avoidance ratings can provide a more reasonable measure of pain compared to the intensity rating.

      Second, the avoidance measure provides a common scale on which we can compare different types of aversive experiences, allowing us to conduct specificity tests for a predictive model of pain. For example, a recent study successfully compared the brain representations of two types of pain and two types of aversive, but non-painful experiences (e.g., aversive auditory and visual experiences) using the same avoidance measure (Ceko, Kragel, Woo, Lopez-Sola, & Wager, 2022). These comparisons were possible because the avoidance measure provided one common scale for all the aversive experiences regardless of their types of stimuli.

      To provide a better justification for the use of the avoidance measure, we now included the specificity test results of our pain predictive models. More specifically, we tested our module allegiance-based SVM and PCR models of pain on the aversive taste and aversive odor conditions (Figure S13).

      Despite these advantages, the use of avoidance rating without thorough validation is a limitation of the current study, and thus future studies need to examine the psychometric properties of the avoidance rating, e.g., examining the relationship among pain intensity, unpleasantness, and avoidance measures. However, the current study showed that the predictive models derived with pain avoidance rating (Study 1) could be used to predict the pain intensity rating (Study 2). In addition, the overall time-course of pain avoidance ratings in Study 1 was similar to the time-course of pain intensity ratings in Study 2, providing some supporting evidence for the convergent validity of the pain avoidance measure.

      As to the following comment, “It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects,” there are pieces of evidence supporting that the low between-individual variability of ratings is due to the characteristics of our experimental design, not to the fact that we used the avoidance measure. As we discussed in more detail in our response to E-1, our experimental procedure based on capsaicin liquid commonly induces the initial burst of painful sensation and the subsequent gradual relief for most of the participants (Figure 1B, left). A similar time-course pattern of ratings was observed in Study 2 (Figure 1B, right), which used the pain “intensity” rating, not the pain avoidance rating. In addition, previous studies with a similar experimental design (i.e., intra-oral capsaicin application) (Berry & Simons, 2020; Lu, Baad-Hansen, List, Zhang, & Svensson, 2013; Ngom, Dubray, Woda, & Dallel, 2001) also showed a similar time-course of pain ratings with low between-individual variability regardless of the rating types (e.g., VAS or irritation intensity), confirming that this observation is not unique to the pain avoidance rating.

      Now we added descriptions on the small between-individual variability of pain ratings and the use of avoidance ratings.

      Revisions to the main manuscript:

      pp. 5-7: Note that the overall trend of pain ratings over time was similar across participants because of the characteristics of our experimental design, which has also been observed in the previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001). However, also note that each individual’s time-course of pain ratings were not entirely the same (Figures S2 and S3).

      p. 26: However, there are also differences between the characteristics of capsaicin-induced tonic pain versus clinical pain. For example, clinical pain continuously fluctuates over time in an idiosyncratic pattern (Apkarian, Krauss, Fredrickson, & Szeverenyi, 2001), whereas capsaicin-induced tonic pain showed a similar time-course pattern across the participants—i.e., increasing rapidly and then decreasing gradually (Figure 1B). This typical time-course of pain ratings has been reported in previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001).

      pp. 26-27: Note that Study 1 used a pain avoidance measure that is not yet fully validated in healthy participants. However, we chose to use the pain avoidance measure, which can provide integrative information on the multi-dimensional aspects of pain (Melzack, 1999; Waddell, Newton, Henderson, Somerville, & Main, 1993). It also has a clinical implication considering that the maladaptive associations of pain avoidance to innocuous environments have been suggested as a putative mechanism of transition to chronic pain (Vlaeyen & Linton, 2012). Lastly, the avoidance measure can provide a common scale across different modalities of aversive experience, allowing us to compare their distinct brain representations (Ceko et al., 2022) or test the specificity of their predictive models (Lee et al., 2021) (Figure S13). Although the psychometric properties of the pain avoidance measure should be a topic of future investigation, we expect that the pain avoidance measure would have a high level of convergent validity with pain intensity given the observed similarity between pain avoidance (Study 1) and pain intensity (Study 2) in their temporal profiles. The generalizability of our PCR model across Studies 1 and 2 also supports this speculation. However, there would also be situations in which pain avoidance is dissociated from pain intensity. For example, capsaicin can be experienced to be intense but less aversive or even appetitive in some contexts, such as cravings for spicy food (Stevenson & Yeomans, 1993). In addition, the gradual rise of avoidance ratings during the late period of the control condition in Study 1 would not be observed if the intensity measure was used. Future studies need to examine the relationship between pain avoidance and the other pain assessments and the advantage of using the pain avoidance measure.

      “• The dynamic measure employed by the Authors is better described from the term "windowed functional connectivity". It is often considered a measure of dynamic functional connectivity and it gives information about fluctuations of the connectivity patterns over time. Nevertheless, the entire focus of the paper, including the title, is on dynamic networks, which inaccurately leads one to think of time-varying measures with higher temporal resolution (either updating for every acquired time point, as the Authors did in their previous publication on the same dataset4, or sliding windows involving weighting or tapering8,9). This allows one to follow network reorganization over time without averaging 2-min intervals in which several different brain mechanisms might play an important role3,10,11. In summary, the assumption of constant response throughout 2-min periods of tonic pain and the use of Pearson correlations do not mirror the idea of dynamic analysis expressed by the Authors in title and introduction. I would suggest removing "dynamic" from the title, reduce the emphasis on this concept, address possible confounds introduced by the choice of long windows and rephrase the aim of the study in terms of brain network reconfiguration over the main phases of tonic pain experience.”

      Now we removed the word ‘dynamic’ from many places in the manuscript, including the title. In addition, we added a brief discussion on the reason we chose to use the long and non-overlapping windows for connectivity calculation.

      Revisions to the main manuscript (p. 8):

      Although the long duration of the time window without overlaps may obscure the fine-grained temporal dynamics in functional connectivity patterns, we chose to use this long time window based on previous literature (Bassett et al., 2011; Robinson, Atlas, & Wager, 2015), which also used long time windows to obtain more reliable estimates of network structures and their transitions.

      “• Procedure chosen for evoking sustained pain. To the best of my knowledge, capsaicin sauce on the tongue is not a validated tonic pain procedure. In favor of this argument is the absence of inter-subject variability in the behavioral results showed in the paper, very unusual for response to painful stimulations. The procedure is well described by the Authors, and some precautions like letting the liquid drying before the start of the scan, have helped reducing confounds. Despite this, the measures in figure 1B suggest that the intensity of the painful stimulation is not constant as expected for sustained pain (probably the effect washes out with the saliva). In this case, the first six-minute interval requires particular attention because it encapsulates the real tonic pain phase, and the following ones require more appropriate labels. Ideally the Author should cite previous studies showing that tongue evoked pain elicits a very specific behavioral response (summation, habituation/decrease of pain, absence of pain perception). If those works are missing, this response need to be treated as a funding rather than an obvious point.”

      We addressed this comment. Moreover, we could find previous studies that experimentally induced tonic pain through the application of capsaicin on the tongue (Berry & Simons, 2020; Boudreau, Wang, Svensson, Sessle, & Arendt-Nielsen, 2009; Green, 1991; Ngom et al., 2001), suggesting that our experimental procedure is in line with previous literature.

      Reviewer #3 (Public Review ):

      “In their manuscript, Lee and colleagues explore the dynamics of the functional community structure of the brain (as measured with fMRI) during sustained experimental pain and provide several potentially highly valuable insights into, and evaluate the predictive capacity of, the underlying dynamic processes. The applied methodology is novel but, at the same time, straightforward and has solid foundations. The findings are very interesting and, potentially, of high scientific impact as they may significantly push the boundaries of our understanding of the dynamic neural processes during sustained pain, with a (somewhat limited) potential for clinical translation.

      However (Major Issue 1), after reading the current manuscript version, not all of my doubts have been dissolved regrading the specificity of the results to pain. Moreover (Major Issue 2), some of the results (specifically, those related to the group level analysis of community differences) do not seem to be underpinned with a proper statistical inference in the current version of the manuscript and, therefore, their presentation and discussion may not be proportional to the degree of evidence. Next to these Major Issues (detailed below), some other, minor clarifications might also be needed before publications. These are detailed below or in the private part of the review ("Recommendations for the authors").

      Despite these issues, this is, in general, a high quality work with a high level of novelty and - after addressing the issues - it has a very high potential for becoming an important contribution (and a very interesting read) to the pain-research community and beyond.”

      We appreciate the reviewer’s thoughtful comments. We have revised the manuscript to address the Reviewer’s major concerns, as described below.

      “Major Issue 1:

      The main issue with the manuscript is that it remains somewhat unclear, how specific the results are to pain.

      Differences between the control resting state and the capsaicin trials might be - at least partially - driven by other factors, like:

      • motion artifacts

      • saliency, attention, axiety, etc.

      Differences between stages over the time-course might, additionally, be driven by scanner drifts (to which the applied approach might be less sensitive, but the possibility is still there ) or other gradual processes, e.g. shifts in arousal, attention shifts, alertness, etc.

      All the above factors might emerge as confounding bias in both of the predictive models.

      This problem should be thoroughly discussed, and at least the following extra analyses are recommended, in order to attenuate concerns related to the overall specificity and neurobiological validity of the results:

      • reporting of, and testing for motion estimates (mean, max, median framewise displacement or anything similar)

      • examining whether these factors might, at least partially, drive the predictive models.

      • e.g. applying the PCR model on the resting state data and verifying of the predicted timecourse is flat (no inverse U-shape, that is characteristic to all capsaicin trials).

      Not using the additional sessions (bitter taste, aversive odor, phasic heat) feels like a missed opportunity, as they could also be very helpful in addressing this issue.”

      We thank the reviewer for this comment on the important issue regarding the specificity of our results and the potential influences of noise. The effects of head motion and physiological confounds are particularly relevant to pain studies because pain involves substantial physiological changes and often causes head motion. To address the related concerns of specificity, we conducted additional analyses assessing the independence of our predictive models (i.e., SVM and PCR models) from head movement and physiology variables and the specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1.

      First, we examined the overall changes of framewise displacement (FD) (Power, Barnes, Snyder, Schlaggar, & Petersen, 2012), heart rate (HR), and respiratory rate (RR) in the capsaicin condition (Figure S11). For the univariate comparison between the capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, the capsaicin condition caused significant changes in head motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). In addition, the increased motion and autonomic responses were more prominent in the early period of pain (Figure S11B). The 10-binned (2 mins per time-bin) FD and HR showed a decreasing trend while the RR showed an increasing trend over time in the capsaicin condition. The comparisons between the early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) periods of the capsaicin condition showed significant differences both for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). These results suggest that while participants were experiencing capsaicin tonic pain, particularly during the early period, head motion and heart rate were increased, while breathing was slowed down. Note that we needed to exclude 4 participants’ data in this analysis due to technical issues with the physiological data acquisition.

      Next, we examined whether the changes in head motion and physiological responses influenced our predictive model performance (Figure S12). We first regressed out the mean FD, HR, and RR (concatenated across conditions and participants as we trained the SVM model) from the predicted values of the SVM model with leave-one-subject-out cross-validation (2 conditions × 44 participants = 88) and then calculated the classification accuracy again (Figure S12A). The results showed that the SVM model showed a reduced, but still significant classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). We also did the same analysis for the PCR model (10 time-bins × 44 participants = 440) and the PCR model also showed a significant prediction performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) (Figure S12B). These results suggest that our SVM and PCR models capture unique variance in tonic pain above and beyond the head movement and physiological changes.

      Lastly, we examined the specificity of our predictive models to pain, by testing the models on the non-painful but aversive conditions including the bitter taste (induced by quinine) and aversive odor (induced by fermented skate) conditions (Figure S13). All the model responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall model responses of the SVM model for the bitter taste and aversive odor conditions were higher than those for the control condition but lower than the capsaicin condition (Figure S13A). Classification accuracies for comparing capsaicin vs. bitter taste and capsaicin vs. aversive odor were all significant (for capsaicin vs. bitter taste, accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; for capsaicin vs. aversive odor, accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), supporting the specificity of our SVM model of pain. Similarly, the model responses of the PCR model for the bitter taste and aversive odor conditions were lower than the capsaicin condition, and their temporal trajectories were less steep and fluctuating compared to the capsaicin condition (Figure S13B). The time-course of the model responses for the control condition was flatter than all other conditions and did not show the inverted U-shape. Furthermore, the model responses of the bitter taste and aversive odor conditions did not show the significant correlations with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain.

      Overall, we have provided evidence that our models can predict pain ratings above and beyond the head motion and physiological changes and that the models are more responsive to pain compared to non-painful aversive conditions.

      Now we added descriptions on the specificity tests to the main manuscript and also to the Supplementary Information.

      Revisions to the main manuscript (p. 20):

      Specificity of the module allegiance-based predictive models To examine whether the predictive models were specific to pain and the prediction performances were not influenced by confounding variables such as head motion and physiological changes, we conducted additional analyses as shown in Figures S11-13. The SVM and PCR models showed significant prediction performances even after controlling for head motion (i.e., framewise displacement) and physiological responses (i.e., heart rate and respiratory rate) (Figures S11 and S12) and did not respond to the non-painful but aversive conditions including the bitter taste and aversive odor conditions (Figure S13), supporting the specificity of our predictive to pain. For details, please see Supplementary Results.

      Revisions to the Supplementary Information (pp. 2-4):

      Specificity analysis (Figures S11-13) To examine whether the predictive models (i.e., SVM and PCR models) were specific to pain and not influenced by confounding noises, we conducted additional specificity analysis assessing the independence of the models from head movement and physiology variables and specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1. First, we examined the overall changes of framewise displacement (FD) (Power et al., 2012), heart rate (HR), and respiratory rate (RR) in sustained pain (Figure S11). For the univariate comparison between capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, capsaicin condition caused significant changes in motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). For the temporal changes of movement and physiology variables (Figure S11B), the results showed that the increased motion and autonomic responses are more prominent in the early period of pain. The 10-binned (2 mins per time-chunk) FD and HR showed decreasing trend while the RR showed increasing trend over time in capsaicin condition. Additional univariate comparisons between early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) period of capsaicin condition showed that differences were significant for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). This suggests that while participants were experiencing tonic pain, particularly in the early period, motion and heart rate was increased but breathing was slowed. Note that we needed to exclude 4 participants’ data due to technical issues with physiological data acquisition. Next, we examined whether the head movement and physiological responses are the main driver of our predictive models (Figure S12). For all the original signature responses from SVM model (2 conditions × 44 participants = 88), we regressed out the mean FD, HR, and RR (concatenated across conditions and participants as the SVM model was trained) and calculated the classification accuracy (Figure S12A). Although the signature responses were controlled for movement and physiology variables, the SVM model still showed a high classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). Similarly, for all the original signature responses from PCR model (10 time-bins × 44 participants = 440), we regressed out the 10-binned FD, HR, and RR (concatenated across time-bins and participants as the PCR model was trained) and calculated the within-individual prediction-outcome correlation (Figure S12B). Again, the PCR model showed a significantly high predictive performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) while controlling for movement and physiology variables. These results suggest that our SVM and PCR models captures unique variance in tonic pain above and beyond the head movement and physiological changes. Lastly, we examined the specificity of our predictive models to pain, by testing the models onto the non-painful but tonic aversive conditions including bitter taste (induced by quinine) and aversive odor (induced by fermented skate) (Figure S13). All the signature responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall signature responses of SVM model for bitter taste and aversive odor conditions were higher than those for control conditions, but lower than capsaicin condition (Figure S13A). Classification accuracy between capsaicin vs. bitter taste and vs. aversive odor were all significantly high (capsaicin vs. bitter taste: accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; capsaicin vs. aversive odor: accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), suggesting the specificity of SVM model to pain. Similarly, the temporal trajectories of the signature responses of PCR model for bitter taste and aversive odor conditions were not overlapping with that of the capsaicin condition (Figure S13B). Furthermore, the signature responses of bitter taste and aversive odor conditions do not have significant relationship with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain. Overall, we have provided evidence that the module allegiance-based models can predict pain ratings above and beyond the movement and physiological changes, and are more responsive to pain compared to non-painful aversive conditions, which suggest the specificity of our results to pain.

      “Major Issue 2:

      Another important issue with the manuscript is the (apparent) lack of statistical inference when analyzing the differences in the group-level consensus community structures (both when comparing capsaicin to control and when analysing changes over the time-course of the capsaicin-challenge).

      Although I agree that the observed changes seem biologically plausible and fit very well to previous results, without proper statistical inference we can't determine, how likely such differences are to emerge just by chance.

      This makes all results on Figs. 2 and 3, and points 1, 4 and 5 in the discussion partially or fully speculative or weakly underpinned, comprising a large proportion of the current version of the manuscript.

      Let me note, that this issue only affects part of the results and the remaining - more solid - results may already provide a substantial scientific contribution (which might already be sufficient to be eligible for publication in eLife, in my opinion).

      Therefore I see two main ways of handling Major Issue 2:

      • enhancing (or clarifying potential misunderstandings regarding) the methodology (see my concrete, and hopefully feasible, suggestions in the "private part" of the review),

      • de-weighting the presentation and the discussion of the related results.

      I believe there are many ways to test the significance of these differences. I highlight two possible, permutation testing-based ideas.

      Idea 1: permuting the labels ctr-capsaicin, or early-mid-late, repeating the analysis, constructing the proper null distribution of e.g. the community size changes and obtain the p-values. Idea 2: "trace back" communities to the individual level and do (nonparametric) statistical inference there.”

      We appreciate this important comment. We did not conduct statistical inference when comparing the group-level consensus community affiliations of the different conditions (Figure 2) or different phases (Figure 3) because of the difficulty in matching the community affiliation values of the networks to be compared.

      For example, let us assume that the 800 out of 1,000 voxels of community #1 and 1,000 out of 4,000 voxels of community #2 in the control condition are commonly affiliated with the same community #3 in the capsaicin condition. To compare the community affiliation between two conditions, we should first match the community label of the capsaicin condition (i.e., #3) to that of the control condition (i.e., #1 or #2), and here a dilemma occurs; if we prioritize the proportion of the overlapping voxels for the matching, the common community should be labeled as #1, whereas if we prioritize the number of the overlapping voxels for the matching, the label of the common community should be #2. Although both choices look reasonable, none of them can be a perfect solution.

      As the example above, it is impossible to exactly match the community affiliation of the different networks. We must choose an imperfect criterion for the matching procedure, which essentially affects the comparison of network structure. This was the main reason that we limited our results of Figures 2-3 to a qualitative description based on visual inspection. Moreover, the group-level consensus community structures in Figures 2-3 are not a simple group statistic like sample mean; they were obtained from multiple steps of analyses including permutation-based thresholding and unsupervised clustering, which could further complicate the interpretation of statistical tests.

      Alternatively, there is a slightly different but more rigorous approach to the comparisons of the community structures, which is the Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). Instead of direct use of the community labels, this method converts the community label of each voxel into a list of module allegiance values between the seed voxel and all the voxels of the brain (i.e., 1 if the seed and target voxels have the same community label and 0 otherwise). This allows quantitative comparisons of voxel-level community profiles between different conditions without an arbitrarily matching of the community labels. We adopted this Phi-test for our analyses to examine whether the regional community affiliation pattern is significantly different between (i) the capsaicin vs. control conditions and (ii) the early vs. late periods of pain (Figure S6), which correspond to the main findings of the Figures 2 and 3 in our manuscript, respectively.

      More specifically, to compare the group-level consensus community structures between the capsaicin vs. control conditions and the early vs. late periods, we first obtained a seed-based module allegiance map for each voxel (i.e., using each voxel as a seed). Then, we calculated a correlation coefficient of the module allegiance values between two different conditions for each voxel. This correlation coefficient can serve as an estimate of the voxel-level similarity of the consensus community profile. Because module allegiance is a binary variable, these correlation values are Phi coefficients. A small Phi coefficient means that the spatial pattern of brain regions that have the same community affiliation with the given voxel are different between the two conditions. For example, if a voxel is connected to the somatomotor-dominant community during the capsaicin condition and the default-mode-dominant community during the control condition, the brain regions that have the same community label with the voxel will be very different, and thus the Phi coefficient will become small. Moreover, the Phi coefficient can be small even if a voxel is affiliated as the same (matched) community label for both conditions, when the spatial patterns of the same community is different between conditions.

      To calculate the statistical significance of the Phi coefficient, we conducted permutation tests, in which we randomly shuffled the condition labels in each participant and obtained the group-level consensus community structure for each shuffled condition. Then, we calculated the voxel-level correlations of the module allegiance values between the two shuffled conditions. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficients, and calculated the proportion of null samples that have a smaller Phi coefficient (i.e., a more dis-similar regional community structure) than the non-shuffled original data.

      Results showed that there are multiple voxels with statistical significance (permutation tests with 1,000 iterations, one-tailed) in the area where the community affiliations of the two contrasting conditions were different (Figure S6). For example, the frontoparietal and subcortical regions for the capsaicin vs. control (c.f., Figure 2), and the frontoparietal, subcortical, brainstem, and cerebellar regions for the early vs. late period of pain (c.f., Figure 3) contain voxels that survived after thresholding with FDR-corrected q < 0.05, suggesting the robustness of our main results.

      Particularly, the somatomotor and insular cortices showed statistical significance in the permutation test, and this may reflect the large changes in other areas that are connecting to the somatomotor and insular cortices across different conditions. The statistical significance was also observed in the visual cortex, which was unexpected. We interpret that the spatial distribution of the visual network community is too stable across conditions, and thus the null distribution from permutation formed a very narrow distribution of Phi coefficients. Therefore, a small change in the community structure could achieve statistical significance.

      Now we added descriptions on the permutation tests.

      Revisions to the main manuscript:

      p. 9: Permutation tests confirmed that the community assignment in the frontoparietal and subcortical regions showed significant changes between the capsaicin versus control conditions (Figure S6A).

      p. 13: Permutation tests further confirmed that the community assignment in the frontoparietal, subcortical, and brainstem regions showed significant changes between the early versus late period of pain (Figure S6B).

      pp. 36-37: Permutation tests for regional differences in community structures. To test the statistical significance of the voxel-level difference of consensus community structures (Figures 2 and 3), we performed the following Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). First, for each given voxel, we compared the community label of the voxel to the community label of all the voxels, generating a list of voxel-seed module allegiance values that allow quantitative comparison of voxel-level community profile (e.g., [1, 0, 1, 1, 0, 0, ...], whose element is equal to 1 if the seed and target voxels were assigned to the same community and 0 otherwise). Next, a correlation coefficient was calculated between the module allegiance values of the two different brain community structures (i.e., capsaicin versus control, and early versus late). This correlation coefficient is an estimate of the regional similarity of community profiles (here, the correlation coefficient is Phi coefficient because module allegiance is a binary variable). To estimate the statistical significance of the Phi coefficient, we performed permutation tests, in which we randomly shuffled the labels and then obtained the group-level consensus community structures from the shuffled data. Then, the Phi coefficient between the module allegiance values of the two shuffled consensus community structures was calculated. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficient for each voxel. Lastly, we examined the probability to observe a smaller Phi coefficient (i.e., a more dissimilar community profile) than the one from the non-shuffled original data, which corresponds to the P-value of the permutation test. All the P-values were one-tailed as the hypothesis of this permutation test is unidirectional.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript by Chen et al., the authors use live-cell single-molecule imaging to dissect the role of DNA binding domains (DBD) and activation domains (AD) in transcription factor mobility in the nucleus. They focus on the family of HypoxiaInducible factors isoforms, which dimerize and bind chromatin to induce a transcriptional response. The main finding is that activation domains can be involved in DNA binding as indicated by careful observations of the diffusion/reaction kinetics of transcription factors in the nucleus. For example, different bound fractions of HIF-1beta and HIF2alpha are observed in the presence of different binding partners and chimeras. The paradigm of interchangeable parts of transcription factors has been eroded over the years (the recent work of Naama Barkai comes to mind, cited herein), so the present observations are not unexpected per se. Yet, the measurements are rigorous and wellperformed and have the important benefit of being in living cells. Enthusiasm is also dampened by the exclusive use of one technique and one analysis to reach conclusions.

      In the revised manuscript we complement the single molecule imaging experiments with genomic approaches, including Cut&Run and RNA-seq, that largely confirm our main conclusions derived from the SPT results. 

      Reviewer #2 (Public Review):

      The authors raise the very important question how different transcription factors with similar in vitro DNA sequence specificity are able to achieve distinct binding profiles associated with distinct functions. They use hypoxia inducible factors (HIF) as model system and combine live cell single-particle tracking with comprehensive genetic and chemical perturbations to study the mechanisms underlying isoform-specific gene regulation. Their main experimental readout is the distribution of diffusion coefficients of a molecular species, extracted from a population of single-particle trajectories. From this distribution, the authors extract the fractions of immobile and mobile molecules as well as the peak diffusion coefficient of the mobile fraction. They find that in addition to the structured DNA binding domain and the dimerization interface of HIF-1a and HIF-2a, the C-terminus of those factors, which includes intrinsically disordered regions and an activation domain, contributes to modulating the bound fraction of HIF-1b and the HIF-a isoforms. In particular, the C-terminus of HIF-2a mediates a higher bound fraction than the one of HIF-1a. This finding is important as it demonstrates that separating HIF into distinct domains that each have clearly defined functions is an oversimplification. Rather, a more holistic view seems suitable, in which all parts of HIF contribute to nuclear diffusion and binding.

      The conclusions drawn on the bound fractions and the nuclear dynamics of HIF isoforms are mostly backed up by data and proper controls. However, some controls are missing and some aspects of data analysis need to be clarified and extended. Moreover, the authors fail to answer their initial question, as the experimental readout does not contain information on the DNA sequences involved in the binding events.

      Experimental controls:

      For some imaging experiments, the authors use cell lines where endogenous HIF-1b or HIF-2a was fused to a N-terminal HaloTag by CRISPR/Cas editing. These cell lines are comprehensively controlled for proper functionality of the edited transcription factors, including expression levels, cellular localization and DNA binding. However, differential expression compared to unedited levels is not quantified and only Halo-HIF-2a is tested for functional gene transcription.

      To confirm that the tagged proteins still maintain normal function in driving target gene expression, we performed RNA-seq on WT cells, HaloTag-HIF-2α KIN and Halo-HIF-1β KIN cells, and show that gene expression on these edited cells do not differ significantly from unedited WT cells (Figure 1—figure supplement 3B, C).

      Other experiments include overexpression of exogenously expressed factors. For those, the authors give statements such as "expressed from a relatively strong ... promoter" and "weakly expressed", but do not provide any control of the amount of overexpression. Quantifying the expression levels will be important, as some of the author's experiments demonstrate a strong dependency of results on expression level. 

      We have now included Western Blot results showing L30-driven expression of all HIF variants in comparison with KIN levels (Fig 4—Figure Supplement 1). However, we note that cells stably expressing the HIF variants are polyclonal and Western Blotting is a bulk assay only able to assess the population average. As such, Western blot analysis may not reflect the actual expression level in the individual cells used in the imaging experiments. To properly control HIF expression at the individual cell level, we instead monitored the protein concentration in each cell and only chose to image cells with similar fluorescence level, as measured by localization density (Fig 4—Figure Supplement 1 and see detailed discussion in Appendix 2).

      Moreover, the authors do not provide any control for proper functionality of domainswap mutants.

      We now include RNA-seq results demonstrating that WT cells over-expressing HIF-α

      WT and domain swap variants (Halo-HIF-1α, Halo-HIF-1α/2α, Halo-HIF-2α, Halo-HIF2α/1α) can activate their specific target genes, confirming that all these variants are also transcriptionally active. (See Figure 6A, B, Figure 6—figure supplement 2 - increased binding of wild type or domain-swapped HIF to several gene loci or neighboring regions coincide with increased transcription levels of these genes, and Figure 7 - HIF expressing cells with same HIF-IDR co-cluster in their mRNA transcription profile).

      The authors further state that they use a high illumination power of 1100 mW. Such high laser power might be detrimental to cells and the authors should control whether this laser power induces any artifacts.

      We agree that a high illumination power (indispensable to achieve high signal-to-noise ratio and detect single molecules) may be detrimental to cells in the long run. However, we only took 1 movie with < 2000 frames for each cell. With a 5-ms frame rate, the total imaging duration per cell was under 10 seconds. Cells are unlikely to respond to any stimulus/damage in such a short time. Moreover, we used stroboscopic illumination instead of continuous illumination, with only 1-ms laser exposure for each 5-ms frame. The total integrated laser exposure is thus only 2 seconds. In addition, all imaging was done with a red laser (633 nm), which has a relatively low phototoxicity. Finally, the 1100 mW is the output from the laser box, but the actual laser power density used for imaging were measured to approximately 2.3 kW/cm2 at 633 nm (Graham et al., 2021). Such an imaging scheme is very unlikely to generate phototoxicity artifacts within the short time window of our measurements. Lastly, we are comparing results across all conditions with the exact same imaging set-up, so any artifact should be accounted and controlled for. We do consider fast SPT a terminal, end-point experiment, where each cell is only imaged once and never re-used.

      Data analysis:

      Distributions of diffusion coefficients greatly vary between individual cells (e.g. Fig. 2A and B, Fig. S3A and C, Fig. S4E). Unfortunately, the authors do not explain whether this variation is a real cell-to-cell variation, or rather reflects variation of their analysis method, potentially due to a low number of single particle tracks per cell. 

      We agree with the reviewer that the cell-to-cell variation we observed could be due to a low number of trajectories collected for each cell. In fact, sampling small numbers of trajectories allows us to identify protein species with unique diffusion coefficients, which might be lost if we just looked at a large population. Also, the fact that the diffusion coefficient distribution varies between cells does not mean that a particular cell only contains the more prevalent species that was detected. Here we are not trying to determine whether proteins in each cell indeed behave differently or whether the observed variation in the diffusion coefficient distribution is simply an effect of the limited trajectories collected in each cell. We instead analyzed data collected from many cells combined to get a better estimation of the population behavior. We have modified our text to make this important point clear to the readers. 

      Moreover, the bound fraction of HIF-1b differs between two independent measurements including three biological replicates each (Fig. 5 C and F). This raises the concern that not enough data enter each biological replicate, or not enough replicates are considered.

      Unfortunately, the number of cells that could be measured in our current setup is limited. It takes approximately 1 hour to collect 20 cells per sample, including staining, washing, looking for cells with desired expression level, and acquiring movies. For experiments with multiple conditions (>12), 20 cells per sample is the upper limit that can fit into a single day. 

      To address the question of what is the minimum number of cells/replicates needed we included in Figure 2—figure supplement 3 - the result of a bootstrapping analysis. We used data collected from a total of 243 cells of the same cell line, from over 11 replicates as the “population” and performed a bootstrapping analysis to identify the source of variation. We have also included appendix 1 with a detailed discussion. Our results showed that cell-to-cell variation contributes most to the total variation of the data, followed by day-to-day (replicate-to-replicate) variation. However, sampling over 800 trajectories, and from over 60 cells, imaged in 3 replicates well approximates the “population value” (bound fraction calculated from 243 cells from over 11 replicates). As a result, in each figure we always used over 60 cells from 3 replicates to generate the reported parameters. Although this approach still gives variable numbers from figure to figure, the variations seen for the same cell line are much smaller compared to the differences observed between different cell lines/conditions. 

      The authors compare the bound fractions among various mutants and experimental conditions. However, the peak diffusion is not, or only descriptively, evaluated. Thus, it is not clear whether the main effect of a mutation or chemical treatment is to change the bound fraction, or rather the diffusion coefficient of the mobile fraction. 

      Since there might be multiple mobile populations (defined as the fraction with a diffusion coefficient > 0.5 μm2/sec), the mean diffusion coefficient can change while the mode (peak) diffusion coefficient stays the same and vice versa. Because of such complexity in the mobile population, we prefer to use descriptive words to report the trend for the change instead of reporting exact values. However, as requested, we have added peak diffusion coefficient information to relevant figures as bar plots. We have also included in Table 1 a summary of mean and mode diffusion coefficient estimated for moving molecules in all relevant figures for reader’s reference. Note that the diffusion coefficient estimation is on a log scale, and the larger the diffusion coefficient, the lower the resolution (e.g, there is 1-grid of difference both between 2.63 and 2.75, and between 9.55 and 10).

      Conclusions:

      The authors provide data that highlight a potential role of the intrinsically disordered domain of HIF in modulating the bound fraction of these transcription factors. They further claim that the intrinsically disordered domains have a main contribution to this bound fraction. However, the autors do not quantify how this contribution relates to those of the DNA binding domain or the dimerisation interface. Changes in bound fraction estimated from the data in e.g. Fig. 3C, Fig. 4C, Fig. 5C and F rather hint to a dominant effect of dimerisation, followed by DNA binding and a smaller contribution of the intrinsically disordered domain. The authors should quantify the relative changes of the bound fraction for all mutants and experimental conditions, to clarify the importance of the contribution of the intrinsically disordered domain.

      It would be ideal if we could quantify what percent of the bound fraction is contributed by dimerization interface, DBD and IDR, respectively. However, it is very likely that these different domains do not act independently of each other in terms of binding to chromatin fibers. In practice, it is very difficult to dissect and quantify these effects independently. For example, we did try to express HIF-1α and 2α with their IDR completely deleted; however, because the protein-degradation signals are within the IDRs, these deletions caused massive stabilization of these proteins, making it impossible to find cells that express these forms at similar levels as the full-length counterpart. As a result, although these IDR-deleted HIF-α show greatly reduced binding, we did not include the results in the paper because the loss of binding could also be due to the overall higher protein expression levels, leading to large unbound fractions. Regarding the DBD mutants, they only have 1 mutation, so it is hard to tell whether the remaining binding in Figure 5B is due to some residual binding affinity of HIF-α (HIF-α only partially lost its binding affinity), or is due to binding through its partner HIF-1β (HIF-α completely lost binding affinity, but can still bind through dimerization with HIF-1β). All we can safely conclude from Figure 5B is that HIF-α DBD is required for optimal binding, but we cannot determine how much exactly it contributes to binding. We thus argue that, given the interdependence of the different protein domains, the reviewer’s request is not experimentally feasible.

      The authors state that the intrinsically disordered domains of HIF determine their differential binding specificity to chromatin. However, the experiments provided do not allow for such a conclusion. In particular, measuring changes in the bound fractions is not sufficient. Such a conclusion requires a method that is able to inform about the DNA sequences involved in HIF binding, for example chromatin immunoprecipitation.

      As requested, we have included new Cut&Run and RNA-seq results in the revised manuscript showing HIF-α-IDR-specific binding and gene activation.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors report a public browser in which users can easily investigate associations between PGSs for a wide range of traits, and a large set of metabolites measured by the Nightingale platform in UKBB. This browser can potentially be used for identifying novel biomarkers for disease traits or, alternatively, for identifying novel causal pathways for traits of interest.

      Overall I have no major technical concerns about the study, but I would encourage the authors to revisit whether they can find a more compelling example that can better showcase the work that they have done. I understand that this is partly a resource paper but I think the resource itself can have more impact if the paper provides a clearer use-case for how it can drive novel biological insight.

      Many thanks for your comments. We have undertaken a new application of bi-directional Mendelian randomization to demonstrate how users may use this approach to disentangle whether associations in our atlas likely reflect either causes or consequences of PGS traits/diseases. This example is described on page 9:

      ‘For example, we applied Mendelian randomization (MR) to further evaluate associations highlighted in our atlas with triglyceride-rich very low density lipoprotein (VLDL) particles. For instance, both VLDL particle average diameter size and concentration were associated with the PGS for body mass index (BMI) (Beta=0.04, 95% CI=0.033 to 0.046, P<1x10-300 & Beta=0.012, 95% CI=0.006 to 0.019, P=2.7x104 respectively) and coronary heart disease (CHD) (Beta=0.026, 95% CI=0.019 to 0.032, P<1x10-300 & Beta=0.035, 95% CI=0.028 to 0.042, P<1x10-300 respectively). Conducting bi-directional MR suggested that the associations with average diameter of VLDL particles are likely attributed to a consequence of BMI and CHD liability as opposed to the size of VLDL particles having a causal influence on these outcomes (Supplementary Table 6). In contrast, MR analyses suggested that the concentration of VLDL particles increases risk of CHD (Beta=1.28 per 1-SD change in VLDL particle concentration, 95% CI=1.25 to 1.65, P=2.8x10-7) which may explain associations between the CHD PGS and this metabolic trait within our atlas.’

      and discussed in the discussion on page 21:

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

      PGS construction: It's unclear how well the PGS work. Should the reader prefer the stringent or lenient PGS? Perhaps there could be some validation with traits that have decent sample sizes in UKBB. Was there any filtering to remove traits with few GWS hits, low sample sizes, or low SNP heritability as these are unlikely to produce useful PGSs?

      An example of validation was previously included for the chronic kidney disease PGS and its association with circulating creatinine, although this has now been removed due to the feedback you provided in your comments below. However, we have now provided the weights for all of the PGS included in our web atlas should users want to use these scores for prediction purposes (page 7):

      ‘The specific weights for clumped variants used in all PGS can be found at https://tinyurl.com/PGSweights.’

      On page 8 we have mentioned that in this work we have used a more lenient threshold to facilitate endeavours in a ‘reverse gear Mendelian randomization’ framework. However, the option to use the more stringent threshold remains an option for users interested in this as an alternative:

      ‘In this paper, we have discussed findings using PGS that were derived using the more lenient criteria (i.e., P<0.05 & r2<0.1), although all findings based on both thresholds can be found in the web atlas.’

      ‘Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very low density lipoprotein (VLDL) particles in the next section.’

      We have not filtering based on other criteria such as the number as SNPs given that certain scores, despite only been constructed using few SNPs, may still provide useful to users. For example, our score for ‘Drinks per day’ based on the more stringent threshold (i.e. P<5x10-8) consists of only 6 SNPs. However, one of these is rs1229984, a missense variant located at the alcohol dehydrogenase ADH1B gene region and known to be a strong predictor of alcohol use (e.g. https://pubmed.ncbi.nlm.nih.gov/31745073/).

      Reviewer #2 (Public Review):

      The authors set out to create an atlas of associations between phenome-wide polygenic scores and circulating lipids, fatty acids, and metabolites. To do so, they utilize GWAS from 129 traits available in the OpenGWAS database to derive polygenic (risk) scores (PGS) along with the recently released NMR metabolomics data containing 249 biomarkers (and ratios) in ~120,000 UK Biobank participants. The authors create a publicly available web portal containing PGS to NMR biomarker associations:

      http://mrcieu.mrsoftware.org/metabolites_PGS_atlas/.

      The strength of this study is in the comprehensive nature of the atlas, containing associations for 129 traits phenome-wide, the large sample size of the UK Biobank NMR data, and the use of PGS for prioritising molecular traits for follow-up experiments, which is an emerging area of interest (International Common Disease Alliance, 2020; Ritchie et al., 2021a). To our knowledge this study is the first to explore this for circulating metabolites.

      In its current form the atlas has several limitations, which should be straightforward to address. Notably, results in the current atlas may be confounded by (1) technical variation in the NMR data (Ritchie et al., 2021b), and (2) major biological determinants of biomarker concentrations, including body mass index, fasting time, and statin usage.

      Firstly, thank you for the suggestion to use your ‘ukbnmr’ R package to help remove technical variations from the UK Biobank NMR metabolites data. We have applied it to remove outliers and variation in the individual data due to (1) the duration between sample preparation and sample measurement, (2) position of samples on shipment plates, (3) different equipment (spectrometers) used. This meant that we needed to re-run our entire analysis pipeline for this project from scratch to the updated dataset. Results do not appear to have drastically changed, although nonetheless we have updated results from all downstream analyses in our online web atlas using this updated dataset provided by ‘ukbnmr’.

      Secondly, the reviewer is correct that biological factors, such as body mass index (BMI) and statin usage, are indeed strongly correlated with metabolites levels. However, we are not able to adjust for such biological factors directly in our analyses, given that they are potential colliders in the causal relationship between diseases/traits and metabolites. Statin usage may be caused by both the high genetic liability to coronary artery disease as well as abnormal lipoprotein lipid levels. Likewise, obesity (and changes in BMI) may result from a high genetic predisposition to cardiometabolic disorders and disrupted metabolism. Thus, adjusting for statin usage and BMI will induce collider bias (https://jamanetwork.com/journals/jama/fullarticle/2790247), which creates spurious associations between the disease/trait PGS and metabolites.

      To better illustrate this issue, we have added additional text on page 14 to justify this study design decision as well as added a new figure (Figure 3) to help demonstrate this clearly to the readers. Fasting time on the other hand we believe is unlikely to act as a collider and was adjusted as a covariate in all linear regression models in this work. This is mentioned on page 25.

      …Further, association results for two (of the 129) PGSs, systolic blood pressure (SBP) and diastolic blood pressure (DBP), are invalid (vastly inflated) as the GWASs used to construct these PGSs included UK Biobank samples.

      Many thanks for your suggestion. We have now removed the SBP and DBP PGS from our atlas due to overlapping samples in UKB. Furthermore, our colleagues at the University of Bristol have notified us that the Glioma GWAS data obtained from the OpenGWAS platform was uploaded with incorrect effect alleles. This PGS has also been subsequently removed from the atlas. Additionally, we removed the Alzheimer’s disease (without APOE) PGS because the pleiotropic effect of lipid associated genes is now systematically examined using lipid gene excluded PGS.

      To demonstrate how one might use these PGS to NMR biomarker associations to prioritise (or deprioritise) findings for follow-up, the authors select a biomarker of interest, glycoprotein acetyls (GlycA), to perform bi-directional Mendelian randomization to orient the direction of causal effects between GlycA and traits of associated PGS. However, the conclusions of this analysis are hampered by the heterogeneous nature of the GlycA biomarker, which captures the levels of five proteins in circulation (Otvos et al., 2015; Ritchie et al., 2019), making it a difficult target to appropriately instrument for Mendelian randomization analysis. This, however, does not detract from the broader point the authors make: that PGS can help prioritize molecular traits for experimental follow-up.

      We have now conducted further sensitivity analyses to evaluate the genetically predicted effects of each of the five proteins in the reference you have provided. This is discussed on page 11:

      ‘We also conducted further sensitivity analyses given that the NMR signal of GlycA is a composite signal contributed by the glycan N-acetylglucosamine residues on five acute-phase proteins, including alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin (Otvos et al., 2015). Using cis-acting plasma protein (where possible) and expression quantitative trait loci (pQTLs and eQTLs) as instrumental variables for these proteins (Supplementary Table 12) did not provide convincing evidence that they play a role in disease risk for associations between PGS and GlycA (Supplementary Table 13). The only effect estimate robust to multiple testing was found for higher genetically predicted alpha1-antitrypsin levels on gamma glutamyl transferase (GGT) levels (Beta=0.05 SD change in GGT per 1 SD increase in protein levels, 95% CI=0.03 to 0.07, FDR=3.6x10-3), although this was not replicated when using estimates of genetic associations with GGT levels from a larger GWAS conducted in the UK Biobank data (Beta=1.6x10-3, 95% CI=-6.9 x10-3 to 0.01, P=0.71). For details of pleiotropy robust analysis and replication results see Supplementary Table 14.’

      There are also several important limitations to the study which cannot be addressed, which the authors discuss appropriately in the paper. First, the NMR data does not provide a comprehensive view of the metabolome - it is heavily focused on lipids and fatty acids. Many small metabolites in circulation cannot be measured by NMR spectroscopy, and further insights must wait for data from molecular profiling efforts planned or underway in UK Biobank (e.g. mass spectrometry). Second, the authors restricted analysis to participants of European ancestries. This a pragmatic analysis choice given (1) the PGSs were derived from GWAS performed in European ancestries, (2) PGS associations are particularly susceptible to confounding from genetic stratification and differences in environment, and (3) the very small sample sizes for which NMR data is currently available in UK Biobank participants. Finally, although a large sample size, UK Biobank is not a random sample of the population: healthy adults are over-represented, meaning PGS to metabolite associations may be different in disease cases or less healthy individuals.

      Overall this study has strong potential, with straightforward to address limitations, and the resulting atlas will provide a useful characterisation of the relationships between NMR biomarkers and polygenic predisposition to various traits and diseases, which can be used by domain experts to prioritise biomarkers or traits for experimental follow-up.

      Reviewer #3 (Public Review):

      Fang et al. created an atlas for associations between the genetic liability of common risk factors or complex disorders and the abundance of small molecules as well as the characteristics of major apolipoproteins in blood. The whole study is well executed, and the statistical framework is sound. A clear strength of the study is the large array of common risk factors and disease analyzed by means of polygenic risk scores (PGS). Further, the development of an open access platform with appealing graphical display of study results is another strength of the work. Such a reference catalog can help to identify novel biomarkers for diseases and possible causative mechanisms. The authors further show, how such a systematic investigation can also help to distinguish cause from causation. For example, an inflammatory molecule readily measured by the NMR platform and strongly associated in observational studies, is likely to be a consequence rather than a cause for common complex diseases.

      However, in its current form, the study suffers from some weakness that would need to be addressed to improve the applicability of the 'atlas'. This includes a distinction of locus-specific versus real polygenic effects, that is, to what extent are findings for a PGS driven by strong single genetic variants that have been shown to have dramatic impact on small molecule concentrations in blood.

      Thank you for your suggestions to help refine our work. In line with this comment, we have repeated all analyses 1) after applying the ‘ukbnmr’ R package as recommending by reviewer #2 to remove technical variations and outliers and 2) conducted sensitivity analyses to remove an established list of lipid gene loci from PGS construction. Full results can be interrogated in the web atlas to evaluate whether PGS association may be driven by locus-specific effects at these regions, which may be particularly informative given the representation of lipoprotein lipid metabolites on the NMR panel. Findings are reported on page 19:

      ‘The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 14 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene-metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes see Supplementary Table 5).

      For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N=209), waist circumference (N=206) and body mass index (N=205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to P<0.05/19 in the initial analysis including these lipid loci (Supplementary Table 17), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary Table 18). Further inspection suggested that the likely explanation for this attenuation of evidence were due to variants located within the APOE locus which are recognised to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).’

      …Further, it is unclear how much NMR spectroscopy adds over and above established clinical biomarkers, such as LDL-cholesterol or total triglycerides. This is in particular important, since the authors do not adequately distinguish between small molecules, such as amino acids, and characteristics of lipoprotein particles, e.g., the cholesterol content of VLDL, LDL or HDL particles, the latter presenting the vast majority of measures provided by the NMR platform. Finally, the study would benefit from more intriguing or novel examples, how such an atlas could help to identify novel biomarkers or potential causal metabolites, or lipoprotein measures other than the long-established markers named in the manuscript, such as creatinine or lipoproteins.

      To address these comments, we have added a new example focusing on the granular measures of VLDL particles provided by the NMR data (on top of the examples listed at the start of the response to reviewer document), which as the review points out is one of its strengths of the measures generated by this platform over long-established biomarkers (page 21):

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

    1. Author Response

      We appreciate the thoughtful and thorough critique provided by the two reviewers, and generally agree with their assessment. The revised submission will address the issues they raise. In particular, we agree that the framework of the paper should be broadened to include bacteria and the deep literature associated with coincidental selection.

    1. Author Response

      Evaluation Summary:

      The work by Volante et al. studied a new plasmid partition system, in which the authors discovered that four or more contiguous ParS sequence repeats are required to assemble a stable partitioning ParAB complex and to activate the ParA ATPase. The work reveals a new plasmid partitioning mechanism in which the mechanic property of DNA and its interaction with the partition complex may drive the directional movement of the plasmid.

      Thank you for the kind evaluation. But we wonder about the description of the pSM19035 partition system we studied here as “a new plasmid partition system”. This system itself is quite old. The editor might have meant “new” as a subject of a research, but plasmid partition systems involving RHH-ParB proteins have been studied by number of groups for some time, including the Alonso Lab, which has worked on the pSM19035 partition system number of years prior to our current collaboration for this paper. Therefore, we wonder if the term “new” is the most appropriate.

      Reviewer #1 (Public Review):

      This is a very thorough biochemical work that investigated the ParABS system in pSM19035 by Volante et al. Volante et al showed convincingly that a specific architecture of the centromere (parS) of pSM19035 is required to assemble a stable/functional partition complex. Minimally, four consecutive parS are required for the formation of partition complex, and to efficiently activate the ATPase activity of ParA. The work is very interesting, and the discovery will allow the community to compare and contrast to the more widespread/more investigated canonical chromosomal ParABS system (where ParB is a sliding CTPase protein clamp, and a single parS site is often sufficient to assemble a working partition complex). All the main conclusions in the abstract are justified and supported by biochemical data with appropriate controls. A proposed multistep mechanism of partition complex assembly and disassembly (summarized in Fig 6) is reasonable. Perhaps the only shortcoming of this work is that the team does not yet get to the bottom of why four consecutive parS are needed.

      Thank you for the kind evaluation. The last point is an important one. We would like to continue to test our current model to either obtain stronger supporting evidence or come up with better alternative model.

      *Reviewer #2 (Public Review):

      ParBs come in two variations, RHH and HTH. In this study, the authors examine the in vitro behavior of the RHH system, which is less studied. Two activities were carefully monitored; ATPase activation and ParA removal from DNA. The system is quite complex, but the authors have done a good job of examining parameter space. One question concerns the physiological relevance. Can this be assessed by uncoupling ParA/ParB expression (making it inducible with IPTG from the chromosome, for example) and testing plasmids with the various constructs?

      This is an excellent point; we agree this a shortcoming of the current study. As described in response to “Essential Revisions”, we very much wanted to include an experiment testing in vivo plasmid stability for different size parSpSM sites in this paper, and we put a significant effort. However, we encountered certain technical issues with the approach we tried, and we failed to obtain conclusive data in timely fashion before we run out of time. Although, we had preliminary data, which appeared to be consistent with the notion that shorter parS sites are non-functional and full-size parS sites are functional, the experiment had certain flaw, which we could not rectify immediately to our satisfaction. Therefore, we decided to postpone this part of the project and plan for broader physiological evaluation of the parSpSM sequence arrangements in near future. In the revision, we mentioned at the beginning of discussion that in vivo functional test of parSpSM site requirements still remains to be examined.

      The authors appear to suggest that the requirement for at least 4 ParB binding sites is due to the inability of ParBs of this type to spread inferring that for the ParB-HTH multiple ParBs bound to ParS are required. Has this been tested in this system?

      ParB spreading has been shown to be essential for the HTH-ParB to perform its role in partition function. We clarified the importance of HTH-ParB spreading for partition function on lines 44-45.

      In any event, another major difference between the two systems is that a peptide corresponding to the N-ter of ParB is sufficient to bind DNA indicating this type of ParB does not have to be bound to DNA to stimulate ParA. It would have been useful if the authors had commented on this.

      There seems to be a mistype here. “N-ter of ParB is sufficient to bind DNA indicating ……” is incorrect. Perhaps this was meant to be “N-ter of ParB is unable to bind DNA, indicating ……” This is not a qualitative difference between the HTH- and RHH-ParBs: the N-terminal ParA interacting peptides of HTH-ParBs also can activate their cognate ParA ATPase without parS DNA binding, and parS-dependency of ATPase activation for HTH-ParBs appears to be significantly less stringent compared to the case for RHH-ParB we report here. ParBpSM1-27 , which cannot bind parSpSM, could only stimulate ParApSM ATPase to at most 1/10 of the full size ParBpSM in the presence of active parSpSM. We clarified this on lines 156-157, and also added discussion about this contrast between the HTH- and RHH-ParBs and possible implications on lines 458-467.

      Reviewer #3 (Public Review):

      Drs. Volante, Alonso, and Mizuuchi presented a milestone experimental finding on how the distinct architecture of centromere (ParS) on bacterial plasmid drives the ParABS-mediated genome partition process. Rather than driven by cytoskeletal filament pushing or pulling as its eukaryotic counterpart, the genome partition in prokaryotes is demonstrated to operate as a burnt-bridge Brownian Ratchet, first put forward by the Mizuuchi group. To drive directed and persistent movement without linear motor proteins, this Brownian Ratchet requires two factors: 1) enough bonds (10s' to 100s') bridging the PC-bound ParB to the nucleoid-bound ParA to largely quench the diffusive motion of the PC, and 2) the PC-bound ParB 'kicks" off the nucleoid-bound ParA that can replenish the nucleoid only after a sufficient time delay, which rectifies the initial symmetry-breaking into a directed and persistent movement. Although the time delay in ParA replenishment is established as a common feature across different bacteria, the binding properties of PC-bound ParB vary greatly, which begs the question of how Brownian Ratcheting adapts to different cellular milieu to fulfill the functional fidelity.

      The finding in this work presented a new but important twist in the Brownian Ratchet paradigm. The authors showed that in the pSM19035 plasmid partition system, only four contiguous ParB-binding repeats in ParS are required for the ParA-ParB interactions that drive the PC partition. In other words, only four chemical bonds are needed for the PC partition. Crucially, the authors further demonstrated that distinct orientation (configuration?) of the ParB-binding repeats is required for this fidelity by their state-of-art biochemistry and reconstitution experiments. The authors then elaborated on a possible mechanism of how the smaller number of PC-bound ParB can drive directed and persistent PC movement by interacting with nucleoid ParA. If I understand correctly, in their proposed scheme, due to their specific orientation (configuration?), when two of the ParS-bound ParB molecules bind to the two nucleoid-bound ParA molecules there arises a torsional/distortional stress. Consequently, the thermal fluctuations preload the forming bonds, triggering the dissociation of the two ParB molecules from the PC. And the remaining PC-bound ParBs may kick off the ParAs that have a time delay in DNA-rebinding, while ParB molecules will replenish the ParS to initiate the next round. In this proposal, the key conceptual leap is that not only the substrate but the cargo remodels to underlie the Brownian Ratcheting.

      We thank the reviewer for kind evaluation of our work. The model proposed is highly speculative at this point. Despite it may appear rather detailed in order to account for the unexpected findings, we consider it only a working hypothesis to be revised or replaced by a better model in future. We thank for many useful suggestions, which we will follow in our revision.

    1. Author Response

      REVIEWER #1 (PUBLIC REVIEW):

      The study by Monterisi et al. reports that loss-of-function mutations in metabolic pathways do not necessarily have a negative impact on cancer growth. The authors suggest that small solutes transferred via gap junction channels formed between wild-type cells and cells express mutants defective in metabolic pathways rescue the metabolic-deficient cancer cells. Through the examination of multiple human cell lines with several advanced means to determine gap junction coupling, Cx26 was identified as a major connexin molecule involved in medicating gap junction coupling between colorectal cancer (CRC) cells. The gene mutations of three metabolic gene mutations were investigated for major metabolic function of the cell, pH regulation, glycolysis and mitochondrial function.

      Strengths: The paper tests a new hypothesis that the mutations that inactivate key metabolic pathways do not incur functional deficits in cancer cells expressing the mutants due to their gap junction coupling to wild type cells.

      From microarray data they identified multiple connexins expressed in various CRC cells. Several advanced analyses were used to assess gap junction coupling in CRC cells including fluorescence recovery after photobleaching (FRAP). The extent of permeability at steady state was evaluated using CellTracker dyes and coupling coefficients were determined. They also used flow-cytometry to study dye transfer, which will provide a quantitative, dynamic means for study cell coupling. The data showed that knocking down Cx26 could greatly reduce diffusive exchange in most of the CRC cells tested.

      The study focused on three metabolic genes, Na+/H+ exchanger NHE1, a regulator of intracellular pH, a glycolytic gene, ALDOA and mitochondrial respiration gene, NDUFS1. These genes were knocked out in the selected CRC cells highly expressing these genes. The co-culture studies were well executed with fluorescence-markers distinguishing the WT and knockout cells and well-defined readouts such as intracellular pH, media pH, glucose/lactate levels and mitochondrial O2 consumption and glycolytic acid.

      The experiments in general were well designed and conducted, and the data supported the conclusions. The paper is also logically written and figures were well presented providing clear graphic illustrations.

      Thank you for recognising the strengths and novelty of our findings.

      Weaknesses: Although the hypothesis is innovative, no clear justification is provided that illustrates the scenario representing the clinical situation. The remaining questions include: What kind of somatic mutations in cancer cells has little impact on their growth and progression?

      We have now added in vivo data (Fig 8) and revised the Introduction and Discussion to address this point. Briefly, the broader clinical relevance our findings relates to the notion of essential genes and their negative selection. We show that connexin-dependent coupling can rescue a genetic deficiency, provided the mutation-carrying cell can access wild-type neighbours for the missing function. This rescue effect is limited to processes that handle solutes that can pass via connexin channels, i.e. metabolic processes. As such, sporadic loss-of-function mutations in “essential genes” may not produce a functional deficit in human cancers. We demonstrate rescue extensively in vitro, and now in a xenograft model.

      We argue that our work can explain why certain metabolic genes are essential in vitro, but not in vivo. In monolayers of mutated cells, diffusion across gap junctions cannot rescue the mutant phenotype, because there is no wild-type cell available to supply the missing function. In contrast, mutations in vivo will arise sporadically and wild-type cells are typically available to couple onto the mutation-bearing cell, providing it with functional rescue. Thus, only in the former case would the lethality of essential genes emerge.

      Indeed, many notable studies have found genes of various metabolic pathways to be essential for growth in vitro. Such genes would be expected to undergo negative selection in vivo, but this is exceedingly rare according to multiple observations. By demonstrating metabolic rescue in co-cultures (i.e. a setting closer to the tumour) and (now) in xenografts, our work provides an explanation for this apparent paradox. Indeed, cells such as NDUFS1-negative SW1222 grow very, very slowly in culture compared to wild-type cells and require regular media changes to keep pH alkaline. However, coupling onto wild-type cells can rescue knock-out cells in vitro and in vivo. We argue that this finding explains why loss-of-function mutations in NDUFS1 (and similar genes) do not undergo negative selection in human tumours (despite in vitro predictions).

      The three proteins selected for this study were chosen to represent very distinct types of solute-handling processes. We illustrate our point in a (new) summary figure in Fig 8.

      What types of WT cells, within the tumour cells or with neighbouring normal cells? Whether the current experimental design closely recapitulates the scenario in vivo?

      Indeed, we find that stromal fibroblasts may also support cancer cells via gap junctions, as this is essentially the same concept (i.e. coupling onto a cell with wild-type genes). However, we feel that expanding our present submission to fibroblasts would make the volume of data exceeding large. Also, the methods we use for fibroblasts are different, and require a full manuscript on its own. For example, there is the issue of how to control for the radically different growth rates of fibroblasts and cancer cells. We chose co-cultures of WT and genetically altered CRC cells so that the co-cultures are of the same background, with just one element changing (i.e. the metabolite-handling gene). This makes our data easy to interpret, and thus strengthens our case. Our in vitro experiments were performed on monolayers, where cells can make contacts in 2D. In vivo, these contacts will spread in all dimensions, thus connectivity is likely to be even more significant. If anything, monolayers probably under-estimate the importance of connectivity, but this preparation is more accessible for studying cell-to-cell communication.

      We recognise the importance of adding in vivo data to firm our conclusions. To that end, we have analysed xenografts established from co-cultures of wild-type DLD1 and NDUFS1-KO SW1112 cells on one flank of a mouse, and Cx26-KO DLD1 and NDUFS1-KO SW1112 cells on the other flank. This experiment tested whether Cx26-dependent connections between mitochondrially-defective NDUFS1 KO SW1222 cells and respiring DLD1 cells (on left flank only) are able to stimulate growth of the former (GFP-tagged). Indeed, NDUFS1-deficient cells grew faster when rescued by Cx26-expressing DLD1 cells. In contrast, their growth decelerated when DLD1 cells were Cx26-negative. We include these experiments and their controls in Fig 8.

      The readouts for co-culturing for glycolytic ALDOA and NDUFS1 knockout are only cell mass, without determining the more relevant markers, glucose/lactate and mitochondrial O2 consumption and glycolytic acid production.

      Our readouts are two-fold: total biomass and the size of the genetically altered compartment of co-cultures (GFP). We can therefore follow the relative growth of KO cells, which is essential for describing their growth (dis)advantage. We appreciate other markers are informative. Indeed, we characterised KO and WT cells in terms of O2 consumption and acid production in Fig 7. However, it would not be possible to measure glucose consumption selectively in GFP-positive KO cells of a co-culture, as the assays available for this measure ensemble rates for the entire population of cells (e.g. in a single well). Nonetheless, we believe that biomass as a readout is highly relevant to cancer, and we hope the reviewer concurs with us.

      The study needs to include cells without functional gap junctions like the characterized negative control RKO cells.

      This is an excellent suggestion, and we have added data for RKO cells to several figures. As expected, these do not form a syncytium and cannot rescue genetic defects in co-cultured cells. New data are shown in Fig 3G-H, Fig 6-supp2 and Fig 7H, adding to existing RKO controls in Fig 2A/B. Briefly, RKO cells do not exchange CellTracker dyes in monolayers (Fig 3F/G), cannot rescue cells that are ALDOA-deficient (Fig 6-supp2), and cannot rescue NDUFS1-deficient cells (Fig 7H). We also added Cx26-KO DLD1 cells to the CellTracker experiments in Fig 3.

      REVIEWER #2 (PUBLIC REVIEW):

      This paper is a logical extension of the 50 year-old concept of the "bystander effect" in tumours, wherein the effects of anti-tumour chemotherapeutics extend beyond the cells that take them up due to spread through gap junctions to adjacent cells. In this case, however, the authors have creatively realized that the reverse might also occur, and that tumour cells with otherwise fatal mutations in essential metabolic pathways can be rescued by their neighbours through passage of the missing metabolites through gap junctions. This can explain why mutations in other critical pathways, such as protein synthesis and transporters, are selected against in rapidly growing tumours, but others in equally critical pathways of glycolysis, electron transport, etc. are not, despite these genes having been demonstrated to be essential in in vitro KO studies (where all cells in the plate have the critical gene knocked out). A series of elegant experiments are used to test this proposal in several colorectal cancer (CRC) cell lines using three examples - pH regulation (defective Na+/H+ exchanger NHE1), glycolysis (defective Aldolase A (ALDOA)) and oxidative phosphorylation (defective Complex 1 - NDUFS1).

      Thank you for these positive comments. We have added key references to the bystander effect in the Introduction, and explain how our findings build on these milestones.

      The authors first determine the levels of different Cx proteins expressed in each cell line, and determine that for most Cx26 and 31 are dominant, although come lines have a subset of cells with high Cx43 expression. They then use Cell Tracker Green to pre-label cells and use FRAP as a means to measure how well the cell population is coupled. This is a useful measurement but is significantly over-interpreted by the authors as a "permeability" in uM/min. This is not really a permeability, which requires knowledge of the concentration gradient of the permeant species, relative cell volumes, etc. Rather it is a rate of fluorescent recovery that is presumably correlated with, but not quantitatively related to, levels of coupling.

      Thank you for this comment. We would like to explain why we believe our FRAP experiments are able to estimate permeability in units of um/s. The rate of recovery of a solute in a cell following its “destruction” (here, photobleaching) is given as follows:

      dCcell/dt = p⋅P(Ccell-Csurround) … [1]

      Where subscripts ‘cell’ and ‘surround’ refer to the cell and its neighbours. P is the permeability of the barrier between these two compartments, and p is the ratio of the surface area of the barrier (i.e. membrane) to volume of the bleached cell. Within a “bleached” cell, we measure fluorescence.

      Since fluorescence (F) is proportional to concentration (C), we can substitute:

      C = α⋅F

      where α is a constant of proportionality. Thus, the rate of recovery (L.H.S. of equation [1]) becomes:

      dC/dt = d(α⋅F)/dt = α⋅dF/dt … [2]

      And the R.H.S. of equation [1] is re-written as: P⋅(Ccell-Csurround) = P⋅(α⋅Fcell-α⋅Fsurround) = α⋅P⋅ (Fcell-Fsurround) … [3]

      Putting [2] and [3] together,

      dFcell/dt = p⋅P⋅(Fcell-Fsurround)

      Prior to photobleaching, there are no (net) gradients, thus initial Fcell and Fsurround are equal.

      Thus, we can re-write the equation in terms of normalized fluorescence (f=F/F0):

      dfcell/dt = p⋅P⋅(fcell-fsurround)

      P can therefore be expressed as:

      P = dfcell/dt / (p⋅ (fcell-fsurround))

      Here, dfcell/dt is measured from the fluorescence recovery time course and fcell-fsurround is measured experimentally (in fact, bleaching in the cell is set to 50%, thus this takes the value of 0.5 by default). We can approximate the monolayer as a network of cuboidal cells. The cell’s volume is thus ‘area’ times ‘height’, and the cell’s surface (at which it contacts its neighbors) is the ‘cell’s perimeter’ times ‘height’. Thus, for the bleached cell,

      p = perimeter × height / area × height = perimeter / area.

      The perimeter and area can be measured from the acquired fluorescence images. Thus, we can describe permeability using data obtained from image stacks. We appreciate that this method makes certain geometrical approximations, but we believe these are not unreasonable. We explain the assumptions and calculations in Appendix 1. More information about the method is published by us in https://pubmed.ncbi.nlm.nih.gov/28368405/. Of course, we accept that these calculations are less accurate than, say, electrical conductance measurements, and to that end, we added a note of caution to the main text.

      This fluorescent recovery is shown to be sensitive to siRNA KO of Cx expression, but strangely its reduction is only correlated with KD of Cx26 in the 5 cell lines examined. KD of Cx43 (in LOVO cells) and Cx31 in all 5 cell lines had no effect or in some cases seemed to increase the rate of recovery (DLD1 and SNU1235). This is a notable finding, yet the authors choose to completely ignore it and continue with Cx26 KDs in studies of specific metabolite transfers. Some discussion should be included as to why KD pf these Cxs has no effect or causes an apparent increase in coupling of the cells.

      The effectiveness of GJB2 knockdown in ablating ensemble connectivity is most likely a reflection that Cx26 is likely the dominant conductance inherited from the parent epithelium. Other isoforms are expressed, but in most CRCs cells, these do not produce major coupling, as GJB2 knockdown was sufficient to uncouple many CRCs. These observations justify our choice of connexin for studying metabolic rescue functionally. These findings are also consistent with the good correlation between ensemble connectivity and GJB2 levels.

      Our data show a trend that GJB3 (Cx31) KD in DLD1 and SNU1235 cells and of GJA1 (Cx43) KD in LOVO cells produce an increase in coupling. However, when analysed by hierarchical (nested) analysis, these effects are not statistically significant, and for that reason we did not elaborate on these trends in the original submission. The apparent increase in conductivity in cells treated with GJA1 or GJB3 siRNA could reflect a compensatory response to the ablation of a specific message, closer contacts between cells allowing Cx26 to strengthen its connections, or a shift away from heterotypic channels involving Cx26 and Cx31/Cx43, towards homotypic Cx26. We did not see any consistent change in the intimacy of cell-cell contacts. We now performed western blots for connexins to probe for compensatory changes (see Fig 2-supp1). In comparison to wild-type cells, expression of Cx31 was not changed by GJB2 (Cx26) or GJA1 (Cx43) knockdown in DLD1 cells. GJB2 KO DLD1 cells did not induce expression of the other major isoform, Cx43. Also, in DLD1 cells, KD of GJB3 or GJA1 did not substantially change Cx26 levels. Similarly, KD of GJB3 did not affect Cx43 levels. In GJA1-high C10 cells, KD of GJB3 did not alter Cx43 levels, although a small decrease was observed with GJB2 KD on Cx43. Also in C10 cells, KD of GJB2 and GJA1 did not induce an increase in Cx31 levels.

      We agree that complex interactions between connexin genes are possible, but we feel that a molecular study of Cx gene regulation would fall outside the scope of the present manuscript. Our findings point to a prominent role of Cx26 in metabolic rescue, and to strengthen this point, we show that Cx26-negative cells that express other connexins (e.g. C10 cells or NCIH747 cells) cannot rescue ALDOA-deficient counterparts or NDUFS1-KO SW1222 cells (new data in Fig 6 and 7). We share the Reviewer’s enthusiasm about the interplay between connexins and will endeavour to study this further in the near future.

      Rather than just focus on acute transfer of dye between cells, the authors develop a system using 50/50 mixes of cells labelled with two junctionally permeant dyes and measured the degree of mixing at equilibrium (48 hours). This is presented as a "coupling coefficient", but how it is calculated, and its significance is not well described, and does not correlate with the historical use of this term in the literature. Nonetheless, the studies do seem to demonstrate a good degree of equilibration, although it would have been informative to determine of the cells that do not exchange dyes express Connexins. To document that this equilibration requires gap junctions, the authors employ low density cultures, which significantly decrease dye exchange. However, in at least one cell line (SW1222) dye exchange is only reduced by <50%, indicating a very high background to this assay. This is not addressed.

      Thank you for these comments. We agree that our description of the method was inadequate, and we have added the necessary information in Appendix 1. We have also added information about actual confluency and restructured the figure. We also added new data for RKO cells and DLD1-Cx26 KO cells, i.e. two negative controls (Fig 3H). We pondered about the best name for describing the numerical output of the method, and concluded that “coupling coefficient” is reasonable (provided we improve our description of it) because it is dimensionless, and like many coefficients has a finite range (here, zero to one). With further explanation, we hope this terminology is acceptable. The issue with SW1222 cells is that both low- and high-seeding densities produce clusters of cells. Even though overall cell numbers were different in high and low seeded cultures, actual connectivity within “islands” of cells remained high, hence their similar coupling coefficients (see Fig 3E). Indeed, this CRC line is unusual in this behaviour, so we only present data from the higher density.

      The most compelling part of the study is the use of reporters to directly demonstrate a role of Cx26 coupling of cells to rescue cells with mutations of the three genes mentioned above when mixed with normal neighbours. This case was most convincing in the cases of ALDOA and NDUFS1, with the data for the pH regulation requiring more explanation for full understanding of the data shown (e.g. Figs 7 G and H).

      Thank you for this comment. Studies of pHi regulation provide a unique opportunity to obtain single-cell resolution (unlike e.g. glycolytic assays). We took advantage of this, and therefore the figure on pHi presents a greater depth of analysis. Nonetheless, we agree the pH data need further explanation. We have expanded the text, and also added a bar plot of data on day 7, which now provides a clearer illustration of the rescue effect. This form of presentation was also adopted for ALDOA and NDUFS1 experiments in the subsequent figures.

      Overall, the study does a credible job of demonstrating that Cx26 coupling of CRC cells serves to rescue cells with mutations in critically necessary metabolic pathways, presumably due to transfer of metabolites from surrounding wt cells. However, some of the results indicate this is not a simple process where all connexins behave similarly, and some effort should be made to investigate if Cx31 and 43, which do not seem to play the same roles in maintaining cell coupling as Cx26, also play any role in such metabolic rescue.

      Thank you for this comment. We have addressed this by selecting three additional cell lines for study: RKO – a cell line with no major Cx expression; C10 – a cell line that expresses Cx43, but very low levels of Cx26; NCIH747 – a cell line that expresses Cx31, but low levels of Cx26. These additional experiments cover lines that are GJB2 (Cx26)-low/negative to test whether metabolic rescue is best achieved with Cx26. Our new data show that these cells are unable to rescue metabolic defects (new data provided in Fig 6H/I, Fig 7H, and Fig 6-supp2). These findings strengthen our case for a major role of Cx26, at least in CRC networks. Indeed, recent analyses by Robert Gatenby and colleagues have shown that mutations in GJB2 (Cx26) are exceedingly rare in cancer (a property not shown for other connexins genes). This is interpreted to mean that Cx26 plays a particularly prominent role, ostensibly for metabolic rescue.

      REVIEWER #3 (PUBLIC REVIEW):

      Strengths of the study include that it appears to be a careful and well thought out set of experiments. The analysis and treatment of multiplexed data is also sophisticated. For the most part, the work is clearly and logically described, as well as well illustrated. In general, the authors achieved their experimental goals, and the methods while not entirely new, do provide new twists and augmentations that should be useful to the field. A general weakness is that this is not entirely a new story. Instead, it is a variant of one of the oldest concepts in the field of gap junction biology i.e. "Metabolic cooperation". The term "Metabolic cooperation" (i.e., as mediated by gap junctions) was not mentioned by the authors, but it is a long-established and foundational concept in the field. Indeed, in a classic paper by Gilula and colleagues published in 1972, the experimental approach used was similar to that of the study in hand. These earlier authors showed how transformed cell lines with deficiencies in hypoxanthine metabolism can be "rescued" by "metabolic cooperation" in co-culture with metabolically competent cells via passing a gap junctional permeant molecule. This and other relevant papers were not cited. More importantly, the extant literature places the onus on the authors to explain and convince reviewers why this study is more than an incremental step.

      We apologise for not quoting these important and classical references. We have now added these works to our reference list (quoted in Introduction). At the time of these seminal discoveries, Loewenstein and colleagues made a case that connexins are absent in cancer, and this belief persisted for many decades. More recently, the role of gap junctions in cancers has garnered attention. With new gene manipulations (e.g. CRISPR/Cas9) and imaging techniques and improved xenografting, it is now possible to precisely study the impact of GJ on cancer metabolism. Moreover, we have a wide panel of cancer cell lines to study, and identify the prominent role of Cx26. We highlight that our study is the first to offer a mechanistic explanation for the absence of negative selection in cancer, a phenomenon which was not known in the 1970s. To strengthen our novelty, we now add in vivo data to Fig 8 that confirm in vitro findings.

    1. Author Response

      Reviewer #1 (Public Review):

      1. “The major weakness of the study is that with the interpretation of the results. The changes in tractography, behavior and TBM are what would be expected following lesions of the neostriatum”

      We appreciate this comment and would like to offer clarification. We respectfully disagree that the pattern of results presented in this manuscript are akin to what would be expected following striatal lesions. In NHPs, striatal lesions typically cause more extreme phenotypes than what we observed in our 85Q-treated animals. In macaques, bilateral putamen lesions can result in phenotypes that include seizures, inappetence, hyper-aggression, and other severe features.  This strongly impacts clinical scores and can make it unfeasible to care for the animals for multiple years. For these reasons, recent NHP HD lesion models have used only unilateral putamen lesions coupled with bilateral caudate lesions to model HD (as in the recent paper by Lavisse et al, 2019). Of additional relevance is that even the cognitive effects of these striatal lesions are more severe than what we observed in our 85Q-treated animals: for example, Lavisse reported reduced performance on similar “prefrontal” cognitive tasks by ~50%, whereas our AAV-HTT model exhibited only ~10% reductions in working memory. This mild, but significant, change in cognitive performance and motor function seen in our 85Q animals is much more akin to that which is observed in the early stages of HD.

      2. “The results have been interpreted as showing a progressive model, although evidence that there is progression is limited”...“begs the question as to whether or not the 85Q-lesioned monkeys would recover to a level similar to the 10Q animals if left for another 12 months”

      At the request of Reviewer 1, we added an additional 30-month timepoint and re-ran all of the analyses to include these new data.  All of the behavioral and neuroimaging data were re-analyzed with this final timepoint included (see Lines 125-141, 146-163, 173-194, 228-255, 270-294, 314-345). Additionally, due to the unidirectional nature of our hypothesis and on the advice of our bio-statistician, we applied one-tailed tests to the planned comparisons in this revision. To address the Reviewer’s point directly: 85Q-treated animals showed minimal evidence of functional recovery between the 20- and 30-months timepoints on the behavior tasks. In particular, working memory deficits measured with SDR and fine motor skills measured with Lifesaver Retrieval did not improve between 20- and 30-months (Figure 1C and 1F). Additionally, neurological rating scores in group 85Q remained consistently elevated (in the 5-7 range) between the 20- and 30-month timepoint. Taken together, we feel confident that these results do not show evidence of any significant functional recovery, out to 2.5 years (30-months). In terms of the longitudinal trajectories of the behavioral measures, we appreciate the Reviewer’s feedback regarding the use of the term ‘progressive’ and have tempered our language appropriately. We removed all instances of the word progressive/progressed except in the context of the motor rating scores, which show a significant Group x Timepoint interaction and demonstrate a clear progression.

      3. “The whole manuscript is written as though this is a genetically-relevant progressive model of HD. But the animals are normal, and so there is no genetic context relevant to HD”

      We thank Reviewer 1 for this comment. We recognize that viral-based animal models of HD, including the model characterized here, are not as genetically similar to the human condition compared to some of the other modeling approaches currently under investigation (ex. knock-in and gene editing). Limitations of the AAV-based HTT85Q model include: 1) vector packaging restrictions that prohibit expression of full-length HTT, 2) the use of a CAG promoter vs. an endogenous promoter that leads to overexpression of the transgene, 3) the use of cDNA versus genomic DNA excludes introns and therefore lacks the ability to produce alternatively spliced variants (ex, Exon 1), 4) the use of a mixed CAG-CAA repeat may preclude the possibility of somatic instability and 5) expression of HTT that is restricted to specific brain regions and cell types. All of these important limitations have been added to the discussion section in this re-submission (Lines 503-517).

      Despite these limitations, we feel that this AAV2:AAV2.retro-HTT85Q based model has some features that make it genetically-relevant to human HD including: 1) the expression of an N-terminal fragment of human HTT (N171), 2) the N-terminal fragment bears a pathological PolyQ expansion (85Q), 3) the expressed mHTT fragment forms neuronal aggregates that can be detected in the nucleus, 4) mHTT fragments are expressed in many of the same brain regions where aggregates are detected in human HD cases, with both regional and sub-regional specificity (ex. higher expression in anterior vs posterior cortical regions and expression primarily limited to deep cortical layers V/VI) and 5) expression of mHTT fragments in these regions leads to many of the same pathological and behavioral changes observed in HD patients.  Importantly, expression of the N-terminal portion of HTT allows for the evaluation of HTT lowering therapeutics that target first 3 exons (ASOs, miRNAs, zinc finger repressors, CRISPR-based therapies, etc), which cannot be evaluated in lesion-based models.

      4. “The authors state in the Abstract that the injection resulted in "robust expression of mutant huntingtin in the caudate and putamen". These data are not in the manuscript.”

      Evidence of mHTT expression in the caudate and putamen, as well as several other brain regions, via immunohistochemical and immunofluorescent staining is now included in the manuscript. Please see additions to the methods, results and discussion sections regarding these findings, as well as a new Figure 5, (see Lines 347-376, 756-788). Additionally, further details regarding an associated PET imaging study in this same cohort of animals using a mHTT aggregate-binding radioligand has been added to the discussion, (see Lines 437-443). Please also see response #13 (below).

      5. “The authors chose to use a fragment of the HD gene, with a very long repeat that is seen only in juvenile patients”

      Comments regarding the need to use a fragment of the HTT gene, versus the full-length gene, due to packaging constraints of the viral vector, were added to the discussion in the context of limitations (Lines 503-517), and also discussed above in response #3.  The choice to use a CAG repeat length of 85 (83 pure CAGs followed by a CAA/CAG cassette -see response #17 below for further details), was based off previous studies wherein similar CAG repeat lengths were used to create animal models of HD over the past several decades. Interestingly, while CAG repeat lengths in patients with adult-onset HD typically range from ~40-60, longer repeat lengths (>60) are typically required in animal models of HD to elicit pathological and behavioral manifestations of disease: transgenic, knock-in and viral vector-based rodent models (ranging from 72-150 CAGs), OVT73 transgenic sheep model (73 CAGs), transgenic and knock-in minipig models (ranging from 85-150 CAGs), transgenic and viral vector-based macaque models (ranging from 82-103 CAGs). See Ramaswamy et al, 2007 and Howland et al, 2021 for thorough reviews of these models.

      6. “For their cognitive testing, the authors used a task (delayed non-match to sample) that measures object recognition and familiarity. Before surgery, only 11/17 of the animals were successfully trained to complete this task. It is not clear how useful the data are when only 64% of the animals can be included.”

      We appreciate the Reviewer’s concerns and have decided to conservatively remove this data from the revised manuscript.

      7. “It is not clear how this monkey model will be useful for developing either disease biomarkers or therapeutic strategies for HD (as stated in the abstract)”. “The authors state that they hope the model will become a widely used resource. This seems an unlikely scenario, given the limitations of the current study and the challenges associated with using monkeys. They say that a major advantage of their technique is being able to generate large numbers of monkeys. But this is not a relevant argument if the usefulness of the model to investigate HD is not proven.”

      We thank the reviewer for requesting clarification on these important points. We believe that this model will be useful for developing therapeutic strategies because the HTT85Q-treated macaques express mutant HTT, along with HTT aggregates, in several key brain regions that are affected in human HD, along with undergoing regional gray matter atrophy and white matter microstructural alterations that correlate well with behavioral dysfunction. Studies currently under review elsewhere also show reduced dopamine neurotransmission and regional hypometabolism via PET imaging in this model. Together, or individually, these imaging and behavioral changes can serve as outcome measures when screening potential therapies. Possible therapeutic interventions that are amenable to screening in this model are included in the discussion.

      Regarding biomarker development, we have already engaged in PET imaging biomarker development in this model in collaboration with the CHDI foundation and the Molecular Imaging Center at the University of Antwerp, evaluating a candidate radioligand that binds to aggregated mHTT. See #13 below for a more detailed description of this PET study, including recent data showing its ability to bind to aggregated species of mHTT in several brain regions in this same cohort of HTT85Q macaques that correspond to 2B4 and em48 IHC staining (a manuscript describing these results has been prepared for submission and the PDF is included for the reviewers to peruse).

      The authors do envision this AAV-based macaque model becoming a resource for the HD research community. While this model does have certain limitations (now detailed in the Discussion), we respectfully assert that all of the HD animal models, both small and large, each have their own important limitations to consider when deciding on which to use to screen therapeutics. Selecting a specific animal model based on the individual scientific questions being asked will be required, and employing a combination of models may be an even more prudent strategy.

      While NHP research presents unique challenges (cost, housing requirements and recent challenges in availability, among them), we believe that viral vector-based NHP models could be more accessible to the HD research community compared to some of the other established large animal models; in that they may able to be readily created at contract research organizations (CROs), in addition to various academic research institutions. There are now many CROs that exist in the US, and elsewhere around the world, that have developed specific expertise in MRI-guided, intracranial delivery of AAVs into the NHP brain (including the caudate and putamen), in the context of assessing therapeutic interventions for a variety of neurological disorders (HD, PD, and MSA, among others). Most of these same CROs also have expertise in NHP imaging (MRI/DTI) and behavioral assessments across multiple domains. It seems feasible that AAV-mediated HD macaques could be produced in sufficient numbers to appropriately power therapeutic studies, using the outcome measures established in the current study.

      Reviewer #2 (Public Review):

      1. “The major weaknesses are the manner in which the data is presented”

      We replotted all of the figures with improved color palettes and larger font sizes to make them easier to read. We also added additional details throughout the results section to aid in clarity and improve readability.

      2. “The authors would benefit from talking more about their model in the introduction and including references to some key points. For example, there has been critical new data in the field showing the importance of poly (CAG) in disease, not necessarily poly(Q), and the community will want to know (and not be required to look up), the nature of the transgene. Is it a pure CAG repeat? A mixed repeat? If it is pure, do they see or could they measure somatic expansion in the various brain regions impacted? How does that data match the phenotypes seen? Since this is a transgene, there is no possibility for the exon1/intron1 splicing variant to appear - how does this impact their interpretation”

      Further details regarding the transgene have been added to the Viral Vector Section of the Methods (Lines 531-550). The repeat is not pure and contains a single CAA interruption. The glutamine encoded repeat for HTT85Q contained 83 pure CAG repeats, followed by a single CAA/CAG cassette, while the glutamine encoded sequence for HTT10Q contained 8 pure CAG repeats followed by a single CAA/CAG cassette. Both constructs contained a proline stretch distal to the glutamine repeat in the following allelic conformation where QT represents the total glutamine length:

      HTT85Q: QT\=85, (CAG)83(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      HTT10Q: QT\=10, (CAG)8(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      There are plans to probe for somatic expansion in various brain regions, including the caudate and putamen, as well as several distal cortical regions. That analysis is ongoing and not in the scope of the present manuscript; however, these analyses are now mentioned in the discussion section (lines 540-560), as well as a discussion on the ability to either remove or duplicate the CAA/CAG cassette to potentially increase or decrease the rate of disease progression, respectively, based on the work of Ciosi et al. 2019. Additionally, Reviewer 2 is correct in that the lack of intronic sequences in the transgene precludes the formation of splicing variants, such as the exon1/intron1 variant, which we know is pathological based on the work of Bates et al. This drawback has been added to the discussion, along with other limitations of this viral vector-based model (Lines 503-517).

      3. “What about RAN translations? Is RAN translation noted at all in this over-expression model? How does that contribute (or not) to the progressive phenotype they see in their NHPs?”

      We are also curious regarding the assessment of toxic protein products from RAN translation of the expanded repeat sequence in this model. These studies are planned, and the results of these assays will be included in a future manuscript describing other ongoing post-mortem evaluations in this model.

    1. Author Respose

      Reviewer #1 (Public Review):

      This manuscript reports a new genetically encoded neuronal silencer BoNT-C. They show that it fully blocks neurotransmission in two classes of Drosophila motor neurons (Is and 1b; tonic and phasic, respectively). They also update a GCaMP postsynaptic reporter SynapGCaMP to express GCaMP8f instead of 6f. They selectively silence 1b or 1s neurons to disambiguate the neurotransmission properties of each neuron. Finally, they show that silencing either 1b or 1s neurons does not induce heterosynaptic structural or functional plasticity (only neuron ablation triggers plasticity). The data are convincing. The new silencing tool will be widely used.

      We thank this reviewer for his positive assessment of our study and for highlighting the utility of the new silencing tool presented in this study.

      Reviewer #2 (Public Review):

      The conclusions of this paper are properly supported by the provided data.

      Overall this work opens a new window to examine novel aspects of heterosynaptic structural and functional plasticity.

      We also thank this reviewer for his positive assessment of our study and for putting the importance of our findings in context.

      Reviewer #3 (Public Review):

      The strength of the manuscript by Han et al. is the comprehensive characterization of BoNT-C, showing that it truly abolishes all evoked and mini responses without structural alteration of the NMJ. Based on this, the authors then show that ablation of all neurotransmission in either Ib or Is does not cause any compensatory changes (neither functional nor structural) in the 'other' (i.e. looking at Is when silencing Ib or looking at Ib when silencing Is).

      The weakness of the manuscript lies in the modest gain over the previous work. Specifically, Aponte-Santiago had already shown that many parameters are not changed (in Ib when Is is perturbed, or in Is when Ib is perturbed), including that 'the Is terminal failed to show functional or structural changes following loss of the coinnervating Ib input' (quote from 2020 paper). Hence, the only major difference is that Han et al now show that Ib also does not really change when Is is silenced. Aponte-Santiago also clearly showed a ~50% EJP reduction when Is or Ib are perturbed alone, and adding these two equals wild type. The highly emphasized finding of Han et al. that (quote) ' composite values of Is and Ib neurotransmission can be fully recapitulated by isolated physiology from each input' quite obviously follows from the one key finding that one does not affect the other, as mentioned above in the strengths. The wording is a bit odd, but really adding Is (with Ib perturbed) and Ib (with Is perturbed) inputs is really not adding much over either the main finding nor the previous work.

      We thank this reviewer for his/her/their assessment of our study and for highlighting the strengths in characterizing the impact of BoNT-C expression at the NMJ. We also understand and appreciate the criticisms raised. It is important to note from the outset that the motivation and central goal of this study was not primarily to mechanistically dissect heterosynaptic plasticity between tonic and phasic motor inputs at the Drosophila NMJ. Rather, it was to develop an approach that would, for the first time, enable accurate isolation of complete neurotransmission from entire MN-Is or MN-Ib NMJs (both miniature and evoked transmission). By the reviewer’s own admission, we were entirely successful at achieving this central goal in our comprehensive characterization of BoNT-C.

      Next, the reviewer raises the valid question about whether this achievement is a significant advance over previous work, and discusses recent experimental findings regarding heterosynaptic plasticity at the fly NMJ. We want to emphasize here that having a tool that is capable, for the first time, of accurately discriminating complete transmission from Is vs Ib alone is a major advance, one that it is not clear the reviewer sufficiently appreciates. As summarized in Fig. 1, no previous attempts have been successful in accurately isolating synaptic transmission between Is vs Ib synapses. In particular, no previous approach was capable of isolating miniature activity from Is vs Ib, and as we show in our manuscript, miniature events exhibit major differences between the two inputs. Thus, without isolating miniature transmission, one cannot know baseline synaptic function in Is vs Ib nor whether heterosynaptic functional plasticity has been induced. Further, we detail major confounds with some of the previous approaches the reviewer alludes to in prior studies, including selective optogenetic stimulation.

      Finally, the reviewer discusses at length recent findings regarding heterosynaptic plasticity and questions whether the new insights revealed by BoNT-C provides a sufficient advance. In particular, the reviewer refers to previous work published in 2020 and 2021, where important initial insights into Is vs Ib structure and transmission after differential manipulations to either input was reported. The reviewer appears to believe that it was settled in these studies that no heterosynaptic functional plasticity was induced.

      However, a critical point that the reviewer appears not to appreciate is that while the two previous studies on heterosynaptic plasticity at the Drosophila NMJ were able to assess structural plasticity (AponteSantiago et al., 2020; Wang et al., 2021), no accurate or quantitative conclusions can be made about heterosynaptic functional plasticity from these studies. This is due to the authors not knowing what baseline synaptic function is at Is vs Ib (miniature frequency, miniature amplitude, and evoked transmission), so that in their manipulations they cannot accurately determine whether any functional changes are observed after their manipulations. Further complicating the interpretation of the previous studies is that at the muscle 1 NMJ (2020 study), like the muscle 4 NMJ (2021 study), ~30% of these NMJs fail to be innervated by a Is input in wild-type larvae. This major confound makes it difficult to know how or whether adaptive plasticity is induced in wild-type NMJs with or without Is innervation (since, interestingly, evoked transmission does not appear to change in wild-type m1 or m5 NMJs with or without a Is input), and then to determine whether any heterosynaptic plasticity is induced. Indeed, we have also struggled with how to accurately determine whether synaptic function changes compared to baseline throughout our studies at earlier stages, despite the fact that the muscle 6/7 NMJ we use in our study does not suffer from the variable Is innervation confounds observed at muscle 4 (Wang et al., 2021) and muscle 1 (Aponte-Santiago et al., 2020).

      Respectfully, we contend that the only way one can accurately and quantitatively determine baseline synaptic transmission (miniature amplitude, frequency, evoked, quantal content), and whether any changes are observed following manipulations to Is or Ib, is to fully and accurately recapitulate wild type (blended Is+Ib) neurotransmission from isolated Is vs Ib transmission. This is why we believe the data shown in Fig. 7 (and also Fig. S7 in the revised manuscript) is so important. It is true that numerous previous studies established relative and qualitative differences between Is vs Ib (miniature events are larger at Is relative to Ib, Is drives larger depolarization in response to single synaptic stimulation over Ib, etc). However, in no case did previous studies accurately assess baseline Is vs Ib synaptic function from entire inputs, and therefore could not conclude with certainty whether heterosynaptic functional plasticity was induced.

      On a different but somewhat similar topic, UAS-BoNT-C is not a new tool. I am a bit put off by the wording ' We have developed a botulinum neurotoxin, BoNT-C...'. More on this and the way the previous BoNT-C paper (Backhaus et al., 2016) is cited in the detail comments below in the recommendations for the authors.

      We understand these points raised by the reviewer. Our BoNT-C transgenic line is indeed a new tool, the only one in which synaptic transmission has ever been electrophysiologically characterized and shown to completely silence synaptic transmission in Drosophila. That being said, in retrospect, we can appreciate that the term “developed” might imply a level of innovation that reasonable people can disagree about. We have therefore elected to change the apparently offensive wording to “We have employed a botulinum neurotoxin, BoNT-C…” in the abstract of the revised manuscript.

      Additionally, the manuscript does not really dive into an analysis of phasic versus tonic functions (that's just a correlation with the Is and Ib dominant modes of function).

      We absolutely agree that selective silencing by BoNT-C now enables a rigorous study of tonic vs phasic neurotransmission at MN-Is vs MN-Ib NMJs, but that in the current manuscript we have not focused on this interesting question. We have adopted the convention the field has used to classify MN-Is and MN-Ib subtypes based on their apparent firing modes as “phasic” vs “tonic”, but like previous studies, we have not analyzed these functional distinctions on a deeper level. Although the focus of the current manuscript was to establish the properties of BoNT-C and highlight its utility as a tool for the field, we are now in the process of preparing an entirely new manuscript focused on just this reviewer’s question about the differences in tonic vs phasic synaptic physiology. This eight-figure manuscript will be entitled “Electrophysiological properties and nanoscale distinctions that define tonic vs phasic glutamatergic synapses” and is focused on the central question raised by the reviewer - how and why synaptic transmission differs between tonic vs phasic inputs. While this interesting question is outside the scope of the current manuscript, we will submit this new manuscript within the next few months, which is based on new experimental insights now enabled by selective BoNT-C silencing established in the current manuscript.

      Finally, since the authors show that loss of Is or Ib function does not cause any change in the other, we are left to wonder what actually DOES cause heterosynaptic plasticity. TNT or rpr DO cause some heterosynaptic plasticity and they also DO cause some structural changes - but whether the structural changes themselves are important here remains unclear. Substantial progress would have been to take the starting point that BoNT-C does not cause heterosynaptic plasticity, and then identify the signal that does (is it morphology? or signaling between Is and Ib? Or with the muscle?).

      We certainly agree with the reviewer that understanding how heterosynaptic plasticity is induced is an important question and worthy of additional investigation. As stated above, the focus of our current study was to establish the tool, BoNT-C, that will now enable a variety of fascinating and important future studies, both at understanding how and why synaptic strength differs between tonic vs phasic synapses and also how heterosynaptic plasticity signaling occurs at the NMJ. It required substantial time and experimental effort to establish that BoNT-C works to cleanly silence transmission without inducing structural and functional plasticity in the current manuscript (Figures 1-7 and several supplemental figures). Respectfully, we believe it is unreasonable to expect all of this data to be relegated to a “starting point” to then go on and probe heterosynaptic plasticity in more detail, all compressed into a single paper.

      It appears this reviewer is particularly interested in heterosynaptic plasticity, which we agree is a fascinating topic. First, we should clarify that in our experiments, TNT expression does NOT induce any heterosynaptic structural or functional plasticity (see Figures 6 and Table S2), at least in our studies at m6/7, m12/13, and m4 NMJs. Rather, TNT expression alters synaptic structure in the neuron in which it is expressed (“intrinsic structural plasticity”, Fig. 6), but does not induce any changes to the convergent input. Hence, the only evidence for actual heterosynaptic plasticity is the rather minor adaptations in synaptic structure and function observed following ablation of Is motor inputs (Fig. 6 and 8).

      In addition to the important insights revealed by BoNT-C in accurately distinguishing tonic vs phasic transmission outlined above, it appears that the reviewer does not fully appreciate the mechanistic constraints that the new BoNT-C tool reveals about heterosynaptic signaling. We would therefore like to highlight the key insights our study has revealed specifically about heterosynaptic plasticity. First, we show that at the muscle 6/7 NMJ, loss of MN-Ib completely eliminates Is innervation – this was not the finding reported in the 2020 study (Ib ablation was not reported in the 2021 study). Rather, AponteSantiago et al. 2020 reported that elimination of Ib did not trigger compensatory changes in active zone or bouton numbers of the Is input, no were compensatory increases in the Is EPSP reported. This may be due to the confounding variable Is innervation at the muscle 1 and muscle 4 NMJs used in the previous studies. Second, to what extent miniature transmission changes after manipulating activity from Is vs Ib could not be accurately assessed in previous studies because spontaneous activity persists following TNT expression as does innervation following rpr.hid expression. Third, and perhaps most important, our study is the only one that can demonstrate no heterosynaptic functional plasticity is induced by the physical presence but functional silencing of neurotransmitter release between tonic vs phasic inputs at NMJs with consistent innervation by both Is and Ib inputs.

      It is clear to us now that we did not do a sufficient job of emphasizing these advances our study has now revealed about the baseline and heterosynaptic relationships between Is vs Ib. We have added additional details throughout the revised manuscript to ensure these insights are highlighted in an effort for the reader to better appreciate the importance of this study.

      Overall, while an initial reading of the manuscript sounded rather exciting, a deeper analysis of the work in context of the literature of the last few years diminishes my enthusiasm for the novelty and progress provided.

      We have responded to the major criticisms raised by this reviewer above and hope that he/she/they can more fully appreciate the importance of the new tool we developed, the impact it will have on the field in opening new studies on tonic vs. phasic transmission, and establishing the rules of heterosynaptic plasticity between convergent tonic and phasic inputs on common targets.

    1. Author Response

      Reviewer #1 (Public Review):

      It should also be noted that their immunohistochemical studies of human fetal tissue for TBX5 and PTK7 are not convincing. There appears to be widespread staining of multiple cell types, suggesting either very broad expression of both genes or poor specificity of the primary antibodies.

      We appreciate the reviewer’s comment that the immunohistochemistry staining does not provide definitive evidence for the functional importance of TBX5 and PTK7 in PUV, however these images do confirm that the proteins are ‘in the right place at the right time’ during normal human urinary tract development. We have updated the discussion on page 19, line 441-445 to emphasise this. To further support a putative role for these proteins in urinary tract development we have added additional images from a second human embryo at the same gestation which confirms these distinct patterns of staining (Figure 8 – figure supplement 1 on page 14, lines 313-317). Even if these proteins can also be detected in other tissues or cell types, this does not detract from this idea, as in other locations the proteins may have redundant or different roles. 

      PUVs have not been described as a clinical manifestation of disease associated with mutations of either gene in humans.

      The reviewer is correct that rare variants affecting TBX5 and PTK7 have not previously been associated with PUV. They have however been associated with other developmental anomalies (as stated in the discussion on page 18, line 408-411 and page 19, line 434-437) confirming a clear role in embryonic development for both these genes.

      The fact that rare variant association testing did not identify an increased burden of rare, likely deleterious variants in these two genes (although with limited power in this cohort) suggests that PUV is not driven by ultra-rare, highly penetrant alleles in these genes. However, the identification of common and low-frequency variants using GWAS suggests a complex mode of inheritance for PUV likely in combination with maternal_/in utero_ factors. As with other complex traits, these signals provide potential insights into the underlying biology of this disease as opposed to the diagnostic implications of conventional monogenic gene discovery associated with purely Mendelian conditions. A paragraph on the Mendelian/complex trait implications of the findings of the study has been incorporated into the discussion (page 21-22, line 594-502).       

      Discuss how variants in either gene or in the patterns of structural variants that they found associated with PUV intersect with sex to result in this exclusively male condition.

      The fact that PUV is a uniquely male disease is most likely the result of differences in urethra and bladder development and length differences in urethra between males and females. Sex hormones may also potentially result in tissue-specific differences in gene expression (Ober, Loisel, and Gilad 2008). We have added a paragraph into the discussion to clarify this (page 20, line 454-463) as well as clarified the results of the chromosome X and sex-specific analyses (page 7, lines 149-155; see also Reviewer 2, point 5 below) as suggested. 

      Reviewer #2 (Public Review):

      Major:

      1. The replication study is problematic given that different genotyping methods are used for cases (targeted KASP) versus controls (WGS). This may introduce differential bias. Moreover, the ancestry of the control cohort (UK-based) does not seem to be well matched to the cases (predominantly German and Polish), and the lack of genome-wide data for the cases precludes proper adjustment for population stratification. The case-control design is also imbalanced in the replication study. The authors should reconsider their replication strategy to include a more balanced cohort with ancestry-matched controls and uniform genotyping. As an alternative, genome-wide genotyping of the replication case cohort would significantly enhance the study and should be considered.

      Many thanks to the reviewer for their valuable comments regarding the replication study case-control cohort. While different sequencing technologies were used to compare allele counts at the lead variants in the replication study (KASP genotyping for cases vs WGS for controls), both techniques exhibit > 99.5% accuracy and are subjected to variant level quality control metrics. Only individuals with reliably called genotypes were included in the replication analysis. This has been clarified in the methods section (page 30, line 693).

      We were able to obtain genome-wide genotyping data for 204 of the 395 European cases in the replication cohort. While (despite sustained effort on our part) we were unable to analyze this data jointly with the control cohort in the 100KGP due to enforced limitations on data sharing, we were able to demonstrate similar ancestry of the replication study cases and controls:  we performed PCA on a set of ~80,000 overlapping autosomal, high-quality, LD-pruned variants with MAF > 10% and projected the cases and controls separately onto (the same) data from the 1000 Genomes Project (Phase 3) labelled by ‘population’ (Figure 5). This clearly demonstrates that both cohorts have homogeneous European ancestry, as stated now in the results (page 8, lines 166-168).

      We note with thanks the reviewer’s comments regarding the case-control imbalance in the replication study which can sometimes result in a type 1 error. To address this, the case control ratio was reduced from 1:27 to 1:10.5 by including only the 4,151 male controls from the cancer cohort of the 100KGP. The results remained significant for both lead variants and have been updated in the manuscript (page 8, line 162-176; Table 2).

      When the number of controls was reduced to 500 males (a case:control ratio of 1:1.3), rs10774740 (TBX5 locus) remained significant demonstrating that case-control imbalance was not driving the observed signal (P\=9.9x10-3; OR 0.77; 95% CI 0.63-0.94). rs144171242 (PTK7 locus) however did not reach significance due to insufficient power (P\=0.06; OR 2.24; 95% CI 0.93-5.36). For a rare variant such as rs144171242 (MAF ~ 1%), a replication study with 500 controls is only powered to detect association with large effect size (OR > 3.5). A case:control ratio of ~1:10 is therefore optimal to maximize power to detect association, while minimizing unnecessary noise from excess controls. This has been added to the results section of the manuscript (page 8-9, lines 178-184).

      2. I am reassured that the TBX5 signal remains genome-wide significant in European-only analysis. However, the signal at PTK7 appears much less robust - it has borderline statistical significance (especially given that the authors test for all rare and common variants across the genome) and is represented by a single variant with a relatively rare risk allele that is differentially distributed by ancestry. Therefore, I would like to see more information for this specific signal:

      Information on the depth of coverage and the quality of the top variant

      This has been incorporated into the manuscript for both lead variants (Page 7, lines 142-145). For rs144171242 at the PTK7 locus, the meanDP was 29.34 and the meanGQ was 75.59.

      Information if the top PTK7 variant remain genome-wide significant after application of genomic control. Of note, the calculation of genomic inflation is dependent on sample size - lambda of 1.05 may represent an underestimate given low power of the cohort, and this point deserves at least a comment. Some methods correcting lambda for sample size have been proposed, and the authors should consider applying these methods.

      We appreciate the reviewer’s comments that the value of lambda may be affected by sample size and have added a comment to this in the manuscript (Page 7, line 136-137). Despite extensive searching, we were unable to find a recent published example of how to correct lambda for sample size and would be grateful if the reviewer could suggest a reference for this.

      To answer the reviewer’s specific question, application of genomic control to the lead variant at PTK7 results in P\=4.37x10-8 which remains below the threshold for conventional genome-wide significance. However, while the genomic inflation factor provides a reasonable indication of possible confounding by population structure, there are recognized limitations to applying it as a corrective factor as it assumes that all variants are confounded i.e., the same correction is applied irrespective of differences in population allele frequency which can be insufficient for some variants and lead to a loss of power in others. Furthermore, in addition to sample size, lambda can vary with heritability and disease prevalence (Yang et al. 2011) and its use for correction can therefore be too conservative and reduce power to detect significant associations. In this manuscript we therefore chose to use the mixed model approach (as part of SAIGE – detailed in the methods on page 28, lines 647-648), which has largely superseded older methods such as genomic control, to robustly correct for both population structure and cryptic relatedness and minimize false positive associations (Shin and Lee 2015).

      This locus requires more robust replication as discussed above. If more robust replication study is not possible, additional functional studies could provide more evidence in support of this locus.

      Please refer to point 1 regarding the revised and more robust evidence of replication. 

      3. There is no validation of sensitivity and specificity of SV detection by variant size or type (e.g. inversions, deletions, duplications). Also, since burden differences are not replicated independently, the authors should stress the exploratory nature of these analyses.

      We appreciate the reviewer’s comment that there is no independent validation of SV detection (e.g., by microarray or long-read sequencing) and this was reported as a limitation of our study in the discussion (page 22-23, line 520-524). However, one of the main strengths of this study is the use of clinical-grade WGS data where all samples have been sequenced on the same platform and undergone variant calling using the same bioinformatics pipeline. This essentially eliminates confounding due to differences in data generation and processing and the sensitivity and specificity of SV detection will therefore be the same for both cases and controls.

      We agree with the reviewer that the SV analyses have not yet been replicated independently and, as they suggest, have stressed the exploratory nature of the findings in the discussion (page 21, line 491-493).

      In the discussion (especially second paragraph, but also throughout), the authors overemphasize multi-ancestry nature of their study. The reality is that the included non-Europeans are very small in numbers (18 SAS cases, 11 AFR cases, and 14 admixed cases). I would suggest for the authors to specifically state these case counts and make it clear that expanded efforts to recruit non-Europeans are still needed given these very low numbers.

      We appreciate the reviewer’s comment about the overemphasis on the multi-ancestry nature of the study and the small absolute numbers of individuals included, however as a proportion of the cohort, a third of the cases are non-European: 14% are of South Asian ancestry, 8% are of African ancestry and 11% are admixed. This breakdown comprises a greater proportion of non-white European ancestry individuals than the UK as a whole (DOI: 10.5257/census/aggregate-2001-2), where the discovery cohort was based. This provides evidence that our study eliminates at least some of the Euro-centric bias present in existing genetic and genomic literature, at least as far as the UK population is concerned. Clearly, global studies fairly representing all populations would be needed to address this issue perfectly. The case counts were reported in Table 1 but we have now referenced the low absolute numbers and included the reviewer’s suggestion about expanding efforts to recruit non-European populations in the main text (page 22, line 518-520). We have also edited paragraph two of the discussion in response to the reviewer’s comments (page 17, line 387-398).   

      Supplemental figure 2 -provide case-control counts in each ancestral group (Y axis).

      These have been added to the figure legend of Figure 6 – supplemental figure 4 (previously Figure 5 - supplemental figure 2).

      Supplemental figure 3 is misleading since allelic frequencies in the cases are pooled and are not available individually for all depicted populations.

      Figure 5 - supplemental figure 3 has been removed and replaced by Figure 6 – supplemental figure 3 to show only the individual case, control and gnomAD AF by ancestry for AFR, SAS and EUR population groups instead of using the pooled allele frequencies.

      5. I did not see details of chr. X analysis. This is important given that the case group involves only Males and control group involves both Males and Females. Also, please explain how sex was used as a fixed effect (as stated in the methods) given that the case cohort is 100% male.

      We thank the reviewer for their insightful comments. Sex was used as a covariate (or fixed effect) to control for the anatomical differences in development of the urethra (and in utero hormonal changes) between the sexes in the control cohort (clarified in the methods, page 28, lines 651-653). Given the PheWAS findings (page 13, line 292-297) reveal an association between the lead variant near TBX5 and female genital prolapse and urinary incontinence, this suggests that while women do not develop PUV (due to differences in urethral development) they may manifest other lower urinary tract phenotypes. In theory, removing the female individuals from the control cohort should therefore strengthen the association as the signal would not be diluted by ‘affected’ women (i.e., those with potentially unknown lower urinary tract phenotypes). We tested this by performing a sex-specific male-only GWAS and found that the strength of association at both lead variants increased. The results of this have been added to the manuscript (page 7, line 149-155).

      The results of the chromosome X rare variant analysis are shown on the Manhattan plot (Figure 9), with no significant genes identified. We have added chromosome X to the mixed-ancestry and European GWAS as suggested (with no significant results) and the Manhattan and Q-Q plots have been updated in Figure 2 and Figure 6. The number of analyzed variants in each analysis has also been updated accordingly.

    1. Author Response

      Reviewer #2 (Public Review):

      Feeding behaviour in C. elegans has been extensively studied over decades. Several methods  of measuring feeding exist, but none can directly measure both pumping and locomotion  behaviour in freely-moving worm populations. The authors have developed a new  imaging-based method for automated detection of pharyngeal pumping events in freely moving

      C. elegans populations, and can thus simultaneously measure pumping and locomotion  behaviour in tens of worms, at a single-worm, single-pump resolution that is not possible with  previous methods. This user-friendly method can be applied to several research directions, such  as large-scale foraging, behavioural coordination, and high throughput screening.

      The authors designed their new method to be broadly applicable and user-friendly, for easy  adaptation in other research labs. However, adding direct evidence to show that "the method is  relatively insensitive to the optical instrument used" will better support this claim of wider  application.

      We appreciate the reviewer’s suggestion to show evidence that our method will also work on  data acquired on different microscopes. We now present data obtained on a second  epi-fluorescent microscope, which was downscaled and analyzed in Fig. 1H-J.

      The authors carefully benchmarked their new method against expert annotations and existing  results from previous methods, to both validate their method and reveal additional advantages.  They also assessed potential pitfalls of the method such as by examining the effect of  fluorescence imaging on the behavioural outcome, albeit only at the timescale of minutes. The  effect of longer-term fluorescence imaging should be further explored, which is relevant for  large-scale foraging experiments that the authors discussed. It could be helpful to determine the  maximum total exposure for the method to still be valid, both in terms of pump detection (which  could be sensitive to photobleaching) and behavioural modulation (which could be sensitive to  higher phototoxicity).

      We thank the reviewer for this comment. In response to their comment and related comments  by the other reviewers, we have provided bleaching curves and evidence of long-term imaging  to show the potential of the methods for longer scale assays. We found that with our illumination  intensity (see methods), bleaching was significant at a time scale of ~1h. We then added  triggered illumination and could extend the recording time to ~5 h (Methods). Additionally, we  perform a supplementary control for viability of worms exposed to continuous light (not  triggered) for 5 hrs. We do not observe any apparent phototoxic effect.

      Overall, the manuscript is well-written and the results are clearly presented both in terms of  statistics and interpretation. Methodological details are well-documented and openly accessible.

      We thank the reviewer for their positive view of our work and their appreciation for our efforts to  document both data and software.

      Reviewer #3 (Public Review):

      In this manuscript, the authors present a method for simultaneous assessment of pharyngeal  pumping (feeding) and locomotion in many C. elegans simultaneously. In this technique,  imaging of the fluorescent labeled pharynx provides a measure of velocity and pumping rate,  through analysis of the spatial variations in fluorescence.

      The technique is clearly described, well-validated, and yields some novel results. It has the  advantage that it can be performed using microscopes found in many C. elegans laboratories.

      We appreciate that the reviewer recognizes the wide applicability of the method across many C.  elegans  laboratories.

      Some limitations of the method include its reliance on fluorescence imaging, which is a  hindrance to genetic analysis, computational intensiveness, and phototoxic effects of  fluorescence excitation that are not fully explored in the manuscript.

      The authors show the utility of their method by assessing pharyngeal pumping and motor  behavior (1) during development, (2) in the presence or absence of food, and (3) in the  presence of two mutations affecting feeding.

      Although I understand these are proof-of-principle demonstrations, I still came away feeling  underwhelmed by these examples. I did not see any results here that could not have been  obtained fairly easily with conventional techniques.

      We appreciate the constructive criticism of the reviewer and highlight in the revised version the  fact that using conventional techniques such studies would require tens of hours of experiment  time. We would like to emphasize the comparisons in Table 1 where we show other methods  and their current limitations. Obtaining a dataset such as in Figure 3 which comprises a total of  34 worm-hours of pumping observation from unrestrained animals is to our knowledge currently  impractical with competing methods. We would like to remind the reviewer that, using our  method we were able to reveal bimodal distributions within a population as illustrated, for  instance, in Fig. 3F, 4B, and 4F. These observations are not possible when the single worm  resolution is not accessible or when large statistics are not feasible as it happens with previous  methods.

      Given these limitations, I feel the method's eventual impact in the field will be relatively small.

      In this study, we present a method allowing performing behavioral studies on worm populations  at high throughput and reduced costs. Such a technique opens the door to many laboratories  that can not do EPG recordings or microfluidics due to the technical difficulties, or that want to  study animals in their normal plate context. We also would like to emphasize that there are already more than 1500 strains containing myo-2  promoter transgene available on CGC, which  would be amenable to our imaging approach. These transgenic strains form broad classes of  interest, such as thermotolerance, ER stress resistance, aging and neural-circuit specific genes.

      Pharyngeal pumping has also been used as a read-out for pharmacological screens, for  example, bacteria pre-loaded with pharmacological agents are tested for their effect on  pharyngeal pumping rate. Pharaglow offers a high-throughput and sensitive method to measure  the pumping rate. This will benefit the field who use C. elegans  pumping for pharmacological  screens, and pave the way for the researchers who plan to use but are hindered by existing  techniques.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors evaluate the involvement of the hippocampus in a fast-paced time-to-contact estimation task. They find that the hippocampus is sensitive to feedback received about accuracy on each trial and has activity that tracks behavioral improvement from trial to trial. Its activity is also related to a tendency for time estimation behavior to regress to the mean. This is a novel paradigm to explore hippocampal activity and the results are thus novel and important, but the framing as well as discussion about the meaning of the findings obscures the details of the results or stretches beyond them in many places, as detailed below.

      We thank the reviewer for their constructive feedback and were happy to read that s/he considered our approach and results as novel and important. The comments led us to conduct new fMRI analyses, to clarify various unclear phrasings regarding our methods, and to carefully assess our framing of the interpretation and scope of our results. Please find our responses to the individual points below.

      1) Some of the results appear in the posterior hippocampus and others in the anteriorhippocampus. The authors do not motivate predictions for anterior vs. posterior hippocampus, and they do not discuss differences found between these areas in the Discussion. The hippocampus is treated as a unitary structure carrying out learning and updating in this task, but the distinct areas involved motivate a more nuanced picture that acknowledges that the same populations of cells may not be carrying out the various discussed functions.

      We thank the reviewer for pointing this out. We split the hippocampus into anterior and posterior sections because prior work suggested a different whole-brain connectivity and function of the two. This was mentioned in the methods section (page 15) in the initial submission but unfortunately not in the main text. Moreover, when discussing the results, we did indeed refer mostly to the hippocampus as a unitary structure for simplicity and readability, and because statements about subcomponents are true for the whole. However, we agree with the reviewer that the differences between anterior and posterior sections are very interesting, and that describing these effects in more detail might help to guide future work more precisely.

      In response to the reviewer's comment, we therefore clarified at various locations throughout the manuscript whether the respective results were observed in the posterior or anterior section of the hippocampus, and we extended our discussion to reflect the idea that different functions may be carried out by distinct populations of hippocampal cells. In addition, we also now motivate the split into the different sections better in the main text. We made the following changes.

      Page 3: “Second, we demonstrate that anterior hippocampal fMRI activity and functional connectivity tracks the behavioral feedback participants received in each trial, revealing a link between hippocampal processing and timing-task performance.

      Page 3: “Fourth, we show that these updating signals in the posterior hippocampus were independent of the specific interval that was tested and activity in the anterior hippocampus reflected the magnitude of the behavioral regression effect in each trial.”

      Page 5: “We performed both whole-brain voxel-wise analyses as well as regions-of-interest (ROI) analysis for anterior and posterior hippocampus separately, for which prior work suggested functional differences with respect to their contributions to memory-guided behavior (Poppenk et al., 2013, Strange et al. 2014).”

      Page 9: “Because anterior and posterior sections of the hippocampus differ in whole-brain connectivity as well as in their contributions to memory-guided behavior (Strange et al. 2014), we analyzed the two sections separately. “

      Page 9: “We found that anterior hippocampal activity as well as functional connectivity reflected the feedback participants received during this task, and its activity followed the performance improvements in a temporal-context-dependent manner. Its activity reflected trial-wise behavioral biases towards the mean of the sampled intervals, and activity in the posterior hippocampus signaled sensorimotor updating independent of the specific intervals tested.”

      Page 10: “Intriguingly, the mechanisms at play may build on similar temporal coding principles as those discussed for motor timing (Yin & Troger, 2011; Eichenbaum, 2014; Howard, 2017; Palombo & Verfaellie, 2017; Nobre & van Ede, 2018; Paton & Buonomano, 2018; Bellmund et al., 2020, 2021; Shikano et al., 2021; Shimbo et al., 2021), with differential contributions of the anterior and posterior hippocampus. Note that our observation of distinct activity modulations in the anterior and posterior hippocampus suggests that the functions and coding principles discussed here may be mediated by at least partially distinct populations of hippocampal cells.”

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence [...]

      2) Hippocampal activity is stronger for smaller errors, which makes the interpretationmore complex than the authors acknowledge. If the hippocampus is updating sensorimotor representations, why would its activity be lower when more updating is needed?

      Indeed, we found that absolute (univariate) activity of the hippocampus scaled with feedback valence, the inverse of error (Fig. 2A). We see multiple possibilities for why this might be the case, and we discussed some of them in a dedicated discussion section (“The role of feedback in timed motor actions”). For example, prior work showed that hippocampal activity reflects behavioral feedback also in other tasks, which has been linked to learning (e.g. Schönberg et al., 2007; Cohen & Ranganath, 2007; Shohamy & Wagner, 2008; Foerde & Shohamy, 2011; Wimmer et al., 2012). In our understanding, sensorimotor updating is a form of ‘learning’ in an immediate and behaviorally adaptive manner, and we therefore consider our results well consistent with this earlier work. We agree with the reviewer that in principle activity should be stronger if there was stronger sensorimotor updating, but we acknowledge that this intuition builds on an assumption about the relationship between hippocampal neural activity and the BOLD signal, which is not entirely clear. For example, prior work revealed spatially informative negative BOLD responses in the hippocampus as a function of visual stimulation (e.g. Szinte & Knapen 2020), and the effects of inhibitory activity - a leading motif in the hippocampal circuitry - on fMRI data are not fully understood. This raises the possibility that the feedback modulation we observed might also involve negative BOLD responses, which would then translate to the observed negative correlation between feedback valence and the hippocampal fMRI signal, even if the magnitude of the underlying updating mechanism was positively correlated with error. This complicates the interpretation of the direction of the effect, which is why we chose to avoid making strong conclusions about it in our manuscript. Instead, we tried discussing our results in a way that was agnostic to the direction of the feedback modulation. Importantly, hippocampal connectivity with other regions did scale positively with error (Fig. 2B), which we again discussed in the dedicated discussion section.

      In response to the reviewer’s comment, we revisited this section of our manuscript and felt the latter result deserved a better discussion. We therefore took this opportunity to extend our discussion of the connectivity results (including their relationship to the univariate-activity results as well as the direction of these effects), all while still avoiding strong conclusions about directionality. Following changes were made to the manuscript.

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence, unlike its absolute activity, which scaled positively with feedback valence (Fig. 2A,B), suggesting that the two measures may be sensitive to related but distinct processes.

      Page 11: Such network-wide receptive-field re-scaling likely builds on a re-weighting of functional connections between neurons and regions, which may explain why anterior hippocampal connectivity correlated negatively with feedback valence in our data. Larger errors may have led to stronger re-scaling, which may be grounded in a corresponding change in functional connectivity.

      3) Some tests were one-tailed without justification, which reduces confidence in the robustness of the results.

      We thank the reviewer for pointing us to the fact that our choice of statistical tests was not always clear in the manuscript. In the analysis the reviewer is referring to, we predicted that stronger sensorimotor updating should lead to stronger activity as well as larger behavioral improvements across the respective trials. This is because a stronger update should translate to a more accurate “internal model” of the task and therefore to a better performance. We tested this one-sided hypothesis using the appropriate test statistic (contrasting trials in which behavioral performance did improve versus trials in which it did not improve), but we did not motivate our reasoning well enough in the manuscript. The revised manuscript therefore includes the two new statements shown below to motivate our choice of test statistic more clearly.

      Page 7: [...] we contrasted trials in which participants had improved versus the ones in which they had not improved or got worse (see methods for details). Because stronger sensorimotor updating should lead to larger performance improvements, we predicted to find stronger activity for improvements vs. no improvements in these tests (one-tailed hypothesis).

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively. Because we predicted to find stronger activity for improvements vs. no improvements in behavioral performance, we here performed one-tailed statistical tests, consistent with the direction of this hypothesis. Improvement in performance was defined as receiving feedback of higher valence than in the corresponding previous trial.

      4) The introduction motivates the novelty of this study based on the idea that thehippocampus has traditionally been thought to be involved in memory at the scale of days and weeks. However, as is partially acknowledged later in the Discussion, there is an enormous literature on hippocampal involvement in memory at a much shorter timescale (on the order of seconds). The novelty of this study is not in the timescale as much as in the sensorimotor nature of the task.

      We thank the reviewer for this helpful suggestion. We agree that a key part of the novelty of this study is the use of the task that is typically used to study sensorimotor integration and timing rather than hippocampal processing, along with the new insights this task enabled about the role of the hippocampus in sensorimotor updating. As mentioned in the discussion, we also agree with the reviewer that there is prior literature linking hippocampal activity to mnemonic processing on short time scales. We therefore rephrased the corresponding section in the introduction to put more weight on the sensorimotor nature of our task instead of the time scales.

      Note that the new statement still includes the time scale of the effects, but that it is less at the center of the argument anymore. We chose to keep it in because we do think that the majority of studies on hippocampal-dependent memory functions focus on longer time scales than our study does, and we expect that many readers will be surprised about the immediacy of how hippocampal activity relates to ongoing behavioral performance (on ultrashort time scales).

      We changed the introduction to the following.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions. Importantly, the hippocampus is not traditionally thought to support sensorimotor functions, and its contributions to memory formation are typically discussed for longer time scales (hours, days, weeks). Here, however, we characterize in detail the relationship between hippocampal activity and real-time behavioral performance in a fast-paced timing task, which is traditionally believed to be hippocampal-independent. We propose that the capacity of the hippocampus to encode statistical regularities of our environment (Doeller et al. 2005, Shapiro et al. 2017, Behrens et al., 2018; Momennejad, 2020; Whittington et al., 2020) situates it at the core of a brain-wide network balancing specificity vs. regularization in real time as the relevant behavior is performed.

      5) The authors used three different regressors for the three feedback levels, asopposed to a parametric regressor indexing the level of feedback. The predictions are parametric, so a parametric regressor would be a better match, and would allow for the use of all the medium-accuracy data.

      The reviewer raises a good point that overlaps with question 3 by reviewer 2. In the current analysis, we model the three feedback levels with three independent regressors (high, medium, low accuracy). We then contrast high vs. low accuracy feedback, obtaining the results shown in Fig. 2AB. The beta estimates obtained for medium-accuracy feedback are being ignored in this contrast. Following the reviewer’s feedback, we therefore re-run the model, this time modeling all three feedback levels in one parametric regressor. All other regressors in the model stayed the same. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor.

      The results we observed were highly consistent across the two analyses, and all conclusions presented in the initial manuscript remain unchanged. While the exact t-scores differ slightly, we replicated the effects for all clusters on the voxel-wise map (on whole-brain FWE-corrected levels) as well as for the regions-of-interest analysis for anterior and posterior hippocampus. These results are presented in a new Supplementary Figure 3C.

      Note that the new Supplementary Figure 3B shows another related new analyses we conducted in response to question 4 of reviewer 2. Here, we re-ran the initial analysis with three feedback regressors, but without modeling the inter-trial interval (ITI) and the inter-session interval (ISI, i.e. the breaks participants took) to avoid model over-specification. Again, we replicated the results for all clusters and the ROI analysis, showing that the initial results we presented are robust.

      The following additions were made to the manuscript.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using the beta estimates obtained for the parametric feedback regressor (Fig. 2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      6) The authors claim that the results support the idea that the hippocampus is findingan "optimal trade-off between specificity and regularization". This seems overly speculative given the results presented.

      We understand the reviewer's skepticism about this statement and agree that the manuscript does not show that the hippocampus is finding the trade-off between specificity and regularization. However, this is also not exactly what the manuscript claims. Instead, it suggests that the hippocampus “may contribute” to solving this trade-off (page 3) as part of a “brain-wide network“ (pages 2,3,9,12). We also state that “Our [...] results suggest that this trade-off [...] is governed by many regions, updating different types of task information in parallel” (Page 11). To us, these phrasings are not equivalent, because we do not think that the role of the hippocampus in sensorimotor updating (or in any process really) can be understood independently from the rest of the brain. We do however think that our results are in line with the idea that the hippocampus contributes to solving this trade-off, and that this is exciting and surprising given the sensorimotor nature of our task, the ultrashort time scale of the underlying process, and the relationship to behavioral performance. We tried expressing that some of the points discussed remain speculation, but it seems that we were not always successful in doing so in the initial submission. We apologize for the misunderstanding, adapted corresponding statements in the manuscript, and we express even more carefully that these ideas are speculation.

      Following changes were made to the introduction and discussion.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      Page 12: This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 13: This is in line with the notion that the hippocampus [...] supports finding an optimal trade off between specificity and regularization along with other regions. [...] Our results show that the hippocampus supports rapid and feedback-dependent updating of sensorimotor representations, suggesting that it is a central component of a brain-wide network balancing task specificity vs. regularization for flexible behavior in humans.

      Note that in response to comment 1 by reviewer 2, the revised manuscript now reports the results of additional behavioral analyses that support the notion that participants find an optimal trade-off between specificity and regularization over time (independent of whether the hippocampus was involved or not).

      7) The authors find that hippocampal activity is related to behavioral improvement fromthe prior trial. This seems to be a simple learning effect (participants can learn plenty about this task from a prior trial that does not have the exact same timing as the current trial) but is interpreted as sensitivity to temporal context. The temporal context framing seems too far removed from the analyses performed.

      We agree with the reviewer that our observation that hippocampal activity reflects TTC-independent behavioral improvements across trials could have multiple explanations. Critically, i) one of them is that the hippocampus encodes temporal context, ii) it is only one of multiple observations that we build our interpretation on, and iii) our interpretation builds on multiple earlier reports

      Interval estimates regress toward the mean of the sampled intervals, an effect that is often referred to as the “regression effect”. This effect, which we observed in our data too (Fig. 1B), has been proposed to reflect the encoding of temporal context (e.g. Jazayeri & Shadlen 2010). Moreover, there is a large body of literature on how the hippocampus may support the encoding of spatial and temporal context (e.g. see Bellmund, Polti & Doeller 2020 for review).

      Because both hippocampal activity and the regression effect were linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. If so, one would expect that hippocampal activity should reflect behavioral improvements independently from TTC, it should reflect the magnitude of the regression effect, and it should generally reflect feedback, because it is the feedback that informs the updating of the internal prior.

      All three observations may have independent explanations indeed, but they are all also in line with the idea that the hippocampus does encode temporal context and that this explains the relationship between hippocampal activity and the regression effect. It therefore reflects a sparse and reasonable explanation in our opinion, even though it necessarily remains an interpretation. Of course, we want to be clear on what our results are and what our interpretations are.

      In response to the reviewer’s comment, we therefore toned down two of the statements that mention temporal context in the manuscript, and we removed an overly speculative statement from the result section. In addition, the discussion now describes more clearly how our results are in line with this interpretation.

      Abstract: This is in line with the idea that the hippocampus supports the rapid encoding of temporal context even on short time scales in a behavior-dependent manner.

      Page 13: This is in line with the notion that the hippocampus encodes temporal context in a behavior-dependent manner, and that it supports finding an optimal trade off between specificity and regularization along with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      The following statement was removed, overlapping with comment 2 by Reviewer 3:

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      8) I am not sure the term "extraction of statistical regularities" is appropriate. The termis typically used for more complex forms of statistical relationships.

      We agree with the reviewer that this expression may be interpreted differently by different readers and are grateful to be pointed to this fact. We therefore removed it and instead added the following (hopefully less ambiguous) statement to the manuscript.

      Page 9: This study investigated how the human brain flexibly updates sensorimotor representations in a feedback-dependent manner in the service of timing behavior.

      Reviewer #2 (Public Review):

      The authors conducted a study involving functional magnetic resonance imaging and a time-to-contact estimation paradigm to investigate the contribution of the human hippocampus (HPC) to sensorimotor timing, with a particular focus on the involvement of this structure in specific vs. generalized learning. Suggestive of the former, it was found that HPC activity reflected time interval-specific improvements in performance while in support of the latter, HPC activity was also found to signal improvements in performance, which were not specific to the individual time intervals tested. Based on these findings, the authors suggest that the human HPC plays a key role in the statistical learning of temporal information as required in sensorimotor behaviour.

      By considering two established functions of the HPC (i.e., temporal memory and generalization) in the context of a domain that is not typically associated with this structure (i.e., sensorimotor timing), this study is potentially important, offering novel insight into the involvement of the HPC in everyday behaviour. There is much to like about this submission: the manuscript is clearly written and well-crafted, the paradigm and analyses are well thought out and creative, the methodology is generally sound, and the reported findings push us to consider HPC function from a fresh perspective. A relative weakness of the paper is that it is not entirely clear to what extent the data, at least as currently reported, reflects the involvement of the HPC in specific and generalized learning. Since the authors' conclusions centre around this observation, clarifying this issue is, in my opinion, of primary importance.

      We thank the reviewer for these positive and extremely helpful comments, which we will address in detail below. In response to these comments, the revised manuscript clarifies why the observed performance improvements are not at odds with the idea that an optimal trade-off between specificity and regularization is found, and how the time course of learning relates to those reported in previous literature. In addition, we conducted two new fMRI analyses, ensuring that our conclusions remain unchanged even if feedback is modeled with one parametric regressor, and if the number or nuisance regressors is reduced to control for overparameterization of the model. Please find our responses underneath each individual point below.

      1) Throughout the manuscript, the authors discuss the trade-off between specific and generalized learning, and point towards Figure S1D as evidence for this (i.e., participants with higher TTC accuracy exhibited a weaker regression effect). What appears to be slightly at odds with this, however, is the observation that the deviation from true TTC decreased with time (Fig S1F) as the regression line slope approached 0.5 (Fig S1E) - one would have perhaps expected the opposite i.e., for deviation from true TTC to increase as generalization increases. To gain further insight into this, it would be helpful to see the deviation from true TTC plotted for each of the four TTC intervals separately and as a signed percentage of the target TTC interval (i.e., (+) or (-) deviation) rather than the absolute value.

      We thank the reviewer for raising this important question and for the opportunity to elaborate on the relationship between the TTC error and the magnitude of the regression effect in behavior. Indeed, we see that the regression slopes approach 0.5 and that the TTC error decreases over the course of the experiment. We do not think that these two observations are at odds with each other for the following reasons:

      First, while the reviewer is correct in pointing out that the deviation from the TTC should increase as “generalization increases”, that is not what we found. It was not the magnitude of the regularization per se that increased over time, but the overall task performance became more optimal in the face of both objectives: specificity and generalization. This optimum is at a regression-line slope of 0.5. Generalization (or regularization how we refer to it in the present manuscript), therefore did not increase per se on group level.

      Second, the regression slopes approached 0.5 on the group-level, but the individual participants approached this level from different directions: Some of them started with a slope value close to 1 (high accuracy), whereas others started with a slope value close to 0 (near full regression to the mean). Irrespective of which slope value they started with, over time, they got closer to 0.5 (Rebuttal Figure 1A). This can also be seen in the fact that the group-level standard deviation in regression slopes becomes smaller over the course of the experiment (Rebuttal Figure 1B, SFig 1G). It is therefore not generally the case that the regression effect becomes stronger over time, but that it becomes more optimal for longer-term behavioral performance, which is then also reflected in an overall decrease in TTC error. Please see our response to the reviewer’s second comment for more discussion on this.

      Third, the development of task performance is a function of two behavioral factors: a) the accuracy and b) the precision in TTC estimation. Accuracy describes how similar the participant’s TTC estimates were to the true TTC, whereas precision describes how similar the participant’s TTC estimates were relative to each other (across trials). Our results are a reflection of the fact that participants became both more accurate over time on average, but also more precise. To demonstrate this point visually, we now plotted the Precision and the Accuracy for the 8 task segments below (Rebuttal Figure 1C, SFig 1H), showing that both measures increased as the time progressed and more trials were performed. This was the case for all target durations.

      In response to the reviewer’s comment, we clarified in the main text that these findings are not at odds with each other. Furthermore, we made clear that regularization per se did not increase over time on group level. We added additional supporting figures to the supplementary material to make this point. Note that in our view, these new analyses and changes more directly address the overall question the reviewer raised than the figure that was suggested, which is why we prioritized those in the manuscript.

      However, we appreciated the suggestion a lot and added the corresponding figure for the sake of completeness.

      Following additions were made.

      Page 5: In support of this, participants' regression slopes converged over time towards the optimal value of 0.5, i.e. the slope value between veridical performance and the grand mean (Fig. S1F; linear mixed-effects model with task segment as a predictor and participants as the error term, F(1) = 8.172, p = 0.005, ε2=0.08, CI: [0.01, 0.18]), and participants' slope values became more similar (Fig. S1G; linear regression with task segment as predictor, F(1) = 6.283, p = 0.046, ε2 = 0.43, CI: [0, 1]). Consequently, this also led to an improvement in task performance over time on group level (i.e. task accuracy and precision increased (Fig. S1I), and the relationship between accuracy and precision became stronger (Fig. S1H), linear mixed-effect model results for accuracy: F(1) = 15.127, p = 1.3x10-4, ε2=0.06, CI: [0.02, 0.11], precision: F(1) = 20.189, p = 6.1x10-5, ε2 = 0.32, CI: [0.13, 1]), accuracy-precision relationship: F(1) = 8.288, p =0.036, ε2 = 0.56, CI: [0, 1], see methods for model details).

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 15: We also corroborated this effect by measuring the dispersion of slope values between participants across task segments using a linear regression model with task segment as a predictor and the standard deviation of slope values across participants as the dependent variable (Fig. S1G). As a measure of behavioral performance, we computed two variables for each target-TTC level: sensorimotor timing accuracy, defined as the absolute difference in estimated and true TTC, and sensorimotor timing precision, defined as coefficient of variation (standard deviation of estimated TTCs divided by the average estimated TTC). To study the interaction between these two variables for each target TTC over time, we first normalized accuracy by the average estimated TTC in order to make both variables comparable. We then used a linear mixed-effects model with precision as the dependent variable, task segment and normalized accuracy as predictors and target TTC as the error term. In addition, we tested whether accuracy and precision increased over the course of the experiment using separate linear mixed-effects models with task segment as predictor and participants as the error term.

      2) Generalization relies on prior experience and can be relatively slow to develop as is the case with statistical learning. In Jazayeri and Shadlen (2010), for instance, learning a prior distribution of 11-time intervals demarcated by two briefly flashed cues (compared to 4 intervals associated with 24 possible movement trajectories in the current study) required ~500 trials. I find it somewhat surprising, therefore, that the regression line slope was already relatively close to 0.5 in the very first segment of the task. To what extent did the participants have exposure to the task and the target intervals prior to entering the scanner?

      We thank the reviewer for raising the important question about the time course of learning in our task and how our results relate to prior work on this issue. Addressing the specific reviewer question first, participants practiced the task for 2-3 minutes prior to scanning. During the practice, they were not specifically instructed to perform the task as well as they could nor to encode the intervals, but rather to familiarize themselves with the general experimental setup and to ask potential questions outside the MRI machine. While they might have indeed started encoding the prior distribution of intervals during the practice already, we have no way of knowing, and we expect the contribution of this practice on the time course of learning during scanning to be negligible (for the reasons outlined above).

      However, in addition to the specific question the reviewer asked, we feel that the comment raises two more general points: 1) How long does it take to learn the prior distribution of a set of intervals as a function of the number of intervals tested, and 2) Why are the learning slopes we report quite shallow already in the beginning of the scan?

      Regarding (1), we are not aware of published reports that answer this question directly, and we expect that this will depend on the task that is used. Regarding the comparison to Jazayeri & Shadlen (2010), we believe the learning time course is difficult to compare between our study and theirs. As the reviewer mentioned, our study featured only 4 intervals compared to 11 in their work, based on which we would expect much faster learning in our task than in theirs. We did indeed sample 24 movement directions, but these were irrelevant in terms of learning the interval distribution. Moreover, unlike Jazayeri & Shadlen (2010), our task featured moving stimuli, which may have added additional sensory, motor and proprioceptive information in our study which the participants of the prior study could not rely on.

      Regarding (2), and overlapping with the reviewer’s previous comment, the average learning slope in our study is indeed close to 0.5 already in the first task segment, but we would like to highlight that this is a group-level measure. The learning slopes of some subjects were closer to 1 (i.e. the diagonal in Fig 1B), and the one of others was closer to 0 (i.e. the mean) in the beginning of the experiment. The median slope was close to 0.65. Importantly, the slopes of most participants still approached 0.5 in the course of the experiment, and so did even the group-level slope the reviewer is referring to. This also means that participants’ slopes became more similar in the course of the experiment, and they approached 0.5, which we think reflects the optimal trade-off between regressing towards the mean and regressing towards the diagonal (in the data shown in Fig. 1B). This convergence onto the optimal trade-off value can be seen in many measures, including the mean slope (Rebuttal Figure 1A, SFig 1F), the standard deviation in slopes (Rebuttal Figure 1B, SFig 1G) as well as the Precision vs. Accuracy tradeoff (Rebuttal Figure 1C, SFig 1H). We therefore think that our results are well in line with prior literature, even though a direct comparison remains difficult due to differences in the task.

      In response to the reviewer’s comment, and related to their first comment, we made the following addition to the discussion section.

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is well in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      3) I am curious to know whether differences between high-accuracy andmedium-accuracy feedback as well as between medium-accuracy and low-accuracy feedback predicted hippocampal activity in the first GLM analysis (middle page 5). Currently, the authors only present the findings for the contrast between high-accuracy and low-accuracy feedback. Examining all feedback levels may provide additional insight into the nature of hippocampal involvement and is perhaps more consistent with the subsequent GLM analysis (bottom page 6) in which, according to my understanding, all improvements across subsequent trials were considered (i.e., from low-accuracy to medium-accuracy; medium-accuracy to high-accuracy; as well as low-accuracy to high-accuracy).

      We thank the reviewer for this thoughtful question, which relates to questions 5 by reviewer 1. The reviewer is correct that the contrast shown in Fig 2 does not consider the medium-accuracy feedback levels, and that the model in itself is slightly different from the one used in the subsequent analysis presented in Fig. 3. To reply to this comment as well as to a related one by reviewer 1 together, we therefore repeated the full analysis while modeling the three feedback levels in one parametric regressor, which includes the medium-accuracy feedback trials, and is consistent with the analysis shown in Fig. 3. The results of this new analysis are presented in the new Supplementary Fig. 3B.

      In short, the model included one parametric regressor with three levels reflecting the three types of feedback, and all nuisance regressors remained unchanged. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor. We found that our results presented initially were very robust: Both the observed clusters in the voxel-wise analysis (on whole-brain FWE-corrected levels) as well as the ROI results replicated across the two analyses, and our conclusions therefore remain unchanged.

      We made multiple textual additions to the manuscript to include this new analysis, and we present the results of the analysis including a direct comparison to our initial results in the new Supplementary Fig. 3. Following textual additions were.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using thebeta estimates obtained for the parametric feedback regressor (Fig. S2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      4) The authors modeled the inter-trial intervals and periods of rest in their univariateGLMs. This approach of modelling all 'down time' can lead to model over-specification and inaccurate parameter estimation (e.g. Pernet, 2014). A comment on this approach as well as consideration of not modelling the inter-trial intervals would be useful.

      This is an important issue that we did not address in our initial manuscript. We are aware and agree with the reviewer’s general concern about model over-specification, which can be a big problem in regression as it leads to biased estimates. We did examine whether our model was overspecified before running it, but we did not report a formal test of it in the manuscript. We are grateful to be given the opportunity to do so now.

      In response to the reviewer’s comment, we repeated the full analysis shown in Fig. 2 while excluding the nuisance regressors for inter-trial intervals (ISI) and breaks (or inter-session intervals, ISI). All other regressors and analysis steps stayed unchanged relative to the one reported in Fig. 2. The new results are presented in a new Supplementary Figure 3B.

      Like for our previous analysis, we again see that the results we initially presented were extremely robust even on whole-brain FWE corrected levels, as well as on ROI level. Our conclusions therefore remain unchanged, and the results we presented initially are not affected by potential model overspecification. In addition to the new Supplementary Figure 3B, we made multiple textual changes to the manuscript to describe this new analysis and its implications. Note that we used the same nuisance regressors in all other GLM analyses too, meaning that it is also very unlikely that model overspecification affects any of the other results presented. We thank the reviewer for suggesting this analysis, and we feel including it in the manuscript has further strengthened the points we initially made.

      Following additions were made to the manuscript.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI) [...]

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. All other regressors including the main feedback regressors of interest remained unchanged, and we repeated both the voxel-wise and ROI-wise statistical tests as described above (Fig. S2B).

      Page 17: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Reviewer #3 (Public Review):

      This paper reports the results of an interesting fMRI study examining the neural correlates of time estimation with an elegant design and a sensorimotor timing task. Results show that hippocampal activity and connectivity are modulated by performance on the task as well as the valence of the feedback provided. This study addresses a very important question in the field which relates to the function of the hippocampus in sensorimotor timing. However, a lack of clarity in the description of the MRI results (and associated methods) currently prevents the evaluation of the results and the interpretations made by the authors. Specifically, the model testing for timing-specific/timing-independent effects is questionable and needs to be clarified. In the current form, several conclusions appear to not be fully supported by the data.

      We thank the reviewer for pointing us to many methodological points that needed clarification. We apologize for the confusion about our methods, which we clarify in the revised manuscript. Please find our responses to the individual points below.

      Major points

      Some methodological points lack clarity which makes it difficult to evaluate the results and the interpretation of the data.

      We really appreciate the many constructive comments below. We feel that clarifying these points improved our manuscript immensely.

      1) It is unclear how the 3 levels of accuracy and feedback (high, medium, and lowperformance) were computed. Please provide the performance range used for this classification. Was this adjusted to the participants' performance?

      The formula that describes how the response window was computed for the different speed levels was reported in the methods section of the original manuscript on page 13. It reads as follows:

      “The following formula was used to scale the response window width: d ± ((k ∗ d)/2) where d is the target TTC and k is a constant proportional to 0.3 and 0.6 for high and medium accuracy, respectively.“

      In response to the reviewer’s comment, we now additionally report the exact ranges of the different response windows in a new Supplementary Table 1 and refer to it in the Methods section as follows.

      Page 10: To calibrate performance feedback across different TTC durations, the precise response window widths of each feedback level scaled with the speed of the fixation target (Table S1).

      2) The description of the MRI results lacks details. It is not always clear in the resultssection which models were used and whether parametric modulators were included or not in the model. This makes the results section difficult to follow. For example,

      a) Figure 2: According to the description in the text, it appears that panels A and B report the results of a model with 3 regressors, ie one for each accuracy/feedback level (high, medium, low) without parametric modulators included. However, the figure legend for panel B mentions a parametric modulator suggesting that feedback was modelled for each trial as a parametric modulator. The distinction between these 2 models must be clarified in the result section.

      We thank the reviewer very much for spotting this discrepancy. Indeed, Figure 2 shows the results obtained for a GLM in which we modeled the three feedback levels with separate regressors, not with one parametric regressor. Instead, the latter was the case for Figure 3. We apologize for the confusion and corrected the description in the figure caption, which now reads as follows. The description in the main text and the methods remain unchanged.

      Caption Fig. 2: We plot the beta estimates obtained for the contrast between high vs. low feedback.

      Moreover, note that in response to comment 5 by reviewer 1 and comment 3 by reviewer 2, the revised manuscript now additionally reports the results obtained for the parametric regressor in the new Supplementary Figure 3C. All conclusions remain unchanged.

      Additionally, it is unclear how Figure 2A supports the following statement: "Moreover, the voxel-wise analysis revealed similar feedback-related activity in the thalamus and the striatum (Fig. 2A), and in the hippocampus when the feedback of the current trial was modeled (Fig. S3)." This is confusing as Figure 2A reports an opposite pattern of results between the striatum/thalamus and the hippocampus. It appears that the statement highlighted above is supported by results from a model including current trial feedback as a parametric modulator (reported in Figure S3).

      We agree with the reviewer that our result description was confusing and changed it. It now reads as follows.

      Page 5: Moreover, the voxel-wise analysis revealed feedback-related activity also in the thalamus and the striatum (Fig. 2A) [...]

      Also, note that it is unclear from Figure 2A what is the direction of the contrast highlighting the hippocampal cluster (high vs. low according to the text but the figure shows negative values in the hippocampus and positive values in the thalamus). These discrepancies need to be addressed and the models used to support the statements made in the results sections need to be explicitly described.

      The description of the contrast is correct. Negative values indicate smaller errors and therefore better feedback, which is mentioned in the caption of Fig. 2 as follows:

      “Negative values indicate that smaller errors, and higher-accuracy feedback, led to stronger activity.”

      Note that the timing error determined the feedback, and that we predicted stronger updating and therefore stronger activity for larger errors (similar to a prediction error). We found the opposite. We mention the reasoning behind this analysis at various locations in the manuscript e.g. when talking about the connectivity analysis:

      “We reasoned that larger timing errors and therefore low-accuracy feedback would result in stronger updating compared to smaller timing errors and high-accuracy feedback”

      In response to the reviewer’s remark, we clarified this further by adding the following statement to the result section.

      Page 5: “Using a mass-univariate general linear model (GLM), we modeled the three feedback levels with one regressor each plus additional nuisance regressors (see methods for details). The three feedback levels (high, medium and low accuracy) corresponded to small, medium and large timing errors, respectively. We then contrasted the beta weights estimated for high-accuracy vs. low-accuracy feedback and examined the effects on group-level averaged across runs.”

      b) Connectivity analyses: It is also unclear here which model was used in the PPIanalyses presented in Figure 2. As it appears that the seed region was extracted from a high vs. low contrast (without modulators), the PPI should be built using the same model. I assume this was the case as the authors mentioned "These co-fluctuations were stronger when participants performed poorly in the previous trial and therefore when they received low-accuracy feedback." if this refers to low vs. high contrast. Please clarify.

      Yes, the PPI model was built using the same model. We clarified this in the methods section by adding the following statement to the PPI description.

      Page 17: “The PPI model was built using the same model that revealed the main effects used to define the HPC sphere “

      Yes, the reviewer is correct in thinking that the contrast shows the difference between low vs. high-accuracy feedback. We clarified this in the main text as well as in the caption of Fig. 2.

      Caption Fig 2: [...] We plot results of a psychophysiological interactions (PPI) analysis conducted using the hippocampal peak effects in (A) as a seed for low vs. high-accuracy feedback. [...]

      Page 17: The estimated beta weight corresponding to the interaction term was then tested against zero on the group-level using a t-test implemented in SPM12 (Fig. 2C). The contrast reflects the difference between low vs. high-accuracy feedback. This revealed brain areas whose activity was co-varying with the hippocampus seed ROI as a function of past-trial performance (n-1).

      c) It is unclear why the model testing TTC-specific / TTC-independent effects (resultspresented in Figure 3) used 2 parametric modulators (as opposed to building two separate models with a different modulator each). I wonder how the authors dealt with the orthogonalization between parametric modulators with such a model. In SPM, the orthogonalization of parametric modulators is based on the order of the modulators in the design matrix. In this case, parametric modulator #2 would be orthogonalized to the preceding modulator so that a contrast focusing on the parametric modulator #2 would highlight any modulation that is above and beyond that explained by modulator #1. In this case, modulation of brain activity that is TTC-specific would have to be above and beyond a modulation that is TTC-independent to be highlighted. I am unsure that this is what the authors wanted to test here (or whether this is how the MRI design was built). Importantly, this might bias the interpretation of their results as - by design - it is less likely to observe TTC-specific modulations in the hippocampus as there is significant TTC-independent modulation. In other words, switching the order of the modulators in the model (or building two separate models) might yield different results. This is an important point to address as this might challenge the TTC-specific/TTC-independent results described in the manuscript.

      We thank the reviewer for raising this important issue. When running the respective analysis, we made sure that the regressors were not collinear and we therefore did not expect substantial overlap in shared variance between them. However, we agree with the reviewer that orthogonalizing one regressor with respect to the other could still affect the results. To make sure that our expectations were indeed met, we therefore repeated the main analysis twice: 1) switching the order of the modulators and 2) turning orthogonalization off (which is possible in SPM12 unlike in previous versions). In all cases, our key results and conclusions remained unchanged, including the central results of the hippocampus analyses.

      Anterior (ant.) / Posterior (post.) Hippocampus ROI analysis with A) original order of modulators, B) switching the order of the modulators and C) turning orthogonalization of modulators off. ABC) Orange color corresponds to the TTC-independent condition whereas light-blue color corresponds to the TTC-specific condition. Statistics reflect p<0.05 at Bonferroni corrected levels () obtained using a group-level one-tailed one-sample t-test against zero; A) pfwe = 0.017, B) pfwe = 0.039, C) pfwe = 0.039.*

      Because orthogonalization did not affect the conclusions, the new manuscript simply reports the analysis for which it was turned off. Note that these new figures are extremely similar to the original figures we presented, which can be seen in the exemplary figure below showing our key results at a liberal threshold for transparency. In addition, we clarified that orthogonalization was turned off in the methods section as follows.

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively, and they were not orthogonalized to each other.

      Comparison of old & new results: also see Fig. 3 and Fig. S5 in manuscript

      d) It is also unclear how the behavioral improvement was coded/classified "wecontrasted trials in which participants had improved versus the ones in which they had not improved or got worse"- It appears that improvement computation was based on the change of feedback valence (between high, medium and low). It is unclear why performance wasn't used instead? This would provide a finer-grained modulation?

      We thank the reviewer for the opportunity to clarify this important point. First, we chose to model feedback because it is the feedback that determines whether participants update their “internal model” or not. Without feedback, they would not know how well they performed, and we would not expect to find activity related to sensorimotor updating. Second, behavioral performance and received feedback are tightly correlated, because the former determines the latter. We therefore do not expect to see major differences in results obtained between the two. Third, we did in fact model both feedback and performance in two independent GLMs, even though the way the results were reported in the initial submission made it difficult to compare the two.

      Figure 4 shows the results obtained when modeling behavioral performance in the current trial as an F-contrast, and Supplementary Fig 4 shows the results when modeling the feedback received in the current trial as a t-contrast. While the voxel-wise t-maps/F-maps are also quite similar, we now additionally report the t-contrast for the behavioral-performance GLM in a new Supplementary Figure 4C. The t-maps obtained for these two different analyses are extremely similar, confirming that the direction of the effects as well as their interpretation remain independent of whether feedback or performance is modeled.

      The revised manuscript refers to the new Supplementary Figure 4C as follows.

      Page 17: In two independent GLMs, we analyzed the time courses of all voxels in the brain as a function of behavioral performance (i.e. TTC error) in each trial, and as a function of feedback received at the end of each trial. The models included one mean-centered parametric regressor per run, modeling either the TTC error or the three feedback levels in each trial, respectively. Note that the feedback itself was a function of TTC error in each trial [...] We estimated weights for all regressors and conducted a t-test against zero using SPM12 for our feedback and performance regressors of interest on the group level (Fig. S4A). [...]

      Page 17: In addition to the voxel-wise whole-brain analyses described above, we conducted independent ROI analyses for the anterior and posterior sections of the hippocampus (Fig. S2A). Here, we tested the beta estimates obtained in our first-level analysis for the feedback and performance regressors of interest (Fig. S4B; two-tailed one-sample t tests: anterior HPC, t(33) = -5.92, p = 1.2x10-6, pfwe = 2.4x10-6, d=-1.02, CI: [-1.45, -0.6]; posterior HPC, t(33) = -4.07, p = 2.7x10-4, pfwe = 5.4x10-4, d=-0.7, CI: [-1.09, -0.32]). See section "Regions of interest definition and analysis" for more details.

      If the feedback valence was used to classify trials as improved or not, how was this modelled (one regressor for improved, one for no improvement? As opposed to a parametric modulator with performance improvement?).

      We apologize for the lack of clarity regarding our regressor design. In response to this comment, we adapted the corresponding paragraph in the methods to express more clearly that improvement trials and no-improvement trials were modeled with two separate parametric regressors - in line with the reviewer’s understanding. The new paragraph reads as follows.

      Page 18: One regressor modeled the main effect of the trial and two parametric regressors modeled the following contrasts: Parametric regressor 1: trials in which behavioral performance improved \textit{vs}. parametric regressor 2: trials in which behavioral performance did not improve or got worse relative to the previous trial.

      Last, it is also unclear how ITI was modelled as a regressor. Did the authors mean a parametric modulator here? Some clarification on the events modelled would also be helpful. What was the onset of a trial in the MRI design? The start of the trial? Then end? The onset of the prediction time?

      The Inter-trial intervals (ITIs) were modeled as a boxcar regressor convolved with the hemodynamic response function. They describe the time after the feedback-phase offset and the subsequent trial onset. Moreover, the start of the trial was the moment when the visual-tracking target started moving after the ITI, whereas the trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The onset of the “prediction time” was the moment in which the visual-tracking target stopped moving, prompting participants to estimate the time-to-contact. We now explain this more clearly in the methods as shown below.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI), which were all convolved with the canonical hemodynamic response function of SPM12. The start of the trial was considered as the trial onsets for modeling (i.e. the time when the visual-tracking target started moving). The trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The ITI was the time between the offset of the feedback-phase and the subsequent trial onset.

      On a related note, in response to question 4 by reviewer 2, we now repeated one of the main analyses (Fig. 2) without modeling the ITI (as well as the Inter-session interval, ISI). We found that our key results and conclusions are independent of whether or not these time points were modeled. These new results are presented in the new Supplementary Figure 3B.

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. [...]

      1. Perhaps as a result of a lack of clarity in the result section and the MRI methods, it appears that some conclusions presented in the result section are not supported by the data. E.g. "Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time." The data show that hippocampal activity is higher during and after an accurate trial. This pattern of results could be attributed to various processes such as e.g. reward or learning etc. I would recommend not providing such interpretations in the result section and addressing these points in the discussion.

      Similar to above, statements like "These results suggest that the hippocampus updates information that is independent of the target TTC". The data show that higher hippocampal activity is linked to greater improvement across trials independent of the timing of the trial. The point about updating is rather speculative and should be presented in the discussion instead of the result section.

      The reviewer is referring to two statements in the results section that reflect our interpretation rather than a description of the results. In response to the reviewer’s comment, we therefore removed the following statement from the results.

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      In addition, we replaced the remaining statement by the following. We feel this new statement makes clear why we conducted the analysis that is described without offering an interpretation of the results that were presented before.

      Page 8: We reasoned that updating TTC-independent information may support generalization performance by means of regularizing the encoded intervals based on the temporal context in which they were encoded.

    1. Author Response

      Reviewer #1 (Public Review):

      7) Can the primary cells in Figure 2E and AML#1 and AML#2 be studied for mTORC1 activity by Western, as in 2D?

      For reasons that we do not understand, we have been unable to effectively culture primary FLT3-ITD AMLs, despite being able to culture most other AMLs for weeks. This issue has prevented us from being able to perform biochemical analyses of FLT3-ITD AMLs in response to FLT3 inhibition.

      8) Additional genetic information should be provided if possible for the primary AML cells - what other mutations in addition to FLT3 were present? Were there any mTOR pathway alterations?

      We provided the other mutations of AML#1 sample (NPM1 mutation) in the section METHODS-Therapeutic modeling in mice, as well as Figure legends 2E and 3D. There were no evident alterations in the mTOR pathway (beyond the FLT3-ITD mutation).

    1. Author Response:

      We thank the reviewers for their thoughtful critiques and helpful suggestions for how to improve our manuscript. Described below is our response clarifying a number of issues raised by the reviewers.

      We agree with the reviewer that we cannot definitively conclude that the first division chromosome segregation defects and the later mid-blastula transition CI-induced defects are the result of distinct mechanisms. In fact, we raise this possibility in the discussion. However, our finding that the CI phenotype induces a temporally and developmentally deferred chromosome segregation defects in the late blastoderm divisions (in addition to the well-studied first division defect) alters the established view of the CI phenotype and must be taken into account when considering mechanisms of CI. Our current view is that the distinct early and late defects could be caused by either 1) a common mechanism (possibly a chromosome mark/defect inherited through the early blastoderm divisions causing segregation defects in the late blastoderm divisions) or 2) distinct early and late mechanisms that do not strictly “depend” upon one another. We have clarified this point in the revised manuscript.

      We disagree with the reviewer that this result is to be expected given previous studies. In D. simulans, a small percentage of embryos derived from the CI cross hatch. These embryos are thought to have bypassed the first division defect. It is not obvious why there must be late defects in these embryos that “escape” early CI-induced defects and subsequently hatch. Previous studies interpreted embryos that exhibit late division errors as those that have lost their entire paternal complement of chromosomes as a result of strong CI-induced defects during the first mitotic division and develop as maternal haploids. These studies, including transgene- induced CI, have focused primarily on embryos that have undergone the first mitotic division embryonic defects. To the best of our knowledge, no group has thoroughly examined embryos that progress normally through the pre-cortical cycles 2-9 as performed in this manuscript. Thus, it was entirely unexpected that these embryos would exhibit the mitotic defects during the late blastoderm divisions and the MBT. We discuss how this finding requires modified current models for the mechanisms of CI.

      Regarding the comment that “the primary claim of the paper that later-stage embryos die for different reasons than early-stage embryos,” we make no such claim. In fact, we provide evidence that the failure to hatch (late embryonic lethality) is, at least in part, due to haploid development—a direct result of the first division CI defect. The focus of our studies are those CI-derived embryos that progress normally, maintain the normal complement of chromosomes through the first division, and exhibit chromosome segregation errors during the late blastoderm divisions. We do not know the fate of these embryos, and previous studies have demonstrated that embryos suffering extensive late blastoderm segregation errors are able to hatch (Sullivan, 1990, Development 110:311-323). We have clarified these points in the discussion.

      While we agree that transgenic tools have proven invaluable in the study of CI, they are not appropriate for these studies. The purpose of our study was to undertake an unbiased re-examination of the CI phenotype. Of necessity, the transgenic studies rely on exogenous host promoters rather than the natural endogenous Wolbachia/Prophage promoters. Thus, while informative, it is unlikely the that the transgenic alleles would capture all of the complexities and nuance of the CI phenotype. In addition, the transgenic studies, of which we are aware, have only interrogated a single pair of the CI-inducing genes, while the Wolbachia genome contains both Cid and Cin CI-associated gene pairs and possibly other yet-to-be-identified CI/Rescue genes.

      Our unbiased re-examination of the CI phenotype induced by W. riverside in D. simulans identified a previously unsuspected temporally and developmentally distinct set of CI-induced defects that occur during and after the mid-blastula transition. This finding must be taken into account when considering the mechanisms that cause CI. In our revisions, we clarify the above points and qualify our statements to appropriately interpret our results in context of the nuances and uncertainties of CI and early Drosophila embryogenesis.

    1. Author Response:

      Reviewer #3 (Public Review):

      The authors revealed the novel role of the DLL-4-Notch1-NICD signaling axis played in platelet activation, aggregation, and thrombus formation. They firstly confirmed the expression of Notch1 and DLL-4 in human platelets and demonstrated both Notch1 and DLL-4 were upregulated in response to thrombin stimulation. Further, they confirmed the exposure of human platelets with DLL-4 would lead to γ-secretase mediated NICD (a calpain substrate) release. Stimulating platelets with DLL-4 alone triggered platelet activation measured by integrin αIIbβ3 activation, P-selectin translocation, granule release, enhanced platelet-neutrophil and platelet-monocyte interactions, intracellular calcium mobilization, PEVs release, phosphorylation of cytosolic proteins, and PI3K and PKC activation. In addition, Susheel N. Chaurasia et al. showed that when platelets were stimulated with DLL-4 and low-dose thrombin, the Notch1 signaling can operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation. When the DLL-4-Notch1-NICD signaling axis was blocked by γ-secretase inhibitors, the platelets responding to stimulation were attenuated, and the arterial thrombosis in mice was impaired.

      This study by Susheel N. Chaurasia et al. was carefully designed and used multiple approaches to test their hypothesis. Their research raised the potential of targeting the DLL-4-Notch1-NICD signaling axis for anti-platelet and anti-thrombotic therapies. Interestingly, compared to thrombin, a potent physiological platelet agonist, the signaling cascade triggered by DLL-4 was relatively weak. Given that the upregulation of DLL-4 and Notch1 happened in response to thrombin stimulation, how much DLL-4 mediated signaling could contribute to in vivo platelet activation in the presence of thrombin is questionable. This could potentially limit the application of targeting Notch1 as an anti-thrombotic therapy. Further, the authors showed that Notch1 signaling could operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation, which means the DLL-4 mediated platelet signaling can act as an accelerator of early-stage hemostasis. Again, inhibition of Notch1 may slow down the hemostasis process. But given the fact that there are other platelet agonists (ADP, collagen...) existing simultaneously, blocking Notch1 signaling may not have a strong anti-platelet effect.

      We concur with the Public Reviewer that, further study is needed to delineate extent of contribution of DLL-4 signaling in thrombin-activated platelets. However, it is now amply clear that Notch signaling plays a central role in development of thrombinactivated phenotype of platelets. Further, DLL-4-Notch1 interaction on surfaces of adjacent platelets within the thrombus reinforces platelet-platelet aggregate formation. This is further reflected from significant inhibition of thrombus formation in vivo in presence of DAPT in a mouse model of intravital thrombosis. Given that there is a lot of redundancy in stimulation of platelets employing different physiological agonists (ADP, collagen, thrombin etc.), none of the present-day drugs is fully capable of effective platelet inhibition due to parallel signaling pathways. Thus, discovery of Notch signaling and its seminal role in platelet activation could explain redundancy associated with anti-platelet drugs, and Notch inhibition could complement with existing anti-platelet regimen in evoking effective and complete platelet inhibition.