2,293 Matching Annotations
  1. Sep 2022
    1. Author Response

      Reviewer #3 (Public Review):

      The study by Randzavola and colleagues provides a follow-up of their previous publication (Thomas DC et al, J Exp Med 2017) describing EROS (Essential for Reactive Oxygen Species or C17Orf62) as a novel chaperone essentially required to support the phagocyte Nox NADPH Oxidase respiratory burst and bacterial killing. Here, the authors extend the investigation of the mechanism underlying EROS effect and show its very early binding in the endoplasmic reticulum and interaction with immature partially glycosylated forms of gp91phox (the catalytic subunit of the Nox complex), allowing the incorporation of heme and subsequent binding of p22phox, which later follows the usual steps for complex maturation. A novel finding was the association of EROS with the OST component of the N-glycosylation machinery. An extended proteome analysis confirmed that EROS is quite specific for the gp91phox/p22phox complex and also for the purinergic P2X7 receptor, which also interacts with EROS (as also shown previously by the authors and further investigated by Ryoden et al. J Immunol 2020). The authors further validate EROS binding to P2X7 and provide evidence that EROS loss-of-function impairs P2X7-associated functions. Particularly, mice with genetically ablated EROS show improved survival to influenza infection.

      A major strength of this line of investigation is the clear functional importance of EROS in the regulation of the protein expression of the Nox complex components. Previous work has clearly shown that human EROS deficiency associated with the severe immunodeficiency Chronic Granulomatous Disease, which is usually caused by genetic deficiency of the Nox complex components. Indeed, the loss of gain of functions of EROS are very clearly associated with major changes in the expression of those components, indicating EROS functional relevance. Moreover, the interplay between the P2X7 receptor and EROS is also relevant, given that this receptor mediates an important arm of innate immunity, namely the nucleotide-driven inflammasome activation. Thus, the authors are likely dealing with some undoubtedly important novel information which may be of broad impact to understand several aspects of the adaptive and even adaptive immunity.

      Enthusiasm for this article, however, is somewhat decreased by some aspects, as follows:

      1) While there is a substantial amount of new data, the corresponding progress in depth of mechanistic insights has not been commensurate, bearing in mind the author's previous work. The novel findings are the more clear documentation of EROS/gp91phox interaction and its time-course during nascent gp91phox protein processing in the ER. Also, their interplay with the OST complex. The extended list of proteins associating with EROS essentially confirms previous findings. Also, the work with P2X7 mostly confirms previous findings, while the novel and interesting experiment with EROS-silenced mice and viral infection needs further work, as commented below.

      We thank the reviewer for this comment and for seeking clarity on novelty. We have addressed this above and in the discussion section. We have not reported the EROS interactome by mass spectrometry in previous work.

      2) Some aspects of these results are less than clearcut. The association between gp91phox and EROS is generally convincing, but for many experiments the authors make wide use of transfections of tagged protein constructs. One can clearly understand that this is possibly the only feasible approach at this time, however these constructs carry the intrinsic problem of possible protein misfolding, which would make them a potentially artificial target of any endoplasmic reticulum chaperone-like protein such as EROS. This would impact exactly on the very mechanism the authors are proposing for EROS effects, i.e., early protein processing.

      We understand Reviewer 3’s concerns about using tagged constructs. However, all transfection experiments depicted in Figure 1 have been done with untagged constructs and in different cell types in both mouse and human systems. The whole approach is also validated by extensive previous work showing the ability of transfected p22phox to augment gp91phox expression (Yu et al., J Biol Chem 1997; PMID: 9341176). All our experiments showed the same result, namely the stabilisation of the 58kDa gp91phox precursor. We have now included data showing we can immunoprecipitate endogenous gp91phox in PLB985 cells and detect endogenous EROS (Figure 3, figure supplement 1A) which confirms the specificity of the association between gp91phox and EROS. In the same sample, we can also detect endogenous p22phox (our positive control) which is well-known to associate as heterodimer with gp91phox. Furthermore, transfection of our constructs does not induce significant ER stress in HEK293 cells. Based on our own data and that of other investigators, we argue that this is a valid and useful approach to demonstrating the ability of EROS to increase gp91phox abundance. Similarly, this is just one of many orthologous techniques used in the manuscript.

      3) The same consideration applies to the experiments in Figure 3 with the OST complex STT3A. The co-localizations shown by the authors are technically acceptable, but their meaning is unclear, given it is expected that the proteins EROS and OST occupy the same compartment, being ER-located proteins, especially if transfected as constructs (tagged or not).

      The experiment has been done to assess the localisation of gp91phox relative to EROS and STT3A which are known to occupy the ER -compartment as pointed by the reviewer. Since HEK293 cells do not express gp91phox, this microscopy analysis allowed to determine if some population of gp91phox could be detected with EROS and STT3A at the ER as opposed to its localization as a mature protein at the plasma membrane and within granules, in phagocytic cells.

      4) It would be important to assess whether cells receiving such constructs depict markers of endoplasmic reticulum stress and/or show impaired survival.

      This has been addressed in Reviewer 3’s recommendation for author point 2.

      5) The experiments with co-transfection in HEK293 cells of EROS, Nox1 and Nox4 provide results at variance with the author's data in their previous work, in which endogenous Nox1 (intestine) and Nox4 (kidney) had no changes in expression in genetically silenced EROS mice.

      We thank the reviewer for this comment and acknowledge that this introduces some ambiguity. In showing the augmentation of NOX1, NOX4 but not p22phox or NOX5 we are demonstrating that it is likely that EROS can bind and stabilise NOX proteins that also require p22phox. In the case of NOX4, this is also supported by our yeast 2 hybrid data. Thus, these data suggest that EROS can bind p22phox-dependent NOX proteins. The key question is whether EROS has a physiological role in controlling the expression of other NOX proteins. Although we addressed this in our previous study, we have done so in a more extensive way in this manuscript. In particular, we note the subsequent publication by Diebold et al. (Methods Mol Biol 2019; PMID: 31172474) which points out that many commercially available antibodies are non-specific. Detailed examination showed this to be the case for the antibody we used in Thomas et al., (J Exp Med, 2017; PMID: 28351984). We therefore undertook specific analysis with the anti-mouse NOX1 antibody clone from Dr C. Yabe-Nishimura and Dr. Misaki Matsumoto.

      Similarly, our work on NOX4 in Thomas et al 2017 (J Exp Med, 2017; PMID: 28351984) suggested that while NOX4 is certainly present in the kidneys of EROS-/- mice, this was a limited analysis as it was not the main focus of the paper, and the conclusion was that there was no drastic effect on NOX4 expression in the same manner as that observed for NOX2. For the revisions to this paper, we examined a cohort of 4 control and 4 EROS-/- mice and showed that EROS does not physiologically regulate NOX4 in the kidney.

      Thus, the use of HEK293, which do not express NOX proteins, as a reductionist system may favour the effect of EROS on NOX1 and NOX4 abundance upon transfection of the constructs. One possible explanation could be that EROS binds to a conserved motifs present on NOX1, NOX2 and NOX4 which is readily accessible in the system we are using.

      6) The article is conceptually divided into two parts. However, there is no clear cross-fertilization between them and they essentially do not integrate.

      Although the reviewer notes that it seems that there are two separate stories, this reflects that we have extensively characterised the function of EROS and found that it specifically and profoundly affects only two distinct pathways in immunity, which is significant in itself. A strength of our manuscript is our extensive granular mass spectrometry approach which shows the specificity of EROS in 2 different cell types in which up to 8000 proteins have been detected. We have therefore placed the control of P2X7 and gp91phox-p22phox in context of the entire proteome. Our paper defines just how specific EROS is in its physiological effects and we therefore focus on the two major pathways that are affected by EROS deficiency. We integrate this in the final figure by showing how the combined lack of gp91phox and P2X7 lead to resistance to influenza A in contrast to the susceptibility to certain bacterial infections.

      7) While the authors claim that "the loss of both ROS and P2X7 signalling leads to resistance to influenza infection", this was not in fact shown in this work. It is known that P2X7 deficiency protects against influenza infection. Thus, it follows naturally that EROS deficiency, which essentially eliminates the expression of P2X7, would have the same effect. However, the role of ROS and gp91phox, i.e. whether or not they add to this equation, remains unclear.

      We thank the reviewer for this comment. The role of phagocyte NADPH oxidase-derived ROS has been explored in gp91phox deficiency and we apologise if this is not made clear in our manuscript. We have now added the following text to the discussion section of the manuscript:

      “A particular strength of our study is that we show marked in vivo sequelae of the lack of P2X7. EROS deficiency leads to profound susceptibility to bacterial infection but protects mice from infection with influenza A. This is likely to reflect the fact that mice that are (i) deficient in gp91phox (ii) deficient in P2X7 (iii) treated with P2X7 inhibitors have improved outcomes following infection with influenza A and raises intriguing questions about the physiological role of EROS. Snelgrove et al showed that gp91phox deficiency improved outcomes in influenza A. gp91phox knockout mice exhibited a reduced influenza titre in the lung parenchyma. Inflammatory infiltrate into the lung parenchyma was markedly reduced and lung function significantly improved (Snelgrove et al., 2006). To et al then showed that the phagocyte NADPH oxidase is activated by single stranded RNA and DNA viruses in endocytic compartments. This causes endosomal hydrogen peroxide generation, which suppresses antiviral and humoral signalling networks via modification of a highly conserved cysteine residue (Cys98) on Toll-like receptor-7. In this study, targeted inhibition of endosomal reactive oxygen species production using cholestanol-conjugated gp91dsTAT (Cgp91ds-TAT) abrogates influenza A virus pathogenicity (To et al., 2017). This group went on to explore infection with a more pathogenic influenza A strain, PR8. Using the same specific inhibitor. Cgp91ds-TAT reduced airway inflammation, including neutrophil influx and alveolitis and enhanced the clearance of lung viral mRNA following PR8 infection (To et al., 2019). This group has also shown that NOX1 (Selemidis et al., 2013) and NOX4 (Hendricks et al., 2022) can drive pathogenic inflammation in influenza A, emphasising the importance of clarifying the roles of EROS in control of expression of these proteins.

      In studies on P2X7, Rosli et al showed that mice infected with 105 PFU of influenza A HKx31 had improved outcomes if they were treated with a P2X7 inhibitor at day 3 post infection and every two days thereafter. Survival was also improved even if the inhibitor is given on day 7 post infection following a lethal dose of the mouse adapted PR8. This was associated with reduced cellular infiltration and pro-inflammatory cytokine secretion in bronchoalveolar lavage fluid, but viral titres were not measured (Rosli et al., 2019). Leyva-Grado et al examined influenza A infection in P2X7 knockout mice. They infected mice with both influenza A/Puerto Rico/08/1934 virus and influenza A/Netherlands/604/2009 H1N1pdm virus. They showed that P2X7 receptor deficiency led to improved survival after infection with both viruses with less weight loss (Leyva-Grado et al., 2017). Production of proinflammatory cytokines and chemokines was impaired and there were fewer cellular hallmarks of severe infection such as infiltration of neutrophils and depletion of CD11b+ macrophages. It is worth noting that the P2X7 knockout strain used in this study was the Pfizer strain in which some splice variants of P2X7 are still expressed (Bartlett et al., 2014). Hence, the dual loss of the phagocyte NADPH oxidase and P2X7 in EROS-/- mice likely confers protection from IAV infection. By reducing the expression of both NOX2 and P2X7, EROS regulates two pathways that may be detrimental in influenza A and we speculate that EROS may physiologically act as a rheostat controlling certain types of immune response.”

    1. Author Response

      Reviewer #1 (Public Review):

      This well-written paper combines a novel method for assaying ubiquitin-proteasome system (UPS) activity with a yeast genetic cross to study genetic variation in this system. Many loci are mapped, and a few genes and causal polymorphism are identified. A connection between UPS variation and protein abundance is made for one gene, demonstrating that variation in this system may affect phenotypic variation.

      The major strength of the study is the power of yeast genetics which makes it possible to dissect quantitative traits down to the nucleotide level. The weakness is that is not clear whether the observed UBS variation matters on any level, however, the claims are suitable to moderate, and generally supported.

      We agree with the reviewer that understanding how causal variants for ubiquitin-proteasome system (UPS) activity affect other molecular, cellular, and organismal phenotypes is an important area of future research.

      The paper provides a nice example of how it is possible to genetically dissect an "endo-phenotype", and learn some new biology. It also represents a welcome attempt to put the function of a mechanism that is heavily studied in molecular cell biology in a broader context.

      We thank the reviewer for these kind words.

      Reviewer #2 (Public Review):

      In this manuscript, the authors developed an elegant quantitative reporter assay to identify quantitative trait loci that regulates N-end rule pathway, a major quality control mechanism in eukaryotes. By crossing two yeast species with divergent proteostasis activity, they generated a population that showed broad variation in proteostasis activity. By sequencing and mapping the underlying loci, they have identified several genes that regulate N-end rule activity. They then verified them using precise genetic tools, validating the power of their approach.

      Overall, it is a very solid manuscript that would be highly interesting for the quality control field.

      In general, I really liked this manuscript for these reasons:

      • Uses fluorescent timers elegantly to quantitatively measure protein degradation.

      • Validates the approach in depth, showing the readers how the tool works.

      • Uses the power of yeast genetics and bulk segregant analysis to map loci that may have small effects.

      • Validates the mapped loci using precise genetic tools.

      In a field that is dominated by biochemistry, this manuscript will be a fresh breath of air…

      We thank the reviewer for their thoughtful evaluation of our work and these kind words.

      Reviewer #3 (Public Review):

      This manuscript, "Variation in Ubiquitin System Genes Creates Substrate-Specific Effects on Proteasomal Protein Degradation" studies the genetic basis of differences in protein degradation. The authors do so by screening natural genetic variation in two yeast strains, finding several genes and often several variants within each gene that can affect protein degradation efficiency by the Ubiquitin-Proteasome system (UPS). Many of these variants have "substrate-specific effects" meaning they only affect the degradation of specific proteins (those with specific degrons). Also, many variants located within the same genes have conflicting effects, some of which are larger than others and can mask others. Overall, this study reveals a complex genetic basis for protein degradation.

      Strengths: Revealing the genetic basis for any complex trait, such as protein degradation, is a major goal of biology. The results of this paper make a significant step towards the goal of mapping the genes and variants involved in this specific trait. Fine mapping methods are used to home in on the specific variants involved and to measure their effects. This is very nicely done and provides a detailed view of the genetic basis of protein degradation. Further, the GFP/RFP system used to quantify the efficiency of the protein degradation system is a very elegant system. Also, the completeness of the analysis, meaning that all 20 N-degrons were studied, is impressive and leads to very detailed findings. It is interesting that some genetic variants have larger and opposite effects on the degradation of different N-degrons.

      We thank the reviewer for these positive comments.

      Weaknesses: Some of the results discussed in this paper are not surprising. For example, the finding that both large effect and small effect genetic variants contribute to this complex trait is not at all surprising. This is true of many complex traits.

      We agree with the reviewer that the number and patterns of QTLs we observe are perhaps not unexpected given that most traits are genetically complex. However, we also note that our results stand in stark contrast to previous efforts to understand how natural genetic variation affects the UPS, which have focused almost exclusively on large-effect mutations in UPS genes that cause rare Mendelian disorders. We have therefore chosen to retain our discussion of the complex genetic architecture of the UPS.

      The discussion of human disease is also a bit extensive given this study was performed on yeast. It might be more productive to use these findings to understand the UPS better on a mechanistic level. Why does the same genetic variant have opposite effects on the degradation of different degrons, even in cases where those degrons are of the same type?

      Following the reviewer’s suggestion we have removed multiple references to human disease from the introduction. We retained paragraph 3 of the introduction (previously, lines 43-55, pg. 2, para. 2 in the revised manuscript), which discusses disease-causing mutations in UPS genes, because the examples presented highlight two important motivations for our work: (1) individual genetic differences create variation in UPS activity and (2) much of our knowledge of how natural genetic variation affects the UPS comes from these rare, limited examples. However, we have re-written the paragraph to focus on these points and removed descriptions of the clinical manifestations of the disorders mentioned.

      We agree with the reviewer that understanding the mechanistic basis of substrate-specific variant effects on distinct N-degrons is important. However, doing so would require additional experiments that we argue are outside the scope of the current study.

      Overall, this manuscript excels at mapping the genetic basis of variation in the UPS system. It demonstrates a very complex mapping from genotype to phenotype that begs for further mechanistic explanation. These results are important to the UPS field because they may help researchers interrogate this highly conserved essential system. The manuscript is weaker when it comes to the broader conclusions drawn about the relative importance of large vs. small effects variants on complex traits, the amount of heritability explained, and the effects of genetic variation on protein abundance vs transcript abundance. Though in the case of protein vs transcript, I feel the cursory examination of the trends is perhaps at an appropriate level for the study, as it is mainly meant to show these things differ rather than to show exactly how and why they differ.

      We state that the distribution of QTL effect sizes for UPS activity consists of many QTLs with small effects and few QTLs of large effects. While this result is similar to patterns observed for other complex traits, it differs dramatically from the results of previous studies of genetic influences on the UPS, which have been largely confined to large-effect variants. Given these differences, we think it is appropriate and worthwhile to emphasize the complex genetic architecture of UPS activity.

      We agree that estimating the fraction of heritability explained by our QTLs and variants would be valuable. However, as noted in our response to Reviewer 1, the QTL mapping method we used does not permit ready calculation of heritability estimates due to its pooled nature.

      The reviewer is correct in noting that the primary goal of our RNA-seq and proteomics experiments was to provide an initial exploration of the effects of causal variants for UPS activity on global gene expression at the protein and mRNA levels. While a comprehensive dissection of the effects of this and other causal variants is an important area of future work, our results here show broad changes in global gene expression and establish that the causal UBR1 variant affects gene expression at the protein and mRNA levels.

      Reviewer #4 (Public Review):

      Overall the paper is clear and well-written. The experimental design is elegant and powerful, and it's a stimulating read. Most QTL mapping has focused on directly measurable phenotypes such as expression or drug response; I really like this paper's distinctive approach of placing bespoke functional assays for a specific molecular mechanism into the classical QTL framework.

      We thank the reviewer for their thoughtful evaluation of the work and positive comments.

    1. Author Response

      Reviewer #2 (Public Review):

      Recent advances in the investigation of functional brain connectivity have allowed the identification of the main connectivity gradient between unimodal to transmodal brain regions. Gao et al. aimed to test whether this connectivity gradient is changing according to task demands and if so, whether this change was also related to the complexity of brain signals evoked by events of various task demands. Their results are three-fold. 1) They first compared the gradient of connectivity obtained during a semantic relatedness judgment task to a purely visual detection task and to a resting state. a) They found that the same main gradient could be extracted from the three conditions, making it suitable for investigating the effect of word relatedness. b) Additionally, they showed that the word relatedness modulates the main gradient: when words are close, the gradient was strengthened, i.e., the dissociation between unimodal and transmodal areas was sharpened. 2) The authors found that the strength of word associations modulates the complexity of brain signals: the closer the words, the more convergent brain signals across participants and trials were, particularly in the transmodal areas of the main gradient. 3) They found that transmodal brain regions in the gradient were similarly activated in participants with similar relatedness judgments. Finally, they tested the link between the three results above using mediation analysis. They showed that the dimensionality difference (result 2) mediated the link between the gradient in the semantic task (result 1a) and the interindividual similarities in semantic judgment and brain activation (result 3). Altogether, this study demonstrates that the main gradient state is predictive of both task variations and inter-individual similarities of task responses. Those results suggest that gradients are a relevant measure of functional connectivity for investigating the variation of connectivity within a task and between individuals. The results overall support conclusions.

      • Strengths:

      1. The main strength of the article is the methods used to obtain the results. Gradients of functional connectivity are a new measure that goes beyond classical brain network functional connectivity. Investigating the dynamics of gradients during a semantic task allows us to better understand how different brain regions (unimodal, transmodal, belonging to some specific networks, etc.) adapt to variability in a task.

      The second strength is the topic: the question is relevant to researchers interested in semantic memory or processing and to any researcher interested in brain dynamics within and between individuals. The demonstration is elegant, and the behavioral task is simple; it compensates for the complexity of the methods.

      • Weaknesses:

      1. The main weakness of the article is the lack of details about the performed analyses, which prevents a clear understanding of the results. The complexity of their methods calls for a crystal-clear description of them. The reader is not informed about how statistics are computed. New terms are sometimes used to describe already mentioned results, making reading the article particularly difficult.

      Thanks very much for the suggestions on statistics. We have now significantly updated our manuscript, please see our detailed reply to Essential Revision.

      1. Conceptually, the authors assumed that during the task, participants generated a word linking the pair of words displayed on the screen and that the neural and cognitive processes solely vary along with the distance between the two words of the pair. However, when words are close, it is not obvious that individuals will generate a third word to link them, and it might be even more challenging to find a linking word in that case as opposed to when words are quite distant from each other. Considering those potential confounds, the interpretation of the results could be different. The authors always contrast very high versus very low distance, then the observed results could also be interpreted as: "observing a link" versus "generating a word link", the first scenario is much more cognitively simple, and this could also explain the differences they observed.

      Sorry that we did not explain our task instruction clearly in our initial submission. The participants were not instructed to generate a linking word specifically and the link was typically expressed in multiple words and could involve imagery as well as words. For this reason, we are not sure that a simple recognition/generation distinction will capture the different neural effects that relate to high and low associations. However, the text now acknowledges that multiple cognitive processes could contribute to the differences we observe, including recognition vs. generation, more automatic retrieval vs. more controlled retrieval, and processes associated with creativity. We have acknowledged multiple ways that the neural patterns could be interpreted in the discussion. Please see page 29.

      ‘Though our results are in line with controlled semantic cognition framework in general, while multiple cognitive processes could contribute to the differences that relate to strong and weak associations we observe, including observing vs. generating semantic links, more automatic retrieval vs. more controlled retrieval, imagery, and processes associated with creativity.’

      Reviewer #3 (Public Review):

      With resting-state fMRI data, recent work has mapped the organisation of the cortex along a continuous gradient, and regions that share similar patterns of functional connectivity are located at similar points on the gradient (Margulies et al., 2016). In the current study, the authors investigate how this dimension of connectivity changes during conceptual retrieval with different levels of semantic association strength. Specifically, they perform gradient analysis on task-fMRI informational connectivity data and reveal a similar principal gradient to the previous study, which captures the separation of heteromodal memory regions from the unimodal cortex. More importantly, by comparing the gradient generated with data from different experimental conditions (i.e., strong vs. weak association), the authors find the separation of the regions at the two ends of the gradient can be regulated by the association strength, with more separation for stronger association. They also examine the relationships between the gradient values and dimensionality and brain-semantic alignment measures, to explore the nature of this shifting gradient as well as the corresponding brain areas.

      Strengths:

      1. The aim of this study is clear and the relevant background literature is covered at an appropriate level of detail. With the cortical gradient analysis approach, this study has the potential to make a contribution to the understanding of the topographical neural basis of semantics in a fine-grained manner.

      2. The methodology in the current study is novel. This study validates the feasibility of performing gradient analysis on task-fMRI data, which is enlightening for future research. Using the number of PCs generated by PCA as a measure of dimensionality is also an interesting approach.

      3. The authors have conducted multiple control analyses, which tested the validity of their results. Specifically, a control task without engaging semantic processing was built in the experimental design (i.e., the chevron task), and the authors conducted multiple parallel control analyses with the data from this control task as a comparison with their main results. Other control analyses were also performed to validate the robustness of their methodological choices. For example, varied thresholds were used during the calculation of dimensionality and similar results were obtained.

      Weaknesses:

      1. As a major manipulation in the experiment, it is not very clear how the authors split/define their stimuli into strong and weak semantic association conditions. If I understood correctly, word2vec was used to measure the association strength in each pair of words. Then the authors grouped the top 1/3 association strength trials as a "strong association" condition and the bottom 1/3 as "weak association" (Line 689), and all analyses comparing the effect of "strong vs. weak association" were conducted with data from these two subsets of stimuli. However, in multiple places, the authors indicate the association strength of their stimuli ranges from completely unrelated to weakly related to highly related (Line 612, Line 147, Line 690, and the examples in Figure 1B). This makes me wonder if the trials with bottom 1/3 association strength (i.e., those were used in the current study) are actually "unrelated/no association" trials (more like a baseline condition), instead of "weak association" trials as the authors claimed. These two situations could be different regarding how they engage semantic knowledge and control processing. Besides, I am very interested in what will the authors find if they compare all three conditions (i.e., unrelated vs. weak association vs. strong association).

      Thanks very much for bringing up this point. We have conducted additional analysis for the intermediary bin and compare it against the bottom for the gradient analysis and against the top 1/3 for the dimensionality analysis (compared to the baseline condition for each analysis), which did show a similar patten like the contrast between strong and weak association but with a smaller effect, thus representing an intermediary profile as expected. The correlation between the principle gradient difference between middle and weak association with the principle gradient value derived from resting state was also significant, see Figure S10C, but its magnitude was smaller than what we reported in the main body of manuscript (r = 0.235 vs. r = 0.369). Given that the expected strongest effect is between top and bottom 1/3, thus, we have now included these results in the supplementary materials. Please see Figure S10 in page 7.

      1. Following the previous point, because the comparison between weak vs. strong association conditions is the key of the current study, I feel it might be better to introduce more about the stimuli in these two conditions. Specifically, the authors only suggested the word pairs fell in these two conditions varied in their association strength, but how about other psycholinguistic properties that could potentially confound their manipulation? For example, words with higher frequency and concreteness may engage more automatic/richer long-term semantic information and words with lower frequency and concreteness need more semantic control. I feel there may be a possibility that the effect of semantic association was partly driven by the differences in these measures in different conditions.

      Thanks for raising this point. We have performed additional control analysis to examine the relationship between association strength and psycholinguistic features according to the reviewer’s suggestion. The association strength did not show significant correlation with word frequency (r = -0.010, p = 0.392), concreteness (r = -0.092, p = 0.285) or imageability (r = 0.074, p = 0.377). Direction comparison of these psycholinguistic features between strongly and weakly associated word-pairs also did not any significant difference: frequency (t = 0.912, p = 0.364), concreteness (t = 1.576, p = 0.119), imageability (t = 1.451, p = 0.153). Please see in page 32:

      ‘The association strength did not show significant correlation with word frequency (r = -0.010, p = 0.392), concreteness (r = -0.092, p = 0.285) or imageability (r = 0.074, p = 0.377).’

      1. The dimensionality analysis in the current study is novel and interesting. In this section, the authors linked decreasing dimensionality with more abstract and less variable representations. However, most results here were built based on the comparison between the dimensionality effects for strong and weak association conditions. I wonder if these conclusions can be generalised to results within each condition and across different regions (i.e., regions having lower dimensionality are doing more abstract and cross-modal processing). If so, I am curious why the ATL (a semantic "hub") in Figure 3A has higher dimensionality than the sensory-motor cortices (quite experiences related) and AG (another semantic "hub").

      The dimensionality and its relationship to the cortical gradient was also examined for each condition. We assessed whether this relationship was influenced by associative strength, averaging dimensionality estimates for sets of four trials with similar word2vec values using a ‘sliding window’ approach. There was a negative correlation between overall dimensionality (averaged across all trials) and principal gradient. And the magnitude of this negative relationship increases as a function of the association strength. So, we believe our conclusion could be generalized across conditions. In our results, we observed higher dimensionality in ATL/frontal orbital cortex than sensory-motor cortices, which seems contradictory to our conclusion. However, these areas are subject to severe distortion and signal loss in functional MRI, the lower tSNR, thus, caused higher dimensionality estimation in PCA. Therefore, we conducted a control analysis in which regions in limbic network were removed due to their low tSNR, while this pattern remained significant (r = -0.346, p = 0.038).

      Please see in Discussion part in page 30.

      ‘It is worth noting that not all brain regions showed the expected pattern in the dimensionality analysis – especially when considering the global dimensionality of all semantic trials, as opposed to the influence of strength of association in the semantic task. In particular, the limbic network, including regions of ventral ATL thought to support a heteromodal semantic hub, showed significantly higher dimensionality than sensory-motor areas – these higher-order regions are expected to show lower dimensionality corresponding to more abstract representations. However, this analysis does not assess the psychological significance of data dimensionality differences (unlike our contrast of strong and weak associations, which are more interpretable in terms of semantic cognition). Limbic regions are subject to severe distortion and signal loss in functional MRI, which might strongly influence this metric. Future studies using data acquisition and analysis techniques that are less susceptible to this problem are required to fully characterize global dimensionality and its relation to the principal gradient.’

      1. I am not sure about the meaning/representational content underlying the semantic similarity matrix in the semantic-brain alignment analysis. According to the authors, this matrix was built based on the correlation of participants' ratings of associative strength (0, no link; 1~4, weak to strong) across trials. The authors indicate that this matrix reflects the global similarity of semantic knowledge between participants (Line 403). However, even though two participants share very similar ratings of association strength across trials, they could still interpret the meaning/knowledge underlying the associations very differently. For example, one participant may interpret the link between "man" and "car" as a man owns a car but another participant may interpret it as a man is hit by a car, although both associations could be rated as strong for this trial. This situation may be even more obvious for those pairs with weak association. Therefore, I am not confident this is a measure of similarity of semantic knowledge.

      Thanks very much for bring up this point. Our experimenter carefully evaluated the links generated for each trial in each participant and found that the weaker association the less consistent their link being formed was. So, we agreed with the reviewer that even when two participants share similar ratings of association strength, they could still interpret those word pairs significantly different, especially for those weakly associated trials. Despite the retrieval content/meaning might be different, i.e. a man owns a car or a man is hit by car, both scenarios are quite consistent and without strong semantic conflict being detected. Therefore, we argued that the semantic-brain alignment might reflect the similarity of neural states of retrieval rather than general semantic content. We have now updated this point in the manuscript. Please see on page 20. ‘A semantic similarity matrix, based on the correlation of participants’ ratings of associative strength across trials (reflecting the global similarity of neural states of retrieval between participants; left-hand panel of Figure 4A), was positively associated with neural pattern similarity in inferior frontal gyrus, posterior middle temporal gyrus, right anterior temporal lobe, bilateral lateral and medial parietal cortex, pre-supplementary motor area, and middle and superior frontal cortex (right-hand panel of Figure 4A).’

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Vides et al. performed a functional analysis of the Parkinson's disease-associated leucine-rich repeat kinase 2 (LRRK2). In particular, the authors sought to address how membrane recruitment of LRRK2 leads to an increase in its kinase activity. Briefly, the authors showed that LRRK2 utilizes two distinct binding sites (350-550 #1, 17/18 #2) for Rab GTPases within its N-terminal Armadillo domain to achieve membrane association. Intriguingly, these two sites differ substantially in their preference for binding phosphorylated (Rab8a, Rab10) and non-phosphorylated (Rab8a, Rab10, Rab29, Rab32, Rab39) substrates. In cells, a LRRK2 site #2 mutant showed a significantly reduced colocalization with phosphorylated Rab10. Using LRRK2 inhibitor washout experiments, the authors demonstrate that disrupting site #2 led to slower re-phosphorylation kinetics. Lastly, the authors employed an elegant in vitro system to demonstrate that LRRK2 membrane association and Rab phosphorylation are coupled in a feed-forward reaction. Overall, the work of Vides and colleagues provide compelling mechanistic insights into the spatial regulation of LRRK2.

      Nevertheless, a few critical points remain.

      Major points:

      1) Since LRRK2 is reported to form dimers and multimers, the authors should perform their colocalization studies (Figure 6) in cells lacking endogenous LRRK2.

      Co-localization with wild type LRRK2 is not seen with the mutant in question, so dimerization/oligomerization with endogenous protein appears not to be an issue for this construct.

      2) To what extent does modification of K17 and/or K18 (e.g., acetylation or ubiquitylation) play a role in regulating LRRK2 pRab binding?

      Phosphosite indicates LRRK2 ubiquitylation at K1118, K1129, K1833, K1963, K2091, with none in the ARM domain. We have not looked at either acetylation or ubiquitylation directly but now mention that this could regulate interaction with pRabs.

      3) In their lipid bilayer-based in vitro assay, the authors should also examine the effect of an LRRK2 variant that lacks site #1.

      We have included the opposite mutant with similar impact on the model: we show that lack of pRab binding site at the N-terminus removes the cooperativity of the otherwise wild type protein.

      Reviewer #2 (Public Review):

      Vides and colleagues describe a novel feed-forward mechanism of LRRK2-mediated phosphorylation of Rab8a and Rab10. The work underlies the importance of the N-terminal armadillo domain in the binding of different Rabs. They further characterized the Rab29 binding epitope, which is involved in the membrane targeting of LRRK2 mediated by Rab29 (site #1). Beyond previous work, the authors could demonstrate that one point mutation (K499E) is sufficient to abolish Rab29 binding. Furthermore, they could show that this binding site also binds the substrate Rabs Rab8a and Rab10. In addition to this binding site (#1), the authors identified one additional site (site #2) particularly involved in the specific binding of Rab8a and Rab10 but not of Rab29 nor the non-LRRK2 substrate Rab7, providing an explanation for the LRRK2 substrate specificity observed in vivo. While the Rab29 binding site bind nonphosphorylated Rabs, the newly identified site around the N-terminal Lysine 18 shows increased binding to phosphorylated Rab and provides support for a feed-forward mechanism in the substrate phosphorylation.

      The authors provide a sound biochemical characterization of critical steps of LRRK2 activation, which is of broad interest to the field. Beyond scientific interest, a well- characterized activation mechanism might guide future drug development strategies.

      We thank the reviewer for noting that we should document the bound nucleotide identity. Rab8 and Rab10 are not the easiest to work with–much harder than other Rabs to retain full nucleotide exchange capacity–preps show at best, 50% active molecules in terms of ability to exchange nucleotide. We maintain Mg-GTP throughout all purification steps and assays and use Q mutants in vitro to stabilize GTP binding. Even so, we now monitored the nucleotide state of purified Rabs by mass spec and found that our routine preps of Rab8A-Q and Rab10-Q each show a 50:50 ratio of bound GTP to GDP. We have noted this caveat in the text –our work will underestimate affinities since GTP-bound forms likely predominate in these interactions.

      Major concerns:

      • The nucleotide states of the different Rabs (after nucleotide exchange), need to be experimentally confirmed, i.e. by HPLC.

      • It is not always clear, which Rab variants (i.e. WT or Q63L) have been used for a particular experiment (information provided in the main text vs material and methods). While irrelevant for in vitro experiments, for studies in cells it should be considered that the use of Rab Q63L constructs (Q60L in Ras), does not necessarily imply that the GAP catalyzed GTP hydrolysis is completely abolished. In contrast to Ras GAPs, some RAB GAPs can provide the water-coordinating glutamine residue, critical for hydrolysis (see: Müller and Goody, 2018; PMID: 28055292).

      All studies within cells were done with endogenous Rab GTPases (WT). We have also clarified the text throughout as to which Rab form is used.

      Reviewer #3 (Public Review):

      Vide et al. present new insights into the interactions between LRRK2 and Rab GTPases. They identified two distinct Rab-binding sites in the N-terminal Armadillo (ARM) domain of LRRK2, which they named Site #1 and Site #2. One of the main findings is the striking effect of Rab GTPase phosphorylation on LRRK2's recruitment to and activation on membranes; both unmodified and phosphorylated Rabs (pRab) bind to the N-terminus of LRRK2, but to different regions. Site #1, located closer to the C-terminus of the ARM domain, binds unmodified Rab8A, Rab10, and Rab29, with Rab29 showing the highest affinity. Site #2, located at the extreme N-terminus of LRRK2, binds to the modified pRab8A and pRab10. Combining structure prediction and conservation analysis they identified the potential interaction interfaces of Site #1 and Site #2, including two conserved lysine residues (K17 and K18) in Site #2 that are critical for pRab binding. The authors propose a model where initial membrane association is mediated by binding unphosphorylated Rab8A, 10, or 29 to the lower-affinity Site #1. Membrane-associated LRRK2 then phosphorylates one of its substrates, which can now engage the higher-affinity Site #2, starting a cascade of phosphorylation events (the feed-forward mechanism).

      Overall, the authors present clear and convincing data showing the interaction between LRRK2's Nterminal ARM domain and Rab/pRab, and supporting their feed-forward mechanism. The main shortcoming in the manuscript is the absence of data directly addressing two important features of their feed-forward model: (1) The proposal that the increased activity of LRRK2 upon recruitment to membranes is only the result of its increased local concentration (without any contributions from a potential Rab-dependent activation); and (2) The ability of LRRK2 to simultaneously bind Rab and pRab. Despite this shortcoming, this manuscript presents an important contribution to our understanding of LRRK2 function, providing an elegant model for LRRK2's recruitment to and activation on membranes. This paper will be of much interest to a broad readership.

      We have fully addressed the “shortcoming”: we now demonstrate that phosphoRab10 can bind LRRK2 Armadillo domain simultaneously with Rab8 and also that pRab8 can activate kinase activity on Rab10. We thank the reviewer for these terrific suggestions.

    1. Author Response

      Reviewer #2 (Public Review):

      This study evaluates the causal relationship between childhood obesity on the one hand, and childhood emotional and behavioral problems on the other. It applies Mendelian Randomization (MR), a family of methods in statistical genetics that uses genetic markers to break the symmetry between correlated traits, allowing inference of causation rather than mere correlation. The authors argue convincingly that previous studies of these traits, both those using non-genetic observational epidemiology methods and those using standard MR methods, may be confounded by demographic effects and familial effects. One possible example of this kind of confounding is that the idea that obesity in parents may contribute to emotional and behavioral problems in children; another is the idea that adults with emotional and behavioral issues may be more likely to have children with partners who are obese, and vice-versa. They then make use of a recently proposed "within-family" MR method, which should effectively control for these confounders, at the cost of higher uncertainty in the estimated effect size, and therefore lower power to detect small effects. They report that none of the previously reported associations of childhood BMI with anxiety, depression, or ADHD are replicated using the within-family MR method, and that in the case of depression the primary association appears to be with maternal BMI rather than the child's own BMI.

      This argument that these confounders may affect these phenotypes is fairly sound, and within-family MR should indeed do a good job of controlling for them. I do not see any major issues with the cohort itself or the choice of genetic instruments. I also do not see any major issues with the definitions or ascertainment of the phenotypes studied, though I am not an expert on any of these phenotypes in particular. I am especially satisfied with the series of analyses demonstrating that the results are robust to many variations of MR methodology. Overall, I think the positive result this study reports is very credible: that the known association between childhood BMI and depression is likely primarily due to an effect of maternal BMI rather than the child's own BMI (though given that paternal BMI has a similar effect size with only a slightly wider confidence interval, I would instead say that the effect is from parental BMI generally, not specifically maternal.)

      In the updated results based on the larger genetic data release, the estimates for the association of maternal BMI and paternal BMI with the child’s depressive symptoms are more clearly different than they were in the smaller dataset (for maternal BMI, beta= 0.11, CI:0.02,0.19, p=0.01; for paternal BMI, beta=0.02, CI:-0.09,0.12, p=0.71). Therefore, in this version, it makes sense to note an association with maternal BMI specifically.

      The main weakness of the study comes from its negative results, which the authors emphasize as their primary conclusion: that previously reported associations of childhood BMI with anxiety, depression, and ADHD are not replicated using within-family MR methods. These claims do not seem justified by the evidence presented in this study. In fact, in every panel of figures 2 and 3, the error bars for the within-family MR analysis encompass the estimates for both the regression analysis and the traditional MR analysis, suggesting that the within-family analysis provides no evidence one way or another about which of these analyses is more accurate. More generally, in order to convincingly claim that there is no causal relationship between two traits, an MR study must argue that the study would be powered to detect a relationship if one existed. Within-family MR methods are known to have less power to detect associations and less precision to estimate effect sizes than traditional MR methods or traditional observational epidemiology methods, so it is not sufficient to show that these other methods have power to detect the association. To make this kind of claim, it is necessary to include some kind of power analysis, such as a simulation study or analytic power calculations, and likely also a positive control to show that this method does have power to detect known effects in this cohort.

      We agree that it is imperative that negative (i.e. “non-significant”) results are correctly interpreted - it is just as important to discover what is unlikely to affect emotional and behavioural outcomes as what does affect them. Negative results (non-significant estimates) are neither a weakness nor strength of the study, but simply reflect the estimation error in our analysis of the data. The key question is whether our within-family MR estimates are sufficiently powered to detect effect sizes of interest or rule out clinically meaningful effect sizes – or are they simply too imprecise to draw any conclusions? As the reviewer suggests, one way to address this is via a post-hoc power calculation. We consider post-hoc power calculations redundant, since all the information about the power of our analysis is reflected in the standard errors and reported confidence intervals. Moreover, any post-hoc power calculation will be necessarily approximate compared to using the standard errors and confidence intervals which we report.

      Despite these methodological reservations, we have conducted simulations to estimate the power of our within-family models (the R code is included at the end of this document). These simulations indicate that we do have sufficient power to detect the size of effects seen for depressive symptoms and ADHD in models using the adult BMI PGS. They also indicate that we cannot rule out smaller effects for non-significant associations (e.g., for the impact of the child’s BMI on anxiety). Naturally, this is entirely consistent with the width of the confidence intervals reported in results tables and in Figures 1 and 2. However, although power calculations are important when planning a study, they make little contribution to interpretation once a study has been conducted and confidence intervals are available (e.g., https://psyarxiv.com/tcqrn/). For this reason, we comment on these simulations in this response to reviewers but do not include them in the manuscript or supplementary materials. At the same time, we have changed the language used in the manuscript to be clearer that the results were imprecise and that values contained within the confidence limits cannot be ruled out.

      For example, the discussion now includes the following:

      ‘However, within-family MR estimates using the childhood body size PGS are still consistent with small effects of the child’s BMI on all outcomes, with upper confidence limits around a 0.2 standard-deviation increase in the outcome per 5kg/m2 increase in BMI.’

      And the conclusion of the paper now reads:

      ‘Our results suggest that genetic variation associated with BMI in adulthood affects a child’s depressive and ADHD symptoms, but genetic variation associated with recalled childhood body size does not substantially affect these outcomes. There was little evidence that BMI affects anxiety. However, our estimates were imprecise, and these differences may be due to estimation error. There was little evidence that parental BMI affects a child’s ADHD or anxiety symptoms, but factors associated with maternal BMI may independently influence a child’s depressive symptoms. Genetic studies using unrelated individuals, or polygenic scores for adult BMI, may have overestimated the causal effects of a child’s own BMI.’

      Regarding a positive control: for analyses of BMI in adults, suitable positive controls would include directly measured biomarkers such as fat mass or blood pressure or reported medical outcomes like type 2 diabetes. In adolescents and younger adults, age at menarche or other measures of puberty can be used, as these are reliably influenced by BMI. However, the age of the participants for whom within-family effects are being estimated (8 years), together with the lack of any biomarkers such as fat mass (due to the questionnaire-based survey design) mean no suitable measures are available.

      Reviewer #3 (Public Review):

      Higher BMI in childhood is correlated with behavioral problems (e.g. depression and ADHD) and some studies have shown that this relationship may be causal using Mendelian Randomization (MR). However, traditional MR is susceptible to bias due to population stratification, assortative mating, and indirect effects (dynastic effects). To address this issue, Hughes et al. use within-family MR, which should be immune to the above-listed problems. They were unable to find a causal relationship between children's BMI and depression, anxiety, or ADHD. They do, however, report a causal effect of mother's BMI on depression in their children. They conclude that the causal effect of children's BMI on behavioral phenotypes such as depression and anxiety, if present, is very small, and may have been overestimated in previous studies. The analyses have been carried out carefully in a large sample and the paper is presented clearly. Overall, their assertions are justified but given that the conclusions mostly rest on an absence of an effect, I would like to see more discussion on statistical power.

      1) The authors show that the estimates of within-family MR are imprecise. It would be helpful to know how much power they have for estimating effect sizes reported previously given their sample size.

      As discussed in response to a comment from reviewer 2, the power of our results is already indicated by our standard errors and confidence intervals. Nevertheless, we conducted simulations to estimate the size of effects which we had 80% power to detect. Results, presented below, are consistent with our main results. As discussed in response to a comment from reviewer 2, we consider post-hoc power calculations redundant when standard errors and confidence intervals are reported; for this reason, we include this information in the response to reviewers but not the manuscript itself.

      2) They used the correlation between PGS and BMI to support the assertion that the former is a strong instrument. Were the reported correlations calculated across all individuals? Since we know that stratification, assortative mating, and indirect effects can inflate these correlations, perhaps a more unbiased estimate would be the proportion of children's BMI variance explained by their PGS conditioned on the parents' PGS. This should also be the estimate used in power calculations.

      The manuscript has been updated to quote Sanderson-Windmeijer conditional R2 values: the proportion of BMI variance explained by the BMI PGS for each member of a trio, conditional on the PGS of the other members of the trio, and all genetic covariates included in within-family models. Similarly, we now show Sanderson-Windmeijer conditional F-statistics for a model including the child, mother, and father’s BMI instrumented by the child, mother, and father’s PGS.

      3) In testing the association of mothers' and fathers' BMI with children's symptoms, the authors used a multivariable linear regression conditioning on the child's own BMI. Was the other parent's BMI (either by itself or using the polygenic score) included as a covariate in the multivariable and MR models? This was not entirely clear from the text or from Fig. 2. I suspect that if there were assortative mating on BMI in the parent's generation, the effect of any one parent's BMI on the child's symptoms might be inflated unless the other parent's BMI was included as a covariate (assuming both mother's and father's BMI affect the child's symptoms).

      Non-genetic models include both the mother and father’s phenotypic BMI as well as the child’s, allowing estimation of conditional effects of all three. This controls for assortative mating as noted by the reviewer. This was not previously clear - all relevant text and figure captions have been updated to clarify this.

      4) They report no evidence of cross-trait assortative mating in the parents generation. The power to detect cross-trait assortative mating in the parents' generation using PGS would depend on the actual strength of assortative mating and the respective proportions of trait variance explained by PGS. Could the authors provide an estimate of the power for this test in their sample?

      We have updated the discussion of assortative mating (in both the results and the discussion section) to note possible limitations of power and clarify that that this approach to examining assortment may not capture its full extent.

      The relevant part of the results section now reads:

      “In the parents’ generation, phenotypes were associated within parental pairs, consistent with assortative mating on these traits (Appendix 1 – Table 5). Adjusted for ancestry and other genetic covariates, maternal and paternal BMI were positively associated (beta: 0.23, 95%CI: 0.22,0.25, p<0.001), as were maternal and paternal depressive symptoms (beta: 0.18, 95%CI: 0.16,0.20, p<0.001), and maternal and paternal ADHD symptoms (beta: 0.11, 95%CI: 0.09,0.13, p<0.001). Consistent with cross-trait assortative mating, there was an association of mother’s BMI with father’s ADHD symptoms (beta: 0.03, 95%CI: 0.02,0.05, p<0.001) and mother’s ADHD symptoms with father’s depressive symptoms (beta: 0.05,95%CI: 0.05,0.06, p<0.001). Phenotypic associations can reflect the influence of one partner on another as well as selection into partnerships, but regression models of paternal polygenic scores on maternal polygenic scores also pointed to a degree of assortative mating. Adjusted for ancestry and genotyping covariates, there were small associations between parents’ BMI polygenic scores (beta: 0.01, 95%CI: 0.00,0.02, p=0.02 for the adult BMI PGS, and beta: 0.01, 95%CI: 0.00,0.02, p=0.008 for the childhood body size PGS), and of the mother’s childhood body size PGS with the father’s ADHD PGS (beta: 0.01, 95%CI: 0.00,0.02, p=0.03). We did not detect associations with pairs of other polygenic scores, which may be due to insufficient statistical power.”

      And the relevant part of the discussion section now reads:

      “We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.”

      5) Are the actual phenotypes (BMI, depression or ADHD) correlated between the parents? If so, would this not suffice as evidence of cross-trait assortative mating? It is known that the genetic correlation between parents as a result of assortative mating is a function of the correlation in their phenotypes and the heritabilities underlying the two traits (e.g., see Yengo and Visscher 2018). An alternative way to estimate the genetic correlation between parents without using PGS (which is noisy and therefore underpowered) would be to use the phenotypic correlation and heritability estimated using GREML or LDSC. Perhaps this is outside the scope of the paper but I would like to hear the author's thoughts on this.

      Associations between maternal and paternal phenotypes are consistent with a degree of assortative mating (shown below). These results have added to Appendix 1 - Table 5, which also shows associations between maternal and paternal polygenic scores, and methods and results updated accordingly (see quoted text in response to the comment above). For comparability, both sets of results are based on regression models adjusting for the mother’s and father’s ancestry PCs and genotyping covariates. We agree that analysis of assortative mating using GREML or LDSC is out of scope for this paper. As noted above, we have updated the discussion to acknowledge the limitations of the approach taken:

      ‘We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.’

      6) It would be helpful to include power calculations for the MR-Egger intercept estimates.

      As with our response to the comments above, post-hoc power calculations are redundant, as all the information about the power of our analysis, including the MR-Egger is indicated by the standard errors and confidence intervals. MR-Egger is less precise than other estimators, as is made clear from the wide confidence intervals reported in the relevant tables (Appendix 1 - Tables 8 and 9). However, we have now updated the discussion to give more weight to this as a limitation. The discussion of pleiotropy in the final paragraph of the discussion now reads:

      ‘While robustness checks found little evidence of pleiotropy, these methods rely on assumptions. Moreover, MR-Egger is known to give imprecise estimates (Burgess and Thompson 2017), and confidence intervals from MR-Egger models were wide. Thus, pleiotropy cannot be ruled out.’

      Similarly, we have updated the relevant line of the results section, which now reads:

      ‘MR-Egger models found little evidence of horizontal pleiotropy, although MR-Egger estimates were imprecise (Appendix 1 - Tables 8 and 9).’

      7) Finally, what is the correlation between PGS and genetic PCs/geography in their sample? A correlation might provide evidence to support the point that classic MR effects are inflated due to stratification.

      Figures presenting the association of the child’s BMI polygenic scores and their PCs have been added to the supplementary information as Appendix 1 - Figure 2 and Appendix 1 - Figure 3. Consistent with an influence of residual stratification, a regression of the child’s BMI polygenic scores against their ancestry PCs (adjusting for genotyping centre and chip) found that 7 of the 20 PCs were associated at p<0.05 with the adult BMI PGS, and 8 of 20 with the childhood body size PGS (under the null hypothesis, we would expect one association in each case). When parental polygenic scores were added to the models, these associations attenuated towards to null.

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript shows that bone is resorbed during the early steps of limb regeneration in urodeles, and osteoclasts are required for this process. In case of impaired resorption, integration of newly-formed tissue with the original bone shaft is compromised. The manuscript further shows that wound epithelium is required for bone resorption and suggests that it induces osteoclastogenesis or migration of osteoclasts. Furthermore, the authors showed that the formation of novel skeletal elements is initiated while the resorption of the old one is still actively ongoing.

      The study is well designed, conclusions are relatively well supported, and data are presented in a clear way. Two new models of transgenic axolotls have been created. The strongest and most important finding is that partial bone resorption is required for tissue reintegration. My main concern is the novelty of this study, which is quite limited in my opinion.

      Specifically, resorption of bone stump during limb regeneration has been shown before in various model organisms.

      The role of osteoclasts in this process has not been well characterized in urodeles but has been shown during the regeneration of a mouse digit.

      It is reasonable to anticipate that similarly, osteoclasts are resorbing bone in salamanders, especially since this is the only cell type known for bone resorption.

      Thus, this observation, despite being nicely and thoroughly done, is of limited interest.

      The role of wound epithelium in bone histolysis is well demonstrated via skin flap experiments in this manuscript. However, upon skin flap surgery no limb regeneration occurs, implying wound epithelium is a key tissue triggering all the processes of limb regeneration. Accordingly, the absence of bone histolysis in such conditions can be secondary to the absence of any other part of the regenerative process, e.g., blastema formation, macrophage M1 to M2 transition, reinnervation, etc. The proposed link between wound epithelium and osteoclastogenesis (i.e., Sphk1, Ccl4, Mdka) is very superficial and very suggestive.

      No functional evidence was provided to confirm these connections. Finally, the authors showed that new bone formation occurs while resorption of the bone stump is still ongoing. This is a nice observation, but again, rather indirect as it is based on the dynamics of bone resorption and bone formation in different animals. Due to high variability among animals, direct evidence, like double staining for osteoclasts and blastema markers would address this point more precisely.

      We consider that our work provides evidence, for the first time, that skeletal resorption in early stages of regeneration has a durable impact by affecting tissue integration. We show that this process occurs in a short and conserved time, which provides a window of interest for comparative research with other models, and interventional therapies. To our knowledge, limb regeneration is studied mainly in amphibians, as they are the only established lab model with this ability. Some lizards, geckos and possibly iguanas, have been reported to regrow an appendage albeit lacking the regenerative fidelity amphibians have. In an established regeneration lab model, such as the axolotl, the study of regeneration-induced resorption has been scarce.

      During murine digit tip, osteoclasts are recruited to the amputation site and resorb the bone in a similar time frame as we show here in the axolotl. Ablating osteoclasts delays the regeneration time, however, no study has been conducted on the impact of tissue integration. Additionally, a key difference between mouse digit and adult axolotl limb regeneration is that the new skeletal elements are built fundamentally different: direct ossification (bone on top of bone) in mouse, versus endochondral ossification (cartilage on top of osteo-cartilage elements) in the axolotl limb. The tissue integration of the latter may present different challenges worth exploring to understand its regulation. What this work adds, is a characterization of the temporal and cellular dynamic of regeneration-induced resorption, the interaction of osteoclasts with skeletal cells and lastly, the impact on tissue integration.

      Based on previous studies in mammals, it is reasonable to anticipate the presence and role of osteoclasts in salamanders. However, the growing body of work in the field, as well as our own work in the axolotl, have shown that extrapolations of mammalian skeletal biology to other species come with their risks.

      We agree that the role of the wound epithelium (WE) in skeletal histolysis will require further and extensive work. The evidence shown here, provides a glimpse of the complex response and crosstalk of the WE with the tissue underneath, and we hypothesize this response is tailored to the tissue composition exposed during the injury.

      Finally, following the reviewer’s advice, we have conducted new experiments to prove the temporal connection between skeletal resorption and regeneration, showing that these processes occur simultaneously.

      Reviewer #3 (Public Review):

      This study outlines the role of osteoclast-mediated resorption in integrating the skeletal elements during limb regeneration, using axolotls that can regenerate the entire limb upon amputation. Using calcium-binding vital dyes (calcein and alizarin red), the authors first demonstrated that a large portion of amputated skeletal elements is resorbed prior to blastema formation. They further show that 1) inhibiting bone resorption by zoledronic acid impairs proper integration of the pre-existing and regenerating skeletal elements, 2) removing the wound epithelium using the full skin flap surgery inhibits bone resorption, and 3) bone resorption and blastema formation are correlated. The authors reached the major conclusion that bone resorption is essential for successful skeletal regeneration. Notably, this study applies a well-established and elegant axolotl limb regeneration model and transgenic reporter strains to reveal the potential roles of resorption in limb regeneration.

      Strengths:

      1. The authors utilized a well-established axolotl limb regeneration model and applied elegant vital mineral dyes and transgenic reporter lines for sequential in vivo imaging. The authors also provided quantitative assessment by examining multiple animals, particularly in the early sections, ensuring the rigor and the reproducibility of the study.

      2. The authors further performed important interventions that can impinge upon successful limb regeneration, including inhibition of bone resorption by zoledronic acid and impairment of the wound epithelium by full skin flap surgery. These procedures gave rise to useful insights into the relationship between bone resorption and successful limb regeneration.

      3. The imaging presented in this manuscript is of exceptionally high quality.

      Weaknesses:

      1. Despite the high quality of the work, many analyses in this study are incomplete, making it insufficient to support the major conclusion. For example, in Figure 4, the authors did not provide any quantitative assessment to show how zol affects the integration of the skeletal elements (angulation?), which seems to be essential for supporting the conclusion. Likewise in Figure 7, the analyses of EdU+ cells and Sox9 reporter expression were not included in zol-treated animals. Similarly in Figure 5, quantification of osteoclasts was not performed with the full skin flap surgery group. Analyses of only normally regenerated animals are not sufficient to support many of the conclusions.

      2. The phenotype of zol-treated animals in limb regeneration is somewhat disappointing. Although zol-treated animals show decreased blastema formation and unresorbed pre-existing skeletal elements, limb regeneration still occurs and the only phenotype is a relatively minor defect in skeletal integration. It is possible that zol-induced defect in blastema formation is not directly linked to the failure of integration at a later stage. I find this “weakness” a bit subjective.

      3. As an integration failure of the newly formed skeleton still occurs in untreated animals, it is not entirely clear how the authors can attribute this defect to a lack of bone resorption. More quantitative analyses would be necessary to demonstrate the correlation between zol treatment and lack of integration.

      Taking into consideration the reviewer’s concerns, we have improved our analysis of integration phenotype. The assessment of integration success was carried out using a score matrix and with it, we correlated the extent of resorption with integration efficiency more accurately. We believe our results provide sufficient evidence to support this correlation.

      When we first saw the phenotype of zol-treated animals, we were far from disappointed, we were actually intrigued that we could observe a significant failure in tissue integration after removing the function of osteoclasts in an early phase of regeneration. All or nothing results are exciting, subtle results on the other hand, could prove more informative, and we think this is the case here. Our treatment does not inhibit regeneration, but disrupts tissue integration, opening another fascinating aspect of regeneration: how old tissue is capable of functionally integrate newly-formed tissue?

      The integration phenotypes observed in the un-resorbed limbs does not resemble anything reported in the field so far. Moreover, the range of phenotypes observed led us to better determine its correlation with resorption. Importantly, the presence of integration failures in untreated animals allowed us to look into ECM organization at this old-new tissue interphase, while highlighting the normal occurrence of imperfect regeneration in the axolotl limb.

      Finally, we have included new results to complement the conclusions presented at the end of our work. Albeit we observed differences in blastema size in zol-treated animals, we did not observe difference in the amount of EdU+ cells, which reveals that the skeleton cannot be used as a reference for assessing blastema location. This conclusion is complemented with our in vivo assays in which we observed condensation of cartilage despite resorption still occurring. We consider our conclusions to be justified and supported by the assays presented in our work.

    1. Author Response

      Reviewer #1 (Public Review):

      Khan et al describe how two important transcription factors functionally cooperate to activate a few of the CRP-dependent genes in Mycobacterium tuberculosis. CRP is a global regulator in eubacteria needed to activate a number of genes while PhoP is an acid stress response regulator required for expression specific set of genes. The authors delineate the interaction between these two key regulators of the bacterial pathogen and show that in a subset of CRP-dependent promoters, PhoP binding recruits CRP to activate transcription.

      The experiments are well designed and executed with a coalescent presentation of the manuscript. While the data is well organized and presented with clean images of phophorimages and blots to facilitate their easy understanding, interpretation could have been more robust (see comments below).

      We thank the reviewer for these extremely encouraging comments. We have now included substantial changes throughout the ‘Results’ section to improve interpretation of the results (please see below our responses).

      Obviously, the strength of the paper is the description of hitherto unknown stress-specific cooperation between two well-studied transcription factors with most evidence supporting the claims. In E. coli (and in other bacteria) studies CRP mediated control of genes have led to the identification of different classes of CRP-dependent promoters with their own specific regulators. Such a description was lacking in M. tuberculosis and the PhoP - CRP collaboration described is likely to have implications on pathogenesis. The weakness (or possibly what remains to be explored) is that the precise mechanism of the cooperative transcription regulation is yet to be understood.

      We agree with the reviewer’s comment that the precise mechanism of cooperative transcription regulation is yet to be fully understood. While we briefly mention it as the future scope of work in the concluding part of the ‘Discussion’ section, we have now included a new paragraph on the schematic model summarizing a possible mechanism of cooperative transcription regulation.

      From the data presented it is apparent that PhoP binds to whiB up promoter own efficiently. It is also evident that CRP is recruited to its site as a result of PhoP binding. This is reminiscent of the bacteriophage Lamba paradigm of positive cooperativity. Thus, it is not reciprocal synergy (as stated in the paper in one place). It is PhoP mediated recruitment as claimed elsewhere. Indeed, PhoP null mutants nicely support the latter interpretation

      The reviewer raises an important and interesting point on positive cooperativity resembling bacteriophage lambda paradigm. We agree. We have now modified text of the ‘Results’ section to establish clarity on this matter.

      A discussion on why and how CRP binds on its own in other CRP-dependent promoters would help better appreciate the need for PhoP sites next to CRP sites for their cooperative interaction in these promoter subsets. CRP sites could be at a varied distance with respect to the promoter as seen in E. coli.

      Again, this is an interesting point. We thank the reviewer for bringing this point to our attention. As recommended by the reviewer, we have now included the following text in the ‘Discussion’ section of the revised manuscript.

      “Notably, the subset of genes which undergo differential expression in Δcrp-H37Rv conforms a pattern largely resembling canonical CRP regulon of E. coli with CRP binding sites either proximal to transcription start sites, leading to repression or distal to transcription start sites, leading to promoter activation, respectively (Kahramanoglou et al., 2014). It is noteworthy that CRP has been suggested to function as a general chromosomal organizer (Grainger et al., 2005). In this study, we uncover that strikingly PhoP binding sites are present next to CRP binding sites, located only distal upstream of promoters, and therefore, associated with activation. We propose that in case of these co-regulated promoters, the additional stability of the transcription initiation complex is derived from protein-protein interaction between CRP and PhoP. These two interacting proteins remain bound to their cognate sites away from the start site, and contribute to stability of the transcription initiation complex, providing access for mycobacterial RNA polymerase (RNAP) to bind and transcribe genes. A schematic model is shown in Fig. 6C. Together, these molecular events mitigate stress by controlling expression of numerous genes and perhaps contribute to better survival of the bacilli in cellular and animal models.”

      Reviewer #2 (Public Review):

      In this manuscript by Khan et al., the authors set out to characterize how the cAMP receptor protein, CRP, and PhoP function to coregulate a subset of virulence genes in Mycobacterium tuberculosis. To this end, the authors use a wide variety of molecular techniques to monitor gene regulation, DNA-binding activity, and protein-protein interactions between phosphorylated PhoP and CRP. The authors conclude that phosphorylated PhoP functions to recruit CRP to promoter regions, where together the two regulators function synergistically to control gene expression. In general, the conclusions of the manuscript appear to be justified by the data, however, the text is difficult to follow. The current version of the paper is likely of interest to scientists within the field of mycobacterial signal transduction.

      The major strength of the paper is that the authors test their hypothesis using a variety of complementary approaches. The authors demonstrate a genetic interaction between CRP and PhoP in vivo and reconstitute the phenomenon in vitro, providing compelling evidence that the coregulation by these well-studied regulators does take place. The major weakness is that the logic of the manuscript is difficult to follow as a reader, at times making an evaluation of results and interpretations difficult. The majority of the experimentation involves the whiB1 promoter while conclusions are extrapolated broadly.

      We would like to thank the reviewer for her/his constructive comments and suggestions. In the revised manuscript, we have now included numerous changes throughout the ‘Results’ and ‘Discussion’ sections to improve logic of the manuscript and interpretation of the results (please see below our responses). Also, we have included experiments as requested by the reviewers and provided additional data and explanations that address their concerns.

    1. Author Response

      Reviewer #3 (Public Review):

      1) Information is missing about the regions of interest in which calcium responses were measured. Judging from Fig. 1E, calcium signals were measured in the somata, and this should be specified. Also judging from this figure, calcium signals seem to be largely confined to the somata and virtually absent from dendritic arbors. Fig. 6a shows very faint signals in the dendrites, yet those signals seem to have been measured rather far from the point of force application (a scale bar is shown but undefined), and, for some unknown reason, not between soma and force application point). Should there be detectable calcium signals in the somata, respective image gains should be adjusted so that those signals can be appreciated by the reader. If there are no clear signals in the dendrites, this would affect interpretations concerning e.g. Ca-α1D.

      Calcium responses can be observed in the soma and dendrites, which was presented in the original manuscript (Figure 6). Inspired by the 2nd suggestion from this reviewer, we went through our data and refined our measurement of the dendritic signal in the revised manuscript (see revised Figure 6). In addition, we also showed that the dendritic response was dependent on Ca-α1D (see revised Figure 6 and Figure 6-figure supplement 1). Finally, in the revised manuscript, we made it clear that all F/F0 were measured from the soma unless otherwise stated (see Figure 2, legend).

      2) Along this line, analyzing also the spacial distribution of dendritic calcium responses to the pokes would provide a much more detailed picture about how the dendritic tree responds to the various pokes. The beauty of the imaging approach chosen here is that it provides such information. Rather than ignoring this possibility, it should be exploited in this study, especially as respective data might provide much deeper insights into the relation between the mechanosensory function of the cell and its dendritic tree (and bolster the modelling results in Fig. 4 experimentally).

      In the original manuscript, we included the data on the dendritic calcium signal and showed that the dendritic signal was reduced when the activity of VGCCs were inhibited or in the Ca-α1D knockdown mutant (see Fig. 6 A-B in the original manuscript). Inspired by the suggestion from the reviewers, we had a closer look at our data and performed additional experiments. In the revised Figure 6 A-B, we showed that the mechanical stimuli could evoke calcium responses not only in the soma, but also in the homolateral (i.e. between the soma and the force probe) and contralateral (i.e. opposite side of the force probe) dendrites, suggesting that the dendritic signals are propagating within the dendritic arbors. Moreover, in the revised Figure 6 A-B and Figure 6-figure supplement 1, we showed that these dendritic signals were reduced in the mutant strains of Ca-α1D or if the fillet preparation was treated with nimodipine, demonstrating a clear dependence on the activity of VGCCs. However, because our imaging speed is not fast enough to capture the dendritic flow of calcium signals, the dynamics of signal propagation remains undefined. This would be an interesting issue to study in the future. Along with the revised Figure 6, we also revised the text and legends accordingly.

      3) When showing response functions as in e.g. Figs. 2C, G, H, 3D, 5C-E, etc., the y-axis should have a logarithmic scaling; receptor potentials of receptor cells usually scale proportionally to the logarithm of the stimulus amplitude. Only then, the reader will be able to fully appreciate the sensitivity differences. This will also alter interpretation of response function slopes.

      We thank the reviewer for the suggestion. However, the stimulation force is actually a distal stimulus for the cell, while the proximal stimuli (e.g. local deformation) are difficult to measure/estimate. Therefore, we are not sure if the cellular responses scale necessarily to the logarithm of macroscopic forces (i.e. the distal stimuli). However, simply by looking at the data, we found that the response is proportional to the force and for conciseness, and thus we fitted the plot using a linear function.

      4) The knockdown and mutant data is interesting, yet important controls are missing. For the RNAi lines used, qPCR data on the knockdown-efficiency should be added. For the channel mutations, available genetic rescue lines should be used as controls. Data on protein localization is presented for the mechanosensitive channels, but not for voltage-gated calcium channel subunit. Should antibodies be available, respective stainings should be included. If not, the authors should at least check whether Ca-α1D is expressed in the cell using e.g. Mi{ET1}Ca-α1D[MB06807] that is available at Bloomington.

      First, we did not use RNAi mutant for Piezo. The PiezoKO line is a genomic mutant strain.

      Second, for Ca-α1D, because there are only a small number of c4da in each animal and Ca-α1D has a quite broad expression in various types of neurons (see our revised Figure 6-figure supplement 2), we expected that the reduction in the expression level of Ca-α1D in c4da would be very difficult to detect. Therefore, we knocked down the expression of Ca-α1D in the whole animal using the same uas-Ca-α1Di strain and the tub-gal4 strain. Using RT-PCR, we showed that the expression level of Ca-α1D was significantly reduced (revised Figure 6-figure supplement 2). In fact, the same RNAi strain was also used in other functional studies.

      5) The statistics used is not entirely convincing. T-test are used throughout, though I do not feel that all the data is distributed normally. Moreover, some figures include multiple comparisons, apparently without statistical correction. The data should be re-analyzed using appropriate statistical procedures.

      We thank the reviewer for this suggestion. We have now used Mann-Whitney U test or Kruskal Wallis test for all the data that were not proven to follow a normal distribution. For multiple comparisons, we used One-way ANOVA. We have now included the relevant information in the revised figure legends.

    1. Author Response

      Reviewer #3 (Public Review):

      1) Validation of reagents: The authors generated a pY1230 Afadin antibody claiming that (page 6) "this new antibody is specific to tyrosine phosphorylated Afadin, and that pY1230 is targeted for dephosphorylation by PTPRK, in a D2-domain dependent manner". The WB in Fig 1B shows a lot of background, two main bands are visible which both diminish in intensity in ICT WT pervanadate-treated MCF10A cell lysates. The claim that the developed peptide antibody is selective for pY1230 in Afadin would need to be substantiated, for instance by pull down studies analysed by pY-MS to substantiate a claim of antibody specificity for this site. However, for the current study it would be sufficient to demonstrate that pY1230 is indeed the dephosphorylated site. I suggest therefore including a site directed mutant (Y1230F) that would confirm dephosphorylation at this site and the ability of the antibody recognizing the phosphorylation state at this position.

      We would like this antibody to be a useful and freely accessible tool in the field and have taken on board the request for additional validation. To this end we have significantly expanded Supplementary Figure 2 (now Figure 1 - figure supplement 2) and included a dedicated section of the results as follows: 1. We have now included information about all of the Afadin antibodies used in this study, since Afadin(BD) appears to be sensitive to phosphorylation (Figure 1 - figure supplement 2A). 2. We have demonstrated that the Afadin pY1230 antibody detects an upregulated band in PTPRK KO MCF10A cells, consistent with our previous tyrosine phosphoproteomics (Figure 1 - figure supplement 2B). This indicates that the antibody can be used to detect endogenous Afadin phosphorylation. 3. We have included two new knock down experiments demonstrating the recognition of Afadin by our antibody (Figure 1 - figure supplement 2C). There appear to be two Afadin isoforms recognised in HEK293T cells by both the BD and pY1230 antibody, consistent with previous reports (Umeda et al. MBoC, 2015). We have highlighted these in the figure. 4. We have performed mutagenesis to demonstrate the specificity of the antibody. We tagged Afadin with a fluorescent protein tag, reasoning that it would cause a shift in molecular weight that could be resolved by SDS PAGE, as is the case. We noted that the phosphopeptide used spans an additional tyrosine, Y1226, which has been detected as phosphorylated (although to a much lower extent than Y1230) on Phosphosite plus. The data clearly show that Afadin cannot be phosphorylated when Y1230 is mutated to a phenylalanine (compared to CIP control), indicating that this is the predominant site recognised by the antibody. In addition, the endogenous pervanadate-stimulated signal is completely abolished by CIP treatment (Figure 1 - figure supplement 2D). 5. We have included densitometric quantification of the dephosphorylation assay shown in Figure 1B, which was part of a time course and shows preferential dephosphorylation by the PTPRK ICD compared to the PTPRK D1. The signal stops declining with time, which could indicate antibody background, or an inaccessible pool of Afadin-pY1230 (Figure 1 - figure supplement 2E). 6. To further demonstrate that this site is modulated by PTPRK in post-confluent cells, we have used doxycycline (dox)-inducible cell lines generated in Fearnley et al, 2019. Upon treatment with 500 ng/ml Dox for 48 hours PTPRK is induced to lower levels than wildtype, however, normalized quantification of the Afadin pY1230 against the Afadin (CST) signal clearly indicates downregulation by PTPRK WT, but not the catalytically inactive mutant (Figure 1 - figure supplement 2F and 2G). Together these data strengthen our assertion that this antibody recognises endogenously phosphorylated Afadin at site Y1230, which is modulated in vitro and in cells by PTPRK phosphatase activity. For clarity, we have highlighted and annotated the relevant bands in figures. We have also included identifiers for each Afadin total antibody was used in particular experiments.

      2) The authors claim that a short, 63-residue predicted coiled coil (CC) region, is both necessary and sufficient for binding to the PTPRK-ICD. The region is predicted to have alpha-helical structure and as a consequence, a helical structure has been used in the docking model. Considering that the authors recombinantly expressed this region in bacteria, it would be experimentally simple confirming the alpha-helical structure of the segment by CD or NMR spectroscopy.

      To clarify, the helical structure in the docking model was independently predicted by several sequence and structural analysis programmes including AlphaFold2, RobettaFold, NetSurfP and as annotated in Uniprot (as a coiled coil). We did not stipulate prior to the AF2 prediction that it was helical. Isolated short peptides frequently adopt helical structure, therefore prediction of a helix within the context of the full Afadin sequence is, in our opinion, stronger evidence than CD of an isolated fragment.

      3) Only two mutants have been introduced into PTPRK-ICD to map the Afadin interaction site. One of the mutations changes a possibly structurally important residues (glycine) into a histidine. Even though this residue is present in PTPRM, it does not exclude that the D2 domain no longer functionally folds. Also the second mutation represents a large change in chemical properties and the other 2 predicted residues have not been investigated.

      The residues that were selected for mutation are all localised to the protein surface and therefore are unlikely to be involved in stable folding of PTPRK. In support of the correct folding of the mutated PTPRK, we include in Figure 1 below SEC elution traces for wild-type and mutant D2 showing that they elute as single symmetric peaks at the same elution volume as the WT protein. This is consistent with them having a similar shape and size, and not being aggregated or unfolded.

      Figure 1. PTPRK-D2 wild-type and mutant preparative SEC elution profiles. A280nm has been normalised to help illustrate that the different proteins elute at the same volume. The main peak from these samples was used for binding assays in the main paper.

      Furthermore, the yield for the double mutant was very high (4 mg of pure protein from a 2 L culture, see A280 value in graph below), whereas poorly folded proteins tend to have significantly reduced yields. This protein was also very stable over time whereas unfolded proteins tend to degrade during or following purification.

      Figure 2. Analytical SEC elution profile for the PTPRK-D2 DM construct showing the very high yield consistent with a well-folded, stable protein.

      Finally, we have carried out thermal melt curves of the WT and mutant PTPRK D2 domains showing that they all possess melting temperatures between 39.3°C and 41.7°C, supporting that they are all equivalently folded. We include these data as an additional Supplementary Figure (Figure 4 - figure supplement 3) in the paper.

      4) The interface on the Afadin substrate has not been investigated apart from deleting the entire CC or a central charge cluster. Based on the docking model the authors must have identified key positions of this interaction that could be mutated to confirm the proposed interaction site.

      We have now made and tested several additional mutations within both the Afadin-CC and PTPRK-D2 domains to further validate the AF2 predicted model of the complex.

      For Afadin-CC we introduced several single and double mutations along the helix including residues predicted to be in the interface and residues distal from the interface. These mutations and the pulldown with PTPRK are described in the text and are included as additional panels to a modified Figure 3. All mutations have the expected effect on the interaction based on the predicted complex structure. To help illustrate the positions of these mutations we have also included a figure of the interface with the residues highlighted.

      For the PTPRK-D2 we have also introduced two new mutations, one buried in the interface (F1225A) and one on the edge of the interface encompassing a loop that is different in PTPRM (labelled the M-loop). GST-Afadin WT protein was bound to GSH beads and tested for their ability to pulldown WT and mutated PTPRK. These new mutations (illustrated in the new Figure 4 – figure supplement 2) further support the model prediction. F1225A almost completely abolishes binding as predicted, while the M-loop retains binding. These mutations and their effects are now described in the main text and the pull-down data, including controls and retesting of the original DM mutant, are included as panel H in a newly modified Figure 4 focussed solely on the PTPRK interface.

      5) A minor point is that ITC experiments have not been run long enough to determine the baseline of interaction heats. In addition, as large and polar proteins were used in this experiment, a blank titration would be required to rule out that dilution heats effect the determined affinities.

      All control experiments including buffer into buffer, Afadin into buffer and buffer into PTPRK were carried out at the same time as the main binding experiment and are shown below overlaid with the binding curve. These demonstrate the very small dilution heats consistent with excellent buffer matching of the samples.

      We were able to obtain excellent fits to the titration curves by fitting 1:1 binding with a calculated linear baseline (see Figure 2B,D). Very similar results were obtained by fitting to the sum (‘composite’) of fitted linear baselines obtained for the three control experiments for each titration.

    1. Author Response

      Public Evaluation Summary:

      This work presents a series of enhancements to the PhIP-seq method of autoantibody discovery, with the goal of improving scaling to larger cohorts and increasing disease specificity. The strength of the paper is the validation of the high throughput format, although results from screening patient samples confirm or only modestly extend previous data.

      We thank the reviewers for their feedback and agree that the validation of our high throughput, easily accessible approach is a strength of this work. We appreciate that the reviewers expressed uncertainty about whether there were sufficient advances to qualify this paper as a Research Advance. In addition to a point-by-point rebuttal, we quantify and enumerate the advances, improvements, and novel findings disclosed in this manuscript, relative to our original eLife paper.

      1. Demonstration of the importance of adequate healthy control cohorts in PhIP-Seq design. Using scaled protocols, we demonstrate the importance of using large control cohorts to filter out non-specific hits, as well as to detect rare but specific disease-associated antigens such as PDYN. To our knowledge, we are the first to demonstrate and discuss the consequences of PhIP-Seq dataset interpretation in the absence of sufficient controls. These findings are especially important in light of recent, high-impact papers using few to no controls (Mina et al. Science 2019, Gruber et al. Cell 2020, among others) to make conclusions about novel autoantibodies in the context of specific diseases.

      2. Design, validation and documentation of accessible, benchtop protocols for scaled PhIP-Seq. These protocols enable parallel testing of 600-800 samples without contamination or batch effects. Using a substantially expanded, multi-cohort set of patients with APS1, we validate the quality of the protocol and apply this protocol to numerous other disease contexts. Importantly, our protocols are documented (protocols.io) with each step tested for optimal quality, and are easily accessible without the need for robotics or specialized equipment.

      3. Machine Learning for disease classification using phage-based immunoprofiling. We show that large, well-controlled PhIP-Seq datasets lend well to machine learning approaches and enable unsupervised classification of disease status. To our knowledge, this is the first successful application of an unsupervised machine learning approach to phage-based immunoprofiling data. We demonstrate that PhIP-Seq data enables APS1 disease classification in 97% of cases (compare even to the 95% sensitivity seen in current testing for anti-IFN antibodies in the setting of suspected APS1). This finding, while applied to only one large cohort, demonstrates that PhIP-Seq data, when appropriately controlled, can have substantial value outside of simply a single-antigen discovery platform. The combination of machine learning and phage-based immunoprofiling will likely have extensive applications beyond APS1 including the discovery of novel diagnostic tests and biomarkers.

      4. Novel IPEX antigen BTNL8. We discovered and validated anti-BTNL8 antibodies in 42% of IPEX patients, suggesting that this may be a major autoantigen in IPEX. BTNL8 is a cell surface-expressed protein in intestinal gamma-delta T-cells, raising the novel question of a possible role for autoantibodies in directly regulating gut epithelial immune homeostasis (see discussion, lines 540-551). This is the first report, not only of BTNL8, but of any antigen discovery by PhIP-seq immunoprofiling in IPEX patients. Given the importance of this discovery, we sought to validate the presence of these autoantibodies in an additional validation cohort. We were successful, and present these findings in the new Figure 5., highlighting the generalizability of our findings to IPEX patients.

      5. BEST4 autoantibodies in IPEX and RAG-hypomorphic patients. We discovered anti-BEST4 antibodies in 15% of patients with IPEX, as well as in 2 patients with RAG1/2 mutations, demonstrating a connection between the intestinal autoimmunity seen in both IPEX and RAG1/2 deficiency. Of note, one of the 2 positive RAG1/2 deficient patients with anti-BEST4 antibodies is known to have very-early-onset IBD (VEO-IBD), a rare sub-phenotype in RAG-hypomorphs (and other primary immune deficiencies). Given the severity of VEO-IBD and how little is known about why certain patients with immune dysregulation develop this phenotype, these findings mark an important scientific advance and provide an essential clue into etiology. Furthermore, given that IPEX is driven by dysfunctional Treg cells, the commonality of these findings in both IPEX and hypomorphic RAG indicate a potential role for Treg dysfunction in hypomorphic RAG.

      6. Expansion of scaled PhIP-Seq to interrogate severe COVID-19 pneumonia, Kawasaki disease (KD), and Multisystem Inflammatory Syndrome in Children (MIS-C). Importantly, in MIS-C we find no evidence for any of the previously reported autoantigens described in Gruber at al (Cell, 2020) – a study which made strong conclusions about autoantibodies despite featuring only 4 PhIP-Seq control samples. Our results highlight the importance of scaling and appropriate control groups, and caution against overinterpretation of reported disease-specific autoantigens in PhIP-Seq (or other expanded antigen screening technologies such as near-proteome wide fixed protein arrays) which utilize smaller control cohorts, often without orthogonal validation experiments.

      7. Anti-CGNL1 antibodies in KD/MIS-C. We discovered and validated autoantibodies to CGNL1 in KD and MIS-C. It is possible that these antibodies represent a subset of specificities within anti-endothelial cell antibodies, given the endothelial expression of CGNL1 as well as its implications in cardiovascular disease.

      Reviewer #2 (Public Review):

      The authors update PhIP-seq into a high throughput format with the goal to accommodate screening of large numbers of human patient sera for the presence of novel autoantibodies and screening of more control sera to better determine standards for positivity of experimental samples. The high throughput protocol is detailed in an associated web-based format and validated in the paper using sera from patients with inherited immunodeficiencies and patients with MIS-C, Kawasaki syndrome, and COVID19. These are strengths of the work, and the high throughput PhIP-seq format will be useful to other investigators doing similar screenings. Yet, the findings do not significantly extend our knowledge of the range of autoantibodies in these illnesses, and many of the autoantibodies detected using PhIP-seq linear epitopes are not validated with other strategies, limiting significance of the results. The data from MIS-C and Kawasaki cohorts are confounded by an undetermined number of IVIG treated subjects, and limited numbers of control samples, including sera from patients with febrile illnesses that contain autoantibodies that are not discussed in the context of findings from the experimental groups.

      In summary, the paper is solid technically, with the high throughput strategy seemingly well validated; however, the advance here is primarily a technical one.

      We thank the reviewer and agree that the technical advance here is substantial and will be of value to other investigators doing similar screenings – as well as to investigators who previously did not have access to this technology due to high requirements for robotics and specialized equipment in previous iterations of the protocol. As such, we feel that this, combined with the demonstration of how to appropriately control PhIP-Seq experiments, should be considered a valuable research advance alone -- even in the absence of the extensive validation and novel findings on 5 additional disease contexts, summarized in greater detail above.

      IVIG status is discussed in lines 417-423. Briefly, the large majority of MISC samples are confirmed to be IVIG free at the time of blood draw. All of our KD samples are confirmed IVIG-free.

      While pediatric febrile illness samples could conceivably contain autoantibodies, we believe that this is best group for comparison given that these samples are taken from age-matched, acutely ill patients, thus providing a control group that is as clinically similar to MIS-C as possible. In addition, we included adult healthy sera and adult COVID19 sera as secondary control groups. Of note, this matching is much more extensive (and substantially larger in number) than the recent study in Cell (Gruber at el 2020), which for PhIP-Seq used only 4 healthy, COVID19-negative samples to compare to 9 MISC samples.

      Reviewer #3 (Public Review):

      This paper presents a rigorously performed series of studies to improve the ability of the PhIP-seq method to discover autoantibodies against peptide antigens that span the whole peptidome at scale, and increase the ease of validation and definition of disease specificity. The paper is an extension of a recent paper from the DeRisi and Anderson groups done on APS1 patients, which defined and validated a novel series of tissue-specific autoantigens in APS1. The current studies show that the authors can find the antibodies they previously defined, and using larger numbers of disease and control samples, can expand some what they detect. They then use the new method to look at multiple additional processes in which autoimmunity has been demonstrated/postulated.

      The dataset may be of use to others interested in defining novel autoantibodies. The findings really did not share significant new insights into the processes they studied,. As the authors note, they were unable to detect the antibodies (~10% of patients) recognizing type I IFNs in severe COVID-19, where these had been demonstrated effectively using ELISA previously. Unlike APS1, where their findings about uncommon tissue specific autoantibody responses across a population with known genetic deficiency and heterogeneous phenotypes could really illustrate the power of the method and approach, that elegance and powerful and novel conclusion is not as evident here.

      The trade-off between sensitivity, specificity, and screening power of antigen discovery tools is present in every assay. We do not feel that the comparison of our assay to a single protein ELISA assay is appropriate (nor particularly relevant for the conclusions drawn in this manuscript) given the inherent difference in nature and goals of the two assays. It has long been understood that PhIP-Seq does not have sensitivity for all protein antigens, including post-translationally modified and conformational antigens, which we state for readers in lines 190-193, within the discussion section, as well as in our previous work.

    1. Author Response

      Reviewer #2 (Public Review):

      Silberberg et al. present a series of cryo-EM structures of the ATP dependent bacterial potassium importer KdpFABC, a protein that is inhibited by phosphorylation under high environmental K+ conditions. The aim of the study was to sample the protein's conformational landscape under active, non-phosphorylated and inhibited, phosphorylated (Ser162) conditions.

      Overall, the study presents 5 structures of phosphorylated wildtype protein (S162-P), 3 structures of phosphorylated 'dead' mutant (D307N, S162-P), and 2 structures of constitutively active, non-phosphorylatable protein (S162A).

      The true novelty and strength of this work is that 8 of the presented structures were obtained either under "turnover" or at least 'native' conditions without ATP, ie in the absence of any non-physiological substrate analogues or stabilising inhibitors. The remaining 2 were obtained in the presence of orthovanadate.

      Comparing the presented structures with previously published KdpFACB structures, there are 5 structural states that have not been reported before, namely an E1-P·ADP state, an E1-P tight state captured in the autoinhibited WT protein (with and without vanadate), and two different nucleotide-free 'apo' states and an E1·ATP early state.

      Of these new states, the 'tight' states are of particular interest, because they appear to be 'off-cycle', dead end states. A novelty lies in the finding that this tight conformation can exist both in nucleotide-free E1 (as seen in the published first KdpFABC crystal structure), and also in the phosphorylated E1-P intermediate.

      By EPR spectroscopy, the authors show that the nucleotide free 'tight' state readily converts into an active E1·ATP conformation when provided with nucleotide, leading to the conclusion that the E1-P·ADP state must be the true inhibitory species. This claim is supported by structural analysis supporting the hypothesis that the phosphorylation at Ser162 could stall the KdpB subunit in an E1P state unable to convert into E2P. This is further supported by the fact that the phosphorylated sample does not readily convert into an E2P state when exposed to vanadate, as would otherwise be expected.

      The structures are of medium resolution (3.1 - 7.4 Å), but the key sites of nucleotide binding and/or phosphorylation are reasonably well supported by the EM maps, with one exception: in the 'E1·ATP early' state determined under turnover conditions, I find the map for the gamma phosphate of ATP not overly convincing, leaving the question whether this could instead be a product-inhibited, Mg-ADP bound E1 state resulting from an accumulation of MgADP under the turnover conditions used. Overall, the manuscript is well written and carefully phrased, and it presents interesting novel findings, which expand our knowledge about the conformational landscape and regulatory mechanisms of the P-type ATPase family.

      We thank the reviewer for their comments and helpful insights. We have addressed the points as follows:

      However in my opinion there are the following weaknesses in the current version of the manuscript:

      1) A lack of quantification. The heart of this study is the comparison of the newly determined KdpFABC structures with previously published ones (of which there are already 10). Yet, there are no RMSD calculations to illustrate the magnitude of any structural deviations. Instead, the authors use phrases like 'similar but not identical to', 'has some similarities', 'virtually identical', 'significant differences'. This makes it very hard to appreciate the true level of novelty/deviation from known structures.

      This is a very valid point and we thank the reviewers for bringing it up. To provide a better overview and appreciation of conformational similarities and significant differences we have calculated RMSDs between all available structures of KdpFABC. They are summarised in the new Table 1 – Table Supplement 2. We have included individual rmsd values, whenever applicable and relevant, in the respective sections in the text and figures. We note that the RMSDs were calculated only between the cytosolic domains (KdpB N,A,P domains) after superimposition of the full-length protein on KdpA, which is rigid across all conformations of KdpFABC (see description in material and methods lines 1184-1191 or the caption to Table 1 – Table Supplement 2). We opted to not indicate the RMSD calculated between the full-length proteins, as the largest part of the complex does not undergo large structural changes (see Figure 1 – Figure Supplement 1, the transmembrane region of KdpB as well as KdpA, KdpC and KdpF show relatively small to no rearrangements compared to the cytosolic domains), and would otherwise obscure the relevant RMSD differences discussed here.

      Also the decrease in EPR peak height of the E1 apo tight state between phosphorylated and non-phosphorylated sample - a key piece of supporting data - is not quantified.

      EPR distance distributions have been quantified by fitting and integrating a gaussian distribution curve, and have been added to the corresponding results section (lines 523-542) and the methods section (lines 1230-1232).

      2) Perhaps as a consequence of the above, there seems to be a slight tendency towards overstatements regarding the novelty of the findings in the context of previous structural studies. The E1-P·ATP tight structure is extremely similar to the previously published crystal structure (5MRW), but it took me three reads through the paper and a structural superposition (overall RMSD less than 2Å), to realise that. While I do see that the existing differences, the two helix shifts in the P- and A- domains - are important and do probably permit the usage of the term 'novel conformation' (I don't think there is a clear consensus on what level of change defines a novel conformation), it could have been made more clear that the 'tight' arrangement of domains has actually been reported before, only it was not termed 'tight'.

      As indicated above we have now included an extensive RMSD table between all available KdpFABC structures. To ensure a meaningful comparison, the rmsd are only calculated between the cytosolic domains after superimposition of the full-length protein on KdpA, as the transmembrane region of KdpFABC is largely rigid (see figure below panel B). However, we have to note that in the X-ray structure the transmembrane region of KdpB is displaced relative to the rest of the complex when compared to the arrangement found in any of the other 18 cryo-EM structures, which all align well in the TMD (see figure below panel C). These deviations make the crystal structure somewhat of an outlier and might be a consequence of the crystal packing (see figure below panel A). For completeness in our comparison with the X-Ray structure, we have included an RMSD calculated when superimposed on KdpA and additional RMSD that was calculated between structures when aligned on the TMD of KdpB (see figure below panel D,E). The reported RMSD that the reviewer mentiones of less than 2Å was probably obtained when superimposing the entire complex on each other (see figure below panel F). However, we do not believe that this is a reasonable comparison as the TMD of the complex is significantly displaced, which stands in strong contrast to all other RMSDs calculated between the rest of the structures where the TMD aligns well (see figure below panel B).

      From the resulting comparisons, we conclude that the E1P-tight and the X-Ray structure do have a certain similarity but are not identical. In particular not in the relative orientation of the cytosolic domains to the rest of the complex. We hope that including the RMSD in the text and separately highlighting the important features of the E1P tight state in the section “E1P tight is the consequence of an impaired E1P/E2P transition“ makes the story now more conclusive.

      Likewise, the authors claim that they have covered the entire conformational cycle with their 10 structures, but this is actually not correct, as there is no representative of an E2 state or functional E1P state after ADP release.

      This is correct, and we have adjusted the phrasing to “close to the entire conformational cycle” or “the entire KdpFABC conformational cycle except the highly transient E1P state after ADP release and E2 state after dephosphorylation.”

      3) A key hypothesis this paper suggests is that KdpFABC cannot undergo the transition from E1P tight to E2P and hence gets stuck in this dead end 'off cycle' state. To test this, the authors analysed an S162-P sample supplied with the E2P inducing inhibitor orthovanadate and found about 11% of particles in an E2P conformation. This is rationalised as a residual fraction of unphosphorylated, non-inhibited, protein in the sample, but the sample is not actually tested for residual unphosphorylated fraction or residual activity. Instead, there is a reference to Sweet et al, 2020. So the claim that the 11% E2P particles in the vanadate sample are irrelevant, whereas the 14% E1P tight from the turnover dataset are of key importance, would strongly benefit from some additional validation.

      We have added an ATPase assay that shows the residual ATPase activity of WT KdpFABC compared to KdpFABS162AC, both purified from E. coli LB2003 cells, which is identical to the protein production and purification for the cryo-EM samples (see Figure 2-Suppl. Figure 5). The residual ATPase activity is ca. 14% of the uninhibited sample, which correlates with the E2-P fraction in the orthovanadate sample.

      Reviewer #3 (Public Review):

      The authors have determined a range of conformations of the high-affinity prokaryotic K+ uptake system KdpFABC, and demonstrate at least two novel states that shed further light on the structure and function of these elusive protein complexes.

      The manuscript is well-written and easy to follow. The introduction puts the work in a proper context and highlights gaps in the field. I am however missing an overview of the currently available structures/states of KdpFABC. This could also be implemented in Fig. 6 (highlighting new vs available data). This is also connected to one of my main remarks - the lack of comparisons and RMSD estimates to available structures. Similarity/resemblance to available structures is indicated several times throughout the manuscript, but this is not quantified or shown in detail, and hence it is difficult for the reader to grasp how unique or alike the structures are. Linked to this, I am somewhat surprised by the lack of considerable changes within the TM domain and the overlapping connectivity of the K indicated in Table 1 - Figure Supplement 1. According to Fig. 6 the uptake pathway should be open in early E1 states, but not in E2 states, contrasting to the Table 1 - Figure Supplement 1, which show connectivity in all structures? Furthermore, the release pathway (to the inside) should be open in the E2-P conformation, but no release pathway is shown as K ions in any of the structures in Table 1 - Figure Supplement 1. Overall, it seems as if rather small shifts in-between the shown structures (are the structures changing from closed to inward-open)? Or is it only KdpA that is shown?

      We thank the reviewer for their positive response and constructive criticisms. We have addressed these comments as follows:

      1. The overview of the available structures has been implemented in Fig. 6, with the new structures from this study highlighted in bold.

      2. RMSD values have been added to all comparisons, with a focus on the deviations of the cytosolic domains, which are most relevant to our conformational assignments and discussions.

      3. To highlight the (comparatively small) changes in the TMD, we have expanded Table 1 - Figure Supplement 1 to include panels showing the outward-open half-channel in the E1 states with a constriction at the KdpA/KdpB interface and the inward-open half-channel in the E2 states. The largest observable rearrangements do however take place in the cytosolic domains. This is an absolute agreement with previous studies, which focused more on the transition occurring within the transmembrane region during the transport cycle (Stock et al, Nature Communication 2018; Silberberg et al, Nature Communication 2021; Sweet et al., PNAS 2021).

      4. The ions observed in the intersubunit tunnel are all before the point at which the tunnel closes, explaining why there is no difference in this region between E1 and E2 structures. Moreover, as we discussed in our last publication (Silberberg, Corey, Hielkema et al., 2021, Nat. Comms.), the assignment of non-protein densities along the entire length of the tunnel is contentious and can only be certain in the selectivity filter of KdpA and the CBS of KdpB.

      5. The release pathway from the CBS does not feature any defined K+ coordination sites, so ions are not expected to stay bound along this inward-open half-channel.

      My second key remark concerns the "E1-P tight is the consequence of an impaired E1-P/E2-P transition" section, and the associated discussion, which is very interesting. I am not convinced though that the nucleotide and phosphate mimic-stabilized states (such as E1-P:ADP) represent the high-energy E1P state, as I believe is indicated in the text. Supportive of this, in SERCA, the shifts from the E1:ATP to the E1P:ADP structures are modest, while the following high-energy Ca-bound E1P and E2P states remain elusive (see Fig. 1 in PMID: 32219166, from 3N8G to 3BA6). Or maybe this is not what the authors claim, or the situation is different for KdpFABC? Associated, while I agree with the statement in rows 234-237 (that the authors likely have caught an off-cycle state), I wonder if the tight E1-P configuration could relate to the elusive high-energy states (although initially counter-intuitive as it has been caught in the structure)? The claims on rows 358-360 and 420-422 are not in conflict with such an idea, and the authors touch on this subject on rows 436-450. Can it be excluded that it is the proper elusive E1P state? If the state is related to the E1P conformation it may well have bearing also on other P-type ATPases and this could be expanded upon.

      This a good point, particularly since the E1P·ADP state is the most populated state in our sample, which is also counterintuitive to “high-energy unstable state”. One possible explanation is that this state already has some of the E1-P strains (which we can see in the clash of D307-P with D518/D522), but the ADP and its associated Mg2+ in particular help to stabilize this. Once ADP dissociates and takes the Mg2+ with it, the full destabilization takes effect in the actual high-energy E1P state. Nonetheless, we consider it fair to compare the E1P tight with the E1P·ADP to look for electrostatic relaxation. We have clarified the sequence of events and our hypothesized role the ADP/Mg2+ have in stabilizing the E1P·ADP state that we can see (lines 609-619): “Moreover, a comparison of the E1P tight structure with the E1P·ADP structure, its most immediate precursor in the conformational cycle obtained, reveals a number of significant rearrangements within the P domain (Figure 5B,C). First, Helix 6 (KdpB538-545) is partially unwound and has moved away from helix 5 towards the A domain, alongside the tilting of helix 4 of the A domain (Figure 5B,C – arrow 2). Second, and of particular interest, are the additional local changes that occur in the immediate vicinity of the phosphorylated KdpBD307. In the E1P·ADP structure, the catalytic aspartyl phosphate, located in the D307KTG signature motif, points towards the negatively charged KdpBD518/D522. This strain is likely to become even more unfavorable once ADP dissociates in the E1P state, as the Mg2+ associated with the ADP partially shields these clashes. The ensuing repulsion might serve as a driving force for the system to relax into the E2 state in the catalytic cycle.”

      We believe it is highly unlikely that the reported E1-P tight state represents an on-cycle high-energy E1P intermediate. For one, we observe a relaxation of electrostatic strains in this structure, in particular when compared to the obtained E1P ADP state. By contrast, the E1P should be the most energetically unfavourable state possible to ensure the rapid transition to the E2P state. As such, this state should be a transient state, making it less likely to be obtainable structurally as an accumulated state. Additionally, the association of the N domain with the A domain in the tight conformation, which would have to be reverted, would be a surprising intermediary step in the transition from E1P to E2P. Altogether, the here reported E1P tight state most likely represents an off-cycle state.

    1. Author Response

      Reviewer #1 (Public Review):

      A novel approach is introduced for targeting Protein-RNA interactions. The approach (presented in Figure 1) integrates computational techniques with cellular assays, and is applicable, in principle, whenever the protein-RNA complex has a druggable binding pocket. It is demonstrated with the discovery of inhibitors of YB-1's interaction with its mRNA target. Of 22 putative hits, discovered based on virtual screen, 11 come out as very strong hits. Far beyond the 5-10 percent success rate that one often sees in drug discovery. The main strength here is the proof of concept that protein-RNA interactions are targetable.

      We agree with the reviewer that large computational screens to identify potential inhibitors generally lead to dead ends. This is why we have rationally designed this integrative approach where predictions are experimentally validated with different tools and the obtained results feed/orient the computational approach. The workflow illustrated in Figure 1 creates a vivid exchange between computational and experimental data and allows a back-and-forth between both to enhance and refine the computational screen. We have also put in place a refined physics-based computational approach to increase our chances in avoiding these dead-end screens (details are in Computational Methods and in Appendix 2). The high predictive power of our computational approach comes from a rationally designed workflow combining the following:

      1- Understanding the dynamic behavior of the target, the binding pocket, and identification of key residues using MD simulations.

      2- The starting 3D structures used and refined using MD simulations.

      3- The prior identification and validation of the binding site and the identification of F1 and F4 as hits by NMR spectroscopy. F1 was then used in the pharmacophore screen.

      4- The statistical mechanics-based filter played an important role in orienting and refining this selection. For example, the use of ligand-water interactions to qualitatively estimate the residence of the ligand in the binding site.

      Nevertheless, the high success rate also comes from human intervention, where visual inspection and rational selection of structurally promising candidates (sometimes intuition-driven) also played an important role in selecting the 111 molecules issued from the static virtual screen (pharmacophore screens). We now clarify this point on pages 5 and 6 of the revised manuscript and give more details on the selection criteria used. We also specify that the large computational screen we implemented was mandatory to validate the MT bench.

      Reviewer #2 (Public Review):

      In the manuscript "Targeting RNA-Protein Interactions with an Integrative Approach Leads to the Identification of Potent YB-1 Inhibitors" the authors have tried to integrate computational, structural, and cellular imaging approaches to identify small molecule inhibitors of RNA-protein interactions. They take up as their target YB-1, an abundant RNA-binding protein (RBP) involved in regulating the translation and/or processing of multiple mRNAs, many of which encode genes involved in tumorigenesis and tumor progression. Firstly, the authors find a binding pocket in the cold shock domain (CSD) of YB-1, for the flavonoid fisetin, and more so for the analog quercetin, by NMR spectroscopy, which they name the "quercetin pocket". They then delineate and refine the RNA-binding characteristics of this pocket by MD simulations. Further, they conduct a computational screen of a large library of small molecules to find candidates which bind to this pocket. They then check the selected candidates as inhibitors of YB1-mRNA interaction using the microtubule bench (MT-bench) method. They find 11 molecules as significant hits with this approach, including one FDA-approved PARP-inhibitor drug (P1). P1 is shown to bind YB-1 by MD-simulation and NMR spectroscopy and was also shown to interfere with YB-1-mRNA interaction by NMR and in cells by the MT-bench assay. Finally, they showed that the molecule P1 reduced cellular translation by a puromycin incorporation assay and this effect was not observed in cells depleted of YB-1.

      Together, these multifarious approaches appear to establish a workflow useful for scoring for inhibitors of RNA-protein interactions. The workflow is rationally designed, moving from the identification of a binding pocket to the identification of binding molecules and then selecting molecules that inhibit protein-mRNA interactions. This workflow may be useful for other researchers attempting to screen libraries of compounds targeting RNA interactions by other RNA-binding proteins. However, as many RNA-binding proteins have large intrinsically disordered regions or no recognizable RNA-binding domains, it is to be seen whether such a structural "binding-pocket"-based approach can be generalizable to all RNA-binding proteins.

      We agree with the reviewer that this is not sufficient to generalize to all RBPs. Performing a complete study for other RBPs would require a separate paper. In the current work, we did show that we can detect mRNA-RBP interactions with two other RBPs HuR and FUS and used them as a control to show the specificity of the tested small molecules towards YB-1 (Figures 3d and 4b,c). We have now tuned down the statements about the generality of the method (page 20).

      In the discussion, we now also explain that YB-1, because it has a single cold-shock domain and a druggable pocket, is an “ideal” target. We also explain that many RNPs harbors many RNA-binding domains, which may reduce the sensitivity of our method when a specific domain is targeted by small molecules because the other domains would contribute to the binding to mRNA. However, a single RNA-binding domain may be isolated and used as bait for the MT bench assay to overcome this obstacle. Developing molecules what would target a specific domain may be sufficient to modulate the biological function exerted by the full length protein.

      While the data presented in the paper is coherent and generally supports the demonstration of an inhibition of RNA-binding by YB-1, what appears to be lacking is evidence that the observed effect is specific to inhibition of YB-1-mediated regulation of translation and whether the expression of transcripts specifically regulated by YB-1 is affected. Secondly, it is not clear what is the effect of the putative inhibitor on cellular activity and behaviour, which is important to judge both specific phenotypic effects as well as non-specific cytotoxic effects.

      Overall the work is interesting and instructive, but the lack of the above observations detracts from its significance.

      We thank the reviewer for his feedback and for raising these interesting points. As indicated in the manuscript, it is very difficult to find functional cellular assays that would reveal a phenotype specific to a general RBP such as YB-1. This is even more difficult with YB-1 since it binds nonspecifically to most mRNAs as shown from CLIP analysis1. This was one of the reasons to develop a specific cellular assay such as the MT bench assay. YB-1 originates from cold shock proteins in bacteria which preserve global mRNA translation during cold stress, presumably by removing secondary structures. YB-1 in contrast with many RBPs has only a single structured RNA-binding domains, which is not favorable for a specific binding to some mRNA sequences/structures. As noticed by the reviewers, YB-1 is indeed not a general translation factor but is a general protein that binds to most non polysomal mRNA 2. mRNAs, even those highly translated, switch from a polysomal state (active) to a non polysomal state (dormant) from time to time. In a recent work, we showed that YB-1 prepared non polysomal mRNAs in a way to facilitate the translation from dormant to active state. We also showed that, accordingly, decreasing the expression of YB-1 reduces global mRNA translation rates in HeLa cells3. Consistent with this trend, a global decrease of mRNA translation as observed with Niraparib P1 that targets YB-1 makes sense. We have no knowledge of established 3’UTRs which would be highly specific to YB-1. YB-1 binds non specifically to both mRNA coding sequences and 3’UTRs (YBX1 data1, YBX3 data4). Large scale and in depth analysis should be performed to find out whether specific structures/sequences increase significantly the YB-1 dependency in mRNA translations. However, the expression of some proteins associated to malignancy have been associated to YB-1 expression level notably Vimentin and E-cadehrin3. For this we performed a new experiment where we measured the expression levels of these two proteins after silencing YB-1 expression in HeLa cells, in the absence and in the presence of Niraparib P1 and Olaparib P2 (used as a negative control). Results show that P1, but not P2, decrease the dependence on YB-1 of Vimentin expression level (significant) and that of E-cadherin (non-significant). Other proteins such as eIF5a and RPL36, used here as negative controls, did not show a similar behavior. These results were thus in agreement with a specific effect of Niraparib on YB-1-mediated translation. In agreement with these results, we now add a result from a recent report showing the down regulation of Vimentin expression in ovarian cancer cells when treated with Niraparib5. This is now discussed on pages 16 and 17 of the revised manuscript and the new data are included as a new figure Figure 8-Figure supplement 3.

      1. Wu, S.-L. et al. Genome-wide analysis of YB-1-RNA interactions reveals a novel role of YB-1 in miRNA processing in glioblastoma multiforme. Nucleic acids research 43, 8516-8528 (2015).

      2. Singh, G., Pratt, G., Yeo, G.W. & Moore, M.J. The clothes make the mRNA: past and present trends in mRNP fashion. Annual review of biochemistry 84, 325 (2015).

      3. Budkina, K. et al. YB-1 unwinds mRNA secondary structures in vitro and negatively regulates stress granule assembly in HeLa cells. Nucleic acids research 49, 10061-10081 (2021).

      4. Van Nostrand, E.L. et al. A large-scale binding and functional map of human RNA-binding proteins. Nature 583, 711-719 (2020).

      5. Zhen Zeng, Jing Yu, Zhongqing Jiang, Ningwei Zhao, "Oleanolic Acid (OA) Targeting UNC5B Inhibits Proliferation and EMT of Ovarian Cancer Cell and Increases Chemotherapy Sensitivity of Niraparib", Journal of Oncology, vol. 2022, 12 pages, 2022. https://doi.org/10.1155/2022/5887671

      As for the effect of the putative inhibitor on cellular activity and behaviour, which is important to judge both specific phenotypic effects as well as non-specific cytotoxic effects. We agree with the reviewer on this remark. YB-1 is associated with the high proliferation rate of cancer cells (and silencing YB-1 does not induce apoptosis). Therefore, we performed cell proliferation assays using cells treated with siRNA and siNEG allowing us to manipulate the endogenous YB-1 expression level rather than a more artificial rescue experiment. These assays were performed in the presence of 3 PARP-1 inhibitors at low concentrations: Niraparib P1 our hit, and two negative controls Olaparib P2 and Talazoparib P3. We used a 48 h incubation time which allows to observe effects at lower concentration of compounds. All PARP-1 inhibitors decrease cell proliferation, albeit to a higher extent with P3. However, P2 or P3 further decrease cell proliferation in siRNA-treated cells compared to siNEG-treated cells (significant differences at 5 µM)). In contrast, Niraparib rather further decreases cell proliferation in siNEG-treated cells when YB-1 levels are high (non-significant variations but opposite to those observed with P2 and P3). This new result is now presented as new Figure 8a. In addition, we show that the separation distance between cells increases significantly in YB-1-rich cells treated with P1, in contrast to P2 and P3 (significant differences) (new figure Figure 8-Figure supplement 1). A short distance of separation between cells may be due to colony formation when cells were plated at low density and allowed to grow for 48 h. Again, it means that Niraparib better inhibits cell proliferation in YB-1-rich cells when compared with what is observed with the two other PARP inhibitors Talazoparib and Olaparib. The text on page 17 was rewritten to include these new results and put this in evidence.

      Reviewer #3 (Public Review):

      The authors introduce an integrative platform for identifying small molecule ligands that can disrupt RNA-protein interactions (RPIs) in vitro and in cells. The screening assay is based on prior work establishing the MT bench assay (Boca et al. 2015) for evaluating protein-protein interactions in cells by utilizing microtubules as a platform to recruit and detect PPIs in cells. In the current manuscript, the authors adapted this methodology to evaluate small molecules targeting RNA-binding protein (RBPs) interactions with mRNA in cells. By combining the MT bench assay with computational docking/screening and ligand-binding evaluations by NMR, the authors discover inhibitors of the RBP YB-1, which included FDA-approved PARP-1 inhibitors. The impact of this work could be high given the critical roles of RNA-binding proteins in regulating the function and fate of coding and non-coding RNA. While the presented data are promising, the ability to generally apply this method beyond YB-1 and to RBPs in general remains to be addressed.

      We agree with the reviewer on his comments. In the revised version of the manuscript, we have tuned down the statements about the generality of the method. In addition, we elaborate about the potential of our assays and how to deal with RBPs that often have more than one RNA-binding domain. If many RNA-binding domains participate to the binding of a given RBP to mRNA, we may lose the sensitivity of the MT bench assays. However, one point is to use as bait to target isolated RNA-binding domain which could be enough to impair/correct the function of the full length RBP target. A statement has been added on page 20 of the revised manuscript to discuss this point.

    1. Author Response

      Reviewer #1 (Public Review):

      GCaMP indicators have become common, almost ubiquitous tools used by many neuroscientists. As calcium buffers, calcium indicators have the potential to perturb calcium dynamics and thereby alter neuronal physiology. With so many labs using GCaMPs across a variety of applications and brain regions, it's remarkable how few have documented GCaMP-related perturbations of physiology, but there are two main contexts in which perturbations have been observed: after prolonged expression of a high GCaMP concentration (common several weeks after infection with a virus using a strong promoter); and when cytoplasmic GCaMP is present during neuronal development. As a result, GCaMP studies are often designed to avoid these two conditions.

      Here, Xiaodong Liu and colleagues ask whether GCaMP-X series indicators are less toxic that GCaMPs. GCaMP-X indicators are modified GCaMPs with an additional N-terminal calmodulin binding domain that reduces interactions of the calmodulin moiety of GCaMP with other cellular proteins. Xiaodong Liu and colleagues document effects of GCaMP expression on neuronal morphology in vitro, calcium oscillations in vitro, and sensory responses in vivo, in each case showing that GCaMP-X indicators are less toxic. Their results are compelling.

      Unfortunately, the paper suffers two main weaknesses. Firstly, the results demonstrate that GCaMP is toxic during development, after prolonged expression via viruses in vivo, and in cell culture where maturation of the culture likely recapitulates key steps in development. GCaMPs are known to be toxic in these circumstances, such toxicity is readily circumvented by driving expression in the adult, and there are countless examples of studies in which adequate GCaMP expression was achieved without toxicity. These new results are of little relevance to the majority of GCaMP experiments. That GCaMP-X indicators are less toxic during development is a new result and may be of interest to those who wish to deploy calcium indicators during development, but this is a relatively small number of neuroscientists.

      We thank the reviewer for providing valuable opinions on these critical matters. Here, we would like to clarify:

      1. In our work, the status of neurites (length, branching, etc.) is indeed one main aspect to monitor, and neuritogenesis during the early stages of development is known to have temporal trajectories with ample dynamic range thus helpful to quantitatively compare GCaMP-X versus GCaMP. However, the key factor is the actual time and level of probe expression in neurons, and the starting timepoint of expression could vary. We have conducted additional experiments using virus-infected neurons (Figure 5—figure supplement 1) and transgenic neurons with inducible expression (Figure 7—figure supplement 3), both starting to express the probes at the mature stage. Thus, GCaMP-X imaging is not necessarily limited to developing neurons. As in the original reports of GCaMP probes with toxicity, virus injection was performed for both immature (2-3 weeks, Tian 2009 PMID: 19898485) and mature mice (~2 months, Chen 2013 PMID: 23868258). According to the protocol (Huber 2012 PMID: 22538608), GCaMP virus injection was done for adult mice (>2 months), which exhibited functional and morphological deficits in nucleus-filled neurons beyond OTW (Figure 2, Figure 5 and Figure 6). Collectively, the central principles of GCaMP-X versus GCaMP are applicable to both immature and mature neurons.

      2. Chronic GCaMP-X imaging has a broad spectrum of potential applications, not limited to neural development (Resendez 2016 PMID: 26914316). As mentioned, GCaMP-X resolves the problem of longitudinal expression thus making chronic imaging more feasible. We agree with the reviewer that a large body of our data in the original version focused on the characteristics of calcium signals during the early stage of neuronal development, which served as an exemplary scenario to compare GCaMP-X with GCaMP. Indeed, the importance of Ca2+ oscillation in neural development is commonly accepted (Kamijo 2018 PMID: 29773754; Gomez 2006 PMID: 16429121). In vivo Ca2+ imaging (Figure 2 and Figure 5) and morphological analyses (e.g., Figure 6) have extended the major conclusions onto mature neurons where dysregulations of Ca2+ oscillations are also tightly coupled with neuronal health or death/damage. Importantly, GCaMP-X paves the way to unexplored directions previously impeded or discouraged due to GCaMP perturbations, e.g., chronic imaging of cultured neurons to concurrently monitor Ca2+ activities and cell morphology as in this study.

      3. To circumvent the toxicity of GCaMP is not a trivial procedure for viral infection. The expression levels need to be carefully adjusted experimentally, e.g., by dilution studies (Resendez 2016 PMID: 26914316). A delicate balance of GCaMP expression is critical: low level (or short time) of expression would result in weak signals and poor SNR whereas high level (or long time) of expression would cause nuclear filling and neural toxicity. Even for the work-around conditions of time window and dilution dosage, nucleus-filled neurons are not uncommon judged by the expression/fluorescence patterns, e.g., in the original reports of GCaMP6 (Supplementary Figure 7, Chen 2013 PMID: 23868258), and GCaMP3 (Supplementary Figure 11, Tian 2009 PMID: 19898485). Under particular conditions (subtypes of neurons, time window of imaging, dosage of virus injection, etc.), many neurons could be found without apparent perturbation/nuclear-filling to proceed with calcium imaging. Using GCaMP-X, dosage is less restricted (10fold higher concentration for GCaMP-X with improved SNR and overall performance in Figure 2, Figure 5 and Figure 6). Practically, GCaMP-X is a simple solution for the issues related to excessive/prolonged expression. Also, GCaMP-X is expected to help maintain the total number of healthy neurons and thus the general health of the brain. Reportedly, some GCaMP lines of transgenic mice exhibit epileptic activities (Steinmetz 2017 PMID: 28932809), awaiting future studies to explore whether GCaMP-X could help.

      4. As the reviewer pointed out, the key of GCaMP-X is to resolve the unwanted (apo)GCaMP binding to endogenous proteins in neurons. We agree with the reviewer that according to the empirical observations the following factors appear to increase the severity of GCaMP perturbations: prolonged time, high concentration and nuclear accumulation. GCaMP-X is able to protect GCaMP from unwanted binding and the consequent damage to neurons, validated by various tests thus far (in vitro and in vivo). In this context, the prolonged time would result in higher GCaMP concentration, meanwhile accumulating the effects due to GCaMP interactions; higher GCaMP concentration would interfere with more binding events and targets of endogenous CaM; and enhanced/prolonged expression of GCaMP is directly correlated with nuclear accumulation, a hallmark of neuronal damage.

      Secondly, the authors extend their claims to conclude that GCaMP indicators are toxic under other circumstances, claims supported by neither their results nor the literature. To provide one example, at the end of the introduction is the statement, 'chronic GCaMP-X imaging has been successfully implemented in vitro and in vivo, featured with long-term overexpression (free of CaM-interference), high spatiotemporal contents (multiple weeks and intact neuronal network) and subcellular resolution (cytosolic versus nuclear), all of which are nearly infeasible if using conventional GCaMP.' The statement's inaccurate: there are many chronic imaging studies in vitro and in vivo using GCaMP indicators without nuclear accumulation of GCaMP or perturbed sensory responses. There are more examples throughout the paper where the conclusions overreach the results and are inaccurate. The results are simply insufficient to support many of the strong statements in the paper.

      Overall, the critics and suggestions of the reviewer have been well taken and we have revised the text accordingly. For this particular paragraph here mentioned by the reviewer, we want to clarify that it was the summary of our results in the whole manuscript, where each claim referred to the data and analyses shown in corresponding figures. In details, these figures were: 'free of CaM-interference (Figure 1), multiple weeks and intact neuronal network (in vitro: Figure 3 and Figure 4; in vivo: Figure 2, Figure 5 and Figure 6; transgenic neurons: Figure 7) and cytosolic versus nuclear (Figure 1 and the previous Figure 8). The last sentence of 'all of which are nearly infeasible if using conventional GCaMP' was meant to summarize the results comparing GCaMP versus GCaMP-X in our experimental settings of chronic imaging with prolonged/excessive probe expression. Again, we agree that for particular experimental settings and purposes the toxicity of GCaMP can be circumvented empirically. To avoid miscommunications, we have revised this paragraph by moving it to the Discussion (after all the data), also ensuring that the statements on GCaMP are backed up with data or literature. Please also see Essential Revisions, Item 3.

      Reviewer #2 (Public Review):

      Geng and colleagues provide further evidence for the lower neuronal toxicity of their improved GECI, GCaMP-X, which allows improved recordings of Ca2+ signals in neurons. As reported previously and studied in more detail here, the improved properties are primarily due to a lower tendency of GCaMP-Xc (reporting cytosolic Ca2+) to enter the nucleus. They present a systematic comparison of their cytosolic or nucleus-targeted GCamP-Xc (and Xn) with the corresponding "conventional" GCaMPs (jGCaMP7b, GCaMP6m). They, again, confirm the absence of apoGCaMP-X binding to the CaM binding domain of Cav1.3 L L-type Ca2+ channels suggesting that this is the main or one of several GCaMP interactions leading to altered intracellular signaling affecting neuronal survival, development and architecture. Evidence for more (likely) physiological Ca2+ responses were obtained from a battery of experiments, including in vivo recordings of acute sensory responses after viral expression of GCaMPs, monitoring of long-term calcium oscillations in cultured neurons, correlations measured Ca2+ oscillations with hallmarks of neuronal development (soma size, neurite outgrowth/arborizations, and long-term recordings of spontaneous Ca2+ activities in vivo in S1 primary somatosensory cortex. The latter experiments also showed that much higher doses of AAV-GCaMP6m-Xc could be administered than of GCaMP6m. They also show that unfavorable effects of GCaMPs on neurons of adult GCaMP expressing transgenic mice, both in in slices and cultured neurons. While most experiments aim at demonstrating improved performance of GCaMP-X, one finding also provides potential novel insight into the role of neuronal activity patterns during neuronal development in culture. Assuming more undisturbed physiological Ca2+ signaling even through longer time periods they can follow different Ca2+ activity patterns during neuronal development. Oscillation amplitudes and the level of synchrony correlated with neurite length and frequency inversely correlated with neurite outgrowth.

      They provide convincing experimental evidence for the improvements claimed for their novel GCamP-X constructs. Some aspects should be clarified.

      A key finding explaining the construct differences is the nuclear localization. The authors should also provide numbers for the N/C ratio for Ca2+ imaging of sensoryevoked responses in vivo (Fig. 2; pg 6: nuclear accumulation was barely noticeable from GCaMP6m-Xc even beyond OTW). Also, for chronic experiments in brain slices they state for GCaMP6m-Xc in the text that (pg 12) "meanwhile the N/C ratio remained ultra-low", yet Fig. 6 shows a N/C ratio of 0.2. This does not appear to be "ultra low".

      We appreciate the reviewer for bringing up the matter of N/C ratio (indicative of nuclear accumulation). We have appended the values of N/C ratio for in vivo experiments (revised Figure 2). Following the previous report, the criteria of N/C ratio was set to 0.8 to regroup the neurons into two subpopulations. A significant fraction of GCaMP neurons were nucleus-filled (N/C ratio>0.8); meanwhile, nearly no neuron expressing GCaMP-XC was found with N/C ratio greater than 0.8 when examined 8-13 weeks post injection. Generally, due to imaging resolution, confocal microscopy provided more precise evaluation for N/C ratio than two-photon in vivo images. In Figure 6, even more clear difference in nuclear distribution was observed between GCaMP and GCaMP-X, which was described as “ultralow” (GCaMP-X). Of note, the N/C ratio of YFP itself was ~1.3. The N/C ratio for GCaMP-XC was not close to zero, consistent with the measurements from other NES-tagged peptides (Yang 2022 PMID: 35589958). GCaMP-XC was not completely excluded from cell nuclei, thus producing some fluorescence there. In light of this comment, we have revised the relevant text including the phrase of “ultralow” (Page 14, Line 393). In addition, Figure 5 was also revised accordingly.

      Along these lines, since nuclear-filled neurons were observed in their experiments with GCaMP-Xc, the authors should comment if altered Ca2+ signals were also seen for the few neurons expressing GCaMP-Xc in the nucleus.

      During 2-photon imaging experiments in vivo, occasionally GCaMP-XC neurons appeared to have some level of nuclear expression especially in those blurred images of low quality. Judged by the criteria of N/C ratio (0.8), these neurons rarely fell into the nucleus-filled group (Figure 2B and Figure 5C, also see confocal imaging Figure 1B). On the other hand, a small fraction of GCaMP-XC could be “leaked” into the nucleus. GCaMP-XN also eliminated toxic (apo)GCaMP interactions in neurons, sharing the same design principle with GCaMP-XC (Figure 1). Therefore, nuclear GCaMP-XC is expected to resemble GCaMP-XN. Experimentally, with GCaMP-XC or GCaMP-XN present in the nucleus, no significant change in neuronal Ca2+ or neurite morphology has been observed. Meanwhile, this comment has pointed out one important direction of future research, i.e., to more precisely confine GCaMP-X within the targeted organelles, e.g., by improving or replacing localization tags.

      Since they performed a systematic comparison of two constructs to demonstrate an (expected) superiority of one of them, the experiments, or at least the analysis, should ideally be performed in a blinded way. The authors should clarify how they avoided experimental bias.

      For in vitro experiments, multiple independent trials of experiments with analyses were performed by two (or more) researchers to ensure the reproducibility and to minimize any bias. And the results and conclusions have been highly consistent (among different trials/researchers). Following the suggestion, we have assured that in vivo experiments and data analyses were separately conducted by the researchers from two different labs. For long-term expression/imaging, the differences between GCaMP-X and GCaMP were often discernable directly in the images even without further calculations or statistics (e.g., Figure 3B). Related information can be found in the Methods (Page 32, Line 799).

      In their chronic Ca2+ fluorescence imaging for autonomous Ca2+ oscillations in cultured cortical neurons ultralong lasting signals (Fig. 3B, DIV 17, GCaMP6m) could be observed. It would be helpful to further describe the nature of these transients, ideally by adding it to their video collection.

      As suggested by the reviewer, the video for Figure 3B (DIV 17, GCaMP6m) has been included in this revision (Figure 3—video supplement 2). In contrast to the oscillatory signals normally observed from healthy neurons, the pronounced and sustained Ca2+ signals are associated with apoptosis and other pathological conditions in neurons (Khan 2020 PMID: 32989314; Nicotera 1998 PMID: 9601613; Harr 2010 PMID: 20826549). The Ca2+ wave with broadened width (FWHM) was indicative of damaged neurons by GCaMP (Figure 3F), rather than (altered) sensing characteristics of GCaMP. We agree that this observation is a notable and interesting phenomenon, worth to follow up in future studies.

      The discussion is very long. In my opinion it would benefit from shortening, avoid redundancies and focus only on the key findings in this paper. This includes the chapter on design and application guidelines for CaM-based GECIs. The main message what the advantage of their GCaMP-X modifications has been made before in the discussion. A more detailed discussion on this appears more suitable in a review article.

      In response to this suggestion, we have made it as concise as possible, by simplifying or removing several topics including the design and application guidelines for CaMbased GECIs.

      It may be worthwhile to include another aspect in the discussion: does the improved GCaMP-Xc cause no change in neuronal function or morphology or is it just less damaging than other GCaMPs. How can this issue be addressed experimentally.

      We have revised the discussion accordingly (Page 21, Line 588). We agree that additional experiments would help evaluate how close GCaMP-X data are to the reality, considering the Ca2+-buffering effect intrinsic to Ca2+ probes and also other factors. In light of this suggestion and also those from Reviewer #1, we have incorporated more experimental controls, including Ai140 mice (GFP, Figure 7—figure supplement 2) and Fluo-4 AM (Ca2+ dye, Figure 3—figure supplement 4). The results have been encouraging in that GCaMP-X neurons were nearly indistinguishable in the morphological and functional aspects from GFP or Fluo-4 AM controls. The incoming feedbacks from GCaMP-X users should continue to help clarify this matter, which we would like to follow up.

    1. Author Response

      Reviewer #1 (Public Review):

      This study uses the mouse calyx of Held synapse as a model to explore the presynaptic role of rac1, a regulator of actin signaling in the brain. Many of the now-classical methods and theory pioneered by Neher and colleagues are brought to bear on this problem. Additionally, the authors were able to make a cell-specific knockout of rac1 by developing a novel viral construct to express cre in the globular bushy cells of the cochlear nucleus; by doing this in a rac1 floxed mouse, they were able to KO rac1 in these neurons starting at around P14. The authors found that KO of rac1 enhanced EPSC amplitude, vesicle release probability, quantal release rates, EPSC onset time and jitter during high-frequency activity, and fast recovery rates from depression. Because the calyx synapses are the largest and most reliable of central nerve terminals, all these various effects had no effect on suprathreshold transmission during 'in vivo-like' stimulus protocols. Moreover, there was no effect morphologically on the synapse. Through some unavoidably serpentine reasoning, the authors suggest that loss of rac1 affects the so-called molecular priming of vesicles, possibly due to a restructuring of actin barriers at the active zone. The experimental analysis is at a very high level, and the work is definitely an important contribution to the field of presynaptic physiology and biophysics. It will be important to test the effects of the KO on other synapses that are not such high-performers as the calyx, and this direction might reveal significant effects on information processing by altered rac1 expression.

      We thank the reviewer for their comments and view that our work is an important contribution to the field of presynaptic physiology and biophysics.

      Major points:

      1) The measurement of onset delay was used to test whether rac1-/- affects positional priming. While there is a clear effect of the KO on the latency to EPSC onset, there is no singular interpretation one can take, due to the ambiguity of the 'onset delay'. Note that in the Results authors state Lines 201-203: "The time between presynaptic AP and EPSC onset (EPSC onset delay) is determined by the distance between SVs and VGCC which defines the time it takes for Ca2+ to bind to the Ca2+ sensor and trigger SV release (Fedchyshyn and Wang, 2007)." However, in Methods "The duration between stimulus and EPSC onset was defined as EPSC onset delay." Thus the 'onset' measured is not between presynaptic spike and EPSC but from axonal stimulus and EPSC. KO of rac might also affect spike generation, spike conduction, calcium channel function, etc. Indeed some additional options are offered in the Discussion. Since the change in onset is ~100usec at most, a number of small factors all could contribute here. Moreover, the authors conclude that the KO does NOT affect positional priming since they would have expected the onset to shorten, given the other enhancements observed in earlier sections.

      It seems to me that all the authors can really conclude is that the onset shifted and they do not know why. If onset is driven by multiple factors, and differentially affected in the KO, then all bets are off. Thus, data in this section might be removed, or at least the authors could further qualify their interpretations given this ambiguity.

      We have further qualified and clarified our interpretations of the EPSC onset measurement. To do so, we have added additional text to the Discussion (see lines 475-491). We would like to emphasize that we do not see a statistically significant change in EPSC1 onset delay and EPSC onset delays during 50 Hz train stimuli between the Rac1+/+- and Rac1−/− synapses but rather an activity-dependent increase in EPSC onset delays in Rac1−/− synapses during 500 Hz stimulation. It is important to note that based on these data, it is less likely that changes in spike generation, spike conduction, or calcium channel function are responsible for the change in EPSC onset delay. If SVs were closer to CaV2.1 channels, we would expect shorter initial EPSC onset delay time or shorter EPSC onset delay times during 50 Hz stimulation. However, changes in spike generation, spike conduction or calcium channel function could contribute to the increase in the EPSC onset delay at 500 Hz. Finally, it is important to note that EPSC onset delay increase during 50 Hz and 500 Hz stimulation in Rac1+/+ synapses indicating an activity-dependent regulation. However, this activity-dependent increase was pronounced in Rac1−/− synapses during both 50 Hz and 500 Hz stimulation (Fig 4B1-B3).

      2) If the idea is that the loss of Rac1 leads to a reduced actin barrier at the active zone, is there an ultrastructural way to visualize this, labeling for actin for example? Authors conclude that new techniques are needed, but perhaps this is 'just' an EM question.

      We are not aware of a method for ultrastructural visualization of actin and SV distributions relative to the plasma membrane. To do so requires specific labeling and detection of actin filaments while visualizing SVs using EM. While EM on samples prepared by high-pressure freeze with freeze substitution allows for detection of filamentous structures near the AZ, the molecular identity of these filamentous structures would remain uncertain. Super-resolution microscopy is amenable to immunohistochemical techniques to label actin, but visualizing SVs in 3D using super-resolution is a major technical challenge. Furthermore, changes in SV docking on the scale of 1-2 nanometers are correlated with severe changes in SV release, therefore we would need to be able to quantify structural changes at this level of resolution. Currently, we are not aware of any study or report that has analyzed SV docking or reported changes on the scale of 1-2 nm using super-resolution light microscopy. It might be possible to use expansion microscopy to achieve such resolution but the respective protocols would need to be established for the calyx synapse. In addition, it is proposed that the regulation of actin filaments is transient and happens on very fast time scales which complicates their investigation by conventional methods (O'Neil et al., 2021). Thus, even if we were able to solve all these technical hurdles, it is well possible to miss potential differences even if we were able to label actin. Therefore, while we agree that having this type of ultrastructural data available would strongly strengthen our hypothesis, the development of the techniques and protocols needed to perform these types of experiments would likely require many months if not years.

      3) Authors use 1 mM kynurenic acid in the bath to avoid postsynaptic receptor saturation. But since this is a competitive antagonist and since the KO shows a large increase in release, could saturation or desensitization have been enhanced in the KO? This would affect the interpretation of recovery rates in the KO, which are quite fast.

      We agree with the reviewer that differences in saturation or desensitization could potentially impact the measured recovery time course in Rac1−/−. However, we think this is unlikely because of the following reasons: Desensitization and saturation of synaptic AMPARs is strongly reduced during calyx synapse maturation (Taschenberger et al., 2002; Taschenberger et al., 2005). We recorded from >P28 calyx synapses which exhibit a claw-like, fenestrated terminal morphology offering many diffusional exits for released glutamate which is expected to speed up transmitter clearance and therefore reduce postsynaptic effects (Taschenberger et al., 2005; Yang et al., 2021). We used 1 mM Kynurenic acid in the external bath solution which resulted in a ~90% reduction in EPSC amplitude in both Rac1+/+ and Rac1−/−, which is comparable to previous reports (e.g. Lipstein et al., 2021). In our study, we performed all experiments in 1.2 mM Ca2+ and at body temperature which further reduces EPSC amplitudes and minimizes potential receptor saturation and desensitization compared to 2 mM Ca2+ at room temperature. Time constants of recovery from desensitization at the calyx are between 30 ms at P14-P16 (Joshi et al., 2004) and 16 ms at P21 (Koike-Tani et al., 2008), both measured at room temperature. It is conceivable that the recovery from desensitization at P30 and at physiological temperature will be significantly shorter. Since we observed the largest effect in recovery between 1 and 4 seconds, this is at least two orders of magnitude slower than the recovery from desensitization could likely account for. Finally, our numerical simulations are consistent with the possibility of faster recovery rates observed in Rac1−/− being a direct consequence of changes in SV priming. This faster pool replenishment likely also enabled increased steady-state EPSC amplitudes at 50 Hz in Rac1−/− synapses. The fact that we were able to measure enhanced steady-state release in Rac1−/− argues against steady-state EPSC amplitudes being limited by AMPARs desensitization.

      Reviewer #2 (Public Review):

      The aim of the study is an improved understanding of the role of the RhoGTPase Rac1 in neurotransmitter release beyond the known roles in synaptogenesis and postsynaptic function. To this end, Rac1 is ablated at P12 (when synapse development has largely progressed to maturation) and transmission is studied at the adult stage (P28 onwards). The study reports a number of interesting findings, in particular, a large increase in synaptic strength, which is interpreted as an '... increased release probability, which results in faster SV replenishment'. It is not clear whether this statement is supposed to suggest a causal relationship or just a correlation between the two parameters. By and large, the discussion of results is somewhat fuzzy with respect to the distinction between release itself (as characterized by release probability) and priming steps, which precede release.

      Besides, the authors present valuable data on Rac1-dependent timing and synchronicity of neurotransmitter release, which point towards a role of Rac1 in 'positional priming', i. e. the proper localization of synaptic vesicles relative to Ca-channels.

      We thank the reviewer for pointing out that our study present valuable data on Rac1-dependent timing and synchronicity of neurotransmitter release.

    1. Author Response

      Reviewer #1 (Public Review):

      Redox signaling is a dynamic and concerted orchestra of inter-connected cellular pathways. There is always a debate whether ROS (reactive oxygen species) could be a friend or foe. Continued research is needed to dissect out how ROS generation and progression could diverge in physiological versus pathophysiological states. Similarly, there are several paradoxical studies (both animal and human) wherein exercise health benefits were reported to be accompanied by increases in ROS generation. It is in this context, that the present manuscript deserves attention.

      Utilizing the in-vitro studies as well as mice model work, this manuscript illustrates the different regulatory mechanisms of exercise and antioxidant intervention on redox balance and blood glucose level in diabetes. The manuscript does have some limitations and might need additional experiments and explanation.

      The authors should consider addressing the following comments with additional experiments.

      1) Although hepatic AMPK activation appears to be a central signaling element for the benefits of moderate exercise and glucose control, additional signals (on hepatic tissue) related to hepatic gluconeogenesis such as Forkhead box O1 (FoxO1), phosphoenolpyruvate carboxykinase (PEPCK), and GLUT2 needs to be profiled to present a holistic approach. Authors should consider this and revise the manuscript.

      We appreciate the constructive suggestion. Besides glycolysis, gluconeogenesis and glucose uptake are critical in maintaining liver and blood glucose homeostasis.

      FoxO1 has been tightly linked with hepatic gluconeogenesis through inhibiting the transcription of gluconeogenesis-related PEPCK and G6Pase expression (1, 2). Herein, we found the expression of FoxO1 increased in the diabetic group but reduced in the CE, IE and EE groups (Fig. X1A, Fig.5E-F in manuscript). Meanwhile, the mRNA level of Pepck and G6PC (one of the three G6Pase catalytic-subunit-encoding genes) also decreased in the CE, IE, and EE groups (Fig. X1B-1C, Fig.5H-I in manuscript). These results indicates that these three modes of exercise all inhibited gluconeogenesis through down-regulating FoxO1.

      For the glucose uptake, we detected the protein expression of GLUT2 in the liver tissue. Glut2 helps in the uptake of glucose by the hepatocytes for glycolysis and glycogenesis. Accordingly, we found GLUT2,a glucose sensor in liver, was up-regulated in diabetic rats, but down-regulated by the CE and IE intervention. However, GLUT2 didn’t decrease in the EE group, which is consistent with the results of the unimproved blood glucose by EE intervention (Figure X1A, Fig.5E and 5G in manuscript).

      Taken together, moderate exercise could benefits glucose control through increasing glycolysis and decreasing gluconeogenesis. We added this part in Page 9 line 251-263 and Figure 5E-5I in this version.

      Figure X1. A. Representative protein level and quantitative analysis of FOXO1 (82 kDa), GLUT2 (60-70 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups. C-D. Expression of hepatic Pepck and G6PC mRNA in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups were evaluated by real-time PCR analysis. Values represent mean ratios of Pepck and G6PC transcripts normalized to GAPDH transcript levels.

      2) Very recently sestrin2 signaling is assumed significant attention in relation to exercise and antioxidant responses. Therefore, authors should profile the sestrin2 levels as it is linked to several targets such as mTOR, AMPK and Sirt1. Additionally, the levels of Nrf2 should be reported as this is the central regulator of the threshold mechanisms of oxidative stress and ROS generation.

      We appreciate reviewer’s expert comments. Nrf2 is an important mediator of antioxidant signaling, playing a fundamental role in maintaining the redox homeostasis of the cell. Under unstressed conditions, Nrf2 activity is suppressed by its innate repressor Kelch-like ECH-associated protein 1 (Keap1) (3). With the increase of ROS level in the development of diabetes, Nrf2 was activated to induce the transcription of several antioxidant enzymes (4, 5).

      Nrf2 expression level has been reported to increase in HFD mice or diabetic patients (6, 7). It has been found from in vitro studies that NRF2 activation is achieved with acute exposure to high glucose, whereas longer incubation times or oscillating glucose concentration failed to activate Nrf2 (8, 9). These suggest that the increase of ROS in diabetes can cause compensatory upregulation of Nrf2. In our study, we found that Nrf2 increased in diabetic rats, which can further initiate the expression of antioxidant enzymes. As shown in Fig.X2A (Fig.2H-2K in manuscript), Grx and Trx involved in thioredoxin metabolism were up-regulated accordingly like Nrf2. After CE intervention, the level of Nrf2 increased further more (Fig.2E-2F), suggesting that CE intervention could activate antioxidant system to achieve a high-level redox balance. We have added these new results into Figure 2.

      On the other hand, the expression level of Sestrin2 and Nrf2 decreased after antioxidant supplement. Our results suggest that the antioxidant treatment improved the diabetes through inhibiting ROS level to achieve a low-level redox balance, but moderate exercise enhanced ROS tolerance to achieve a high-level balance (Fig.X2D-F, Fig.3E-3G in manuscript).

      We added the new data in “Page 5 line 147-153 and Page 7 line 183-186” and Figure 2-3 in current version.

      Figure X2. A-C. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D and T2D + CE groups. D-F. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and HSP90 (90 kDa) in the rats in the Ctl, T2D and T2D + APO groups.

      3) Authors should discuss the exercise-associated hormesis curve. They should discuss whether moderate exercise could decrease the sensitivity to oxidative stress by altering the bell-shaped dose-response curve.

      We thank the reviewer’s valuable comments. According to literatures, Zsolt Radak et al proposed a bell-shaped dose-response curve between normal physiological function and level of ROS in healthy individuals, and suggested that moderate exercise can extend or stretch the levels of ROS while increases the physiological function (10). Our results validated this hypothesis and further proposed that moderate exercise could produce ROS meanwhile increase antioxidant enzyme activity to maintain high level redox balance according to the Bell-shaped curve, whereas excessive exercise would generate a higher level of ROS, leading to reduced physiological function. In this study, we found the state of diabetic individuals is more applicable to the description of a S-shaped curve, due to the high level of oxidative stress and decreased reduction level in diabetic individuals (Fig.8B). With the increase of ROS, the physiological function of diabetic individuals gradually decreases and enters a state of redox imbalance. Moderate exercise shifts the S-shaped curve into a bell-shaped dose-response curve, thus reducing the sensitivity to oxidative stress in diabetic individuals and restoring redox homeostasis. However, with excessive exercise, ROS production increases beyond the threshold range of redox balance, resulting in decreased physiological function (Fig.8B, see the decreasing portion of the bell curve to the right of the apex).

      Nevertheless, the antioxidant intervention increased physiological activity by reducing ROS levels in diabetic individuals, restoring a bell-shaped dose-response curve at low level of ROS (Fig.8B). Therefore, redox balance could be achieved either at low level of ROS mediated by antioxidant intervention or at high level of ROS mediated by moderate exercise, both of which were regulated by AMPK activation. Therefore, both high and low levels of redox balance can lead to high physiological function as long as they are in the redox balance threshold range. Then, the activation of AMPK is an important sign of exercise or antioxidant intervention to obtain redox dynamic balance which helps restore physiological function. Accordingly, we speculate that the antioxidant intervention based on moderate exercise might offset the effect of exercise, but antioxidants could be beneficial during excessive exercise. The human study also supports that supplementation with antioxidants may preclude the health-promoting effects of exercise (11). Therefore, personalized intervention with respect to redox balance will be crucial for the effective treatment of diabetes patients.

      We added this part into “Discussion” in this version (Page 13-14 line 389-418).

      4) It would not be ideal to single-out AMPK as a sole biomarker in this manuscript. Instead, authors should consider AMPK activation and associated signaling in relation to redox balance. This should also be presented in Fig 7.

      We thank reviewer’s critical comments. According to the comments, we have discussed the AMPK signaling in the discussion part (Page 13, line 373-384) and added the AMPK signaling in Fig.8A.

      Reference:

      1. R. A. Haeusler, K. H. Kaestner, D. Accili, FoxOs function synergistically to promote glucose production. J Biol Chem 285, 35245-35248 (2010).
      2. J. Nakae, T. Kitamura, D. L. Silver, D. Accili, The forkhead transcription factor Foxo1 (Fkhr) confers insulin sensitivity onto glucose-6-phosphatase expression. J Clin Invest 108, 1359-1367 (2001).
      3. M. McMahon, K. Itoh, M. Yamamoto, J. D. Hayes, Keap1-dependent proteasomal degradation of transcription factor Nrf2 contributes to the negative regulation of antioxidant response element-driven gene expression. J Biol Chem 278, 21592-21600 (2003).
      4. R. S. Arnold et al., Hydrogen peroxide mediates the cell growth and transformation caused by the mitogenic oxidase Nox1. Proc Natl Acad Sci U S A 98, 5550-5555 (2001).
      5. J. M. Lee, M. J. Calkins, K. Chan, Y. W. Kan, J. A. Johnson, Identification of the NF-E2-related factor-2-dependent genes conferring protection against oxidative stress in primary cortical astrocytes using oligonucleotide microarray analysis. J Biol Chem 278, 12029-12038 (2003).
      6. T. Jiang et al., The protective role of Nrf2 in streptozotocin-induced diabetic nephropathy. Diabetes 59, 850-860 (2010).
      7. X. H. Wang et al., High Fat Diet-Induced Hepatic 18-Carbon Fatty Acids Accumulation Up-Regulates CYP2A5/CYP2A6 via NF-E2-Related Factor 2. Front Pharmacol 8, 233 (2017).
      8. T. S. Liu et al., Oscillating high glucose enhances oxidative stress and apoptosis in human coronary artery endothelial cells. J Endocrinol Invest 37, 645-651 (2014).
      9. Z. Ungvari et al., Adaptive induction of NF-E2-related factor-2-driven antioxidant genes in endothelial cells in response to hyperglycemia. Am J Physiol Heart Circ Physiol 300, H1133-1140 (2011).
      10. Z. Radak et al., Exercise, oxidants, and antioxidants change the shape of the bell-shaped hormesis curve. Redox Biol 12, 285-290 (2017).
      11. M. Ristow et al., Antioxidants prevent health-promoting effects of physical exercise in humans. Proc Natl Acad Sci U S A 106, 8665-8670 (2009).
    1. Author Response

      Reviewer #2 (Public Review):

      Klein et al. have developed a high-throughput tracker to evaluate operant conditioning in Drosophila larvae. Employing this device, they train larvae to prefer bending towards one specific side (left or right), by using as unconditioned stimulus (US) the optogenetic activation of dopaminergic and serotoninergic neurons, demonstrating that larvae are able to perform this behaviour. Furthermore, they show that serotoninergic neurons alone are sufficient to mediate the reward signal, and that specifically serotoninergic neurons in the VNC are required for this behaviour. However, they do not show whether serotoninergic VNC neurons are sufficient. The results are interesting and novel. Operant conditioning had been shown for Drosophila adult. Furthermore, the existence of VNC circuits sufficient for operant conditioning had been shown for other species, as the authors point out in the discussion. Nonetheless, the genetic dissection to identify serotonine expressing neurons as mediators of operant conditioning in the Drosophila larva, and the identification of VNC serotonine cells as necessary are new. Furthermore, given the experimental advantages of the Drosophila larva, including genetic accessibility and a full connectome, the findings open the door to future research into the circuit mechanisms of operant conditioning. I have some comments that I think would be important to address.

      The high-throughput tracker is impressive. However, there is no sufficient documentation to ensure that an expert would be able to easily reproduce it. All of the hardware assembly files, the list of materials, as well as the electronic circuit maps and all of the required software needs to be appropriately documented and uploaded onto a public repository. This is a basic requirement when publishing new hardware/software, particularly in an open journal such as eLife.

      We have now included all the documentation and CAD files for the high-throughput tracker. The software is publicly available in the following Github repository (https://github.com/ZlaticLab/multi-larva-tracker-scripts-public). The CAD files are available in the Supplementary materials of the paper.

      • The differences observed in the results of operant conditioning are very subtle (see for example figure 3c), which means that it is extremely important that statistic analyses are correctly made. The sample number (n) for these experiments is really high (n>100) and for what I understood is not equivalent to the number of animals, because the same animal can generate n >1, eg. n = 2 or n =3 if it collides one or two times, as each time it collides a new identity is given to the larvae. This means that the datapoints collected are not independent, and I think in that case a Wilcoxon rank-sum test is not the appropriate test to take. I recommend the authors and eLife editors to consult with an expert in this type of statistics. Alternatively, the authors could, for each experiment, take into account only the data from larvae that did not collide, and for those that collide only take into account the data before the collision. This can be calculated easily as they just need to exclude from their analysis in each experiment all of the larval IDs where the ID is larger than the initial number of larvae identified by the software.

      We apologise if we did not clarify sufficiently that we only took into account (for each time bin) larvae that did not collide. Within the Materials and methods, we describe how objects retained for analysis had to satisfy several criteria. The first criterion is that the object needed to be detected in every frame of the given 60 s bin. In this way, the object identity is stable throughout the bin - a reflection that the object did not collide with another object. In other words, within a single time bin, the same animal only contributes once. Text has been added to the Materials and methods to clarify that this first criterion is selecting for larvae that did not collide.

      The reviewer mentions that Wilcoxon rank-sum test is not the appropriate nonparametric test for dependent samples. We agree. In accordance with this, the test used for within-bin comparisons was Wilcoxon signed-rank, which is also nonparametric but is for dependent samples. We believe, then, that there is no need to reconsider the statistical tests used.

      -The finding that serotoninergic neurons in the VNC, which with the line they used amount to only 2 neurons per VNC hemisegment, are required for operant conditioning is very interesting. It would be great if they could also test whether they are sufficient. It seems that they would just need to make two split Gal4 lines one for tsh and one for tph, so the experiment does not seem too difficult and would significantly add to their findings.

      Generating new intersections is beyond the scope of this already large study which has been significantly impacted by the pandemic. We have therefore added the following sections below explaining that we have identified candidate serotonergic neurons that are required for operant learning and that identifying specific single neuron types that may be sufficient would be an exciting avenue for future follow-up work.

      In the Results section entitled, “Serotonergic VNC neurons may play role in operant conditioning of bend direction” we have added:

      “The Tph-Gal4 expression pattern contains two neurons per VNC hemisegment (with the exception of a single neuron in each A8 abdominal hemisegment, Huser2012). Future experiments exclusively targeting a single serotonergic neuron per VNC hemisegment could be valuable in determining whether they are sufficient for operant learning.”

      In the Discussion section entitled: “Automated operant conditioning of Drosophila larvae”

      “Furthermore, developing sparser lines that target single serotonergic and dopaminergic neuron types will enable the identification of the smallest subsets of neurons that are sufficient for providing the operant learning signal. Behavioural experiments with these genetic lines may have the added benefit of mitigating conflicting or non-specific reinforcement signalling.”

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript is clear and well-written and provides a novel and interesting explanation of different illusions in visual numerosity perception. However, the model used in the manuscript is very similar to Dehaene and Changeux (1993) and the manuscript does not clearly identify novel computational principles underlying the number sense, as the title would suggest. Thus, while we were all enthusiastic about the topic and the overall findings, the paper currently reads as a bit of a replication of the influential Dehaene & Changeux (1993)-model, and the authors need to do more to compare/contrast to bring out the main results that they think are novel.

      Major concerns:

      1) The model presented in the current manuscript is very similar to the Dehaene and Changeux 1993 model. The main difference is in the implementation of lateral inhibition in the DoG layer where the 1993 model used a recurrent implementation, and the current model uses divisive normalization (see minor concern #1). The lateral inhibition was also identified as a critical component of numerosity estimation in the 1993 model, so the novelty in elucidating the computational principles underlying the number sense in the current manuscript is not evident.

      If the authors hypothesize that the particular implementation of lateral inhibition used here is more relevant and critical for the number sense than the forms used in previous work (e.g., the recurrent implementation of the 1993 model or the local response normalization of the more recent models), then a direct comparison of the effects of the different forms is necessary to show this. If not, then the focus of the manuscript should be shifted (e.g., changing the title) to the novel aspects of the manuscript such as the use of the model to explain various visual illusions and adaptation and context effects.

      Thank you for bringing up these issues. We acknowledge that there was a lack of clear explanations for the key differences between the proposed model and that of Dehaene & Changeux (hereafter D&C). Please see our revisions below where we: 1) explain the D&C model and its limitations in more in detail; 2) our critical changes to the D&C model; and 3) how those critical changes allow a novel way to explain numerosity perception.

      The paragraph in the Introduction where we first introduce D&C is modified to read:

      “The computational model of Dehaene and Changeux (1993) explains numerosity detection based on several neurocomputational principles. That model (hereafter D&C) assumes a one-dimensional linear retina (each dot is a line segment), and responses are normalized across dot size via a convolution layer that represents combinations of two attributes: 1) dot size, as captured by difference-of-Gaussian contrast filters of different widths; and 2) location, by centering filters at different positions. In the convolution layer, the filter that matches the size of each dot dominates the neuronal activity at the location of the dot owing to a winner-take-all lateral inhibition process. To indicate numerosity, a summation layer pools the total activity over all the units in the convolution layer. While the D&C model provided a proof of concept for numerosity detection, it has several limitations as outlined in the discussion. Of these, the most notable is that strong winner-take-all in the convolution layer discretizes visual information (e.g., discrete locations and discrete sizes yielding a literal count of dots), which is implausible for early vision. As a result, the output of the model is completely insensitive to anything other than number in all situations, which is inconsistent with empirical data (Park et al., 2021).”

      The revised Discussion describes our critical modifications to D&C and their consequences.

      “At first blush, the current model might be considered an extension of Dehaene and Changeux (1993). However, there are four ways in which the current model differs qualitatively from the D&C model. First, the D&C model is one-dimensional, simulating a linear retina, whereas we model a two-dimensional retina feeding into center-surround filters, allowing application to the two-dimensional images used in numerosity experiments (Fig. 1A). Second, extreme winner-take-all normalization in the convolution layer of the D&C model implausibly limits visual precision by discretizing the visual response. For example, the convolution layer in the D&C model only knows which of 9 possible sizes and 50 possible locations occurred. In contrast, by using divisive normalization in the current model, each dot produces activity at many locations and many filter sizes despite normalization, and a population could be used to determine exact location and size. Third, extreme winner-take-all normalization also eliminates all information other than dot size and location. By using divisive normalization, the current model represents other attributes such edges and groupings of dots (Fig. 1B) and these other attributes provide a different explanation of number sensitivity as compared to D&C. For example, the D&C model as applied to the spacing effect between two small dots (Fig. 4A) would represent the dots as existing discretely at two close locations versus two far locations, with the total summed response being two in either case. In contrast, the current model gives the same total response for a different reason. Although the small filters are less active for closely spaced dots, the closely spaced dots look like a group as captured by a larger filter, with this addition for the larger filter offsetting the loss for the smaller filter. Similarly, as applied to the dot size effect (Fig. 4B), the D&C model would only represent the larger dots using larger filters. In contrast, the current model represents larger dots with larger filters and with smaller filters that capture the edges of the larger dots, and yet the summed response remains the same in each case owing to divisive normalization (again, there are offsetting factors across different filter sizes). The final difference is that the D&C model does not include temporal normalization, which we show to be critical for explaining adaptation and context effects.”

      In sum, the current model explains a wider range of effects by using representations and processes that more closely reflect early vision. The change to two-dimensions allows application to real images. The inclusion of temporal normalization allows application to temporal effects. The change from winner-take-all to divisive normalization might appear to be a parameter setting, but it’s one that produces qualitatively different results and explanations (e.g., representations of edges and groupings that are part of the explanation of selective sensitivity to number). These behaviors are consistent with empirical data and are qualitatively different from that of the D&C model. Now that we’ve highlighted the ways in which this model differs qualitatively from the D&C model, we hope that our original title still works.

      Reviewer #2 (Public Review):

      This is a very interesting and novel model of numerosity perception, based on known computational principles of the visual system: center-surround mechanisms at various scales, combined with divisive normalization (over space and time). The model explains, at least qualitatively, several of the important aspects of numerosity perception.

      Firstly, the model makes major and minor predictions. Major: the effect of adaptation, at least 30%, as well as impendence of several densities and dot size; minor: tiny effects like irregularity, around 6%. I think it would make sense to separate these. To my knowledge, it is the first to account for adaptation, which was the major effect that brought numerosity into the realm of psychophysics: and it explains it effortlessly, using an intrinsic component of the model (divisive normalization), not with an ad-hoc add-on. This should be highlighted more. And perhaps, the fit can be more quantitative. Murphy and Burr (who they cite) showed that the adaptation is rapid. How does this fit the model? Very well, I would have thought.

      Thanks for the positive evaluation of our work. In the revised manuscript, we followed the reviewer’s suggestion to highlight the novelty of the model in its explanation of numerosity adaptation. As the reviewer says, one significant aspect of our work is that the model can explain a relatively large effect of numerosity adaptation with minimal effort. To be clear, even though we call it “numerosity” adaptation, the model does not know number in any explicit way. One way to highlight this aspect, we thought, is to compare the current adaptation results to a simulation where the adaptor and target are defined along the dimensions of size or spacing. In such cases (which are now reported in Fig. S6 and S7), no reliable under- or over-estimation was observed. These results suggest that numerosity adaptation is a natural byproduct of divisive normalization working across space and time.

      The question about the rapidity of adaptation is indeed an interesting one. However, the current model is not designed to simulate the effect of exposure duration on neural activity. More specifically, the current model operates across trials and stimuli (e.g., one response per stimulus), using a single parameter that captures the temporal gradient of divisive normalization from prior trials (e.g., the influence of two trials ago as compared to one trial ago). As currently formulated, the model does not address adaptation at the level of milliseconds, as would be necessary to model adaptor duration. To model adaptation at the millisecond level requires a dynamic model that not only specifies the rate of adaptation but also the rate of recovery from adaptation, such as in the visual orientation adaptation model of Jacob, Potter, and Huber (2021), which includes the dynamics of synaptic depression and synaptic recovery. In future work we hope to make such modifications to the model to expand the range of explained effects. Nevertheless, a dynamic version of the model should encompass this simpler trial-by-trial version of the model as a special case. Our goal in this study was a clear demonstration of the neural mechanisms underlying numerosity in early vision and so we have attempted to keep the model as simple as possible while still capturing neural behavior.

      We have elected not to fit data and instead we explored the behavior model in a qualitative way, asking whether the commonly observed numerosity effects emerge from the model in the qualitatively correct direction regardless of its parameter values (e.g., as reported in Fig S2). This approach follows from our central aim, which is to explain the neurocomputational principles of the number sense rather than produce a detailed model with specific parameters values fit to data. Our aim was to show that the correct qualitative behaviors naturally emerge from these principles without requiring specific parameter values (and more importantly, to show how these behaviors emerge from these principles).

      Jacob, L. P., Potter, K. W., & Huber, D. E. (2021). A neural habituation account of the negative compatibility effect. Journal of Experimental Psychology: General, 150(12), 2567.

      Among the tiny predicted effects (visually indistinguishable bar graphs) is the connectedness effect. But this is in fact large, up to 20%. I would say they fail here, by predicting only 6%. And I would say this is to be expected, as the illusion relies on higher-order properties (grouping), which would not immediately result from normalization. Furthermore, the illusion varies with individual personality traits (Pomè et al, JAD, 2021). The fact that it works with very thin lines suggests that it is not the physical energy of the lines that normalizes, but the perceptual grouping effect. I would either drop it, or give it as an example of where the predictions are in the right direction, but clearly fall short quantitatively. No shame in saying that they cannot explain everything with low-level mechanisms. A future revised model could incorporate grouping phenomena.

      Thank you for the suggestion. We agree that trying to explain the connectedness illusion with center-surround filters is not ideal. As the reviewer says, the main driver of the connectedness illusion is likely to be groupings of dots. The current model captures groupings of dots, but it does so in a circularly symmetric way, which is not ideal for capturing the oblong groupings (barbells) that are likely to play a role in the connectedness illusion. It is probably because of this mismatch (between the shape of the groupings and shape of the filters) that the model produces a smaller magnitude connectedness illusion. If the model included a subsequent convolution layer in which the filters were oriented lines of different sizes, it would likely produce a larger connectedness illusion. Following the reviewer’s suggestion, we have placed the connectedness illusion in the supplementary materials and only refer to this in the future directions section of the discussion, writing:

      “Another line of possible future work concerns divisive normalization in higher cortical levels involving neurons with more complex receptive fields. While the current normalization model with center-surround filters successfully explained visual illusions caused by regularity, grouping, and heterogeneity, other numerosity phenomena such as topological invariants and statistical pairing (He et al., 2015; Zhao and Yu, 2016) may require the action of neurons with receptive fields that are more complex than center-surround filters. For example, another well-known visual illusion is the effect of connectedness, whereby an array with dots connected pairwise with thin lines is underestimated (by up to 20%) compared to the same array without the lines connected (Franconeri et al., 2009). This underestimation effect likely arises from barbell-shaped pairwise groupings of dots, rather than the circularly symmetric groupings of dots that are captured with center-surround filters. Nonetheless, a small magnitude (6%) connectedness illusion emerges with center-surround filters (Fig. S10). Augmenting the current model with a subsequent convolution layer containing oriented line filters and oriented normalization neighborhoods of different sizes might increase the predicted magnitude of the illusion.”

      In short, I like the model very much, but think the manuscript could be packaged better. Bring out the large effects more, especially those that have never been explained previously (like adaptation). And try to be more quantitative.

      Thank you. We now highlight the novel computational demonstrations of adaptation to a greater degree and—as also suggested by Reviewer 1—provide more quantitative reports of the illusory effects that the model naturally produces.

    1. Author Response:

      Reviewer #1 (Public Review):

      In this manuscript, the authors leverage novel computational tools to detect, classify and extract information underlying sharp-wave ripples, and synchronous events related to memory. They validate the applicability of their method to several datasets and compare it with a filtering method. In summary, they found that their convolutional neural network detection captures more events than the commonly used filter method. This particular capability of capturing additional events which traditional methods don't detect is very powerful and could open important new avenues worth further investigation. The manuscript in general will be very useful for the community as it will increase the attention towards new tools that can be used to solve ongoing questions in hippocampal physiology.

      We thank the reviewer for the constructive comments and appreciation of the work.

      Additional minor points that could improve the interpretation of this work are listed below:

      • Spectral methods could also be used to capture the variability of events if used properly or run several times through a dataset. I think adjusting the statements where the authors compare CNN with traditional filter detections could be useful as it can be misleading to state otherwise.

      We thank the reviewer for this suggestion. We would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR. We have adjusted the text as suggested. In particular, a) we removed the misleading sentence from the abstract, and instead declared the need for new automatic detection strategies; b) we edited the introduction similarly, and clarified the need for improved online applications.

      • The authors show that their novel method is able to detect "physiological relevant processes" but no further analysis is provided to show that this is indeed the case. I suggest adjusting the statement to "the method is able to detect new processes (or events)".

      We have corrected text as suggested. In particular, we declare that “The new method, in combination with community tagging efforts and optimized filter, could potentially facilitate discovery and interpretation of the complex neurophysiological processes underlying SWR.” (page 12).

      • In Fig.1 the authors show how they tune the parameters that work best for their CNN method and from there they compare it with a filter method. In order to offer a more fair comparison analogous tuning of the filter parameters should be tested alongside to show that filters can also be tuned to improve the detection of "ground truth" data.

      Thank you for this comment. As explained before, see below the results of the parameter study for the filter in the very same sessions used for training the CNN. The parameters chosen (100- 300Hz band, order 2) provided maximal performance in the test set. Therefore, both methods are similarly optimized along training. This is now included (page 4): “In order to compare CNN performance against spectral methods, we implemented a Butterworth filter, which parameters were optimized using the same training set (Fig.1-figure supplement 1D).”

      • Showing a manual score of the performance of their CNN method detection with false positive and false negative flags (and plots) would be clarifying in order to get an idea of the type of events that the method is able to detect and fails to detect.

      We have added information of the categories of False Positives for both the CNN and the filter in the new Fig.4F. We have also prepared an executable figure to show examples and to facilitate understanding how the CNN works. See new Fig.5 and executable notebook https://colab.research.google.com/github/PridaLab/cnn-ripple-executable-figure/blob/main/cnn-ripple-false-positive-examples.ipynb

      • In fig 2E the authors show the differences between CNN with different precision and the filter method, while the performance is better the trends are extremely similar and the numbers are very close for all comparisons (except for the recall where the filter clearly performs worse than CNN).

      This refers to the external dataset (Grosmark and Buzsaki 2016), which is now in the new Fig.3E. To address this point and to improve statistical report, we have added more data resulting in 5 sessions from 2 rats. Data confirm better performance of CNN model versus the filter. The purpose of this figure is to show the effect of the definition of the ground truth on the performance by different methods, and also the proper performance of the CNN on external datasets without retraining. Please, note that in Grosmark and Buzsaki, SWR detection was conditioned on the

      coincidence of both population synchrony and LFP definition thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).

      • The authors acknowledge that various forms of SWRs not consistent with their common definition could be captured by their method. But theoretically, it could also be the case that, due to the spectral continuum of the LFP signals, noisy features of the LFP could also be passed as "relevant events"? Discussing this point in the manuscript could help with the context of where the method might be applied in the future.

      As suggested, we have mentioned this point in the revised version. In particular: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez- Villegas et al., 2015). Since the power spectral level operationally defines the detection of SWR, part of this microcircuit intrinsic variability may be escaping analysis when using spectral filters” (page 16).

      • In fig. 5 the authors claim that there are striking differences in firing rate and timings of pyramidal cells when comparing events detected in different layers (compare to SP layer). This is not very clear from the figure as the plots 5G and 5H show that the main differences are when compare with SO and SLM.

      We apologize for generating confusion. We meant that the analysis was performed by comparing properties of SWR detected at SO, SR and SLM using z- values scored by SWR detected at SP only). We clarified this point in the revised version: “We found larger sinks and sources for SWR that can be detected at SLM and SR versus those detected at SO (Fig.7G; z-scored by mean values of SWR detected at SP only).” (page 14).

      • Could the above differences be related to the fact that the performance of the CNN could have different percentages of false-positive when applied to different layers?

      The rate of FP is similar/different across layers: 0.52 ± 0.21 for SO, 0.50 ± 0.21 for SR and 0.46 ± 0.19 for SLM. This is now mentioned in the text: “No difference in the rate of False Positives between SO (0.52 ± 0.21), SR (0.50 ± 0.21) and SLM (0.46 ± 0.19) can account for this effect.” (page 12)

      Alternatively, could the variability be related to the occurrence (and detection) of similar events in neighboring spectral bands (i.e., gamma events)? Discussion of this point in the manuscript would be helpful for the readers.

      We have discussed this point: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez-Villegas et al., 2015).” (Page 16)

      Overall, I think the method is interesting and could be very useful to detect more nuance within hippocampal LFPs and offer new insights into the underlying mechanisms of hippocampal firing and how they organize in various forms of network events related to memory.

      We thank the reviewer for constructive comments and appreciation of the value of our work.

      Reviewer #2 (Public Review):

      Navas-Olive et al. provide a new computational approach that implements convolutional neural networks (CNNs) for detecting and characterizing hippocampal sharp-wave ripples (SWRs). SWRs have been identified as important neural signatures of memory consolidation and retrieval, and there is therefore interest in developing new computational approaches to identify and characterize them. The authors demonstrate that their network model is able to learn to identify SWRs by showing that, following the network training phase, performance on test data is good. Performance of the network varied by the human expert whose tagging was used to train it, but when experts' tags were combined, performance of the network improved, showing it benefits from multiple input. When the network trained on one dataset is applied to data from different experimental conditions, performance was substantially lower, though the authors suggest that this reflected erroneous annotation of the data, and once corrected performance improved. The authors go on to analyze the LFP patterns that nodes in the network develop preferences for and compare the network's performance on SWRs and non-SWRs, both providing insight and validation about the network's function. Finally, the authors apply the model to dense Neuropixels data and confirmed that SWR detection was best in the CA1 cell layer but could also be detected at more distant locations.

      The key strengths of the manuscript lay in a convincing demonstration that a computational model that does not explicitly look for oscillations in specific frequency bands can nevertheless learn to detect them from tagged examples. This provides insight into the capabilities and applications of convolutional neural networks. The manuscript is generally clearly written and the analyses appear to have been carefully done.

      We thank the reviewer for the summary and for highlighting the strengths of our work.

      While the work is informative about the capabilities of CNNs, the potential of its application for neuroscience research is considerably less convincing. As the authors state in the introduction, there are two potential key benefits that their model could provide (for neuroscience research): 1. improved detection of SWRs and 2. providing additional insight into the nature of SWRs, relative to existing approaches. To this end, the authors compare the performance of the CNN to that of a Butterworth filter. However, there are a number of major issues that limit the support for the authors' claims:

      Please, see below the answers to specific questions, which we hope clarify the validity of our approach

      • Putting aside the question of whether the comparison between the CNN and the filter is fair (see below), it is unclear if even as is, the performance of the CNN is better than a simple filter. The authors argue for this based on the data in Fig. 1F-I. However, the main result appears to be that the CNN is less sensitive to changes in the threshold, not that it does better at reasonable thresholds.

      This comment now refers to the new Fig.2A (offline detection) and Fig.2C,D (online detection). Starting from offline detection, yes, the CNN is less sensitive than the filter and that has major consequences both offline and online. For the filter to reach it best performance, the threshold has to be tuned which is a time-consuming process. Importantly, this is only doable when you know the ground truth. In practical terms, most lab run a semi-automatic detection approach where they first detect events and then they are manually validated. The fact that the filter is more sensible to thresholds makes this process very tedious. Instead, the CNN is more stable.

      In trying to be fair, we also tested the performance of the CNN and the filter at their best performance (i.e. looking for the threshold f¡providing the best matching with the ground truth). This is shown at Fig.3A. There are no differences between methods indicating the CNN meet the gold standard provided the filter is optimized. Note again this is only possible if you know the ground truth because optimization is based in looking for the best threshold per session.

      Importantly, both methods reach their best performance at the expert’s limit (gray band in Fig.3A,B). They cannot be better than the individual ground truth. This is why we advocate for community tagging collaborations to consolidate sharp-wave ripple definitions.

      Moreover, the mean performance of the filter across thresholds appears dramatically dampened by its performance on particularly poor thresholds (Fig. F, I, weak traces). How realistic these poorly tested thresholds are is unclear. The single direct statistical test of difference in performance is presented in Fig. 1H but it is unclear if there is a real difference there as graphically it appears that animals and sessions from those animals were treated as independent samples (and comparing only animal averages or only sessions clearly do not show a significant difference).

      Please, note this refers to online detection. We are not sure to understand the comment on whether the thresholds are realistic. To clarify, we detect SWR online using thresholds we similarly optimize for the filter and the CNN over the course of the experiment. This is reported in Fig.2C as both, per session and per animals, reaching statistical differences (we added more experiments to increase statistical power). Since, online defined thresholds may still not been the best, we then annotated these data and run an additional posthoc offline optimization analysis which is presented in Fig.2D. We hope this is now more clear in the revised version.

      Finally, the authors show in Fig. 2A that for the best threshold the CNN does not do better than the filter. Together, these results suggest that the CNN does not generally outperform the filter in detecting SWRs, but only that it is less sensitive to usage of extreme thresholds.

      We hope this is now clarified. See our response to your first bullet point

      Indeed, I am not convinced that a non-spectral method could even theoretically do better than a spectral method to detect events that are defined by their spectrum, assuming all other aspects are optimized (such as combining data from different channels and threshold setting)

      As can be seen in the responses to the editor synthesis, we have optimized the filter parameter similarly (new Fig.1-supp-1D) and there is no improvement by using more channels (see below). In any case, we would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR.

      • The CNN network is trained on data from 8 channels but it appears that the compared filter is run on a single channel only. This is explicitly stated for the online SWR detection and presumably, that is the case for the offline as well. This unfair comparison raises the possibility that whatever improved performance the CNN may have may be due to considerably richer input and not due to the CNN model itself. The authors state that a filter on the data from a single channel is the standard, but many studies use various "consensus" heuristics, e.g. in which elevated ripple power is required to be detected on multiple channels simultaneously, which considerably improves detection reliability. Even if this weren't the case, because the CNN learns how to weight each channel, to argue that better performance is due to the nature of the CNN it must be compared to an algorithm that similarly learns to optimize these weights on filtered data across the same number of channels. It is very likely that if this were done, the filter approach would outperform the CNN as its performance with a single channel is comparable.

      We appreciate this comment. Using one channel to detect SWR is very common for offline detection followed by manual curation. In some cases, a second channel is used either to veto spurious detections (using a non-ripple channel) or to confirm detection (using a second ripple channel and/or a sharp-wave) (Fernandez-Ruiz et al., 2019). Many others use detection of population firing together with the filter to identify replay (such as in Grosmark and Buzsaki 2019, where ripples were conditioned on the coincidence of both population firing and LFP detected ripples). To address this comment, we compared performance using different combinations of channels, from the standard detection at the SP layer (pyr) up to 4 and 8 channels around SP using the consensus heuristics. As can be seen filter performance is consistent across configurations and using 8 channels is not improving detection. We clarify this in the revised version: ”We found no effect of the number of channels used for the filter (1, 4 and 8 channels), and chose that with the higher ripple power” (see caption of Fig.1-supp-1D).

      • Related to the point above, for the proposed CNN model to be a useful tool in the neuroscience field it needs to be amenable to the kind of data and computational resources that are common in the field. As the network requires 8 channels situated in close proximity, the network would not be relevant for numerous studies that use fewer or spaced channels. Further, the filter approach does not require training and it is unclear how generalizable the current CNN model is without additional network training (see below). Together, these points raise the concern that even if the CNN performance is better than a filter approach, it would not be usable by a wide audience.

      Thank you for this comment. To handle with different input channel configurations, we have developed an interpolation approach, which transform any data into 8-channel inputs. We are currently applying the CNN without re-training to data from several labs using different electrode number and configurations, including tetrodes, linear silicon probes and wires. Results confirm performance of the CNN. Since we cannot disclose these third-party data here, we have looked for a new dataset from our own lab to illustrate the case. See below results from 16ch silicon probes (100 um inter-electrode separation), where the CNN performed better than the filter (F1: p=0.0169; Precision, p=0.0110; 7 sessions, from 3 mice). We found that the performance of the CNN depends on the laminar LFP profile, as Neuropixels data illustrate.

      • A key point is whether the CNN generalizes well across new datasets as the authors suggest. When the model trained on mouse data was applied to rat data from Grosmark and Buzsaki, 2016, precision was low. The authors state that "Hence, we evaluated all False Positive predictions and found that many of them were actually unannotated SWR (839 events), meaning that precision was actually higher". How were these events judged as SWRs? Was the test data reannotated?

      We apologize for not explaining this better in the original version. We choose Grosmark and Buzsaki 2016 because it provides an “incomplete ground truth”, since (citing their Methods) “Ripple events were conditioned on the coincidence of both population synchrony events, and LFP detected ripples”. This means there are LFP ripples not included in their GT. This dataset provides a very good example of how the experimental goal (examining replay and thus relying in population firing plus LFP definitions) may limit the ground truth.

      Please, note we use the external dataset for validation purposes only. The CNN model was applied without retraining, so it also helps to exemplify generalization. Consistent with a partial ground truth, the CNN and the filter recalled most of the annotated events, but precision was low. By manually validating False Positive detections, we re-annotated the external dataset and both the CNN and the filter increased precision.

      To make the case clearer, we now include more sessions to increase the data size and test for statistical effects (Fig.3E). We also changed the example to show more cases of re-annotated events (Fig.3D). We have clarified the text: “In that work, SWR detection was conditioned on the coincidence of both population synchrony and LFP definition, thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).” (see page 7).

      • The argument that the network improves with data from multiple experts while the filter does not requires further support. While Fig. 1B shows that the CNN improves performance when the experts' data is combined and the filter doesn't, the final performance on the consolidated data does not appear better in the CNN. This suggests that performance of the CNN when trained on data from single experts was lower to start with.

      This comment refers to the new Fig.3B. We apologize for not have had included a between- method comparison in the original version. To address this, we now include a one-way ANOVA analysis for the effect of the type of the ground truth on each method, and an independent one- way ANOVA for the effect of the method in the consolidated ground truth. To increase statistical power we have added more data. We also detected some mistake with duplicated data in the original figure, which was corrected. Importantly, the rationale behind experts’ consolidated data is that there is about 70% consistency between experts and so many SWR remain not annotated in the individual ground truths. These are typically some ambiguous events, which may generate discussion between experts, such as sharp-wave with population firing and few ripple cycles. Since the CNN is better in detecting them, this is the reason supporting they improve performance when data from multiple experts are integrated.

      Further, regardless of the point in the bullet point above, the data in Fig. 1E does not convincingly show that the CNN improves while the filter doesn't as there are only 3 data points per comparison and no effect on F1.

      Fig.1E shows an example, so we guess the reviewer refers to the new Fig.2C, which show data on online operation, where we originally reported the analysis per session and per animal separately with only 3 mice. We have run more experiments to increase the data size and test for statistical effects (8 sessions, 5 mice; per sessions p=0.0047; per mice p=0.033; t-test). This is now corrected in the text and Fig.1C, caption. Please, note that a posthoc offline evaluation of these online sessions confirmed better performance of the CNN versus the filter, for all normalized thresholds (Fig.2D).

      • Apart from the points above regarding the ability of the network to detect SWRs, the insight into the nature of SWRs that the authors suggest can be achieved with CNNs is limited. For example, the data in Fig. 3 is a nice analysis of what the components of the CNN learn to identify, but the claim that "some predictions not consistent with the current definition of SWR may identify different forms of population firing and oscillatory activities associated to sharp-waves" is not thoroughly supported. The data in Fig. 4 is convincing in showing that the network better identifies SWRs than non-SWRs, but again the insight is about the network rather than about SWRs.

      In the revised version, have now include validation of all false positives detected by the CNN and the filter (Fig.4F). To facilitate the reader examining examples of True Positive and False Positive detection we also include a new figure (Fig.5), which comes with the executable code (see page 9). We also include comparisons of the features of TP events detected by both methods (Fig.2B), where is shown that SWR events detected by the CNN exhibited features more similar to those of the ground truth (GT), than those detected by the filter. We feel the entire manuscript provides support to these claims.

      Finally, the application of the model on Neuropixels data also nicely demonstrates the applicability of the model on this kind of data but does not provide new insight regarding SWRs.

      We respectfully disagree. Please, note that application to ultra-dense Neuropixels not only apply the model to an entirely new dataset without retraining, but it shows that some SWR with larger sinks and sources can be actually detected at input layers (SO, SR and SLM). Importantly, those events result in different firing dynamics providing mechanistic support for heterogeneous behavior underlying, for instance, replay.

      In summary, the authors have constructed an elegant new computational tool and convincingly shown its validity in detecting SWRs and applicability to different kinds of data. Unfortunately, I am not convinced that the model convincingly achieves either of its stated goals: exceeding the performance of SWR detection or providing new insights about SWRs as compared to considerably simpler and more accessible current methods.

      We thank you again for your constructive comments. We hope you are now convinced on the value of the new method in light to the new added data.

    1. Author Response

      We thank the reviewers for their very thorough, detailed, and fair reviews that will help us improve the manuscript. We have two minor comments. First, we emphasize that the evidence is for pervasive positive selection being the main driver of the genetic diversity of Atlantic cod. Secondly, regarding the application of the Moran process to model the reproduction of high fecundity organisms. In the Moran process, a single individual is chosen at random to reproduce at any time, and another individual is chosen to die. However, the parent also persists in the population and can generate a large number of offspring in its lifetime. Hence, the Moran process does not imply an especially low level of fecundity. The multiple mergers seen in coalescent models of highly fecund organisms arise from a combination of high fecundity and reproductive skew; models of high fecundity without skewness are consistent with genealogies with binary mergers only. Hence, the Durrett-Schweinsberg model we employ can be thought of as a model for a highly fecund organism for which reproductive skewness manifests through selective sweeps.

    1. Author Response

      Public Evaluation Summary:

      This is potentially an interesting paper in which extensive MD simulations are used to probe the effect of phosphorylation of a tyrosine residue on the conformational ensemble of Ras GTPase. The insights form the basis for a screen of small molecule(s) that disrupt interaction with its target Raf kinase, and predictions are tested experimentally. Overall, the integrated approach is of interest to a wide range of biochemist and protein scientists and could potentially be used to modulate the activities of other proteins.

      We would like to thank the reviewers for their valuable comments/suggestions. We provided detailed responses to the questions raised by the reviewers and also submit the revised manuscript where the modified parts are highlighted in yellow. We believe that the original manuscript is improved in light of these changes.

      In the revised version, we (i) increased the number of replicates of MD simulations to four per system studied, (ii) extended previous simulations, which were presented in the original submission, up to 1 µs to test the statistical significance of the main results, and (iii) increased the number of SMDs to 70 per system. We provided time-line data for each replicate of the classical MD simulation in the SI and showed the results obtained from these combined trajectories in the main text along with respective statistical error values. We also repeated calculations such as RMSF, PCA, and the number of waters including the new trajectories and provided updated values/distribution plots in the revised version.

      In general, we obtained similar results to those presented in the original submission except the flexibilities of G60 and Q61. They seemed to display similar behavior among the systems studied as presented in Table 1 upon inclusion of the new replicates. On the other hand, the two residues reached relatively higher RMSF values in the phosphorylated RAS when considering the error values calculated. We presented these values in Table 1 and revised the text accordingly.

      Also, we revised a part in the original submission pertaining to the criterion used for describing the opening of the nucleotide binding pocket in HRASG12D. We noticed that Q61 was not considered for describing the wideness of the nucleotide binding pocket in the references provided. It is also important to mention that the opening of the nucleotide binding pocket, which was described by the distance measured between the Cα atoms of D12 and D34, did not change by the distance measured between the side chain of Q61 and γ-phosphate atom of GTP. Therefore, we dropped the respective distribution of Q61 in the revised version.

      In the application of the PSP methodology, we increased the number of SMD simulations for each of the ligand-bound and ligand-free systems to 70. We also made a more detailed analysis of the results, and we can now rely on not just the qualitative features of the PMFs, but also on the quantities obtained. In particular, the large barrier to cavity opening (ca. 30 kcal/mol) in the ligand-bound form is now clearly shown, and the fact that cerubidine binding leads to a barrierless transition that requires about 1/3 of the energy is demonstrated.

      Reviewer #3 (Public Review):

      In their manuscript "Inhibition of mutant RAS-RAF interaction by mimicking structural and dynamic properties of phosphorylated RAS", Ilter, Kasmer, et al. search for druggable sites in the RAS mutant G12D in computer calculations, and verify their results by experiments. RAS is a major oncogene for various types of cancer and is notoriously hard to target with drugs. Any significant insight into how to find drugs targeting RAS mutants is therefore of high interest. The present manuscript tries to provide such insight, and the connection between simulation and theory appears sound, as the identified compound cerubidine apparently indeed blocks mutant RAS activity.

      As I am an expert in simulations, but not in experiments, I will only focus on the presented computational part. In this function, however, I see some significant problems with the results: The data basis that the authors base their analysis on is quite small (only two simulations of 2.5 µs total simulation time), and from the presented data set I do not see any information on if the results on Y32 dynamics are anecdotal or reproducible. All presented distance distribution plots miss error bars/error ranges, as well as some time course plots that the simulations have indeed converged. So I cannot confirm whether the presented results are valid or if the authors were just lucky in their small data set.

      We would like to thank the reviewer for sharing his comments pertaining to inadequacy of the data used. During the revision period, we performed additional simulations to have four replicates, each of which is about 1 µs, per system. For ligand-bound RAS systems, we ran the simulations until Switch I was displaced from the nucleotide-binding pocket and extended it for an additional ca. 200-300 ns to check if it comes back to its original position. Respective time-line plots of replicates of both ligand-bound and non-liganded systems were provided in Figures S4 – S6 and S11-S14 the SI of the revised MS. We also provided error values in the caption of corresponding figures in the main text. The updated simulation times were provided in the methods section. We presented the total simulation times of each ligand-bound RAS system in the SI.

      To show the convergence of the systems, we provided RMSD profiles for each replicate of the system studied in panel A of Figures S4–S6 and S11-S14. For HRASWT, HRASG12D, and HRASPY32, RMSDs reached a plateau after some time while those of ligand-bound systems did not, as Switch I was highly fluctuating. Importantly, we observed similar behavior in each replicate of the systems so it can be said that the results presented in the original MS are reproducible.

      Interestingly, Switch I was displaced in one of the four replicates of HRASG12D which might lead to the release of the nucleotide from the pocket, thus triggering transitioning towards the apo state. In fact, this observation does not contradict with the findings in the literature.

      It has been shown that mutant RAS can also adopt the apo state albeit with low probability due to its low intrinsic GTPase activity. Therefore, except for KRASG12C, which has a relatively higher intrinsic GTPase activity, either the GDP or GTP-bound state of RAS mutants have been targeted for therapeutic purposes. This information is now included in the manuscript on page 2 of the current version.

      Furthermore, it might be that I have overlooked this information, but this work is not the first finding of druggable sites in RAS (see e.g. review of Moor et al., Nat. Rev. Drug Discov. 2020). The authors should include such a comparison in their manuscript.

      We would like to thank the reviewer for suggesting this comprehensive review. We included it along with the sentence below in the revised version of the manuscript (page 2):

      ‘In these studies, the mutant RAS was targeted directly or in combination with other proteins including SOS, tyrosine kinase, SHP2, and RAF. Also, except for the KRASG12C mutant, the GTP-bound state has been targeted, as RAS mutants either lose their intrinsic or GAP-mediated GTPase activity. However, the intrinsic GTPase activity of KRASG12C is relatively higher than the other mutants which enables targeting the GDP-bound state of KRAS (Moor et al., Nat. Rev. Drug Discov. 2020).’

      We would also like to clarify that we do not claim our study is the first in the field presenting druggable sites in RAS but rather we claim that the study provides a perspective for mimicking the impact of phosphorylation in targeting undruggable mutant RAS.

      Especially the PMF presented in Figure 9 is erroneous, and all arguments based on this plot need to be discarded from the manuscript. From the Methods and Eq. (9), I assume the authors indeed use only the first two cumulants to calculate the PMF. The artificially low PMF with a difference of up to ~800 kcal/mol is a well-understood artefact (see Jäger et al., J. Chem. Mol. Model. 2022) that indicates the breakdown of the second-order approximation in Eq. (9) due to the presence of different pathways in the steered MD data set. This artefact overlays the PMF and obfuscates any information on the true free energy profile.

      We thank the reviewer for these details. The pulling directions remain the same. We indeed found that the absence of enough number of samples along with the breakdown of the second-order approximation due to the presence of different pathways in the SMD data set led to this behavior. We have also included a more detailed error analysis by implementing block averaging (this information now appears on page 18). We hope that the conclusions we draw from the updated PMF curves support the findings to the satisfaction of the reviewer.

    1. Author Response

      Reviewer #1 (Public Review):

      A limitation here is that this colony morphology only seems to manifest strongly in mutants lacking flagella, which I don't think is common among wild P. aeruginosa isolates. To the extent that groups of P. aeruginosa cells have been imaged in situ, e.g. in the sputum of CF patients, this kind of channel formation does not occur in more realistic conditions. See DePas et al. (2015) https://journals.asm.org/doi/epub/10.1128/mBio.00796-16. I think it's more likely that this colony morphology is idiosyncratic to the agar growth substrate on which the cells are growing in this case, so the more interesting thing here is the physics of the system rather than its applications to clinical or ecological settings.

      We thank the Reviewer for appreciating the novelty of our work. We have revised the third paragraph of Discussion section to limit the generality of our findings in clinical or ecological settings (lines 440-456). Results of imaging P. aeruginosa cells in situ in sputum samples from cystic fibrosis patients are compared, and the shortage of using flagellum mutants is highlighted.

      The authors have established that flgK-null P. aeruginosa forms colonies with channels in this agar growth and incubation environment, and made a strong case for the physics underlying the spontaneous formation of this morphology. The idea that this morphology reflects a multicellular developmental program for P. aeruginosa is not strong, though, as this morphology is not found in the wild. In general, the idea that groups of microbes on agar are analogous to multicellular organisms with circulatory systems has little support from in-situ imaging experiments, or from fundamental evolutionary theory. So, I would advise shifting the introduction and discussion away from the multicellular organism focus toward a greater focus on the physics of the system and its potential for synthetic systems. See for example Yan et al. (2019) https://elifesciences.org/articles/43920

      We thank the Reviewer for the suggestion. We now focus more on the physics of canal formation in Introduction and Discussion (revising/adding texts in lines 93-99 and restructuring the paragraphs in Discussion section). We also put greater emphasis on the application of our findings for engineering living materials based on synthetic microbial consortia (lines 58-64, 428-438), while deleting the texts related to the implication for multicellularity in Introduction/Discussion.

    1. Author Response

      Reviewer #1 (Public Review):

      The study presented by AL Seufert et al. follows the trajectory of trained immunity research in the context of sterile inflammatory diseases such as gout, cardiovascular disease and obesity. Previous studies in mice have shown that a 4 week Western-type diet is sufficient to induce systemic trained immunity, with gross reorganization of the bone marrow to support a potentiated inflammatory response [PMID: 29328911]. The current study demonstrates that mice on a Western-type diet (WD) and the more extreme Ketogenic diet (KD; where carbohydrates are essentially eliminated from the diet) for 2 weeks results in a state of increased monocyte-driven immune responsiveness when compared to standard chow diets (SC). This increased immune responsiveness after high-fat diet resulted in a deadly hyper-inflammatory in the mice in response to endotoxin (LPS) challenge in vivo.

      These initial findings as displayed in Figure 1 are made difficult to interpret because the authors use a mix of male and female mice coupled with very small sample sizes (n = 5 - 9). Male and female mice are shown to have dimorphic responses to LPS exposure in vivo, with males having elevated cytokine levels (TNF, IL-6, IL1β, and also interesting IL-10) increased rates severe outcomes to LPS challenge [PMID: 27631979]. As a reader it is impossible to discern from their methodological description what the proportion of the sexes were in each group, and therefore cannot determine if their data are skewed or biased due to sexual dimorphic responses to LPS rather than diet. Additionally due to the very small sample sizes, the authors can't perform a stratified analysis based on sex to determine whether the diets are having the greatest effects in accordance with LPS induce inflammation.

      The Reviewer brings up an important point, all studies with endotoxemia in wild-type conventional mice were carried out in 6–8-week female BALB/c mice, as mentioned in the Methods section under “Ethical approval of animal studies” and “endotoxin-induced model of sepsis” sections. This is extremely important to mention more clearly in the results text, because the Reviewer 1 is correct, sexual dimorphism and age differences can have very large effects on LPS treatment outcome. This was not stated clearly enough in the results and now the age, sex, and background of mice have been explicitly stated in each Results and Figure Legend section for each experiment.

      When comparing SC to the KD, the authors identify large changes in fatty acid distribution circulating in the blood. The majority of the fatty acids were shown to relate to saturated fatty acids (SFA). Although Lauric, Myristic, and Myristovaccenic acid where the most altered after KD, the authors focus their research on the more thoroughly studied palmitic acid (PA).

      We followed up on multiple saturated fatty acids (SFAs; Myristic, Lauric, and Behenic acid) that were identified in the lipidomic data, and found no robust or repeatable phenotypes in vitro using physiologically relevant concentrations. The inability to reproduce some of the findings with these SFAs may be due to the instability of some of these fats in solution, and plan to troubleshoot these assays in order to understand the complexity of SFA-dependent control of inflammation in macrophages. Please see Fig. R1 in this document for data showing LPS-stimulated BMDMs pre-treated with Myristic (Fig R1 A-C), Lauric (Fig R1 D-F), or Behenic (Fig R1 G-I) fatty acids. The physiological concentrations used in these studies were referenced from Perreault et. al., 2014.

      Figure R1. The effect of Myristic Acid, Lauric Acid, and Behenic Acid on the response to LPS in macrophages. Primary bone marrowderived macrophages (BMDMs) were isolated from aged-matched (6-8 wk) C57BL/6 female and male mice. BMDMs were plated at 1x106 cells/mL and treated with either ethanol (EtOH; media with 0.05% or 0.35% ethanol to match MA and LA solutions respectively), media (Ctrl), LPS (10 ng/mL) for 24 h, or myristic or lauric acid (MA, LA stock diluted in 0.05%, or 0.35% EtOH; conjugated to 2% BSA) for 24 h, with and without a secondary challenge with LPS (10 ng/mL). After indicated time points, RNA was isolated and expression of (A, B) tnf, (D, E) il- 6, and (G, H) il-1β was measured via qRT-PCR. RAW 264.7 macrophages were thawed and cultured for 3-5 days, pelleted and resuspended in DMEM containing 5% FBS and 2% BSA, and treated identical to BMDM treatments with behenic acid (BA stock diluted in 1.7% EtOH) used as the primary stimulus. (C) tnf, (F) il-6, and (I) il-1β was measured via qRT-PCR. For all plates, all treatments were performed in triplicate. For all panels, a student’s t-test was used for statistical significance. p< 0.05; p < 0.01; **p< 0.001. Error bars shown mean ± SD.

      PA was shown to increase the expression of inflammatory cytokines gene expression and protein production of TNF, IL-6 and IL-1β in bone marrow derived macrophages (BMDMs). The authors tie these effects to ceramide synthesis through a pharmacological blockade as well as the use of oleic acid, which allegedly sequesters ceramide synthesis. The author's claim that oleic acid supplementation reverses the inflammatory signaling induced by PA is invalid, as oleic acid was shown to induce a high level of cytokines in their model. When PA was added along with oleic acid, the cytokine levels returned to the levels produced by BMDM's stimulated with PA alone (see Figure 4 panels D- F).

      This was an unfortunate oversight in our revisions of this manuscript, original Figure 5A-C was mislabeled (though colored the correct colors) – OA-12h → LPS-24h should have been switched with PA-12h → LPS-24h. These data were labeled correctly in the source file: Source_data_Fig5 and have since been updated in Figure 5 of the manuscript with correct labels. The corrected graphs have been split up in the resubmission in light of new data collected. Please see Fig 3K-M and Fig 5A-C.

      Finally the authors test whether injection of PA into mice can recapitulate the systemic inflammatory response seen by WD and KD feeding followed by LPS exposure. They were able to demonstrate that injecting 1 mM of PA, waiting for 12h, and then exposing the mice to LPS for 24h could similarly result in a hyper-inflammatory state resulting in greater mortality. The reviewer is skeptical that 1 mM of PA truly represents post-prandial PA levels as one would expect to see after a single fatty meal, and whether this injection is generally well tolerated by mice. Looking into the paper cited by Eguchi et al. to inform their methods, it's shown that the earlier study continuously infused an emulsified ethyl palmitate solution (which contained 600 mM) at a rate of 0.2 uL/min. As far as I can read by Eguchi, they only managed to reach a serum PA concentration of 0.5 mM. This is hardly the same thing as a single i.p. injection of 1 mM PA. and reflects a single bolus injection of double the serum concentration of PA achieved by Eguchi et al.

      The reviewer brings up an important point, Eguchi et al. did use infusions. From their data (Fig 1A), we calculated that after 600mM of i.v. injection (total = 267uL within 14h; 0.2L/min) there was ~420uM absolute PA within the blood. They were using C57BL/6 mice that were 23g on average. Using these results, we extrapolated that one single 200uL injection of a 750mM PA solution within 6–8-week female BALB/c mice (~15-18g) would equate to ~500-1mM of PA within the blood. Considering obese healthy and unhealthy humans vary widely in total PA concentrations in the blood (0.3-4.1 mM) (1, 2), we moved forward with these calculations. Considering this, we thank the reviewer for this advice, and we agree that we have not definitively shown we are increasing systemic levels of PA. Thus, we ran a lipidomic analysis of serum from SC-fed mice with Veh or PA for 12 h. We show that a 750 mM i.p. injection of ethyl palmitate enhances free PA levels in the serum to 173-425 μM at 2 h post-injection, which is within the reported range for humans on high-fat diets (0.34.1mM). We have added this new data to Fig. S7A of the main manuscript.

      Importantly, the concentration in the PA-treated mice is greater than that of the Veh-treated mice, however we believe the value shown is an underestimate of maximum serum PA levels enhanced by i.p. injection, because free PA is known to be packaged into chylomicrons within enterocytes and travel through the circulation with a half-life of less than an hour (3, 4). Thus, serum concentrations of free PA are only transiently enhanced by i.p. injection, and is quickly taken up by adipose tissue, skeletal muscle, heart, and liver tissue. These complex lipid transport processes make it difficult to determine maximum concentrations of free PA in the serum.

      While all of the details concerning PA circulation following an i.p. injection are unknown, we suggest that this method of “force-feeding” is similar to dietary intake in that uptake of PA into the circulation occurs within the peritoneal space prior to traveling to the blood via the thoracic duct and right lymphatic duct (5).

      PA is known to induce inflammation in monocytes and macrophages, therefore the findings certainly make sense in the context of previously published literature. However the authors have made some poor methodological decisions in their mouse studies, namely haphazardly switching between groups of young and old mice (4-6 weeks, 8-9 weeks, and 14-23 weeks), using different LPS injection protocols (6, 10, and 50 mg/ml of LPS), and including multiple sexes of mice. All of which are drastically alter the interpretation of the data, and preventing solid conclusions from being drawn.

      We appreciate this review and suggest that:

      1) For the LPS models, mice were all female and aged matched between 6-8 weeks. We are aware of sex differences in the endotoxemia model, which is why we specifically use female mice in our studies (6, 7). This is mentioned twice in the methods under the sections “Endotoxin-induced model of sepsis” and “Ethical approval of animal studies”. We have added these specifics of our model to all Results and Figure Legend sections for clarification.

      2) For Germ-free models, it is notoriously difficult to breed C57BL/6 germ-free mice. It was inherently difficult to obtain enough mice within the same sex and age to carry out these experiments, however since we have published in this model before with mixed sex and age we were aware that our WD phenotype is robust enough in these backgrounds (7). Further, we believe that seeing our robust phenotype independent of age or sex within germ-free mice provides more evidence of the strength of this phenotype. It is important to note that we induce endotoxemia within Germ-free mice with 50mg/kg, instead of 6mg/kg which is used in conventional mice, because this is our reported LD50 for mixed sex Germ-free C57BL/6, as we have published previously in detail (7). This difference is due to the presence of the microbiota (8, 9) and also germ-free mice have an immature immune system that correlates with a hyporesponsiveness to microbial products (10-12). We agree with the reviewer that the ages of the C57BL/6 germ-free mice are significantly older than our conventional 6-8 week mice, thus we confirmed that WD- and KD-fed conventional C57BL/6 female mice aged 20 – 21 weeks old still show enhanced disease severity and mortality in an LPS-induced endotoxemia model, compared to mice fed SC (Fig. S1G-H).

      Figure R2. PA treatment enhances survival in both female and male RAG-/- mice. Age-matched (8-9 wk) RAG-/- mice were injected i.v. with ethyl palmitate (PA, 750mM) or vehicle (Veh) solutions 12 h before C. albicans infection. Survival was monitored for 40h post-infection.

      3) In our preliminary results, we stratified survival during C. albicans infection between male and female C57BL/6 and found no notable difference in survival at 40h post IP infection with Candida albicans (Fig R2 A-B). However, the data presented in the manuscript on CFU is female kidney burden and we do not have data on fungal burden within male mice. This is an important piece of data that we would like to collect for understanding sex differences in the PA-dependent enhanced resistance to systemic C. albicans. We are currently addressing this question within the lab as well as elucidating the cell type and mechanism of PA-dependent enhanced fungal resistance.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors have used many cleverly chosen mouse models (periodontitis models; various models that lead to an on-switch of genes) and methods (immune localizations of high quality; single cell RNA sequencing) for the quest of elucidating a role for telocytes. They describe that more telocytes are present around teeth in mice that had periodontitis. These cells proliferated, and they expressed a pattern of genes that allowed macrophages to differentiate into a different direction. In particular, they showed that telocytes in periodontitis express HGF, a molecule that steers macrophage differentiation towards a less inflammatory cell type, paving the way for recovery. As a weakness, one could state that an attempt to extrapolate to human cells is missing.

      In the Discussion, we have a sentence that states further investigation in human periodontitis is required (see page 20, paragraph 416).

      Reviewer #3 (Public Review):

      Zhao and Sharpe identified telocytes in the periodontium. To address their contribution to periodontal diseases, they conducted scRNA-seq analysis and lineage tracing in mice. They demonstrated that telocytes are activated in periodontitis. The activated telocytes send HGF signals to surrounding macrophages, converting M2 to M1/M2 hybrid status. The study implies that targeting telocytes and HGF signal for the potential treatment of periodontitis.

      The significance of the study could be improved by authors testing if targeting telocytes or HGF signals could ameliorate periodontitis in the mouse model. The current form of the manuscript lacks the data that demonstrate the actual contribution of telocytes in the homeostasis of periodontium or progression of periodontitis.

      Major comments:

      1) I see the genetic validation of the role of telocytes or HGF signals are crucial to assure the significance of this manuscript. I recommend either of two experiments. a. testing the role of HGF signals by deleting the Hgf gene in telocytes. Using Wnt11-Cre; Hgf f/f mice, the authors could address the role of HGF signals in periodontitis. CX3CR1-Cre; cMet f/f mice will delete HGF signals in monocyte-derived macrophages. This will be another verification, but not sure if the PDL macrophages are derived from yolk sac or monocytes. b. measuring the contribution of telocytes in the homeostasis or disease progression. The mouse model could be challenging though, the system if achieved will be very informative. The authors could first check the expression of telocyte enriched genes, such as Lgr5 or Foxl1 reported previously in other tissue telocytes. Delete those genes under the Wnt1-Cre driver and check if telocyte lineage is removed. The system would be very useful for next-level study. DTA model could be an alternative, but Wnt1-Cre is vastly expressed in neural crest lineage.

      These are good suggestions but unfortunately not feasible as we do not have all the mouse lines (e.g., Hgf f/f mice). Lgr5 and Foxl1 are used in intestine but is not suitable for PDL tissue. CD34;DTA show CD34+ cells, however, we encountered challenges associated with induced genetic heterogeneity when using this model, preventing us from making concrete conclusions from the experiments using the CD34;DTA model. Lgf5/Foxl1 are either not expressed or overlap with CD34 in and therefore do not seem suitable for us to pursue.

      2) This paper points out that the M1/M2 hybrid state of macrophages appears upon periodontitis. The authors could further characterize the hybrid macrophages by the expression of more markers, production of cytokines, and morphology. Need to clarify if this means some macrophages are in M1 state and others are in M2 state, or one macrophage possesses both M1 and M2 phenotype. Please conduct either FACS or immunofluorescence to demonstrate if one macrophage expresses both markers. Please introduce more information about the M1/M2 hybrid state of macrophage based on other present literature.

      Unlike our single cell sequencing data, we were unsuccessful in determining if one macrophage possesses both M1 and M2 phenotype by immunolabelling.

      3) In the introduction part, the author lists several markers that can be used for telocyte identification, such as CD34+CD31-, CD34+c-Kit+, CD34+Vim+, CD34+PDGFRα+. Could authors explain why they chose CD34 CD31, but not other markers?

      As shown in the cluster images below, the other markers do not overlap very well with CD34 cells or in the case of Vim, expressed more ubiquitously. We generated a new supplementary figure (Supp Fig2) and explained this in the text (page 12, lines 235-238).

      4) In figure 5g, I don't think the yellow color cell shows the reduction trend in the Tivantinib treatment group compared with a control group. Please validate the observation by gene expression analysis, WB, etc. In addition, please show c-Met+ cells level in the Tivantinib treatment group and control group.

      New Supp Fig4 is included to show Met expression in homeostasis and periodontitis.

    1. Author Response

      Reviewer #2 (Public Review):

      Members of the WTF gene family can result in distorted meiosis (away from predicted Mendelian segregation) due to a "poison-antidote" like system. The authors find that members of the WTF gene family are found in numerous species long diverged species of fission yeast, that these genes show signatures of ongoing adaptive evolution, and that some of the novel wtf genes discovered here can also distort meiosis. Additionally, the authors show that gene conversion is quite common, and suggest that processes like gene conversion, expansion, and contraction underlie the long-term maintenance of this system in the face of potential loss of function by fixation and/or suppression. While interesting, the support for this vague model is unclear, and the novelty of this system compared to other drive systems was not sufficiently justified.

      The presented work is interesting, and I trust the bioinformatic and functional work (although both are a bit beyond my specialties). I am quite concerned, however, with the introduction, discussion, and take-home conclusions, which at times go beyond the data presented.

      Active meiotic driver genes throughout their history in fission yeasts?

      For example, the authors claim that "Our results suggest that the gene family has contained active meiotic driver genes throughout their history in fission yeasts." Evidence for such a claim would be interesting, but very difficult to obtain and not presented in this manuscript. Rather the authors show that wtf genes are present, evolving rapidly, and can distort meiosis in numerous species. What has happened in the intervening 100 million years is unclear, but I would be surprised if it included an unbroken streak of active meiotic drive. It is well known that drivers spread rapidly, and this group's previous modeling of the system showed that a wtf driver would spread rapidly. I also don't know of evidence for a strong enough cost to wtf/homozygotes in this system to sustain long-term balancing selection (which is what is needed for long-term driving). Otherwise, it seems that most of a driver's history would be fixed (or at least locally fixed), and that continuous drive activity is unlikely (unless the authors mean it "could drive").

      We agree that we have not demonstrated perpetual meiotic drive over the last 100 million years. Instead, we argue the family has retained the capacity to drive for that amount of time. We have modified the text to be more precise.

      We also disagree that long-term balancing selection is needed for long-term drive. Our work suggests an alternative option where long-term drive is not tied to a single locus, but is a property shared across the gene family. Active drive likely comes and goes at individual loci. We propose the evolution of wtf drivers is better described as a cycle of novel drivers being born and spreading (perhaps to local fixation), rather than one driver that is maintained at a given locus for a long time.

      The "model"

      The authors present a brief verbal "model" of the rejuvenation of wtf drivers by expansion/contraction/non-allelic gene conversion etc. While these processes all appear to occur in this system and likely play an important role in its evolution, it is hard to make much of this model. For example, I have trouble understanding the time scale at which these processes operate (e.g. do we expect fixation - which the authors have previously shown to occur quite rapidly at a single locus - to generally occur before an opportunity for one of these processes to occur and/or before suppression evolves? My sense is "probably"). If the scale of fixation is much more rapid than the other processes this system seems to fit in well with the other case discussed in the intro. Rather, it appears that the true excitement of the system, is the fast rate at which wtf emerges (likely facilitated by expansion/contraction/non-allelic gene conversion, etc.) and perhaps their slow breakdown after fixation (unexplained here).

      We have modified our discussion to better highlight the limitations in our understanding and clarify local fixation of a driver from global fixation in the species. We also clarify that mutations can rejuvenate fixed, suppressed, or psuedogenized wtf drivers.

    1. Author Response:

      Reviewer #1:

      It was previously shown that HGF and Met controls development of the diaphragm muscle. In particular, the signal induces delamination and migration of muscle progenitor cells that colonize the diaphragm. The present manuscript by Sefton and coworkers confirms and extends these observations using (i) conditional mouse lines in which the HGF gene was targeted by Cre/loxP recombination in the pleuroperitoneal folds (Prx1-cre) and at other sites PdgfraCreERT2, and of (ii) Met inhibitors. Overall, the technical quality of the data on diaphragm muscle development is excellent; the conceptual advance over previous work is not exceptional; the evidence for Met/HGF-dependent development of the phrenic nerve is marginal and needs to be strengthened.

      The data show that fibroblasts provide HGF signals received by Met in muscle progenitor cells that is essential for diaphragm development. The PdgfraCreERT2 line was used to demonstrate that HGF produced by fibroblasts but not by muscle progenitors is essential for diaphragm development. Moreover, development of dorsal and ventral regions of diaphragm muscle requires continuous MET signaling. Thus, HGF is not only required for the delamination of progenitors, but also for proliferation and survival of those muscle progenitors that reached the anlage of the diaphragm.

      My major concern is the limited data on the HGF-dependent development of the phrenic nerve (defasciculation). While it is well documented that HGF acts as a trophic factor for motor neurons in culture, its role in development of motor neurons was highly debated due to the fact that some changes observed in Met or HGF mutant mice in vivo are also present in other mutants that lack the muscle groups derived from migrating muscle progenitors. Moreover, careful genetic analyses previously demonstrated indirect mechanisms of Met during motor neuron development, i.e. a non-cell-autonomous function of Met during the recruitment of motor neurons to PEA3-positive motor pools (Helmbacher et al., Neuron 2003).

      Sefton et al. provide an analysis of a single time point, one histological picture (3G, magnified in 3H) that indicate that in Met+/- animals defasciculation of the phrenic nerve does not occur correctly. This is accompanied by a quantification that barely reaches significance (Fig. 3K). Data shown in Fig. 7 using Met inhibitors show a major change in phrenic nerve branching, which is presumably due to the major change in diaphragm development, as conceded by the authors.

      Despite this weakness on the experimental side, the role of HGF/Met in phrenic nerve development is strongly emphasized in abstract /intro/discussion (e.g. line 414: However, PPF-derived HGF is crucial for the defasciculation and primary branching of the nerve, independent of muscle). The data need to be strengthened in order to conclude that HGF coordinates both, diaphragm muscle and phrenic development.

      In response to comments from the reviewers, we have more thoroughly investigated the role of Met in the development of the phrenic nerve and include two new sets of genetic experiments. In our first submission, we found a decreased number of phrenic nerve branches at E11.5 in Met Δ/ Δ  and Met Δ/+ compared with Met+/+ embryos. In the Met Δ/ Δ  embryos, no muscle is present in the diaphragm. Therefore, the greatly reduced branching in these embryos is likely a secondary effect of the requirement of Met in muscle progenitors for diaphragm muscularization. Of particular interest is the reduced branching in the Met Δ/+ embryos. Because the diaphragm is muscularized in these embryos, this suggested that Met may be required intrinsically in the phrenic nerve. One reviewer suggested that the reduced branching in the Met Δ/+ embryos could be due to a developmental delay in the whole embryo. However, we found that Met Δ/ Δ  and Met Δ/+ embryos are not overall delayed relative to Met+/+ embryos (as measured by crown rump length or limb length; Figure 3—figure supplement 1). Also, to increase the robustness of these data, we added additional embryos to the analysis. We then extended our analysis of Met Δ/ Δ, Met Δ/+ and Met+/+ embryos to E12.5 (Figure 3—figure supplement 1) to see whether the branching phenotype persisted; we found that while the of Met Δ/ Δ embryos continue to have very few branches, the number of branches in Met Δ/+ embryos recovers and matches that of Met+/+ embryos.

      To explicitly test whether Met is required within the phrenic nerve, we used Olig2Cre/+_to conditionally delete _Met. This line was chosen for its early expression in motor neurons (Zawadzka et al. 2010). We examined Olig2Cre/+;Met Δ_/flox_embryos compared to Olig2Cre/+; Metflox/+ embryos. We chose to include Olig2Cre in our controls because the Olig2Cre is a knock-in/knock-out and Olig2 has important roles in nerve development. However, deletion of Met did not affect the number of branches at E11.5 (Figure 3—figure supplement 2) or E12.5 (data not shown). These data suggest that Met does not intrinsically regulate phrenic nerve branching. This suggests that PPF-derived HGF regulates phrenic nerve branching indirectly via muscle. To test if HGF is sufficient to promote early stages of nerve branching in the absence of muscle, we  turned to Pax3SpD/SpD mutants in which a point mutation in Pax3 prevents migration of muscle progenitors into the diaphragm (Figure 3—figure supplement 2). In these embryos, the diaphragm is muscleless, but the PPFs still express HGF. In these diaphragms the number of branches at E11.5 is severely reduced. These data demonstrate that in the absence of muscle the presence of HGF in the PPF fibroblasts is not sufficient to support diaphragm branching.

      Altogether our data demonstrate that PPF-derived HGF, via its regulation of muscle, controls the primary branching of phrenic nerve. The Met Δ/+ data demonstrate that Met controls phrenic nerve branching at E11.5 in a dose-dependent manner, but this effect is lost by E12.5. Although we see no obvious defects in muscle of Met Δ/+ diaphragms at later stages, the most parsimonious explanation of the reduced phrenic nerve branching at E11.5 is that this is due to fewer muscle progenitors at this time point.

      We thank the reviewers for prompting us to look at the role of HGF/Met in the phrenic nerve more closely. Our revised conclusions are presented in the Results and Discussion. We show that PPF-derived HGF is critical for integrating both muscle and phrenic nerve development, but now demonstrate that HGF’s regulation of phrenic nerve branching is via muscle, which is well-known to express multiple trophic factors required by motor neurons.

      In response to the specific point about the Met+/- raised, the images shown in Figure 3G and H are representative whole mount confocal images of Met Δ/+ phrenic nerves. For each genotype, we immunolabeled, confocal imaged, rendered in 3-dimensions the phrenic nerves, and counted (blinded to genotype) the number of branches. We also have added several additional embryos to this analysis. In Figure 7 the branching defects resulting from application of the BMS777607 are similar, as expected, to the severe branching defects seen in the Met Δ/ Δ embryos.

      Reviewer #2:

      In this study Sefton et al interrogated the source of HGF in the developing mouse embryo that produces HGF, required for muscularization and also proper innervation of the diaphragm. The authors extended previous results that are over 20 years old by generating cell type specific mutants of Met and Hgf and found that inactivation of Hgf in fibroblasts via PDGFRa-CreERT2 results in muscle-less diaphragms. Similarly, Hgf inactivation in fibroblasts via PDGFRa-CreERT2 mostly abrogated limb muscle formation, formally identifying PDGFRa+ mesenchymal cells as the main source of HGF for generation of muscles in the limb and diaphragm. Similarly, inactivation of Hgf using Prx1-Cre, which targets fibroblasts derived from the pleuroperitoneal folds (PPT) also prevented muscularization of the diaphragm and branching of the phrenic nerve. Interestingly, branching of the phrenic nerve was reduced in heterozygous Met mutants with normal diaphragm musculature, indicating that HGF-MET signaling plays a direct role in phrenic nerve branching and that failure of nerve branching in homozygotic Met or Hgf mutants is not solely due to the loss of the musculature. Finally, the authors performed co-cultures between PPFs and myoblasts and found that pharmacological inhibition of MET lowered motility, survival and MyoD expression of myoblasts, leading to the claim that HGF-MET does also play a role in myogenic commitment

      The study sheds further light on the source of HGF required for muscularization of the diaphragm and is well executed. However, the gain of knowledge is mainly incremental and deeper molecular insights are missing.

      We appreciate this critique and have added data to increase the molecular insight into the role of MET signaling in cell survival in the PPFs (Figure 6—figure supplement 1).

      The most interesting part of the study is the formal demonstration that HGF is not only required for delamination of muscle progenitor cells from the epithelium of somites but also to maintain migration at later stages. Similar conclusions have been made many years ago based on studies in chicken embryos but the current study clearly goes a step further.

      We agree that this is one of the interesting findings in our study.

      The part that claims a role of HGF-MET in myogenic commitment is not that well developed and may need further proof.

      We apologize for the misunderstanding here and have altered the text to indicate that we do not propose a role for Met in myogenic commitment, but rather that Met regulates the number of MyoD+ cells by promoting their survival.

      Reviewer #3:

      In this MS by Sefton et al., the authors investigate the role of HGF/MET pathway, as well as the cellular source of these molecules, during diaphragm development. In particular, the authors address the function of this pathway on muscle progenitors and phrenic nerve. They further provide evidence for the expression of HGF in pleuroperitoneal folds and for its requirement for muscle progenitor recruitment and maintenance during diaphragm muscle formation. This study is interesting and in general the results support the conclusions. The work could be improved by (1) providing appropriate controls for the role of HGF in the connective tissue and (2) linking the muscleless diaphragms and HGF to the hernia phenotype.

      We appreciate this review and have added controls for the role of HGF in the connective tissue. Specifically PDGFRaCreER/+; HGF-/fl; RosamTmG/+ embryos have fibroblasts present in muscleless regions. We further link muscleless diaphragms and HGF to the hernia phenotype in our abstract. Absence of muscle is necessary for herniation, but not sufficient.

    1. Author Response

      Reviewer #1 (Public Review):

      1-1. I do have some concerns that the differences in network clustering reported in Fig 6 may be due to noise and I think the comparisons against the HCP parcellation could be more robust. Specifically, with regard to the network clustering in Fig 6. The authors use a clustering algorithm (which is not explained) to cluster the parcels into different functional networks. They achieve this by estimating the mean time series for each parcel in each individual, which they then correlate between the n regions, to generate an nxn connectivity matrix. This they then binarise, before averaging across individuals within an age group. It strikes me that binarising before averaging will artificially reduce connections for which only a subset of individuals are set to zero. Therefore averaging should really occur before binarising. Then I think the stability of these clusters should be explored by creating random repeat and generation groups (as done for the original parcells) or just by bootstrapping the process. I would be interested to see whether after all this the observation that the posterior frontoparietal expands to include the parahippocampal gryus from 3-6 months and then disappears at 9 months - remains.

      We thank the reviewer for this insightful comment on our clustering process. For the step of “binarizing before averaging”, we followed the method proposed by Yeo et al (1). In this method, all correlation matrices are binarized according to the individual-specific thresholds. Specifically, each individual-specific threshold is determined according to the percentile, and only 10% of connections are kept and set to 1, while all other connections are set to 0. Yeo et al. (1) explained their motivation for doing so as “the binarization of the correlation matrix leads to significantly better clustering results, although the algorithm appears robust to the particular choice of the threshold”. We consider that the possible reason is that the binarization of connectivity in each individual offers a certain level of normalization so that each subject can contribute the same number of connections. If averaging occurs before binarizing, the actual connectivity contributed by different subjects would be different, which leads to bias. Meanwhile, we tested the stability of ‘binarizing first’ and ‘averaging first’, and the result is shown in Fig. R1 below. This figure suggests a similar conclusion as (1), where binarizing first before averaging leads to better clustering stability. We added the motivation of binarizing before averaging in the revised manuscript between line 577 and line 581.

      Fig. R1. The comparison of clustering stability of different methods. The red line refers to the clustering stability when binarizing the correlation matrices first and then averaging the matrices across individuals, while the blue line refers to the clustering stability when averaging the correlation matrices across individuals first and then binarizing the average matrix.

      For the final clustering results, we performed our clustering method using bootstrapping 100 times, and the final result is a majority voting of each parcel. The comparison of these two results is shown in Fig. R2. Overall, we do observe good repeatability between these two results. However, we also observed that some parcels show different patterns between the two results, especially for those parcels that are spatially located around the boundaries of networks or the medial wall. The pattern of the observation that “the posterior frontoparietal expands to include the parahippocampal gyrus from 3-6 months and then disappears at 9 months – remains” was not repeated in the bootstrapped results. These results might suggest that the clustering method is quite robust, the discovered patterns are relatively stable, and the differences between our original results and bootstrapping results might be caused by noises or inter-subject variabilities.

      Fig. R2. Top panel: the network clustering results using all data in the original manuscript. Bottom panel: the network clustering results using majority voting through 100 times of bootstrapping. Black circles and red arrows point to the parahippocampal gyrus, which was included in the posterior frontoparietal network, and is not well repeated in the bootstrapped results. (M: months)

      1-2. Then with regard to the comparison against the HCP parcellation, this is only qualitative. The authors should see whether the comparison is quantitatively better relative to the null clusterings that they produce.

      Thank you for this great suggestion! As suggested, we added this quantitative comparison using the Hausdorff distance. Similar to the comparison in parcel variance and homogeneity, the 1,000 null parcellations were created by randomly rotating our parcellation with small angles on the spherical surface 1,000 times. We compared our parcellation and the null parcellations by accordingly evaluating their Hausdorff distances to some specific areas of the HCP parcellation on the spherical space, including Brodmann's area 2, 3b, 4+3a, 44+45, V1, and MT+MST. The results are listed in Figure 4. From the results, we can observe that our parcellation generally shows statistically much lower Hausdorff distances to the HCP parcellation, suggesting that our parcellation generates parcel borders that are closer to HCP parcellations compared to the null parcellations.

      However, we noticed very few null parcellations that show smaller Hausdorff distances compared to our parcellation. A possible reason comes from our surface registration process with the HCP template purely based on cortical folding, without using functional gradient density maps, which are not available in the HCP template. As a result, this does not ensure high-quality functional alignment between our infant data and the HCP space, thus inevitably increasing the Hausdorff distance between our parcellation and the HCP parcellation.

      1-3. … not all individuals appear (from Fig 8) to be acquired exactly at the desired timepoints, so maybe the authors might comment on why they decided not to apply any kernel weighted or smoothing to their averaging? Pg. 8 'and parcel numbers show slight changes that follow a multi-peak fluctuation, with inflection ages of 9 and 18 months' explain - the parcels per age group vary - with age with peaks at 9 and 18 - could this be due to differences in the subject numbers, or the subjects that were scanned at that point?

      We do agree with the reviewer that subjects are not scanned at similar time points. This is designed in the data acquisition protocol to seamlessly cover the early postnatal stage so that we will have a quasi-continuous observation of the dynamic early brain development.

      We didn’t apply kernel weighted average or smoothing when generating the parcellation, as we would like each scan to contribute equally, and each parcellation map could be representative of the cohort of the covered age, instead of only part of them. Meanwhile, our final ‘age-common parcellation’ could be representative of all subjects from birth to 2 years of age. However, we do agree that the parcellation map that is only designed for the use of a specific age, e.g., 1-year-olds, kernel weighted average, or even a more restricted age range could be a more appropriate solution.

      For the parcel number that likely shows fluctuations with subject numbers, we added an experiment, where we randomly selected 100 scans by considering the minimum scan number in each age group using bootstrapping and repeated this process 100 times. The average parcel number of each age is reported in the following Table R1. We didn’t observe strong changes in parcel numbers when reducing scan numbers, which further demonstrates that our parcel numbers do not show a strong relation to subject numbers. However, the parcel number does not increase greatly from 18M to 24M in the bootstrapping results, so we modified the statement in the manuscript about the parcel number to ‘… all parcel numbers fall between 461 to 493 per hemisphere, where the parcel number attains a maximum at around 9 months and then reduces slightly and remains relatively stable afterward. …’, which can be found between line 121 and line 122.

      1-4. I also have some residual concerns over the number of parcels reported, specifically as to whether all of this represents fine-grained functional organisation, or whether some of it represents noise. The number of parcels reported is very high. While Glasser et al 2016 reports 360 as a lower bound, it seems unlikely that the number of parcels estimated by that method would greatly exceed 400. This would align with the previous work of Van Essen et al (which the authors cite as 53) which suggests a high bound of 400 regions. While accepting Eickhoff's argument that a more modular view of parcellation might be appropriate, these are infants with underdeveloped brain function.

      We thank the reviewer for this insightful comment. We agree that there might be noises for some of the parcels, as noises exist in each step, such as data acquisition, image processing, surface reconstruction, and registration, especially considering functional MRI is noisier than structural MRI. Though our experiments show that our parcellation is fine-grained and is suitable for the study of the infant brain functional development, it is hard to directly quantitatively validate as there is no ground truth available.

      Despite these, we are still motivated to create fine-grained parcellations, as with the increase of bigger and higher resolution imaging data and advanced computational methods, parcellations with more fine-grained regions are desired for downstream analyses, especially considering the hierarchical nature of the brain organization (2). And the main reason that our method generates much finer parcellation maps, is that both our registration and parcellation process is based on the functional gradient density, which characterizes a fine-grained feature map based on fMRI. This leads to both better inter-subject alignment in functional boundaries and finer region partitions. This strategy is different from Glasser et al (3), which jointly considers multimodal information for defining parcel boundaries, thus parcels revealed purely by functional MRI might be ignored in the HCP parcellation. We hope our parcellation framework can be a useful reference for this research direction. We added this discussion in the revised manuscript between line 268 and line 271.

      For the parcel number, even without performing surface registration based on fine-grained functional features, recent adult fMRI-based parcellations greatly increased parcel numbers, such as up to 1,000 parcels in Schaefer et al. (4), 518 parcels in Peng et al. (5), and 1,600 parcels in Zhao et al. (6). For infants, we do agree that the infant functional connectivity might not be as strong as in adults. However, there are opinions (7-9) that the basic units of functional organization are likely to present in infant brains, and brain functional development gradually shapes the brain networks. Therefore, the functional parcel units in infants could be possibly on a comparable scale to adults. Even so, we do agree that more research needs to be performed on larger datasets for better evaluations. We added this discussion in the revised manuscript between line 275 and line 280.

      1-5. Further comparisons across different subjects based on small parcels increases the chances of downstream analyses incorporating image registration noise, since as Glasser et al 2016 noted, there are many examples of topographic variation, which diffeomorphic registration cannot match. Therefore averaging across individuals would likely lose this granularity. I'm not sure how to test this beyond showing that the networks work well for downstream analyses but I think these issues should be discussed.

      We agree with the reviewer that averaging across individuals inevitably brings some registration errors to the parcellation, especially for regions with high topographic variation across subjects, which would lead to loss of granularity in these regions. We believe this is an important issue that exists in most methods on group-level parcellations, and the eventual solution might be individualized parcellation, which will be our future work. We added this discussion in the revised manuscript between line 288 and line 292.

      We also agree with the reviewer that downstream analyses are important evaluations for parcellations. We provided a beta version of our parcellation with 602 parcels (10) to our colleagues, and they tested our parcellation in the task of infant individual recognition across ages using functional connectivity, to explore infant functional connectome fingerprinting (10). We compared the performance of different parcellations with 602 ROIs (our beta version), 360 ROIs (HCP MMP parcellation (3)), and 68 ROIs (FreeSurfer parcellation (11)). The results (Fig. R3) show that our parcellation with a higher parcellation number yields better accuracy compared to other parcellations. We added a description of this downstream application in the discussion between line 284 and line 287.

      Fig. R3. The comparison of different parcellations for infant individual recognition across age based on functional connectivity (figure source: Hu et al. (10)). The parcellation with 602 ROIs is the beta version of our parcellation, 360 ROIs stands for HCP MMP parcellation (3) and 68 ROIs stands for the FreeSurfer parcellation (11). This downstream task shows that a higher parcellation number does lead to better accuracy in the application.

      1-6. Finally, I feel the methods lack clarity in some areas and that many key references are missing. In general I don't think that key methods should be described only through references to other papers. And there are many references, particular to FSL papers, that are missing.

      We thank the reviewer for this great suggestion. We added related references for FLIRT, FSL, MCFLIRT, and TOPUP For the alignment to the HCP 32k_LR space, we first aligned all subjects to the fsaverage space using spherical demons, and then used part of the HCP pipeline (12) to map the surface from the fsaverage space to HCP 164k_LR space, and downsampled to 32k_LR space. We modified this citation by referencing the HCP pipeline by Glasser et al. (12) instead and detailed this registration process in the revised manuscript between line 434 to line 440 in the revised manuscript and as below:

      “… The population-mean surface maps were mapped to the HCP 164k ‘fs_LR’ space using the deformation field that deforms the ‘fsaverage’ space to the ‘fs_LR’ space released by Van Essen et al. (13), which was obtained by landmark-based registration. By concatenating the three deformation fields of steps 1, 3, and 4, we directly warped all cortical surfaces from individual scan spaces to the HCP 164k_LR space and then resampled them to 32k_LR using the HCP pipeline (12), thus establishing vertex-to-vertex correspondences across individuals and ages …”

      Reviewer #2 (Public Review):

      2-1. Diminishing enthusiasm is the lack of focus in the result section, the frequent use of jargon, and figures that are often difficult to interpret. If those issues are addressed, the proposed atlas could have a high impact in the field especially as it is aligned with the template of the Human Connectome Project.

      We’d like to thank Reviewer #2 for the appreciation of our atlas. According to the reviewer’s suggestion, we went through the manuscript again by focusing on correcting the use of jargon, clarity in the result section, as well as figures and figure captions. We hope our corrections can help explain our work to a broader community. Our revisions are accordingly detailed in the following. Meanwhile, our parcellation maps have been aligned with the templates in HCP and FreeSurfer and made available via NITRC at: https://www.nitrc.org/projects/infantsurfatlas/.

      References

      1. B. Thomas Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology 106, 1125-1165 (2011).

      2. S. B. Eickhoff, R. T. Constable, B. T. Yeo, Topographic organization of the cerebral cortex and brain cartography. NeuroImage 170, 332-347 (2018).

      3. M. F. Glasser, T. S. Coalson, E. C. Robinson, C. D. Hacker, J. Harwell, E. Yacoub, K. Ugurbil, J. Andersson, C. F. Beckmann, M. Jenkinson, S. M. Smith, D. C. Van Essen, A multi-modal parcellation of human cerebral cortex. Nature 536, 171-178 (2016).

      4. A. Schaefer, R. Kong, E. M. Gordon, T. O. Laumann, X.-N. Zuo, A. J. Holmes, S. B. Eickhoff, B. T. J. C. C. Yeo, Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. 28, 3095-3114 (2018).

      5. L. Peng, Z. Luo, L.-L. Zeng, C. Hou, H. Shen, Z. Zhou, D. Hu, Parcellating the human brain using resting-state dynamic functional connectivity. Cerebral Cortex, (2022).

      6. J. Zhao, C. Tang, J. Nie, Functional parcellation of individual cerebral cortex based on functional mri. Neuroinformatics 18, 295-306 (2020).

      7. W. Gao, S. Alcauter, J. K. Smith, J. H. Gilmore, W. Lin, Development of human brain cortical network architecture during infancy. Brain Structure and Function 220, 1173-1186 (2015).

      8. W. Gao, H. Zhu, K. S. Giovanello, J. K. Smith, D. Shen, J. H. Gilmore, W. J. P. o. t. N. A. o. S. Lin, Evidence on the emergence of the brain's default network from 2-week-old to 2-year-old healthy pediatric subjects. 106, 6790-6795 (2009).

      9. K. Keunen, S. J. Counsell, M. J. J. N. Benders, The emergence of functional architecture during early brain development. 160, 2-14 (2017).

      10. D. Hu, F. Wang, H. Zhang, Z. Wu, Z. Zhou, G. Li, L. Wang, W. Lin, G. Li, U. U. B. C. P. Consortium, Existence of Functional Connectome Fingerprint during Infancy and Its Stability over Months. Journal of Neuroscience 42, 377-389 (2022).

      11. R. S. Desikan, F. Ségonne, B. Fischl, B. T. Quinn, B. C. Dickerson, D. Blacker, R. L. Buckner, A. M. Dale, R. P. Maguire, B. T. Hyman, An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31, 968-980 (2006).

      12. M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80, 105-124 (2013).

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1) There is affluent evidence that the cortical activity in the waking brain, even in head restrained mice, is not uniform but represents a spectrum of states ranging from complete desynchronization to strong synchronization, reminiscent of the up and down states observed during sleep (Luczak et al., 2013; McGinley et al., 2015; Petersen et al., 2003). Moreover, awake synchronization can be local, affecting selective cortical areas but not others (Vyazovskiy et al., 2011). State fluctuations can be estimated using multiple criteria (e.g., pupil diameter). The authors consider reduced glutamatergic drive or long-range inhibition as potential sources of the voltage decrease but do not attempt to address this cortical state continuum, which is also likely to play a role. For example: does the voltage inactivation following ripples reflect a local downstate? The authors could start by detecting peaks and troughs in the voltage signal and investigate how ripple power is modulated around those events.

      Our study is correlational, and hence, we cannot speak as to any casual role that the awake hippocampal ripples may play in the post-ripple hyperpolarization observed in aRSC. It is indeed possible that the post-awake-ripple neocortical hyperpolarization is independent of ripples and reflects other mechanisms that our experiments have possibly been blind to. One such mechanism is neocortical synchronization in the awake state. As reviewer 1 pointed out, it is possible that a proportion of hippocampal ripples occur before neocortical awake down-states. To test this hypothesis, we triggered the ripple power signal by the troughs (as proxies of awake down-states) and peaks (as proxies of awake up-states) of the voltage signals, captured from different neocortical regions, during periods of high ripple activity when the probability of neocortical synchronization is highest (McGinley et al., 2015; Nitzan et al., 2020). According to this analysis (see the figure below), the ripple power was, on average, higher before troughs of aRSC voltage signal than before those of other regions. On the other hand, the ripple power, on average, was not higher after the peaks of aRSC voltage signal than after those of other regions. This observation supports the hypothesis that a local awake down-state could occur in aRSC after the occurrence of a portion of hippocampal ripples. However, a recent work whose preprint version was cited in our submission (Chambers et al., 2022, 2021) reported that, out of 33 aRSC neurons whose membrane potentials were recorded, only 1 showed up-/down-states transitions (bimodal membrane potential distribution). Still, a portion (10 out of 30) of the remaining neurons showed an abrupt post-ripple hyperpolarization. In addition, they reported a modest post-ripple modulation of aRSC neurons’ membrane potential (~ %20 of the up/down-states transition range). Hence, these results suggest that the post-ripple aRSC hyperpolarization is not necessarily the result of down-states in aRSC. A paragraph discussing this point was added to the discussion lines 262-279.

      Mean ripple power triggered by troughs and peaks of voltage signal captured from aRSC, V1, and FLS1. Zero time represents the timestamp of neocortical troughs/peaks. The shading represents SEM (n = 6 animals).

      Point 2) Ripples are known to be heterogeneous in multiple parameters (e.g., power, duration, isolated events/ ripple bursts, etc.), and this heterogeneity was shown to have functional significance on multiple occasions (e.g. Fernandez-Ruiz et al., 2019 for long-duration ripples; Nitzan et al., 2022 for ripple magnitude; Ramirez-Villegas et al., 2015 for different ripple sharp-wave alignments). It is possible that the small effect size shown here (e.g. 0.3 SD in Fig. 2a) is because ripples with different properties and downstream effects are averaged together? The authors should attempt to investigate whether ripples of different properties differ in their effects on the cortical signals.

      The seeming small effect size (e.g. 0.3 SD in Fig. 2a) is because the individual peri-ripple voltage/glutamate traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In the latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than the standard deviation of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size). For instance, if the sample size in Fig. 2A (number of ripples) was 100, the mean z-scored value would be 0.3*10 = 3. In any case, it is of interest to investigate the relationship between the ripple and neocortical activity features.

      To investigate the relationship between the hippocampal ripple power and the peri-ripple neocortical voltage activity, we focused on the agranular retrosplenial cortex (aRSC) as it showed the highest level of modulation around ripples. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple power, the ripples were divided into 8 subgroups using 8-quantiles of their power distribution, and the corresponding aRSC voltage traces were averaged for each subgroup (similar to the work of Nitzan et al. (Nitzan et al., 2022)). The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple power subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lowest and highest power subgroups. The shading represents the standard error of the mean.

      These results suggested that there might be a positive correlation between the ripple power and the pre-ripple and post-ripple aRSC voltage amplitude. To test this possibility, Pearson’s correlation between the ripple power and pre-/post-ripple aRSC amplitude was calculated for each animal separately. The ripple power for each detected ripple was defined as the average of the ripple-band-filtered, squared, and smoothed hippocampal LFP trace from -50 ms to +50ms relative to the ripple's largest trough timestamp (ripple center). The pre- and post-ripple aRSC amplitude for each ripple was calculated as the average of the aRSC voltage trace over the intervals [-200ms, 0] and [0, 200ms], respectively. The results come as follows.

      Top: the scatter plots of the ripple power and pre-ripple aRSC voltage amplitude for individual animals. The black lines in each graph represent the linear regression line. The blue circles in each graph are associated with one ripple. The Pearson’s correlation values (ρ) and the p-value of their corresponding statistical significance are represented on top of each graph. Bottom: the same as top graphs but for post-ripple aRSC amplitude.

      According to this analysis, 4 out of 6 animals showed a weak positive correlation (ρ = 0.0806 ± 0.0115; mean ± std), 1 animal showed a negative correlation (ρ = -0.20183), and 1 animal did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (ρ = -0.1 and -0.14), and 4 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      To check that the correlation results were not influenced by the extreme values of the ripple power and aRSC voltage, we repeated the same correlation analysis after removing the ripples associated with top and bottom %5 of the ripple power and aRSC voltage values. According to this analysis, 1 out of 6 animals showed a negative correlation (ρ = -0.13), and 5 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (same animals that showed negative correlation before removing the extreme values; ρ = -0.12 and -0.14), 1 animal showed a positive correlation (ρ = 0.1), and 3 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      Based on these results, we cannot conclude that there is a meaningful correlation between the ripple power and amplitude of aRSC voltage activity before and after the ripples. It is noteworthy to mention that Nitzan et al. (see Fig S6 in (Nitzan et al., 2022)) did not report a statistically significant correlation between ripple power octile number (by discretizing a continuous-valued random variable into 8 subgroups) and pre-ripple firing rate of the mouse visual cortex. However, they reported a statistically significant negative correlation (ρ = -0.13) between the ripple power octile number and post-ripple firing rate of the mouse visual cortex. It appears that their reported negative correlation was influenced by the disproportionately larger values of the firing rate associated with the first ripple power octile compared to the other octiles. Therefore, repeating their analysis after removing the first octile would probably lead to a weak correlation value close to 0.

      Next, we investigated the relationship between ripple duration and aRSC voltage activity. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple duration, the ripples were divided into 8 subgroups using 8-quantiles of their duration distribution, and the corresponding aRSC voltage traces were averaged for each subgroup. The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple duration subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lower and highest duration subgroups. The shading represents standard error of the mean.

      These results do not reveal a qualitative difference between the patterns of aRSC peri-ripple voltage modulation and ripple duration. However, the same correlation analysis performed for the ripple power was also conducted for the ripple duration. Only 1 animal out of 6 showed a statistically significant correlation (ρ = 0.08) between pre-ripple aRSC voltage amplitude and ripple duration.

      Moreover, only 1 animal out of 6 showed a statistically significant correlation (ρ = -0.08) between post-ripple aRSC voltage amplitude and ripple duration. In conclusion, there does not seem to be a meaningful linear relationship between peri-ripple aRSC voltage amplitude and ripple duration.

      Next, we investigated whether the peri-ripple aRSC voltage modulation differs depending on whether a single or a bundled ripple occurs in the dorsal hippocampus. The bundled ripples were detected following the method described in our previous work (Karimi Abadchi et al., 2020). We found that 9.4 ± 3.5 (mean ± std across 6 animals) percent of the ripples occurred in bundles. Then, the aRSC voltage trace was triggered by the centers of the single as well as centers of the first/second ripples in the bundled ripples, averaged for each animal, and averaged across 6 animals. The results of this analysis are represented in the following figure.

      Left: animal-wise average of mean peri-ripple aRSC voltage trace triggered by centers of the single and centers of the first ripple in the bundled ripples. Right: Same to the left but triggered by the centers of the second ripple in the bundled ripples.

      These results suggest that the amplitude of aRSC voltage activity is larger before bundled than single ripples, and the timing of aRSC voltage activity is shifted to the later times for bundled versus single ripples. The pre-ripple larger depolarization might signal the occurrence of a bundled ripple (similar to larger pre-bundled- than pre-single-ripple deactivation observed during sleep (Karimi Abadchi et al., 2020)).

      Point 3) The differences between the voltage and glutamate signals are puzzling, especially in light of the fact that in the sleep state they went hand in hand (Karimi Abadchi et al., 2020, Fig. 2). It is also somewhat puzzling that the aRSC is the first area to show voltage inactivation but the last area to display an increase in glutamate signal, despite its anatomical proximity to hippocampal output (two synapses away). The SVD analysis hints that the glutamate signal is potentially multiplexed (although this analysis also requires more attention, see below), but does not provide a physiologically meaningful explanation. The authors speculate that feed-forward inhibition via the gRSC could be involved, but I note that the aRSC is among the two major targets of the gRSC pyramidal cells (the other being homotypical projections) (Van Groen and Wyss, 2003), i.e., glutamatergic signals are also at play. To meaningfully interpret the results in this paper, it would be instrumental to solve this discrepancy, e.g., by adding experiments monitoring the activity of inhibitory cells.

      Observing that glutamate and voltage signals do not go hand-in-hand in awake versus sleep states was surprising for us as well, and it was the main reason that SVD analysis was performed. Especially that a portion of aRSC excitatory neurons showed elevated calcium activity despite the reduction of voltage and delayed elevation of glutamate signals in aRSC at the population level. At the time of initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons, as the reviewer 1 pointed out, or from other neocortical or subcortical regions projecting to aRSC. It is also possible that feedback inhibition would be involved where the principal aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons.

      Point 4) I am puzzled by the ensemble-wise correlation analysis of the voltage imaging data: the authors point to a period of enhanced positive correlation between cortex and hippocampus 0-100 ms after the ripple center but here the correlation is across ripple events, not in time. This analysis hints that there is a positive relationship between CA1 MUA (an indicator for ripple power) and the respective cortical voltage (again an incentive to separate ripples by power), i.e. the stronger the ripple the less negative the cortical voltage is, but this conclusion is contradictory to the statements made by the authors about inhibition.

      A closer look at Figure 2B iv reveals that elevation of the cross-correlation function between peri-ripple aRSC voltage and hippocampal MUA starts with a short delay (~20 ms) and peaks around 75 ms after the ripple centers. It means the maximum correlation between the two signals occurs at point (75ms, 75ms) on the MUA time-voltage time plane whose origin (i.e. the point (0, 0)) is the ripple centers in the hippocampal MUA and corresponding imaging frame in the voltage signal. Reviewer 1’s interpretation would be correct if the maximum correlation occurred at the point (0, 0) not at the point (75ms, 75 ms). It is because the MUA value at the time of ripple centers (t = 0) is the indicator of the ripple power not at the time t = 75ms. Figure 2B iii shows that the amplitude of hippocampal MUA is more than 2 dB less at t = 75ms than at t = 0 which is a reflection of the fact that ripples are often short-duration events. Instead, if the maximum correlation occurred at the point (0, 100ms) where the ripples had maximum power and aRSC voltage was at its trough (Figure 2B iii), it could have been concluded that “the stronger the ripple the less negative the cortical voltage”.

      Point 5) Following my previous point, it is difficult to interpret the ensemble-wise correlation analysis in the absence of rigorous significance testing. The increased correlation between the HPC and RSC following ripples is equal in magnitude to the correlation between pre-ripple HPC MUA and post-ripple cortical activity. How should those results be interpreted? The authors could, for example, use cluster-based analysis (Pernet et al., 2015) with temporal shuffling to obtain significant regions in those plots. In addition, the authors should mark the diagonal of those plots, or even better compute the asymmetry in correlation (see Steinmetz et al., 2019 Extended Fig. 8 as an example), to make it easier for the reader to discern lead/lag relationships.

      The purpose of calculating the ensemble-wise correlation coefficient was to provide further information about the relationship between the two random processes peri-ripple HPC MUA and peri-ripple neocortical activity. In general, the correlation between the two random processes cannot be inferred from the temporal relationship between their mean functions. In other words, there are infinitely many options for the shape of the correlation function between two random processes with given mean functions. Moreover, the point was to compare the correlation of peri-ripple neocortical activity and HPC MUA across neocortical regions. The fact that mean peri-ripple activity in, for example, RSC and FLS1 are different does not necessarily mean their correlation functions with peri-ripple HPC MUA are also different.

      As requested, we performed cluster-based significant testing via temporal shuffling for each individual VSFP (n = 6), iGluSnFR Ras (n = 4), and iGluSnFR EMX (n = 4) animals. The following figures summarize the number of animals showing significant regions in their correlation functions between peri-ripple HPC MUA and different neocortical regions. The diagonal of the correlation functions is marked; however, the temporal lead/lag should not be inferred from these results mainly because the temporal resolution of the two signals, one electrophysiological and one optical, are not the same.

      Point 6) For the single cell 2-photon responses presented in Fig. 3, how should the reader interpret a modulation that is at most 1/20 of a standard deviation? Was there any attempt to test for the significance of modulation (e.g., by comparing to shuffle)? If yes, what is the proportion of non-modulated units? In addition, it is not clear from the averages whether those cells represent bona fide distinct groups or whether, for instance, some cells can be upmodulated by some ripples but downmodulated by others. Again, separation of ripples based on objective criteria would be useful to answer this question.

      As explained in response to point 2, the seeming small modulation size (e.g. 0.05 SD in Fig. 3b) is because the individual peri-ripple calcium traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In this latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than that of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size).

      As suggested by the reviewer and to make our results more comparable with those of electrophysiological studies, we deconvolved the calcium traces and tested for the significance of the modulation of each neuron by comparing its mean peri-ripple deconvolved trace with a neuron-specific shuffled distribution (see the methods section for details). We found %8.46 ± 3 (mean ± std across 11 mice) of neurons were significantly modulated over the interval [0, 200ms] and %81.08 ± 8.91 (mean ± std across 11 mice) of which were up-modulated. If the criterion of being distinct is being significantly up- or down-modulated, these two groups could be considered distinct groups. The following figures show mean peri-ripple calcium and deconvolved traces, averaged across up- or down-modulated neurons for each mouse and then averaged across 11 mice.

      Point 7) Fig. 3: The decomposition-based analysis of glutamate imaging using SVD needs to be improved. First, it is not clear how much of the variance is captured by each component, and it seems like no attempt has been made to determine the number of significant components or to use a cross-validated approach. Second, the authors imply that reconstructing the glutamate imaging data using the 2nd-100th components 'matches' the voltage signal but this statement holds true only in the case of the aRSC and not for other regions, without providing an explanation, raising questions as to whether this similarity is genuine or merely incidental.

      The first 100 components explained about %99.9 of the variance in the concatenated stack of peri-ripple neocortical glutamate activity for each animal which is practically equivalent to the entire variance in the data. Our goal was not to obtain a low-rank approximation of the data for which the number of significant components had to be determined. Instead, we decomposed the data into the activity along the first principal component for which there was no noticeable topography among neocortical regions and the activity along the rest of the components for which there was a noticeable topography among neocortical regions. The first component explained %83.11 ± 6.75 (mean ± std across 4 iGluSnFR Ras mice) and %83.3 ± 5.07 (mean ± std across 4 iGluSnFR EMX mice) of variance in the concatenated stack of peri-ripple neocortical glutamate activity.

      As we discussed in the discussion section of the manuscript, SVD is agnostic about brain mechanisms and only cares about capturing maximum variance. Specifically, it is not designed to capture the maximum similarity between glutamate and voltage activity in the brain. Therefore, the only thing we can say with certainty comes as follows: when the activity along the axis with maximum co-variability (1st principal component) across the neocortical regions’ glutamate activity is removed, only aRSC, and no other regions, show a post-ripple down-modulation, whose timing matches that of aRSC post-ripple voltage down-modulation. Moreover, the timing of activity of 1st principal component matches better with that of calcium activity among the up-modulated portion of aRSC neurons. Even though the genuineness of these results is not guaranteed, the similarity between the timing of SVD output in aRSC glutamatergic activity with that in two independently collected signals in aRSC, i.e. voltage and calcium, could support the idea that peri-ripple aRSC glutamatergic activity is likely a mixture of up- and down-modulated components.

      Point 8) The estimation of deep pyramidal cells' glutamate activity by subtracting the Ras group (Fig. 4d) is not very convincing. First, the efficiency of transgene expression can vary substantially across different mouse lines. Second, it is not clear to what extent the wide field signal reflects deep cells' somatic vs. dendritic activity due to non-linear scattering (Ma et al., 2016), and it is questionable whether a simple linear subtraction is appropriate. The quality of the manuscript would improve substantially if the authors probe this question directly, either by using deep layer specific line/ 2-P imaging of deep cells or employing available public datasets.

      Simulation studies have suggested that the signal, captured by wide-field imaging of voltage-sensitive dye, can be modeled as a weighted sum of voltage activity across neocortical layers (Chemla and Chavane, 2010; Newton et al., 2021). Hence, modeling the glutamate signal as a weighted sum of the glutamate activity across neocortical layers is a good starting point. Future studies would be needed to improve this starting point by imaging glutamate activity in a cohort of mice with iGluSnFR expression in only deep layers’ neurons. Moreover, Ma et al. (Ma et al. 2016) stated that “This means that signal detected at the cortical surface (in the form of a two-dimensional image) represents a superficially weighted sum of signals from shallow and deeper layers of the cortex”.

      Reviewer #2 (Public Review):

      Point 1) The authors throughout the manuscript compare the correlation between hippocampal MUA and the imaged cortical ensemble activity (Example: Lines 120-122). There is a potential time lag in signal detection with regard to the two detection methods. While the time lag using electrophysiological recording is at the scale of milliseconds, the glutamate-sensitive imaging might take several 100s of ms to be detected. It is not clear in the manuscript how the authors considered this problem during the analysis.

      The ensemble-wise correlation analysis characterizes the relationship between two random processes, peri-ripple HPC MUA and peri-ripple neocortical activity (please see the response to reviewer 1’s major point 5). Although it is a valid point that the temporal resolution of the two signals is not the same which could introduce an error in the exact timing of the relationship between the two processes, we did not draw any conclusion based on the exact timing of the elevated correlation between the two processes. Moreover, we smoothed (equivalent to low-pass filtering) and down-sampled the MUA signal (please see the methods section) to bring the temporal scale of the two processes closer to each other. We also want to clarify that the temporal resolution of voltage and glutamate imaging is in the range of 10s of ms (Xie et al., 2016).

      Point 2) In the results section "The peri-ripple glutamatergic activity is layer dependent", are the Ras and EMX expressed in two different experimental animal groups? If yes, and there was a time lag between the two groups, is it valid to estimate the deeper layer activity using a scaled version of the Ras from the EMX signal?

      This comment is addressed in response to reviewer 1’s major point 8.

      Point 3) The authors did not discuss the results adequately in the discussion section. Since there is no behavioral paradigm and no behavioral read-out to induce or correlate it with possible planning and future decision-making process, the significance of the paper will be enhanced by discussing the possible underlying circuitry mechanism that might cause the reported observations. With no planning periods in the task (instead just sitting on a platform), it is actually quite unclear what the purpose of wake ripples should be. For example, the authors discuss the superficial and deep layer responses and their relation to the memory index theory. However, the RSC possesses different groups of excitable neurons in different layers. Specifically, three excitable neurons are found within the different layers of the RSC; the intrinsically bursting neurons (IB), regular spiking (RS), and low-rheobase (LR) neurons. These neurons are distributed heterogeneously within the RSC cortical layer. Although the RS are abundant in the deeper layers of the RSC, they occupy 40% of the total amount of excitable neurons found in layers II/III. On the other hand, the LR is the dominant excitable neuron in the superficial layers. It will add to the significance of the work if the authors discussed the results in the context of the cellular structure of the RSC and how would that impact the observed inhibition in the peri-ripple time window. It would be helpful for the readers and the reviewers to add a schematic diagram to the discussion section.

      The goal of our study was to characterize the patterns of neocortical activity around hippocampal ripples in the awake state and not shed light on the function (purpose) of awake ripples. However, we speculated about what our results could mean in the discussion section. To address the reviewer’s comment on the differences across RSC layers, the following paragraph was added to the discussion section lines 342-353.

      “Our results suggest that dendrites of deep pyramidal neurons, arborized in the superficial layers of the neocortex, receive glutamatergic modulation earlier than those of the superficial ones. However, the results do not provide a mechanistic explanation of the phenomenon. It is possible that the observed layer-dependency of the glutamatergic modulation would partially result from the heterogeneity of the excitatory as well as inhibitory neurons across aRSC layers. But, the question is how this heterogeneity may lead to the above-mentioned layer-dependency to which our data does not provide an answer. It could be speculated that the difference in the dendritic morphology and firing type of different types of RSC excitatory neurons (Yousuf et al., 2020) or the difference in connectivity of different RSC layers with other brain regions would play a role (Sugar et al., 2011; van Groen and Wyss, 1992; Whitesell et al., 2021). This is a complicated problem and could only be resolved by conducting experiments specifically designed to address this problem.”

      Point 4. A general issue (in addition to the missing behaviour), is the mix of the methods. On one side this makes the article very interesting since it highlights that with different methods you actually observe different things. But on the other side, it makes it very difficult to follow the results. It would be a major improvement of the article if the authors could include (as mentioned above) a schematic of the results and their theory, especially highlighting how the different methods would capture different parts of the mechanism. Finally, the authors should not use calcium signals as a direct measure of neuronal firing. Calcium influx is only seen in bursts of firing, not with individual spikes. It is a plasticity signal and therefore should be treated and discussed as such. Just recently it was shown by Adamantidis lab that the calcium signal changes between wake and sleep and this change does not parallel changes in neuronal firing/spikes.

      We agree with the reviewer that the calcium signal is biased toward burst of spikes (Huang et al., 2021). To address this concern, the term “spiking activity” was replaced with “calcium activity” throughout the manuscript. Moreover, the calcium signal was deconvoled to get a better estimate of the spiking activity (please refer to our response to the reviewer 1’s point 6).

      Point 5. In the discussion section, the authors focus their discussion on the connectivity between the CA1 area and the RSC. Although it is an important point, since the authors are examining the peri-ripple cortical dynamics, it is critical to discuss other possible connectivity effects. Furthermore, the hippocampal input preferentially targets the granular RSC, how would that impact the results and the interpretation of the authors? Additionally, a previous study reported the suppression of the thalamic activity during hippocampal ripples (Yang et al., 2019). Importantly, the thalamic inputs to the RSC target the superficial layers. It will add to the value of the paper if the authors expanded the discussion section and elaborated further on the possible interpretation of the results.

      At the time of our initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons or other neocortical or subcortical regions projecting to aRSC (please see the discussion section). However, the source being from the thalamus is less likely because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022). On the other hand, it is also possible that feedback inhibition would be involved where the excitatory aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons which in turn inhibit pyramidal cells. To address this comment, the following paragraph was added to the discussion section in lines 323-328.

      “Thalamus is another source of axonal projections to aRSC (Van Groen and Wyss, 1992). However, it is less likely that thalamic projections contribute to the peri-awake-ripple aRSC activity modulation because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022).”

    1. Author Response

      Reviewer #3 (Public Review):

      Lillvis et al present a new method for quick targeted analysis of neural circuits through a combination of tissue expansion and (lattice) light sheet microscopy. Three color labeling is available which allows to label neurons of a molecularly specific type, presynaptic and/or post-synaptic sites.

      Strengths:

      • The experimental technique can provide much higher throughput than EM

      • All source code has been made available

      • Manual correction of automatic segmentations has been implemented, allowing for an efficient semi-automatic workflow

      • Very different kinds of analyses have been demonstrated

      • Inclusion of electrical connections is really exciting, what a great complement to the existing EM volumes!

      Weaknesses:

      • Limitations of the method are not really discussed. While the approach is simpler and cheaper than EM, it's still important to give the readers a clear picture of the use cases where it's not expected to work before they embark on the journey of acquiring tens of terabytes of data. Here are just a few examples of the questions I would have if I wanted to implement the method myself - I am a computational person and can easily imagine my "wet lab" colleagues would have even more to ask about the experimental side:

      Please see our response to the Essential Revisions (for the authors) section above in addition to the responses to each point below.

      • It is not very clear to me if the resolution of the method is sufficient to disentangle individual neurons of the same type. It has been demonstrated for a few examples in the paper, but is it generally the case? Are there examples of brain regions/neuron types where it wouldn't be possible? If another column was added to the table in Figure 1, e.g. "individual neuron connectivity", EM would be "+", LM "-", what would ExLLSM be?

      Individual neuron connectivity is possible using this current version of ExLLSM either by labeling individual neurons genetically or by manually segmenting neurons in sparsely labeled samples. Of course, the exact answer to this question depends on labeling density and sample quality, and we have added a statement to address this.

      Lines 585-591: The difficulty of such manual segmentation can vary substantially depending on labelling density and signal quality. For instance, manually segmenting individual L2 outputs (Fig. 3) took ~10 minutes/neuron whereas segmenting a pair of SAG neurons from off-target neurons (Fig. 4) took 1-5 hours depending on the sample. Of course, more densely labeled samples will take more time. Finally, while it is possible to segment individual neurons from entangled bundles as shown here and elsewhere (Gao et al., 2019), the expansion factor will need to be increased by an order or magnitude or more and neuron labels must be continuous to approach EM levels of reconstruction density.

      • Similarly, the procedures for filling gaps in the signal could result in falsely merged neurons. Does it ever happen in practice?

      Because the gap filling process is not utilized until after semi-automatic segmentation this was not a concern (the gaps were filled on manually inspected neuron masks that should only include signals from the neuron(s) of interest). This would certainly be a concern if we were using this gap filling step – or the fully automated neuron segmentation approach – to segment individual neurons from samples in which off-target neurons are also labeled, but that was not the case here.

      • How long does semi-manual analysis take in person-hours/days for a new biological question similar in scope to the ones demonstrated in the paper?

      The statement discussed above (lines 585-591) and an additional statement (lines 581-583) aim to address this.

      Lines 580-582: As such, analyzing the DA1-IPN data, for example, required relatively little human time. The semi-automatic neuron segmentation steps required a maximum of one hour per sample and all other steps are automated.

      • How robust are the networks for synaptic "blob" detection? The authors have shown they work for different reporters, when are they expected to break? Would you recommend to retrain for every new dataset? How would you recommend to validate the results if no EM data is available?

      We expect that the network for blob detection is quite robust as it essentially acts as high signal detector for punctate signals, as opposed to classifying a high-level shape or structure. We have modified the text to suggest that the synapse and neuron segmentation models we include be attempted before automatically retraining.

      Lines 368-372: Furthermore, the convolutional neural network models for synapse and neuron segmentation are classifiers of high signal punctate and continuous structures, respectively. As such, the models may already work well for segmenting similar structures from other species or microscopes. If not, these models can be retrained with a suitable ground truth data set and the entire computational pipeline can be applied to these new systems.

    1. Author Response

      Reviewer #2 (Public Review):

      Burger et al. present their compilation of 3 well established cervical cancer natural history micro-simulation models from the US (Harvard), Denmark (Miscan) and Australia (Policy 1) to evaluate what effect Covid, or any systemic problem impeding screening over a time duration, will have on cervical cancer incidence ("symptomatic cervical cancer") in the short and long term. They use the United States for the modeling example and establish that a temporary screening delay has less deleterious effect on cervical cancer incidence and morbidity than being under-screened. Screening test and previous screening frequency also impact on the outcomes.

      The authors evaluate a number of factors in their analysis:

      1. Screening test type: HPV (every five years) or cytology (every three years), as per guidelines.

      2. Screening delay such as with Covid: 1, 2, or 5 years from the participant's last screening encounter.

      3. The participant screening frequency: 1, 3, 5 or 10 yearly.

      4. Three birth cohorts: 35yo, 45yo and 55yo in 2020.

      As the Covid pandemic meant a delay of months toward a year, a key finding was the projected relative increase in symptomatic cervical cancer cases from a year delay which varied from 38% higher with Policy 1, to 170% higher with Miscan-Cervix. The comparison was for women who had not had cytology screening for 5 years before the delay versus those appropriately screened at 3 years. In the long term, over a lifetime, a one year (up to 5 year) delay, had less effect on developing cervical cancer than screening frequency or test type. This finding is reassuring for the general public. Most importantly, however, the authors showed that not being screened for a long duration (underscreened) is the most significant factor to developing cervical cancer, especially with a further systemic delay such as Covid. Being under, or never screened, is a clinically well known fact in the cervical cancer screening community. HPV screening type was also shown to be more protective against developing cervical cancer given its superior sensitivity for longer duration over cytology allowing HPV to be done every 5 years versus cytology at 3 years.

      The strength of the paper is showing the above findings through the multiple permutations of effects in detailed analytic tables for quantitative mathematical modeling experts, and summary figures simple enough for a more general reader to follow. The variation in results among the models was explained well with most of the difference due to the "dwell" time before an HPV infection develops into precancer and cancer, the Miscan model having the shortest dwell time and thus some of the higher relative rate and absolute increases in cancer. The authors emphasized that "heterogeneity" in screening history could be due to socioeconomic factors that aren't directly evaluated in the model, but women with greater socioeconomic barriers, tend to be those that are under-screened and most at risk of developing cervical cancer.

      The results are grouped into short and long term impacts. For the short term impact, the authors concentrated on showing excess cancer in women screening less frequently than guidelines, and used cytology every 3 years as a baseline. So women screening 5 or 10 years before disruption, did worse than q3 yearly guideline compliant women. There was then discussion about guideline compliant HPV screening which is done every 5 years, so only the 10 year group was non compliant. The authors discuss changing to HPV at 30 yo. Without knowing the actual guidelines for screening in the US, this section can be a bit confusing for the reader. It would be very helpful if the authors clearly state that cytology is offered every 3 years to women under 30, starting at 21 yo, and then HPV is offered q 5 yearly from 30yo. Alternatively, q3 yearly cytology can be done throughout a screening lifetime. This background information makes the short term results clearer to understand. The Figure is helpful and clear for interpretation.

      For the long term impacts, the authors are able to show that a temporary disruption in screening is less deleterious than overall poor screening history (not following guidelines). They also show that HPV testing from age 30 is better at preventing cervical cancer than 3 yearly cytology, and had less impact from a screening delay. (the notation to figure reads right panel but is likely Lower panel).

      Thank you for flagging this typo; we have changed the notation.

      Overall, the authors clearly show the effects of a temporary screening disruption in the context of a women's overall screening history, frequency and test node.

      This work is very relevant and timely in the cervical screening field and emphasizes the importance of assuring women are not under-screened, the greatest risk factor for cervical cancer. They give a comprehensive discussion of how their results are relevant for cervical cancer screening today and in the near future.

      Thank you for the nice summary and feedback.

      As alluded to earlier, clarification about the age related switch to HPV testing at 30yo would help the reader better understand the point about the two factors having to be balanced when considering HPV testing. Are the two factors the greater protective test sensitivity vs the benefit of the actual screening moment? This section was slightly confusing.

      In addition to the request for a more complete description of the US guidelines (addressed in Essential Revisions), we have clarified the description of the “2 competing factors” on page 6-7 of the manuscript.

      “Similarly, we found that the impact of disrupting an HPV-based screening program has different implications than the disruption of a cytology-based program. This can be explained by the fact that HPV screening has a higher sensitivity to detect (pre-invasive) cervical lesions than cytology; therefore, the cancer risk at time of disruption is lower (as there are fewer undetected lesions) and this may provide a greater buffer to endure temporary disruptions. On the other hand, in case of the more sensitive HPV test, disruption takes away a relatively more valuable (i.e., sensitive) screening moment. The balance between these two factors causes a greater or smaller excess risk per delay duration in case of HPV screening compared to cytology screening, which contributes to the within model differences of cytology-based versus HPV-based screening in Table 1. If in a model the first effect (HPV screening contributes to lower risk at the time of disruption) is larger than the second effect (removal of a valuable screening moment), disruption of the HPV program would have a smaller effect than that of the cytology program, which is the case for all screening frequencies in both the Harvard and Policy1-cervix models and the annual screeners in MISCAN-cervix (Table 1). The MISCAN-Cervix model predicted relatively more excess cancers for women screened with HPV 3-yearly, 5-yearly or 10-yearly due to disruptions, where delaying a more sensitive test (the second factor) seems to outweigh the first (less underlying disease at the time of the disruption). Differences in dwell time for HPV and cervical precancer among the three models predominately contributes to this balance between the two factors (Appendix), where the MISCAN-Cervix model has the shortest preclinical dwell time from HPV acquisition to cancer development (20). In addition to the shorter dwell times, the MISCAN model also assumes that some precancerous lesions are structurally missed over time by cytology-based screening because they are located deeper into the cervical canal. For women with such lesions, missing a cytology screen due to a disruption is less harmful, which increases the relative difference between primary cytology and primary HPV screening in case of a disruption, and increases the effect of women missing a very sensitive screen (second factor).”

      The authors speak to self collection as a potential solution for some underscreened women (people with cervix). It would be important to outline how self sampling is actually done. Some people believe cytology can be done on self sample. Self sampling can also include urine HPV, thus some detail about self sampling in the discussion would be helpful and give another benefit to HPV testing (DNA based for self sampling).

      Although there is research interest in urine-based HPV screening, this is not yet at the point where it has been widely implemented in screening programs; however, we agree some additional information on self-sampling would be helpful for the general reader.

      On page 7 we have added, “Importantly, vaginal HPV-based screening (unlike cytology-based screening) enables self-collection of samples at home, which may provide a tool to reduce screening barriers and facilitate outreach to under-screened people who are also most vulnerable to screening disruptions.”

    1. Author Response

      Reviewer #1 (Public Review):

      Anopheles is an important disease vector and the efforts to characterize the extent of genetic variation in the system are welcome. In this piece, the authors propose a Variational Autoencoders method to assign species boundaries in a large sample of Anopheles mosquitoes using a panel of 62 nuclear amplicons. Overall, the method performs well as it can assign samples to an acceptable granularity. The main advantage of the method is that it takes reduced representation genome sampling which should cut costs in genotyping. The authors do not compare the effectiveness of their amplicon panel with other approaches to do reduced representation sequencing, or the computational method with other previously published methods. Additionally, the manuscript does not clearly state what is the importance of species assignments and the findings/method are -by definition- limited to a single biological system.

      It is important to draw the reviewer’s attention to the fact that this is a two part approach – the reviewer seems to have overlooked the Nearest Neighbour component of the work. The approach is not solely a VAE – the VAE only comes into play at the species complex level. The higher level assignments are done using NN approaches.

      The manuscript has three main limitations. First, there is no explicit test of the performance of ANOSPP compared to other methods of low-dimensional sampling. While the authors state that the ANOSPP panel will lead to genotyping for low cost (justifiably so), there is no direct comparison to other low-representation methods (e.g., RAD-Seq, MSG).

      The key advantage of ANOSPP is that it works on the entire Anopheles genus; while the other suggested sequencing methods are more applicable to a group of specimens of the same or closely related species. The purpose of the panel is to do species identification for the whole genus; so it really is an alternative to the current methods of species identification, which commonly consists of morphological identification of the species complex, followed by complex-specific PCR amplification of a single species-diagnostic locus. The only other species identification method for Anopheles that is not limited to a single species complex, that we are aware of, is a mass spectrometry approach (Nabet et al. Malar J, 2021); however, they only investigate three different species and reach a classification accuracy of at most 67.5%.

      The main advantage of ANOSPP over other reduced representation sequencing methods, like MSG and RAD-Seq, is that it is specifically designed to work for the entire Anopheles genus to support genus-wide species identification. In a genus comprising an estimated 100 million years of divergence, a sequencing approach that relies on restriction enzymes is likely to introduce a lot of variability in which parts of the genome are sequenced for different species. Moreover, both MSG and RAD-Seq typically map the reads to a reference genome; any choice of reference genome will likely introduce considerable bias when dealing with such diverged species. In general, the sequence data generated by those sequencing methods require more complicated and labour intensive processing. And lastly, the costs per sample for library preparation and sequencing are substantially lower with ANOSPP than with MSG and RAD-Seq: for library prep <1 USD (ANOSPP) versus 5 USD (RAD-Seq) (Meek and Larson, Mol Ecol Resour, 2019) and with 768 samples (ANOSPP), 384 samples (MSG; Andolfatto et al, Genome Res., 2011) and 96 samples (RAD-Seq; Meek and Larson, Mol Ecol Resour, 2019) per run.

      Second, and on a related vein, the authors present NNoVAE as a novel solution to determine species boundaries in Anopheles. Perusing the very references the authors cite, it is clear that VAEs have been used before to delimit species boundaries which diminishes the novelty of the approach on its own.

      The VAE is only a part of the method presented in this manuscript. We believe a substantial amount of the value of NNoVAE lies in its ability to perform assignments for the entire Anopheles genus comprising over 100 MY of divergence - the closest analogous approach would be COI or ITS2 DNA barcoding, neither of which is robust for species complexes. Using NNoVAE, samples are first assigned to their relevant groups, and in many cases to their species, by the Nearest Neighbour method. Only those samples that are identified by the Nearest Neighbour method as members of the An. gambiae complex and cannot be unambiguously assigned to a single species, are passed through the VAE assignment method.

      Indeed, in (Derkarabetian et al, Mol Phylogenet Evol, 2019) VAEs are used to delimit species boundaries in an arachnid genus. However, this study works with ultra conserved elements, incorporating a total of 76kB of sequence, which is much more data than the approximately 10kB we get for all amplicons combined. Moreover, a crucial difference is that the referenced work uses SNP calls, based on alignment to one of their sequenced samples, as input for the VAE, where our VAE takes k-mer based inputs. This is also an important consideration in working with a large number of highly diverged species.

      Perhaps more importantly, the manuscript does not present a comparison with other methods of species delimitation (SPEDEStem, UML -this approach is cited in the paper though-), or even of assessment of population differentiation, such as STRUCTURE, ADMIXTURE, or ASTRAL concordance factors (to mention a few among many). The absence of this comparative framework makes it unclear how this method compares to other tools already available.

      NNoVAE is primarily a method for species assignment rather than for species delimitation. SPEDEStem addresses the question whether different groups of samples are separate species or not; different groups can be defined by e.g. described races, described subspecies, different morphotypes or different collection locations. The aim of ANOSPP and NNoVAE is to remove the necessity of any prior sorting of samples into groups – all that needs to be known is that the sample is an Anopheline. This avoids the issues associated with morphological identification and single marker molecular barcodes. So to perform species assignment with SPEDEStem, we’d have to run many replicates, each time asking whether a single sample is of the same species as one of the species represented in our reference database. For example, for the 2218 samples presented in the case studies, we would have to run SPEDEStem more than 130,000 times, to check for each of these samples whether they are any of the 62 species represented in the reference dataset NNv1.

      However, we agree that it would be good to check that the species-groups in the reference database, NNv1, are indeed supported as separate species. We attempted to run SPEDEStem, but the web browser no longer exists, and we were not able to install the command line application, which runs on Python 2. Moreover, the example files provided in the tutorial are not complete. Therefore, we were unable to even carry out this basic comparison.

      UML (unsupervised machine learning) approaches comprise quite a wide range of methods, including VAE. We have conducted a comparison between the VAE assignments and assignments based on UMAP, for the discussion see below and page 20 in the manuscript and newly added supplementary information section 4.

      As requested by the reviewer, we have compared our assignment approach to ADMIXTURE on the Anopheles gambiae complex training set (see Supplementary information section 5). It is a good sanity check to compare the structure revealed by ADMIXTURE to the structure revealed by the VAE. We found that ADMIXTURE does not satisfyingly differentiate between the species in the complex that are only represented by a handful of samples, while the VAE suffers much less from the differences in group sizes in the training set. Moreover, we want to point out that ADMIXTURE is a tool for assessing population differentiation, not for species assignment. To use it as an assignment method, there are two options: either infer the allele frequencies in the ancestral populations from the training set and use those to compute the maximum likelihood of ancestry frequencies for the test set; or run ADMIXTURE on the training and test sets combined and use the labels from the training set to label ancestral populations. A major drawback from the former approach is that it is tricky to discover cryptic taxa or outliers in the test set; while with the second approach we create a dependency of the training set results on the test set it is combined with during the run. But more importantly, ADMIXTURE performs worse than the VAE on the An. gambiae complex training set by itself; and identifies only two to three different groups among the five diverged species (An. melas, An. merus, An. quadriannulatus, An. bwambae and An. fontenillei). For more information, see page 20 in the manuscript and newly added supplementary information section 5

      One important use case of our method is to identify interesting samples, e.g. potential hybrids or cryptic taxa, for subsequent whole genome sequencing. After selection and whole genome sequencing of interesting samples detected by ANOSPP+NNoVAE, ADMIXTURE may be useful as one of the tools to investigate such samples.

      A final concern is less methodological and more related to the biology of the system. I am curious about the possibility of ascertainment bias induced by the amplicon panel. In particular, the authors conclusively demonstrate they can do species assignment with species that are already known. Nonetheless, there is the possibility of unsampled species and/or cryptic species. This later issue is brought up in passing the 'Gambiae complex classifier datasets' section but I think the possibility deserves a formal treatment. This is particularly important because the system shows such high levels of hybridization that the possibility of speciation by admixture is not trivial.

      We appreciate the reviewer’s concern regarding ascertainment bias in the amplicon panel. The targets have been selected based on multiple sequence alignments of all Anopheles reference genomes at the time (Makunin et al. Mol Ecol Resour, 2022). Using sequenced species from four different subgenera, the species span a considerable amount of evolutionary time in the Anopheles genus. For all species we have since tested the panel on, we find that at least half of the targets get amplified.

      We share the reviewer’s concern regarding species which are not (yet) represented in the reference database. This is one of the main advantages of the Nearest Neighbour method: it works on three levels of increasing granularity. So for samples that cannot be assigned at species level, we are often able to identify the group of species from the reference database it is closest to. In particular, the situation of a test sample whose species is not represented in the reference database, is mimicked in the drop-out experiment by the species-groups which contain only one sample. On page 16 in the manuscript, we explain how NNoVAE deals with such samples and we show that in the majority of cases NNoVAE assigns the sample to a group of closely related species rather than misclassifying it more specifically to the wrong species.

      In summary, the main limitation of the manuscript is that the authors do not really elaborate on the need for this method. The manuscript does show that the method is feasible but it is not forthcoming on why this is of importance, especially when there is the possibility of generating full genome sequences.

      ANOSPP and NNoVAE are specifically designed for high throughput accurate species identification across the entire Anopheles genus – WGS is important to address many questions, but is complete overkill for doing species identification. ANOSPP costs only a small fraction of whole genome sequencing, which makes it possible to monitor mosquito populations at much larger scale (e.g., in partnership with our vector biologist collaborators in Africa, we have already generated ANOSPP data for approximately 10,000 mosquitoes and will be running 500,000 over the next few years). Moreover, for most analyses using whole genome sequencing, a reference genome of a sufficiently similar species is required. While we are in a position of privilege having reference genomes for more than 20 species in Anopheles, we have a long way to go before we have 100s of reference genomes covering the true diversity of the genus.

      NNoVAE can also be used to select interesting samples (e.g. species that have not been through the panel before, divergent populations, potential hybrids), which can be submitted for whole genome sequencing subsequently.

      Since Anopheles is arguably one of the most important insects to characterize genetically, the ANOSPP panel is certainly important but I am not completely sure the method of species assignment is novel or groundbreaking .

      Reviewer #2 (Public Review):

      The medically important mosquito genus Anopheles contains many species that are difficult or impossible to distinguish morphologically, even for trained entomologists. Building on prior work on amplicon sequencing, Boddé et al. present a novel set of tools for in silico identification of anopheline mosquitoes. Briefly, they decompose haplotypes generated with amplicon sequencing into kmers to facilitate the process of finding similar sequences; then, using the closest sequence or sequences ("nearest neighbors") to a target, they predict taxonomic identity by the frequency of the neighbor sequences in all groups present in a reference database. In the An. gambiae species complex, which is well-known for its historical and ongoing introgression between closely-related species, this approach cannot distinguish species. Therefore, they also apply a deep learning method, variational autoencoders, to predict species identity. The nearest neighbor method achieves high accuracy for species outside the gambiae complex, and the variational autoencoder method achieves high accuracy for species within the complex.

      The main strength of this method (along with the associated methods in the paper on which this work builds) is its ability to speed up the identification of anopheline mosquitoes, therefore facilitating larger sample sizes for a wide breadth of questions in vector biology and beyond. This technique has the added advantage over many existing molecular identification protocols of being non-destructive. This high-throughput identification protocol that relies on a relatively straightforward amplicon sequencing procedure may be especially useful for the understudied species outside the well-resourced gambiae complex.

      An additional and intriguing strength of this method is that, when a species label cannot be predicted, some basic taxonomic predictions may still be made in some cases. Indeed, even in the case of known species, the authors find possible cryptic variation within An. hyrcanus and An. nili, demonstrating how useful this new tool can be.

      The main weakness of this method is that, as the authors note, accuracy is dependent on the quality and breadth of the reference database (which in turn relies on the expertise of entomologists). A substantial portion of the current reference database, NNv1, comes from one species complex, An. gambiae. This is reasonable given the complex's medical importance and long history of study; however, for that same reason, robust molecular and computational tools for identifying species in this complex already exist. The deep learning portion of this manuscript is a valuable development that can eventually be applied to other species complexes, but building up a sufficient database of specimens is non-trivial. For that reason, the nearest neighbor method may be the more immediately impactful portion of this paper; however, its usefulness will depend on good sampling and coverage outside the gambiae complex.

      Another potential caveat of this method is its portability. It is not clear from either the manuscript or the code repository how easy it would be for other researchers to use this method, and whether they would need to regenerate the reference database themselves. The authors clearly have expansive and immediate plans for this workflow; however, as many researchers will read this manuscript with an eye towards using these methods themselves, clarifying this point would be valuable.

      This is an important point; currently the amplicon panel is only run on specialised robots, but we are working to adapt the protocol so that it can be run in any basic molecular lab. We have now clarified this in the conclusion. Furthermore, there is never a need to regenerate the reference databases – this is fully publicly available at github.com/mariloubodde/NNoVAE and version controlled. As we obtain ANOSPP data from additional samples, representing new species or new within-species diversity, we will add these to the reference database and create an updated openly available version.

      The authors present data suggesting that their method is highly accurate in most of the species or groups tested. While the usefulness of this method will depend on the reference database, two points ameliorate this potential concern: it is already accurate on a wide breadth of species, including the understudied ones outside the An. gambiae complex; additionally, even when a specific species identification cannot be made, the specimen may be able to be placed in a higher taxonomic group.

      Overall, these new methods offer an additional avenue for identifying anopheline species; given their high-throughput nature, they will be most useful to researchers doing bulk collections or surveillance, especially where multiple morphologically similar species are common. These methods have the potential to speed up vector surveillance and the generation of many new insights into anopheline biology, genetics, and phylogeny.

    1. Author Response

      Reviewer #1 (Public Review):

      1) The relationship between lobe cannibalism and mtDNA reduction seems to be too mild. The authors first show that about half of mitochondria are removed in PGCs between the embryo and L1 stages. At this point, the number of mitoDNA/cell decreases by half compared to the embryonic stage, and based on this result, they propose that this is a bottleneck. To me (intuitively) 50% reduction does not seem strong as a bottleneck. Perhaps it is better to tone down the claim a bit here (unless they can provide stronger evidence, such as modeling, that a 50% reduction is sufficient to cause a bottleneck. Textual editing would suffice, though (unless they already have the evidence for bottleneck).

      We used the term ‘bottleneck’ to indicate the point when mtDNA numbers are at their lowest point (prior to mtDNA expansion) that we could detect in the early germ line lineage, which results from a combination of reductive cell divisions to get to the PGC lineage, followed by lobe cannibalism and autophagy for a final two-fold reduction. The ~150-fold total reduction we detect in PGC mtDNA number relative to the estimated number of mtDNAs present in the oocyte (inferred from analyzing whole early embryos) is comparable to the ~100-fold reduction in mtDNAs that occurs between the oocyte and early PGCs in mouse, which has been proposed to be a germline mtDNA genetic bottleneck based on computer simulation studies (Cree et al., 2008, NCB 40: 249-254). In addition, the number of mtDNAs we detect per PGC (~200) at its low point in L1 larvae is comparable to the number of mtDNAs in mouse PGCs purified using two different reporter transgenes (280 mtDNAs per PGC, Wai et al., 2008, NCB 40: 1484-1488; 203 mtDNAs per PGC, Cree et al., 2008, NCB 40: 249-254). However, we agree with Reviewer 1 that our data does not show whether the 150-fold total reduction in mtDNA number we detect in PGCs relative to the oocyte has a functional consequence on segregation of mtDNA mutations in the germ line. To clarify what we have shown and what we propose based on findings and simulations in other systems, we refer to the number of mtDNAs in L1 PGCs as a ‘low point’ and introduce in the discussion how this reduction could affect segregation and inheritance of mtDNA mutations (pg. 14, lines 333-339).

      2) Overall, one thing that struck me was that, when they assay 'selection' by mtDNA (e.g. the number of mtDNA, frequency of mutant mtDNA, reduction by autophagy pathway, reduction by pink1, etc), the effect seems to be way too mild. However, in Fig1c, d, and Fig2c, the amount of mitoGFP that goes to the lobe seems to be at least 80-90%. Is this because the 'striking' images were selected for presentation? Alternatively, I wonder if mito with more mtDNA actually end up surviving, and mito with fewer mitoDNA goes to the lobe (as a result, the amount of mito removed to the lobe is much higher than the amount of mtDNA removed). If so, is this actually THE selection that happens during embryo-to-L1 transition? Is there any way to measure the amount of mito and amount of mtDNA simultaneously?

      Thank you for bringing up this point. Reviewer 1 is correct that 80% of mitochondria are in lobes initially, so the images in Figures 1 and 2 are representative. However, prior to lobe scission, some mitochondria move back into the cell body, such that only 60% are in lobes at the two-fold stage (we have not performed this analysis even later, just as cannibalism begins between the two and three-fold stage, because of embryo rotation in the eggshell). This was shown and quantified in our original Fig. S1. To stress this point, the quantification of this data has been moved to Figure 1, while representative images remain in the supplement (Figure 1—figure supplement 1) (See Figure 1F). The movement of mitochondria back into the cell body is an avenue that we plan to explore in future studies, although we feel that it is beyond the scope of the current manuscript.

      Although we have not quantified mtDNA distribution due to the challenges of imaging late embryos, we have no evidence that there is a significant asymmetry in mtDNA density (mtDNAs per unit of mitochondrial mass) between lobe and cell body mitochondria; mtDNAs are distributed among mitochondria in both lobes and the cell body (see Figure 1I). Our experiments on uaDf5 mutant mtDNA also show that even if a small asymmetry in uaDf5 were present, it is not responsible for selecting against uaDf5 mtDNAs since uaDf5 mtDNA heteroplasmy in PGCs still decreases between embryo and L1 in nop-1 mutants, though we cannot exclude the possibility that a small asymmetry exists.

      Reviewer #2 (Public Review):

      Major points:

      1) I wish that the authors provided more direct evidence to support their conclusion that there is no mtDNA replication in embryonic PGCs and mtDNA only starts to replicate before the first division of PGCs in early L1.

      See essential revisions.

      2) It will also be interesting to show how compromising cannibalism (e.g. using nop1 mutant) affects the replication of mtDNA after the first division of PGCs in L1.

      See essential revisions.

      3)Finally, given that the total mtDNA copy number in later GSCs is similar between worms with and without the PGC cannibalism (wt vs nop-1 mutant) (Fig 3), and cannibalism does not selectively eliminate detrimental mtDNA mutation, I also wonder why PGCs need a bottleneck for the mtDNA population.

      We also do not fully understand why PGC lobe cannibalism is necessary. However, PGCs are born with a relatively high number of mtDNAs, as they arise from a small number of invariant cell divisions during embryogenesis (5) relative to somatic cells (on average ~8); lobe cannibalism could be a way to eliminate this excess to reach ~200 before PGCs differentiate into GSCs in larvae. Our experiments on nop-1 mutants clearly show that this number is important, as it is achieved through an independent mechanism even when lobe cannibalism is blocked. We have dedicated an entire paragraph to discussing these important points (pg. 13-14, lines 326-339).

    1. Author Response

      Reviewer #3 (Public Review):

      Understanding the relevance of skewed X-Chromosome Inactivation (XCI) in women disease susceptibility and development is an intriguing open question. In this manuscript entitled "Age acquired skewed X Chromosome Inactivation is associated with adverse health outcomes in humans" Roberts et al. attempt to characterize this relationship by assaying skewed X-Chromosome Inactivation in >1.500 females from the TwinsUK population cohort. The authors reported an association between skewed XCI and increased cardiovascular risk across the tested population. This association is reinforced by a twin study based on age matched twin pairs discordant in their degree of XCI skewing. This approach is indeed powerful as it controls for age in predicting the cardiovascular disease risk score. The authors also found an association between skewed XCI and a haematopoietic bias towards the myeloid lineage. Finally, skewed XCI was shown to be predictive of future cancer incidence. -This area of research is timely and of great interest for the community. However, in my opinion, the conclusions of this manuscript are not fully supported by the presented data and some aspects of the data analysis and results need to be extended.

      We thank the reviewer for their kind comments on the importance of our study. We do hope they agree that, through addressing the necessary changes outlined above, particularly with regards to the discussion, that the conclusions of the manuscript are now more nuanced and present the work in the broader context of related fields, in particular clonal haematopoiesis. Overall, we hope the reviewer agrees that the conclusions of the manuscript now better reflect the results presented.

    1. Author Response

      Reviewer #1 (Public Review):

      Liu et. al. applied an existing method to study the subtypes of CRC from a network perspective. In the proposed framework, the authors calculated the perturbation of expression-rank differences of predefined network edges in both tumor and normal samples. By clustering the derived perturbation scores in CRC tumors using publicly available gene expression datasets, they reported six subtypes (referred to as GINS 1-6) and then focused on the association of each subtype with clinical features and known molecular mechanisms and cell phenotypes. My recommendation is major revision.

      Major concerns:

      (1) While this study originates from the network-perspective, it is unclear to me if the new subtypes provide key novel insights into the gene regulatory mechanisms for the development of CRC. For example, the "Biological peculiarities of six subtypes" section is descriptive and lacks a punch point.

      Thanks for your professional suggestions. In this study, we focused on the global network perturbations instead of snapshot transcriptional profiles, because snapshot transcriptional profiles largely ignore the dynamic changes of gene expressions in a biological system, and conversely, biological networks remain relatively stable irrespective of time and condition. In this perturbation network, we only use the global network perturbation matrix to perform consensus clustering, rather than for exploring the gene regulatory mechanisms of each subtype. However, subtype-related studies tend to investigate the biological characteristics of each subtype.

      Thus, we then delineate the biological attributes inherent to GINS subtypes using two different algorithms (SSEA and GSVA). These works were done to understand the underlying biological characteristics of these subtypes and define them biologically, similar to previous subtype studies (PMID: 31563503; 31875970; 26457759; 30833271; 30842092; 32164750; 30837276). As you commented, the section is descriptive and lacks a punch point. Hence, we highlighted potential transformation among GINS2/4/5. In this study, GINS2 was endowed with higher stromal activity and lower immune activity, whereas GINS5 conveyed the opposite trend entirely, concordant with the tumor invasiveness and prognosis of two subtypes, and GINS4 was characterized by a mixed phenotype that displayed moderate level of stromal and immune pathways. As three subtypes with abundant TME components, GINS2/4/5 may mutually evolve in stromal and immune functions. Thus, we intended to extract consistently upregulated and downregulated genes among these three subtypes, using Mfuzz package, a noise-robust soft clustering analysis with the fuzzy c-means form(Kumar and M, 2007). The Mfuzz analysis revealed 10 gene clusters, and gene cluster 3 and 10 displayed the stable expression pattern from GINS2 to GINS5 (Figure 5C and Supplementary File 8). As expected, gene cluster 3 was prevailingly associated with immune infiltration and activation (Figure 5D), whereas gene cluster 10 was prominently characterized by stromal activation and remodeling (Figure 5E), which further supported our findings. This also indicated that TME had profound impacts on the progression and prognosis of tumors, and GINS2/5 acted as two extremes of TME components, indeed showing diametrically opposite clinical outcomes (Red mark in “Biological peculiarities of six subtypes” part). Subsequently, we further investigate the immune regulations of GINS subfamilies. We found that GINS5 was also characterized by higher immune infiltration and stronger immunogenicity based on the transcriptome and proteome analysis. For example, GINS5 harbored remarkably higher tumor mutation burden (TMB) and neoantigen load (NAL) (P <0.001, Figure 6C), possibly further inducing abundant immune elements and regulations. GINS5 also possessed the abundant infiltration of Th1, Th2, and M1 macrophages(Mills et al., 2016) (Figure 6-figure supplement 1A-C), which could secrete proinflammatory cytokines and enhance immune activation. Conversely, M2 traditionally regarded as promoting tumor growth by suppressing cell-mediated immunity and subsequent cancer cell killing(Mills et al., 2016), was significantly elevated in GINS2 (Figure 6-figure supplement 1D). In line with this, three other classical immunosuppressive cells, including fibroblasts, myeloid-derived suppressor cells (MDSC), and Treg cells(Hicks et al., 2022), were also significantly enriched in GINS2.

      Additionally, in the “GINS6 tumors conveyed rich lipid metabolisms” part, we further observed that lipid metabolisms were the most significant metabolic processes in GINS6. Metabolomics analysis suggested that GINS6 exhibited higher levels in four fatty acids including α-glycerophosphate, adipate, taurocholate, and aconitate. These findings validated that GINS6 was closely associated with metabolic reprogramming and accumulated fatty acids.

      Overall, it is difficult to profoundly investigate the underlying biological mechanisms of all subtypes in a paragraph, so we first used the ‘Hallmark’ genesets to preliminarily explore the biological characteristics of these subtypes, thus giving us inspiration and direction for further exploration, in fact the following studies in this part are refinement and deepening of this part.

      Thank you for your academic discussion with us.

      (2) To further demonstrate the novelty of the identified subtypes, the authors need to show the additional benefit of the GINS1-6 to patient stratification derived from existing methods, such as integrative clustering based on multiple genomic evidence (copy number alterations, gene expression and somatic mutations).

      Thanks for your thoughtful comments. We wanted to clarify this issue from the following three aspects:

      1) First of all, the basis that inspired us is that the global network perturbations have advantages over snapshot transcriptional profiles (main traditional methods in CRC), because snapshot transcriptional profiles largely ignore the dynamic changes of gene expressions in a biological system, and conversely, biological networks remain relatively stable irrespective of time and condition. The gene interactions in a biological network are overall stable in a particular type of normal human tissue but widely perturbed in diseased tissues (PMID: 29040359 and 25165092). These perturbations in gene interactions (edge perturbations) in each sample can be measured by the change in the relative gene expression value. The edge perturbations at an individual level can be used to characterize the perturbation of the biological network for each sample efficiently. Thus, this is the starting point for cancer clustering in this study.

      2) Second, the essence of molecular clustering is to investigate tumor heterogeneity. In order to detect multiple subtypes (some of which may represent relatively small fractions of the patient population) (PMID: 23584089), the clustering methods require moderately large numbers of samples – more than contained in any one of the individual CRC data sets published to date. With that in mind, we began our analysis by identifying suitable and comparable microarray datasets (n=2167, Supplementary File 15). The sample number in our discovery dataset is the largest among the current CRC subtype-related studies. For multi-omics clustering, there is currently no multi-omics sequencing cohort with a large number of samples and good sequencing quality, only the TCGA-CRC cohort has eligible multi-omics data (only less than 300 patients with multi-omics data). Therefore, subtypes represent relatively small fractions of the patient population cannot be detected.

      3) Third, we actually tested several methods and datasets before determining GINS subtypes. Clustering always divides tumors into several subgroups, but we expect these subgroups to reproduce in other cohorts. Thus, we need to validate the robustness of our subtypes in multiple independent cohorts. Our validation works focused on the following four contexts: (1) data from the same platform (GPL570); (2) data from different platforms and sequencing techniques (microarray or RNA-seq); (3) microdissected or whole tumors; (4) in-house clinical setting. However, as mentioned above, only TCGA-CRC has data, so a rigorous verification cannot be carried out, so more rigorous verification cannot be carried out.

      Thank you for your academic discussion with us.

    1. Author Response

      Reviewer #1 (Public Review):

      Huang et al. sought to study the cellular origin of Tuft cells and the molecular mechanisms that govern their specification in severe lung injury. First the authors show ectopic emergence of Tuft cells in airways and distal parenchyma following different injuries. The authors also used lineage tracing models and uncovered that p63-expressing cells and to some extent Scgb1a1-lineaged labeled cells contribute to tuft cells after injury. Further, the authors modulated multiple pathways and claim that Notch inhibition blocks tuft cells whereas Wnt inhibition enhances Tuft cell development in basal cell cultures. Finally, the authors used Trpm5 and Pou2f3 knock-out models to claim that tuft cells are indispensable for alveolar regeneration.

      In summary, the findings described in this manuscript are somewhat preliminary. The claim that the cellular origin of Tuft cells in influenza infection was not determined is incorrect. Current data from pathway modulation is preliminary and this requires genetic modulation to support their claims.

      We thank the reviewer for the comments and we have performed extensive experiments to address the reviewer’s comments. In the revised manuscript we provide additional data including genetic modulation findings to support our model.

      Major comments:

      1) The abstract sounds incomplete and does not cover all key aspects of this manuscript. Currently, it is mainly focusing on the cellular origin of Tuft cells and the role of Wnt and notch signaling. However, it completely omits the findings from Trpm5 and Pou2f3 knock-out mice. In fact, the title of the manuscript highlights the indispensable nature of tuft cells in alveolar regeneration.

      We have modified the abstract and title accordingly.

      2) In lines 93-94, the authors state that "It is also unknown what cells generate these tuft cells.....". This statement is incorrect. Rane et al., 2019 used the same p63-creER mouse line and demonstrated that all tuft cells that ectopically emerge following H1N1 infection originate from p63+ lineage labeled basal cells. Therefore, this claim is not new.

      We thank the reviewer’s comment. Although Rane et al. reported the p63-expressing lineage-negative epithelial stem/progenitor cells (LNEPs) could contribute to the ectopic tuft cells after PR8 virus infection, it is still not clear whether the p63+ cells immediately give rise to tuft cells or though EBCs. Thus, we performed TMX injection after PR8 infection, different from Rane et al (Rane et al., 2019). who performed Tmx injection before viral infection to indicate the ectopic tuft cells are derived from EBCs, as shown in revised Figure 2.

      3) Lines 152-153 state that "21.0% +/- 2.0 % tuft cells within EBCs are labeled with tdT when examined at 30 dpi...". It is not clear what the authors meant here ("within EBC's")? And also, the same sentence states that "......suggesting that club cell-derived EBCs generate a portion of tuft cells....". In this experiment, the authors used club cell lineage tracing mouse lines. So, how do the authors know that the club cell lineage-derived tuft cells came through intermediate EBC population? Current data do not show evidence for this claim. Is it possible that club cells can directly generate tuft cells?

      We apologize for the confusion and revised the text accordingly. Here, “within EBCs” means within the “pods” area where p63+ basal cells are ectopically present. The sentence is revised as “21.0% +/- 2.0 % tuft cells that are ectopically present in the parenchyma are labeled by tdT. Notably, these lineage labeled tuft cells were co-localized with EBCs.” We don’t know whether the club cell lineage-derived tuft cells transit through intermediate EBCs and that is why we use “suggest”. It is also possible that club cells can directly generate tuft cells. To avoid the confusion, we delete the sentence.

      4) Based on the data from Fig-3A, the authors claim that treatment with C59 significantly enhances tuft cell development in ALI cultures. Porcupine is known to facilitate Wnt secretion. So, which cells are producing Wnt in these cultures? It is important to determine which cells are producing Wnt and also which Wnt? Further, based on DBZ treatments, it appears that active Notch signaling is necessary to induce Tuft cell fate in basal cells. Where are Notch ligands expressed in these tissues? Is Notch active only in a small subset of basal cells (and hence generate rate tuft cells)? This is one of the key findings in this manuscript. Therefore, it is important to determine the expression pattern of Wnt and Notch pathway components.

      We thank the reviewer’s interesting questions and agree the importance of identifying the specific ligands and receptors for relevant Wnt and Notch signaling during tuft cell derivation. That being said, we think the topic is beyond the scope of this study which is focused on the role of tuft cells in alveolar regeneration. The point is well taken and we will investigate the topic in our future study.

      5) How do the authors explain different phenotypes observed in Trpm5 knockout and Pou2f3 mutants? Is it possible that Trpm5 knockout mice have a subset of tuft cells and that they might be something to do with the phenotypic discrepancy between two mutant models?

      Again we thank the reviewer for the interesting question. As discussed in the discussion section, Trpm5 is also reported to be expressed in B lymphocytes (Sakaguchi et al., 2020). It is possible that loss of Trpm5 modulates the inflammatory responses following viral infection, which may contribute to improved alveolar regeneration. However, it is also possible that Trpm5-/- mice keep a subset of tuft cells that facilitate lung regeneration as suggested by the reviewer.

      6) One of the key findings in this manuscript is that Wnt and Notch signaling play a role in Tuft cell specification. All current experiments are based on pharmacological modulation. These need to be substantiated using genetic gain loss of function models.

      We have performed the genetic studies.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the ectopic differentiation of tuft cells that were derived from lineage-tagged p63+ cells post influenza virus infection. These tuft cells do not appear to proliferate or give rise to other lineages. They then claim that Wnt inhibitors increase the number of tuft cells while inhibiting Notch signaling decreases the number of tuft cells within Krt5+ pods after infection in vitro and in vivo. The authors further show that genetic deletion of Trpm5 in p63+ cells post-infection results in an increase in AT2 and AT1 cells in p63 lineage-tagged cells compared to control. Lastly, they demonstrate that depletion of tuft cells caused by genetic deletion of Pou2f3 in p63+ cells has no effect on the expansion or resolution of Krt5+ pods after infection, implying that tuft cells play no functional role in this process.

      Overall, in vivo and in vitro phenotypes of tuft cells and alveolar cells are clear, but the lack of detailed cellular characterization and molecular mechanisms underlying the cellular events limits the value of this study.

      We thank the reviewer for the comments and acknowledging that our findings are clear. In the revised manuscript we provide more detailed characterization and genetic evidence to elucidate the role of tuft cells in lung regeneration.

      1) Origin of tuft cells: Although the authors showed the emergence of ectopic tuft cells derived from labelled p63+ cells after infection, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitors, as previously reported, can also contribute to tuft cell expansion (Rane et al. 2019; by labelling p63+ cells prior to infection, they showed that the majority of ectopic tuft cells are derived from p63+ cells after viral infection). It would be more informative if the authors show the differentiation of tuft cells derived from p63+Krt5+ cells by tracing Krt5+ cells after infection, which will tell us whether ectopic tuft cells are differentiated from ectopic basal cells within Krt5+ pods induced by virus infection.

      We thank the reviewer for the helpful suggestion. We have performed the experiment accordingly.

      2) Mechanisms of tuft cell differentiation: The authors tried to determine which signaling pathways regulate the differentiation of tuft cells from p63+ cells following infection. Although Wnt/Notch inhibitors affected the number of tuft cells derived from p63+ labelled cells, it remains unclear whether these signals directly modulate differentiation fate. The authors claimed that Wnt inhibition promotes tuft cell differentiation from ectopic basal cells. However, in Fig 3B, Wnt inhibition appears to trigger the expansion of p63+Krt5+ pod cells, resulting in increased tuft cell differentiation rather than directly enhancing tuft cell differentiation. Further, in Fig 3D, Notch inhibition appears to reduce p63+Krt5+ pod cells, resulting in decreased tuft cell differentiation. Importantly, a previous study has reported that Notch signalling is critical for Krt5+ pod expansion following influenza infection (Vaughan et al. 2015; Xi et al. 2017). Notch inhibition reduced Krt5+ pod expansion and induced their differentiation into Sftpc+ AT2 cells. In order to address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs, the authors should provide a more detailed characterization of cellular composition (Krt5+ basal cells, club cells, ciliated cells, AT2 and AT1 cells, etc.) and activity (proliferation) within the pods with/without inhibitors/activators.

      Again we thank the reviewer for the insightful suggestions. We agree that it will be interesting to further address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs. In this revised manuscript we added new findings of EBC differentiation into tuft cells in mice with genetic deletion of Rbpjk.

      3) Impact of Trpm5 deletion in p63+ cells: It is interesting that Trpm5 deletion promotes the expansion of AT2 and AT1 cells derived from labelled p63+ cells following infection. It would be informative to check whether Trpm5 regulates Hif1a and/or Notch activity which has been reported to induce AT2 differentiation from ectopic basal cells (Xi et al. 2017). Although the authors stated that there was no discernible reduction in the size of Krt5+ pods in mutant mice, it would be interesting to investigate the relationship between AT2/AT1 cell retaining pods and the severity of injury (e.g. large Krt5+ pods retain more/less AT2/AT1 cells compared to small pods. What about other cell types, such as club and goblet cells, in Trpm5 mutant pods? Again, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitor cells can directly convert into AT2/AT1 cells upon Trpm5 deletion rather than p63+Krt5+ cells induced by infection.

      We thank the reviewer for the comments and suggestions. Our new data using KRT5-CreER mouse line confirmed that pod cells (Krt5+) do not contribute to AT2/AT1 cells, consistent with previous studies (Kanegai et al., 2016; Vaughan et al., 2015). Our data also show that p63-CreER lineage labeled AT2/AT1 cells are separated from pod cell area, suggesting pod cells and these AT2/AT1 cells are generated from different cell of origin. We also checked the Notch activity in pod cells in Trpm5-/- mice, and some pod cell-derived cells are Hes1 positive, whereas some are Hes1 negative (RLFigure 1). As indicated in discussion we think that AT2/AT1 cells are possibly derived from pre-existing AT2 cells that transiently express p63 after PR8 infection. It will be interesting to test whether Trpm5 regulates Hif1a in this population (p63+,Krt5-), and this will be our next plan.

      RLFigure 1. Representative area staining in Trpm5-/- mice at 30 dpi. Area 1: Notch signaling is active (Hes1+, arrows) in pod cells following viral infection. Area 2: pod cells exhibit reduced Notch activities. Note few Hes1+ cells in pods (arrows). Scale bar: 50 µm.

      4) Ectopic tuft cells in COVID-19 lungs: The previous study by the authors' group revealed the presence of ectopic tuft cells in COVID-19 patient samples (Melms et al. 2021). There appears to be no additional information in this manuscript.

      In Melms et al., Nature, 2021 (Melms et al., 2021), we showed tuft cell expansion in COVID-19 lungs but not the potential origin of tuft cells. In this manuscript we show some cells co-expressing POU2F3 and KRT5, suggesting a pod-to-tuft cell differentiation.

      5) Quantification information and method: Overall, the quantification method should be clarified throughout the manuscript. Further, in the method section, the authors stated that the production of various airway epithelial cell types was counted and quantified on at least 5 "random" fields of view. However, virus infection causes spatially heterogeneous injury, resulting in a difficult to measure "blind test". The authors should address how they dealt with this issue.

      We clarified that quantification method as suggested. For the in vitro cell culture assays on the signaling pathways, we took pictures from at least five random fields of view for quantification. For lung sections, we tile-scanned the lung sections including at least three lung lobes and performed quantification.

      Reviewer #3 (Public Review):

      In this manuscript Huang et al. study how the lung regenerates after severe injury due to viral infection. They focus on how tuft cells may affect regeneration of the lung by ectopic basal cells and come to the conclusion that they are not required. The manuscript is intriguing but also very puzzling. The authors claim they are specifically targeting ectopic basal progenitor cells and show that they can regenerate the alveolar epithelium in the lung following severe injury. However, it is not clear that the p63-CreERT2 line the authors are using only labels ectopic basal cells. The question is what is a basal cell? Is an ectopic basal progenitor cell only defined by Trp63 expression?

      The accompanying manuscript by Barr et al. uses a Krt5-CreERT2 line to target ectopic basal cells and using that tool the authors do not see a signification contribution of ectopic basal cells towards alveolar epithelial regeneration. As such the claim that ectopic basal cell progenitors drive alveolar epithelial regeneration is not well-founded.

      We appreciate the reviewer for the positive comments and agreeing that our findings are interesting.

      The title itself is also not very informative and is a bit misleading. That being said I think the manuscript is still very interesting and can likely easily be improved through a better validation of which cells the p63-CreERT2 tool is targeting.

      We have revised the title accordingly and performed extensive experiments to address the reviewer’s concerns.

      I, therefore, suggest the following experiments.

      1) Please analyze which cells p63-CreERT2 labels immediately after PR8 and tamoxifen treatment. Are all the tdTomato labeled cells also Krt5 and p63 positive or are some alveolar epithelial cells or other airway cell types also labeled?

      We thank the reviewer for the question. To answer the reviewer’s question, we performed PR8 infection (250 pfu) on three Trp63-CreERT2;R26tdT mice and TMX treatment at days 5 and 7 post viral infection. We didn't perform TMX injection immediately as the mice were sick at a few days post infection. The lung samples were collected at 14 dpi. We observed that tdT+ cells are present in the airways (rebuttal letter RLFigure 2A, B), and it appears that the lineage labeled cells (tdT+) include club cells (CC10+) that are underlined by tdT+Krt5+ basal cells (RLFigure 2C). We think that these labeled basal cells give rise to club cells. However, we also noticed that rare club cells and ciliated cells (FoxJ1+) are labeled by tdT in the areas absent of surrounding tdT+ basal cells (RLFigure 2D). Moreover, a minor population of tdT+ SPC+ cells are present in the terminal airways that were disrupted by viral infection (RLFigure 2E and D). We did not see any pods formed in this experiment and we did not observe any tdT+ cells in the intact alveoli (uninjured area).

      RLFigure 2. Trp63-CreERT2 lineage labeled cells in the airways but not alveoli when Tamoxifen was induced at day 5 and 7 after PR8 H1N1 viral infection. Trp63-CreERT2;R26-tdT mice were infected with PR8 at 250 pfu and Tmx were delivered at a dose of 0.25 mg/g bodyweight by oral gavage. Lung samples were collected and analyzed at 14 dpi. Stained antibodies are as indicated. Scale bar: 100 µm.

      2) Please also show if p63-CreERT2 labels any cells in the adult lung parenchyma in the absence of injury after tamoxifen treatment.

      Dr. Wellington Cardoso’s group demonstrated that Trp63-CreERT2 only labels very few cells in the airways but not the lung parenchyma in the absence of injury after tamoxifen treatment (Yang et al., 2018). Dr. Ying Yang has revisited the data and she did not observe any labeling in the lung parenchyma (n = 2).

      3) Please analyze if p63-CreERT2 labels any cells with tdTomato in the absence of injury or after PR8 infection but without tamoxifen treatment.

      We performed the experiment and didn't observe any labeled cells in the lung parenchyma without Tamoxifen treatment (n = 4).

      4) Please analyze when after PR8 infection do the first p63-CreERT2 labeled tdTomato positive alveolar epithelial cells appear.

      We administered tamoxifen at day 5 and 7 after PR8 infection and harvested lung tissues at day 14. As shown in Figure 1, we observed a few tdT+ SPC+ cells in the terminal airways that are disrupted by viral infection. Notably, we did not observe any lineage labeled cells in the intact alveoli (uninjured) in this experiment..

      5) A clonal analysis of p63-CreERT2 labeled cells using a confetti reporter might also help interpret the origin of p63-CreERT2 labeled cells.

      We thank the reviewer for the suggestion. Our new data demonstrate that a rare population of SPC+tdT+ cells are present in the disrupted terminal airways of Trp63-CreERT2;R26tdT mice. Our data in the original manuscript and the new data suggest that the initial SPC+;tdT+ cells are rare because we have to administrate multiple doses of Tamoxifen to label them. Given the less labeling efficiency of confetti than R26tdT mice, it is possible we will not be able to label these SPC+ cells. Moreover, our original manuscript clearly shows individual clones of SPC+tdT+ cells in the regenerated lung, and they do not seem to compose of multiple clones. Therefore we think that use of confetti mice may not add new information..

      6) Lastly could the authors compare the single-cell RNAseq transcription profile of p63-CREERT2 labeled cells immediately after PR8 and tamoxifen treatment and also at 60dpi. A pseudotime analysis and trajectory interference analysis could help elucidate the identity of p63-CreERT2 labeled cells that are actually not ectopic basal progenitor cells.

      We appreciated the reviewer’s suggestion and agree that single cell RNA sequencing with pseudotime analysis can provide further information regarding the origin of the lineage labeled alveolar cells of Trp63-CreERT2;R26tdT mice. That said, our new data clearly show that KRT5-CreER lineage labeled cells do not give rise to AT1/2 cells as previously described (Kanegai et al., 2016; Vaughan et al., 2015), suggesting that the ectopic basal progenitor cells do not generate alveolar cells. By contrast, Trp63-CreERT2 lineage labeled cells do give rise to AECs, suggesting that this p63+ cell population capable of generating AECs are different from Krt5+ ectopic basal progenitor cells. Our single cell core has an extremely long waiting list due to the pandemic and we hope that our new findings are enough to address the reviewer’s concern without the need of single cell analysis..

    1. Author Response

      Reviewer #2 (Public Review):

      Activation of TEAD-dependent transcription by YAP/TAZ has been implicated in the development and progression of a significant number of malignancies. For example, loss of function mutations in NF2 or LATS1/2 (known upstream regulators that promote YAP phosphorylation and its retention and degradation in the cytoplasm) promote YAP nuclear entry and association with TEAD to drive oncogenic gene transcription and occurs in >70% of mesothelioma patients. High levels of nuclear YAP have also been reported for a number of other cancer cell types. As such, the YAP-TEAD complex represents a promising target for drug discovery and therapeutic intervention. Based on the recently reported essential functional role for TEAD palmitoylation at a conserved cysteine site, several groups have successfully targeted this site using both reversible binding non-covalent TEAD inhibitors (i.e., flufenamic acid (FA), MGH-CP1, compound 2 and VT101~107), as well as covalent TEAD inhibitors (i.e., TED-347, DC-TEADin02, and K-975), which have been demonstrated to inhibit YAP-TEAD function and display antitumor activity in cells and in vivo.

      Here, Fan et al. disclose the development of covalent TEAD inhibitors and report on the therapeutic potential of this class of agents in the treatment of TEAD-YAP-driven cancers (e.g., malignant pleural mesothelioma (MPM)). Optimized derivatives of a previously reported flufenamic acid-based acrylamide electrophilic warhead-containing TEAD inhibitor (MYF-01-37, Kurppa et al. 2020 Cancer Cell), which display improved biochemical- and cell-based potency or mouse pharmacokinetic profiles (MYF-03-69 and MYP-03-176) are described and characterized.

      Strengths:

      All of the authors' claims and conclusions are very well supported and justified by the data that is provided. Clear improvements in biochemical- and cell-based potencies have been made within the compound series. Cell-based selective activities in the HIPPO pathway defective versus normal/control cell types are established. Transcriptional effects and the regulation of BMF proapoptotic mRNA levels are characterized. A 1.68 A X-Ray co-crystal structure of MYF-03-69 covalently bound to TEAD1 via Cys359 is provided. In vivo efficacy in a relevant xenograft is demonstrated, using a 30 mg/kg, BID PO dose.

      We thank the reviewer for appreciating and highlighting the strengths of our study.

      Weaknesses:

      Beyond the impact on BMF gene regulation, new biological insights reported here for this compound series are moderate. Progress and differentiation with respect to activity and/or ADME PK profiles relative to the very closely related and previously described (Keneda et al. 2020 Am J Cancer Res 10:4399. PMID 33415007) acrylamide-based covalent TEAD inhibitor K-975 (identical 11 nM cell-based potencies when compared head-to-head and identical reported in vivo efficacy doses of 30 mg/kg) is not entirely clear. Demonstration of on-target in vivo activity is lacking (e.g., impact on BMF gene expression at the evaluated exposure levels).

      We thank the reviewer’s question. We have compared mouse liver microsome stability and hepatocyte stability of K-975 and MYF-03-176 and found that K-975 is metabolically less stable.

      Consistently, when NCI-H226 cells derived xenograft mice were dosed with 30 mg/kg K-975 twice daily, the tumors kept growing and reach more than 1.5-fold volume on 14th day. While with the same dosage, MYF-03-176 showed a significant tumor regression. K-975 did not reach such efficacy even with 100 or 300 mg/kg twice daily, either in NCI-H226 or MSTO-211H CDX mouse model according to the paper (Keneda et al. 2020 Am J Cancer Res 10:4399).

      To demonstrate the on-target in vivo activity, we tested expression of the TEAD downstream genes and BMF in tumor sample after 3-day BID treatment (PD study) and we observed reduction of CTGF, CYR61, ANKRD1 and an increase of BMF, which indicates an on-target activity in vivo.

    1. Author Response:

      We would like to thank the reviewers for a very thorough and careful analysis of our manuscript.  All the comments and suggestions were taken to heart, and we feel that our revised manuscript is vastly improved because the reviewers clearly put in a significant effort to help us interpret and clarify our conclusions.  We appreciate that the reviewers took the time to help us convey our results within the context of the field.  Understanding and working with BRCA2 variants of uncertain significance is a challenging and complex, and we strive to report accurate and solid data that will facilitate to predict cancer risk and targeted therapies for patients. 

      We appreciate the comment by reviewer 1 stating: “Identification of truly pathogenic BRCA2 missense mutations is a challenging but very important task for early diagnostics.” Our goal is to define cancer risk to prevent tumor formation and to promote personalized medicine with targeted therapies for homologous recombination deficient tumors. In future studies, we will expand these analyses to potentially pathogenic mutations of BRCA1 and PALB2.

      In reviewer 1 public text we have noticed a few mistakes in the text:

      1.     In the text it says RAD52, but it should say RAD51. This is the sentence: “Using an impressive array of cellular and biochemical approaches they demonstrated that the first two BRCA2 mutants have a detrimental effect on RAD52-dependent DNA repair, and therefore likely to be pathogenic

      2.     In the text it says T1980I instead of T1346I. This is the sentence: “In contrast, T1980I seems to have no effect on DNA repair in various tested assays and is likely to be a passenger mutation.”

      We thank reviewer 2 for the thoughtful comments and questions and we agree that this paper is important to the field of homologous recombination, replication, genome stability maintenance, DNA double strand break repair and classification of VUS. We wish that BRCA2 S1221P full length mutant could have been purified to provide us with deeper mechanistic insights about how this mutant affects BRCA2 biochemical functions.

      We appreciate the constructive question about the impact of the results raised by reviewer 3, and we acknowledge the immense efforts from different laboratories to classify VUS over the years with different approaches (segregation studies, protein prediction algorithms, viability analysis in ES, HDR reporters, in vitro analysis, etc.). However, we see the need of rigorous and thorough in vitro and in cells analysis to understand BRCA2 fundamental biology and better classification of VUS with a more comprehensive analysis of altered BRCA2 functions.

      To answer the comments and questions raised by reviewer 3, we have incorporated more elaborated introduction, results, and discussion in the manuscript to cover this. In brief, our comprehensive analysis of three independent variants located in the BRC repeats of BRCA2 highlight the importance of using multiple analysis to understand the altered BRCA2 functions because there is not a unique assay to measure BRCA2 tumor suppression activity. Our goals were two: 1) to identify potentially pathogenic variants with important clinical implications for cancer risk in patients and 2) to leverage deleterious variants to uncover the specific functions carried out by individual BRC repeats.

      As an example of our answers to the questions of reviewer 3, we incorporate the last paragraph of our discussion here:

      Clinical integration of functional assays into the genetic counseling setting is an important goal but should be met with caution until we fully understand how specific variants impact the tumor suppressor functions of BRCA2.  The use of robust and accurate functional assays will be essential to correctly evaluate BRCA2 VUS.  Our study demonstrated that novel pathogenic variants exist not only in the DBD domain of BRCA2 but also in the BRC region leading to defects in RAD51 binding, activity, and subsequent HDR deficiencies.  Mechanistic studies leveraging patient variants will continue to reveal the many functions of BRCA2.”

    1. Author Response

      Reviewer #1 (Public Review):

      This paper tackles a very important question in somatosensory biology - the identity of the sodium channel controlling excitability in proprioceptors. While whole rainforests' worth of papers have been published on sodium channels in nociceptors, there has been a significant gap in our understanding of which NaV isoforms are at play in the large fiber proprioceptors and LTMRs. Using pharmacology, gene KO, behavior, and histology, the authors show quite convincingly that NaV1.1 in sensory neurons is essential for normal motor behavior and contributes to proprioceptor excitability. Interestingly, they find NaV1.1 is haploinsufficient. This finding is all the more exciting given the many human NaV1.1 het and homo mutants and points to future possibilities for interrogating the role of this channel in human proprioception and using human tissue (e.g. iPSCs).

      We are delighted that the Reviewer finds our results address a “very important question in the field of somatosensory biology”.

      Reviewer #2 (Public Review):

      The manuscript by [Espino et al, 2022] characterizes the role of the sodium channel Nav1.1 in DRG sensory neurons, focusing on its role in proprioceptive sensory neurons. Nav1.1 expression has previously been observed in myelinated DRG neurons (including proprioceptive muscle afferents) but its significance for proprioceptive function remains unknown. In a series of molecular and in vitro patch clamp studies (using pharmacological Nav1.1 inhibitors and activators), the authors demonstrate that all proprioceptors express Nav1.1 and that this sodium channel is required for repetitive firing in the majority of proprioceptors. A pan sensory conditional deletion of Nav1.1 leads to a loss in motor coordination, suggesting that Nav1.1 in sensory neurons is required for normal motor control. While this is a somewhat generic and slightly unsatisfying conclusion, further morphological studies and ex vivo electrophysiological recordings of functionally identified muscle spindle afferents begin to offer a more interesting take on the role of Nav1.1 in proprioceptor function. First, while proprioceptor number and spindle morphology are unchanged, it appears as if the number of synapses between muscle spindle afferents to motor neurons is reduced, perhaps suggesting that a reduction in proprioceptor excitability during development affects the formation of proprioceptive sensory-motor circuits. Second, ex vivo recordings of MS afferents indicate that the loss of Nav1.1 primarily affects the static phase of their response to increases in muscle length, suggesting a role in the regulation of proprioceptor slow adaptation response properties.

      There are two clear strengths of the manuscript. First, mutations in Nav1.1 have been shown to be associated with a number of central brain disorders, including those that lead to motor impairments. The notion that a sensory neuron restricted loss of Nav1.1 similarly leads to motor coordination defects indicates that some phenotypes that previously had been suggested to be due to a central role of Nav1.1 could in fact have a peripheral basis. A second strength is that these studies further our understanding of the molecules that regulate excitability in proprioceptors and offer a foundation for further work to tease apart the molecular underpinnings of the physiological response properties of individual proprioceptor subtypes.

      While the studies generally support the main conclusion that Nav1.1 in mammalian sensory neurons is required for normal motor behaviors, the depth of some of the analyses leaves a bit more to be desired. For example, it seems that a little more could have been done to strengthen the in vitro analyses of Nav1.1 in proprioceptors with additional controls, and by expanding this analysis to genetically identified Nav1.1 mutant (heterozygous or homozygous) proprioceptors. In addition, it feels a bit of a missed opportunity that there is no further exploration of the relationship of Nav1.1 function in the context of specific proprioceptor subtypes (even if only through discussion). In addition, the observation that a loss in Nav1.1 may cause disruptions in sensorymotor connectivity could benefit from additional analyses to support these findings.

      We thank the reviewer for identifying the strengths of our study and pointing out that these new findings will “offer a foundation for further work”. We are eager to continue this line of investigation and are currently developing new approaches in the lab that will allow us to deepen our analyses in the future.

      Reviewer #3 (Public Review):

      The authors characterize the role of voltage-gated sodium channel Nav1.1 expression in proprioceptors in the peripheral nervous system. They use genetically modified mice, pharmacological blockers, and electrophysiological methods to support their claims. Albeit it was known for a long time that Nav1.1 is expressed in the peripheral nervous system, Espino et al. here present a thorough characterization of its role in proprioception and show its importance for motor behaviour, proprioceptor function, and synaptic transmission in the spinal cord. Characterizing the sodium channel subtype's function is crucial for our understanding of the function and dysfunction of the nervous system and to potentially develop new therapeutic approaches.

      We thank the Reviewer for their comments on the importance of our work investigating sodium channel function in proprioceptors.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper is a follow-up of the authors previous paper (2018), in which they carefully described the organisation of the junctions between cells of the adult Drosophila midgut epithelium and their control from the basal side by integrin signalling. Here, the authors used state-of-the art imaging and genetics to unravel step-by-step the events leading from an initially unpolarised cell to an epithelial cell that integrates into the existing epithelium. Many of the images are accompanied by cartoons, which help the reader to better understand the images and follow the conclusions. It would have been helpful yet, in particular with respect to the mutant phenotypes described later, if they would have named each of the steps/stages. In addition, mentioning the timescale would give an idea about the temporal frame in which this process elapses.

      We have used terms such as “unpolarised cells, polarised Actin/Cno” to label different stages in Figure 6, since this sequence of steps is inferred from results obtained from fixed samples with still images. We have illustrated the septate junction mutant phenotype in Figure 8I.

      We have also performed a new experiment to estimate the time taken for an activated EB to form a PAC and to become a mature enterocyte using overexpressing Sox21a with esg[ts]>GFP to induce enteroblast differentiation. Counting the number of GFP+ve cells without PAC, with a PAC and with full apical domain at different time points suggests that activated EBs take about a day to form a PAC and another day to form a fully-integrated enterocyte. We have summarised the results in Figure 5-figure supplement 1C.

      We have also included this result in the main-text as “ To estimate the time taken for enteroblasts to progress to pre-enterocytes with a PAC, and for pre-enterocytes become to enterocytes, we induced enterocyte differentiation by over-expressing UAS-Sox21a under the control of esg[ts]-Gal4 and counted the number of GFP+ve cells without a PAC or apical domain, with a PAC and with a full apical domain at different time points after induction (Chen et al., 2016; Meng and Biteau, 2015; Zhai et al., 2017). 17 hours after shifting the flies to 25ºC to inactivate Gal80ts, almost no GFP+ve cells had progressed to pre-EC with a PAC (0.1%) or EC (1%), and these few cells probably started to differentiate before Sox 21a induction. 24 hours later, 10% of the GFP+ve cells had developed into pre-ECs with a PAC and 20% had become ECs (Figure 5-figure supplement 1B-C). After an additional 24 hours, the number of cells with a PAC fell to 1%, whereas 50% were ECs. Assuming that it takes 12-17 hours to induce high levels of Sox21a expression, these results suggest that most activated EBs take about 24 hours to develop into a pre-EC with a PAC and a further 24 hours to differentiate into a mature EC, although some cells differentiate faster. This time frame is in agreement with a previous study using similar approaches to accelerate differentiation (Rojas Villa et al., 2019) and a recent live imaging study tracing the enteroblast to enterocyte transition (Tang et al., 2021). These results also indicate that down-regulation of Sox21a is not essential for enteroblast to pre-enterocyte differentiation, since enteroblasts overexpressing Sox21a still from a PAC (Figure 5-figure supplement 1B).

      The authors convincingly show that septate junctions are instrumental for proper polarisation and integration of the enteroblast. However, while they nicely showed that Canoe in neither required in the enteroblast nor in the enterocytes for this process, it remains unclear whether septate junction proteins are required in enteroblast or in enterocytes or in both and at which particular step the process fails in the mutant.

      Early stage enteroblasts neither express or require septate junction proteins, whereas late stage enteroblasts and pre-enterocytes do (Chen et al., 2020; Hung et al., 2020; Izumi et al., 2019; Xu et al., 2019). Since cells mutant for septate junction proteins do not develop into mature enterocytes with an apical domain facing the gut lumen, we cannot answer the reviewer’s question of whether septate junction proteins are required in enterocytes.

      As we discussed in the paper, we think that “differentiating enteroblasts only require a basal cue to establish their initial apical-basal polarity, whereas the formation of the pre-assembled apical compartment also requires a junctional cue. The septate junctions are not necessary for apical domain formation per se, however, as mesh mutant enteroblasts form a full-developed apical domain with a brush border inside the cell. This suggests that septate junctions define the site of apical domain formation by delimiting the region where apical membrane proteins are secreted to assemble the brush border, but do not control the process of apical domain formation directly.”

      Reviewer #2 (Public Review):

      The authors recently showed the polarization of the cells of the adult Drosophila midgut does not require any of the canonical epithelial polarity factors, and instead depend on basal cues from adhesion to the ECM, as well as septate junction proteins (Chen et al, 2018). Here they extend this research to examine in greater detail precisely how midgut epithelial cells integrate in the pre-exisiting epithelium and become polarized. Surprisingly, they show that enteroblasts form an apical membrane initiation site prior to polarizing. Furthermore, they show that this develops into a pre-apical compartment containing fully-formed brush border. This is a very interesting finding - it explains how integrating enteroblasts can integrate into a pre-existing epithelium without disrupting barrier function. The conclusions of this paper are mostly well supported by data, but some aspects could do with being clarified and extended as outlined below.

      Model presented in Figure 6

      While the separation of membranes indicated in Figure 6 steps 3-5 can be seen in the image shown in Figure 3B, this is one of the only images which supports the idea that there is a separation of membranes between the enteroblast and overlying enterocytes during PAC formation. Is the model in Figure 6 supported by EM data - can you see a region where there is brush border and separation of cells? Supplementing Figure 3 with corresponding EM images would greatly aid the reader in interpreting the data and strengthen the model.

      We think that AJ clearing and membrane separation is a brief process that is quickly followed by the separation of the apical and junctional proteins and apical secretion at the AMIS to form the PAC. We have not captured this stage in our EM images, but have many other examples that show this step (e.g Figure 4C and Figure 8F). Another example is shown below.

      A key step in the model is that the clearance of E-Cadherin from the apical membrane leads to a loss of adhesion between the enteroblast and the overlying enterocytes. This would need to be supported by functional data such as overexpression of E-Cad or E-CadDN in enteroblasts or by generating shg mutant clones. If the model is correct, perturbing E-Cad levels in enteroblasts should lead to defects in PAC formation, such as loss of de-adhesion/early de-adhesion/excessive de-adhesion.

      We think it is the local clearance of ECad from the apical membrane, not the downregulation of total level of ECad that is important for the local membrane separation and future PAC formation. The experiment of overexpressing ECad or ECad-DN proposed by the reviewer might be crucial to demonstrate the importance of total amount of ECad, but might not be very helpful in determining the importance of membrane separation in the PAC formation. Moreover, AJ formation in fly midgut epithelium does not depend on ECad, suggesting that ECad and NCad act redundantly which further complicates this approach (Choi et al., 2011; Liang et al., 2017).

      Role for the septate junction proteins

      Septate junction proteins were previously shown by these authors to be required for enteroblast polarization and integration into the midgut epithelium (Chen et al, 2018). Here they extend this by examining enteroblasts mutant for septate junction proteins, and conclude that septate junction proteins are required for normal PAC formation. However, it is not clear what aspect of the polarization of the enteroblasts is disrupted, because a number of mesh mutant cells (albeit a lower proportion than in wildtype) do form PACs. The main phenotype seems to be that cells fail to polarize (as previously reported) or have internalised PACs. It is hard to know what to conclude from this data about the role of the septate junction components in PAC formation.

      The major phenotype of the septate junction mutants is the loss of polarity, i.e. an inability to form an apical domain and integrate into the epithelial layer as shown in Figure 8. Neither mesh or Tsp2a mutants can form a PAC, even though mesh mutant cells have higher propensity to form an internal PAC-like structure (Figure 8B,C,E,G,H, Figure 8-figure supplement 1L). Thus, we think that septate junctions are required for AMIS and PAC formation. What complicates the interpretation is that some (6-20%) septate junction mutant cells do form an AMIS like structure (Figure 8D-F, Figure 8-figure supplement 1F&K). The simplest explanation for this result is that this is due to perdurance of the wild-type proteins after clone induction, with the weaker phenotype of ssk mutants being due to longer perdurance of this protein. However, we cannot rule out the alternative explanation that AMIS and PAC formation is facilitated by the septate junction proteins, but that they can still form very inefficiently in their absence.

      We realise that this section was quite confusing in the orginal version of the manuscript and have now re-written it to make this interpretation clearer.

      Coracle is used as a readout for the localization of septate junction components, yet the staining for Cora in Figure S3B looks quite different to Mesh in S3D. If Cora is to be used as a readout for the localization of septate junction components, then staining for Cora/Mesh and/or Cora/SSk or Tsp2a should be shown.

      When discussing the requirement for septate junctions for enteroblast integration - Coracle and Mesh are used interchangeably - but as mentioned before, it is not clear if they colocalize, or if their localization is interdependent (as demonstrated for Mesh, Tsp2a and Ssk in Figure 7). What is the phenotype of enteroblasts mutant for cora?

      Following from the previous point - while it is clear that Coracle is apical early during AMIS formation, it is not clear if Mesh, Tsp2a and Ssk also are, yet these are the mutants that are examined for a role in AMIS/PAC formation. It would be good to know whether the loss of cora would lead to defects in AMIS formation.

      The reason we used mainly Coracle as a marker for the septate junctions is that Mesh and Tsp2A localise to the basal labyrinth as well as to the septate junctions which could confuse the reader. We have now added new panels to Figure 3-figure supplement 3E&F showing the colocalization of Cora with Mesh/Tsp2a at the septate junctions and during the crucial stages of PAC formation.

      Additional Results:

      "Coracle is a peripheral septate junction protein whose localisation depends on the structural septate junction components such as Mesh/Ssk/Tsp2a (Chen et al., 2018; Izumi et al., 2016, 2012). Cora antibody staining provides a clearer marker for the septate junctions than Mesh or Tsp2a antibody staining, because the latter also label the basal labyrinth (Figure 3-figure supplement 1E&F). To determine whether Cora is required for PAC formation or epithelial polarity in the adult midgut, we generated a null mutant allele with a premature stop codon in FERM domain using CRISPR. Cells mutant for this allele, corajc, or a second cora null allele, cora5, can form a PAC, septate junctions and a full apical domain, indicating that Cora is also not required for enteroblast integration or enterocyte polarity (Figure 7F&G, Figure 7-figure supplement 1E-H).

      Additional Materials and Methods:

      We used the CRISPR/Cas9 method (Bassett and Liu, 2014) to generate null alleles of canoe and coracle. sgRNA was in vitro transcribed from a DNA template created by PCR from two partially complementary primers:

      forward primer:

      For coracle:

      5′-GAAATTAATACGACTCACTATAGAAGCTGGCCATGTACGGCGGTTTTAGAGCTAGAAATAGC-3′;

      The sgRNA was injected into…Act5c-Cas9 embryos to generate coracle null alleles (Port et al., 2014). Putative…coracle mutants in the progeny of the injected embryos were recovered, balanced, and sequenced. …The coraclejc allele contains a 2bp deletion around the CRISPR site, resulting in a frameshift that leads to stop codon at amino acid 225 in the middle of the FERM domain, which is shared by all isoforms. No Coracle protein was detectable by antibody (DSHB C615.16) staining in both midgut and follicle cell clones. The coraclejc allele was recombined with FRT G13 to make the FRTG13 coraclejc flies.

      It is unclear what is happening in Figure 8A,C,E, S7D. Is that a detachment phenotype or an integration phenotype? Are the majority of cells unpolarised due to loss of integrin attachment rather than failure to form an AMIS/PAC?

      Cells mutant for septate junction proteins do not detach from the basement membrane and still localise Talin basally, as illustrated by the new panel we have added (Figure 8-figure supplement 1N), showing Talin localisation in Tsp2a mutant cell.

      However, because the mutant cells cannot integrate and remain stuck beneath the septate junctions between the enterocytes, they sometimes become displaced from a portion of the basement membrane by younger EBs that derive from the same mutant ISC, leading to a pile up of cells in the basal region of the epithelium (e.g. Figure 8A, E and H).

      We have added the following sentences to the Results, explaining these points:

      "Because the mutant cells remain trapped beneath enterocyte-enterocyte septate junctions, they accumulate in the basal region of the epithelium, with new EBs derived from the same mutant ISC forming beneath them and reducing their contact with the basement membrane (Figure 8A)."

      " The majority of cells mutant for septate junction components fail to polarise or form an AMIS, although they form normal lateral and basal domains, as the basal integrin signalling component, Talin, localises normally (Figure 8-figure supplement 1N)."

      It is unclear whether enteroblasts really pass through an 'unpolarized stage'. In Figure 6, when they are described as 'unpolarised', they clearly have distinct basal and AJ domains. In septate junction mutants, when cells are classified as unpolarized, do they still have distinct regions of integrin/E-Cad expression?

      This is a semantic question. We agree that they have distinct lateral and basal domains, but they do not have an apical domain. In this respect, these "unpolarised" cells are similar to a mesenchymal fibroblast migrating on a substrate, which has a distinct basal side contacting the substrate that is different from the non-contacting regions of the cell surface. They also match the description of the migratory, "mesenchymal" enteroblasts (Antonello et al., 2015). To make this clearer, we have added the following notes to the legend for Figure 6: “Unpolarised” in the second panel of this figure indicates that the enteroblast has not formed a distinct apical domain. At this stage, no marker is clearly apically localised. “unpolarised” or “polarised” in the third and fourth panels describe the localisation of marker proteins, such as Actin and Cno."

    1. Author Response

      Reviewer #1 (Public Review):

      Solving the puzzle of this paper was clearly not easy, and the authors used an impressive set of tools and statistical methods to get to the bottom of what they observed in a very creative way. However, the presentation of the manuscript and its relevance could perhaps be improved.

      We are pleased to see that the referee was favorably impressed. We hope that this revision has improved the presentation of the manuscript and has clarified its relevance.

      First, I find the arguments in some parts of the manuscript to be a bit awkwardly formulated. For example, there is much discussion about social evolution and the paradox of why cells invest into rhamnolipid production, but this does not seem to be the topic of the paper, which focuses more on understanding P. aeruginosa's metabolism. Instead, there is very little discussion about the origin of these isolates and to what extent these findings may be relevant for P. aeruginosa's natural environment. I understand that this may be very speculative, but there could at least be more discussion on why glycerol was chosen as a growth medium, and what would happen if a more realistic growth medium were used instead. What environment does this bacterium experience and might it be surrounded by other species that could reduce oxidative stress?

      We understand that the referee would like a broader analysis of how the growth environment impacts surfactant secretion. We have added an entirely new section titled “Mathematical model predicts impact of carbon sources on surfactant production” at the end of the results section to address this issue (p. 16-19, l. 358-435). In the new section we present new data on how a range of carbon sources, beyond glycerol, impact P. aeruginosa growth and biosurfactant secretion. Then we use our model to determine carbon sources that favor secretion and we identify D-glucose as being better than glycerol. These new experimental and computational work refine our model to explain surfactant secretion more broadly than in glycerol. The biosynthesis of this secondary metabolite is favored when the carbon and energy source imposes a low burden on the primary metabolism. We also investigated the rhlAB expression dynamics in PA14 in glucose to further support our results.

      The overall message of the paper could be clarified: essentially, cells only produce rhamnolipids when they are not experiencing oxidative stress. I am sure the message is more nuanced, but this is not clear from the current abstract.

      We have changed the abstract to clarify our main point: that cells only produce surfactants when they are not experiencing oxidative stress and they more carbon source than needed for growth.

    1. Author Response

      Reviewer #1 (Public Review):

      Weaknesses:

      The methods section lacks sufficient detail, and arbitrary choices made in the simulation setup may have biased the results. The author's finding that the LR is disordered does not provide obvious mechanistic insights, and the simulations with the bound ligand are too preliminary to make solid conclusions. Although this manuscript is technically strong, the significance of the results is often unclear.

      We did not make “arbitrary choices”. The set up choice (only one) which we made was guided by heuristics and its adequacy was amply confirmed by the robustness of the simulated system. We have emphasized this in the revision.

      Reviewer #2 (Public Review):

      Strengths:

      1) The authors have focused on the LR region of TSHR and perform rigorous MD simulations to identify its various conformers and tried to give a reasoning for this observation. The authors also showed the stability of LR increased in the presence of the ligand, TSH.

      2) The authors have done many simulations of the TMD helix bundle and meticulously tried to quantitate the differences by assessing the changes in helix length, radius and angles.

      Weaknesses:

      1) Although the focus of the paper was the full model of the TSHR, the authors have broken down the whole protein into smaller sequences and have done separate simulations, and discussed the result. The whole picture of the TSHR is not clear. For example in Figure 5, the various confirmation (and secondary structures) of only LR is shown at different times. For the TMD helix bundle, separate tables have been shown, focusing only on TMD.

      The whole picture of the TSHR is shown on Figure 1. The reason the TMD is not shown in Figure 5 is because there is little variation in the TMD (as analyzed in detail in Tables 1-3) and not including it allowed us to show more detail in the ectodomain.

      2) The authors have analyzed the cysteines in the LR doing simulation, showing the propensity for various pairs of disulfide formation. However, the authors have not further discussed this point. Can this information be used to better guide the modelling process?

      We have added a statement in the Discussion section suggesting that the closeness of these cysteines during the simulation indicates that they indeed should form disulfide bonds. Furthermore, their separation when TSH was introduced indicated the likely role of these disulfide bonds in signal transduction.

      3) Based on the data in this manuscript, the authors claim that the LR domain makes significant contact with the TSH ligand. However if one refers to the crystal structure of FSH ligand with the ectodomain of its receptor (pdb: 4AY9) , the corresponding loop for LR is missing, directing to the point that either the interaction between this loop (LR) and ligand is either very weak or there is no interaction.

      While our results on the TSH-TSHR complex are still preliminary we pointed out in the revision that (a) the TSH-LR contact we see involves the part of the LR that are missing in the FSH-FSHR structure (b) there is only 39.3% sequence identity between the LR of TSHR and FSHR and (c) the large fluctuations we see in the LR conformations suggests that it is very unlikely that the contacts seen are artifacts of the initial structure.

    1. Author Response

      Reviewer #2 (Public Review):

      The manuscript presents an interesting study on a timely topic (hyperacusis). The study was carried out in awake animals using modern approaches in neurosciences (calcium imaging, optogenetic). The amount of data is impressive, the study is very ambitious, and overall its quality is indisputable. However, I have some general comments and questions on some concepts that are critical for the study, and also on the interpretation of the data, in particular the behavioral data.

      We appreciate Reviewer 2’s overall positive evaluation as well as their more specific critiques, which we address below.

      The first point I want to mention is the concept of 'homeostatic plasticity'. I am not sure we agree on its definition. My understanding of it is that the AVERAGE of central activity will remain constant around a set point value. In case of a reduction of sensory inputs (hearing loss), the neurons' sensitivity will be enhanced in such a way that the averaged activity will be preserved. So, neural hyperactivity after partial or sensory deprivation is not 'maladaptive': it is a collateral effect, 'the price to pay' for maintaining neural activity stable around a given value. In my opinion, this point is crucial. The authors should also mention and cite the model's paper from Schaette et al.

      “Homeostasis” is a term used widely in physiology to describe a negative feedback process in which an internal adjustment compensates for an external perturbation to return a given system (temperature, pH, etc.) to a set point. To the reviewer’s point, homeostatic processes – broadly defined – can work at many different biological scales including perhaps large, distributed systems like the example s/he gave of neurons throughout the central auditory pathway. By contrast, “homeostatic plasticity” is a mechanism studied by dozens of laboratories in hundreds of papers by which neurons (typically studied in cortical neurons) adjust their synaptic and intrinsic excitability to maintain their activity around a set point range. A key feature of homeostatic plasticity is that neurons “sense” deviations from their set point and initiate a compensatory process to offset this deviation. Up to this point, it seems that we are on the same page as the reviewer.

      The first point of possible disagreement lies in the interpretation of how excess neural activity relates to homeostatic plasticity. The reviewer mentioned modeling papers by Schaette and Kempter (2006, 2007, 2012) on the cochlear nucleus, which are also based on homeostatic plasticity and their work is now cited in the revised text (see line 71). The reviewer is correct that there is a difference in how the term is used and interpreted, but the difference is fairly subtle. Their work and our work propose that homeostatic plasticity processes are applied within a single neuron to offset the reduced afferent input that accompanies cochlear damage. As the reviewer recalled, they describe hyperactivity as a consequence of this compensation, as we do as well. The only difference is that they and the reviewer describe hyperactivity as the byproduct of the normal, successful implementation of homeostatic plasticity, which it unequivocally is not because – by definition – homeostatic plasticity is a stabilizing process that maintains activity at a predetermined set point range.

      The second point of disagreement lies in the reviewer’s statement that “neural hyperactivity after partial or sensory deprivation is not 'maladaptive': it is a collateral effect, 'the price to pay' for maintaining neural activity stable around a given value.” We disagree. Hyperactivity can be both a collateral and maladaptive effect. Hyperactivity and hypersynchrony are understood to be the basis of tinnitus, which is a maladaptive, disordered state. The reviewer’s comment implies that there is no alternative for compensating for sensory deprivation but to make cortical neurons hyperactive. We see no reason why this must be so. In fact, stabilization of activity rates after sensory deprivation has been demonstrated in hundreds of studies in the developing visual system. In the adult auditory system, activity in cortical neurons is initially depressed after injury before rebounding to exceed baseline levels (see Resnik Polley 2017 eLife, Asokan 2018 Nat Comm., Resnik Polley 2021 Neuron). It is not obligatory for cortical activity rates to pass through the set point range and continue into hyperactivity, nor is it obligatory for cortical activity rates to remain elevated above baseline many days after the injury. Additional evidence for this point comes from Figures 4, 6, and 8, which show that some cortical neurons actually do homeostatically regulate their activity back to baseline (i.e., show stable gain). This raises the intriguing question of why some neurons recover to their homeostatic activity set point while others do not. Figure 8 provides new insight into this question by showing that that their baseline response properties can account for 40% of the variability in gain stabilization after peripheral insult.

      A third point of disagreement related to the reviewer’s statement that “My understanding of it is that the AVERAGE of central activity will remain constant around a set point value. In case of a reduction of sensory inputs (hearing loss), the neurons' sensitivity will be enhanced in such a way that the averaged activity will be preserved”. We agree that homeostatic plasticity processes are influenced by activity propagating through distributed neural networks. However, the biological implementation of the process is programmed into individual neurons. The activity set point is neuron-specific, the error signal that encodes a deviation from the set point is neuron-specific, and the transcriptional/translational changes deployed to stabilize the activity rate are neuron-specific. As an analogy, home climate control systems work autonomously for each house, because the sensors (thermostat) and actuators (heating/cooling) are sensitive to fluctuations in that home, not across other houses in the town. The heating and cooling systems for each house in town may be driven by a distributed, common source (e.g., a hot day) but the mechanisms that bring the ambient temperature back to the set point for each house are autonomous and reflect the particular thermostat programming for each house. The widely studied homeostatic plasticity mechanisms mentioned in our manuscript (e.g., excitatory synaptic scaling) are not sensitive to and do not target the averaged neural activity among millions of neurons distributed throughout the sensory neuroaxis.

      As a final point on this statement, there is no demonstration that we are aware of that average central activity remains constant after a reduction of sensory inputs. This would require recording from many neurons across multiple stages of the sensory pathway in a single animal to show that the increased gain at later stages in the system exactly offsets the reduced responsiveness at earlier stages of the system. So, the reviewer’s definition of homeostatic plasticity is based on a general supposition about a distributed process that has never been empirically demonstrated whereas the definition we use is consistent with the mechanisms and terminology used throughout the neuroscience literature (albeit often incorrectly in the hearing loss literature).

      The second point is that a lot is built on the behavioral procedure and d'. I am not convinced by the behavioral procedure (and the d') is a convincing measurement of loudness (and therefore loudness hyperacusis). So, in my opinion, the title may be changed and more importantly the entire spirit of the paper should be modified.

      The reviewer’s critique as well as comments from other reviewers helped us realize that we had used the terms “hyperacusis” and “loudness” imprecisely. We think that is part of the confusion. What we have studied here is auditory hypersensitivity after sensorineural hearing loss, which may or may not be a model of why persons with hyperacusis can exhibit loudness hypersensitivity.

      Once “hyperacusis” and “loudness” have been stripped away from the behavior, we contend that we have a behavioral assay for auditory hypersensitivity, which is the main point of our study. To be clear, the behavioral readout most commonly employed in the animal literature to model hyperacusis is reaction time, which has a less direct relationship to hypersensitivity than does d’. D-prime is widely used as the sensitivity index in detection behaviors. The main advantage of d’ is that it controls for differences in response bias either between subjects or after noise exposure. We used the d’ metric to show that mice can more reliably detect tone levels near their sensation threshold and can more reliably detect direct stimulation of thalamocortical projection neurons after acoustic trauma. These observations provide the framework for all of the neural measurements that follow.

      On the balance, the reviewer was correct that our imprecise use of hyperacusis and loudness was confusing and contradictory. The terms “hyperacusis” and “loudness” now only appear in the manuscript to describe other published findings or to describe what our study does not address. This resulted in several small text changes throughout the manuscript as well as a direct statement about the relationship between our work, loudness, and hyperacusis on Pg. 14, Lns 448-466.

      “While the findings presented here support an association between sensorineural peripheral injury, excess cortical gain, and behavioral hypersensitivity, they should not be interpreted as providing strong evidence for these factors in clinical conditions such as tinnitus or hyperacusis. Our data have nothing to say about tinnitus one way or the other, simply because we never studied a behavior that would indicate phantom sound perception. If anything, one might expect that mice experiencing a chronic phantom sound corresponding in frequency to the region of steeply sloping hearing loss would instead exhibit an increase in false alarms on high-frequency detection blocks after acoustic trauma, but this was not something we observed. Hyperacusis describes a spectrum of aversive auditory qualities including increased perceived loudness of moderate intensity sounds, a decrease in loudness tolerance, discomfort, pain, and even fear of sounds (Pienkowski et al., 2014a). The affective components of hyperacusis are more challenging to index in animals, particularly using head-fixed behaviors, though progress is being made with active avoidance paradigms in freely moving animals (Manohar et al., 2017). Our noise-induced high-frequency sensorineural hearing loss and Go-NoGo operant detection behavior were not designed to model hyperacusis. Hearing loss is not strongly associated with hyperacusis, where many individuals have normal hearing or have a pattern of mild hearing loss that does not correspond to the frequency dependence of their auditory sensitivity (Sheldrake et al., 2015). While the excess central gain and behavioral hypersensitivity we describe here may be related to the sensory component of hyperacusis, this connection is tentative because it was elicited by acoustic trauma and because the detection behavior provides a measure of stimulus salience, but not the perceptual quality of loudness, per se.”

      A lot is derived/interpreted from the results, but I believe there is a lot of over-interpretation. I would suggest the authors be more cautious and moderate in their speculations and conclusions. I would reconfigure the manuscript, and simplify it.

      We believe that the changes mentioned above and in the response to their specific comments below reduce over-interpretation and simplify the manuscript.

      As an example of a change made to moderate the conclusions from our work, we added the following to Pg. 14, Lns 442-447

      “Further, while the perceptual salience (Figure 2) and neural decoding of spared, 8kHz tones (Figure 5) were both enhanced after high-frequency sensorineural hearing loss, these measurements were not performed in the same animals (and therefore not at the same time). Definitive proof that increased cortical gain is the neural substrate for auditory hypersensitivity after hearing loss would require concurrent monitoring and manipulations of cortical activity, which would be an important goal for future experiments.”

      Reviewer #3 (Public Review):

      The study uses a mouse animal model of sensorineural hearing loss after sound overexposure at high frequencies that mimics ageing sensorineural hearing loss in humans. Those mice present behavioural hypersensitivity to mid-frequency tones stimuli that can be recreated with optogenetic stimulation of thalamocortical terminals in the auditory cortex. Calcium chronic imaging in pyramidal neurons in layers 2-3 of the auditory cortex shows reorganization of the tonotopic maps and changes in sound intensity coding in line with the loudness hypersensitivity showed behaviourally. After an initial state of neural diffuse hyperactivity and high correlation between cells in the auditory cortex, changes concentrate in the deafferented high-frequency edge by day 3, especially when using mid-frequency tones as sound stimuli. Those neurons can show homeostatic gain control or non-homeostatic excess gain depending on their previous baseline spontaneous activity, suggesting a specific set of cortical neurons prompt to develop hyperactivity following acoustic trauma.

      This study is excellent in the combination of techniques, especially behaviour and calcium chronic imaging. Neural hyperactivity, increase in synchrony, and reorganization of the tonotopic maps in the auditory cortex following peripheral insult in the cochlea has been shown in seminal papers by Jos Eggermont or Dexter Irvine among others, although intensity level changes are a new addition. More importantly, the authors show data that suggest a close association between loudness hypersensitivity perception and an excess of cortical gain after cochlear sensorineural damage, which is the main message of the study.

      The problem is that not all the high-frequency sensorineural hearing loss in humans present hyperacusis and/or tinnitus as co-morbidities, in the same manner that not all animal models of sensorineural hearing loss present combined tinnitus and/or hyperacusis. In fact, among different studies on the topic, there is a consensus that about 2/3rds or 70% of animals with hearing loss develop tinnitus too, but not all of them. A similar scenario may happen with hearing loss and hyperacusis. Therefore, we need to ask whether all the animals in this study develop hyperacusis and tinnitus with the hearing loss or not, and if not, what are the differences in the neural activity between the cases that presented only hearing loss and the cases that presented hearing loss and hyperacusis and/or tinnitus. It could be possible that the proportion of cells showing non-homeostatic excess gain were higher in those cases where tinnitus and hyperacusis were combined with hearing loss.

      We thank the reviewer for her/his careful reading of the original manuscript and many helpful suggestions and critiques that have been addressed in the revision. Both Reviewer 2 and Reviewer 3 understood that we were presenting our high-frequency sensorineural hearing loss manipulation as a way to model the clinical phenomenon of hyperacusis. This was not our intent, and we regret the wording of the original manuscript communicated this point. In fact, the clinical literature shows that hyperacusis does not have a strong association with hearing loss and moreover our behavioral and neural outcome measures were not designed to index the core phenotype of hyperacusis (a spectrum of sound-evoked distress, disproportionate scaling of loudness with sound level, and sound-evoked pain). Our study addresses the neural and behavioral signatures of auditory hypersensitivity, which is an “upstream” condition that may (or may not) be related to the presentation of clinical phenomena like hyperacusis and tinnitus.

      The reviewer mentions a litmus test for animal models of tinnitus, in which the utility of an animal model for tinnitus would be evaluated in part based on whether a controlled insult only produced a behavioral change suggestive of a chronic phantom percept in a fraction of animals. That may be so, but our study is clearly not modeling tinnitus and we make no claims to this effect in the original or revised manuscript. The Reviewer then goes on to say that “a similar scenario may happen with hearing loss and hyperacusis”. “May” is the operative word here because the association between sensorineural hearing loss and the clinical presentation hyperacusis is quite weak overall in human subjects but no study (that we are aware of) has attempted to document the probabilistic appearance of hyperacusis before and after acoustic trauma. So, we really don’t know whether hyperacusis has a probabilistic appearance like tinnitus or is more deterministic like cochlear threshold shift. But, again, the main point is that our experiments make no direct claim about hyperacusis one way or the other, which we now clarify and discuss throughout the revised text, as detailed below.

      We do contend that our experiments allow us to study auditory hypersensitivity, though again there is no precedent or consensus in the literature for expecting auditory hypersensitivity to present probabilistically or deterministically across mice after a controlled insult. Regardless, we agree with the reviewer that it is a very good idea to provide the individual animal data to the reader. We added new panels to Figure 2C to show that an increase in the 8kHz d’ slope after noise exposure (i.e., a change > 1) was observed in 7/7 mice that underwent acoustic trauma but 1/6 mice in the sham exposure group, suggesting a deterministic, binary behavioral effect found in every mouse with noise-induced high-frequency sensorineural damage. On the other hand, within the acoustic trauma cohort, 3 mice showed marked increases in the d’ growth slope (> 2) while 4 showed more subtle changes, suggesting a more graded or probabilistic effect. By providing the individual animal data as per the Reviewer’s request, the reader can now make a more informed determination about the reliability of auditory hypersensitivity within the acoustic trauma cohort.

      Regarding the relationship between the peripheral/cortical/perceptual auditory hypersensitivity we report here and the clinical conditions of tinnitus and hyperacusis, we revised the text such that the word “hyperacusis” only appears in the context of other publications and have added the following text (Pg. 14, Lns 448-466).

      “While the findings presented here support an association between sensorineural peripheral injury, excess cortical gain, and behavioral hypersensitivity, they should not be interpreted as providing strong evidence for these factors in clinical conditions such as tinnitus or hyperacusis. Our data have nothing to say about tinnitus one way or the other, simply because we never studied a behavior that would indicate phantom sound perception. If anything, one might expect that mice experiencing a chronic phantom sound corresponding in frequency to the region of steeply sloping hearing loss would instead exhibit an increase in false alarms on high-frequency detection blocks after acoustic trauma, but this was not something we observed. Hyperacusis describes a spectrum of aversive auditory qualities including increased perceived loudness of moderate intensity sounds, a decrease in loudness tolerance, discomfort, pain, and even fear of sounds (Pienkowski et al., 2014a). The affective components of hyperacusis are more challenging to index in animals, particularly using head-fixed behaviors, though progress is being made with active avoidance paradigms in freely moving animals (Manohar et al., 2017). Our noise-induced high-frequency sensorineural hearing loss and Go-NoGo operant detection behavior were not designed to model hyperacusis. Hearing loss is not strongly associated with hyperacusis, where many individuals have normal hearing or have a pattern of mild hearing loss that does not correspond to the frequency dependence of their auditory sensitivity (Sheldrake et al., 2015). While the excess central gain and behavioral hypersensitivity we describe here may be related to the sensory component of hyperacusis, this connection is tentative because it was elicited by acoustic trauma and because the detection behavior provides a measure of stimulus salience, but not the perceptual quality of loudness, per se.”

    1. Author Response

      Reviewer #1 (Public Review):

      The article by Solvi and colleagues aims to investigate what type and degree of information (either absolute, relative, or a weighted combination of both) is used by bumblebees when retrieving the value of an item. The authors reported recent evidence in humans and birds that suggest they seem to use a combination of absolute memories and remembering of subjective ranking, and an absence of relevant studies for other species, including invertebrates. Thus, the authors conducted four different experiments to study what type of information is guiding the decision of bumblebees when facing different qualitative and quantitative comparisons.

      In the first two experiments, the authors reported the use of relative ranking of stimuli instead of a memory of their absolute value. According to the authors, these results are confirmed by experiment three, where bees were presented with two equally-ranked choices which, in fact, were not treated as different by bees. In the last experiment, bumblebees showed a preference for the highest rank item.

      Despite the presentation of well-designed experiments, the conclusions that bumblebees are using only memories of ordinal comparisons, thus showing a different strategy with respect to humans and birds, seems to not be fully supported by the results. The behaviour on the first two experiments, for instance, could be explained by a recency effect, where the higher item of the last comparison is better retrieved (the work of Giurfa on transitive inferences in bees was not mentioned, though is relevant here). Furthermore, in the last experiment, bumblebees could not have used an ordinal ranking; their choice for the higher-ranking item could be based on its higher absolute quantitative value in terms of sucrose solution.

      We’re sorry for not being clearer in our descriptions in our original submission. In each of the first three experiments, the order of sessions in which the different pairs of sucrose concentrations had been used were counterbalanced. For example, in experiment 1, half of the bees experienced 45 vs 30 first and 30 vs 20 second, and half of the bees experienced 30 vs 20 first and 45 vs 30 second. Further, our GLM results show that the order of training did not affect bees’ preferences. Therefore, a recency effect cannot explain the results of experiments 1 or 2 (or 3). We now highlight this on lines 101 - 105 in the Results and in each of the Experimental descriptions in the Methods, and explain that the GLMs showed no effect of these factors on lines 424 - 426 of Methods.

      With regards to the experiment 6 (last experiment in our original submission), there is no reason bees could not have used ordinal ranking. However, they also could have used absolute memories. We apologise for not making the rationale for and interpretation of experiment 6 clearer. The rationale was to determine how our results spoke to a situation that was more ecologically relevant for a bumblebee. In response to Reviewer #3’s concerns, we have now added new data from an experiment which also helps better explain both our rationale and our interpretation of experiment 6. We discuss this in more detail now in the revised manuscript. In short, bumblebees must use absolute properties, otherwise they would not be able to discriminate or rank any two sequentially visited flowers. However, our results suggest that they only retain (or only utilise memory of) absolute information for a short period of time (a few minutes). Despite this, experiment 6 suggests that in normal foraging situations, bees’ preferences for the highest rewarding flowers will not be affected. This is because, in the wild, bumblebees could commonly experience short time intervals (a few minutes) between flowers, which would allow them to compare each flower’s absolute information and encode ranking information. We discuss the new data on lines 204 - 233 and add clarification for experiment 6 on lines 249 - 261.

      The different behaviours and strategies used by bees here could be better explained by differences in the experimental task proposed, rather than supporting a general statement about the evolution of different strategies in comparison to other species.

      We hope that our explanations and clarifications to your above comments and to the other referees’ comments remedy this concern.

      Reviewer #2 (Public Review):

      This manuscript analyzes if bumblebees choose feeding options based on their absolute or relative remembered subjective value. The experiments relate to previous work done in starlings where comparable questions were raised (1). The design used in the four experiments presented is elegant and provides support for the conclusion that bees guide their choices by remembered ranking of feeders instead of focusing on their absolute rewards. Bees preferred the options that were ranked higher within each experimental context experienced, irrespective of the absolute reward they provided. As a consequence, they even preferred a sucrose solution of low concentration (15%) to one that was more profitable (30%), simply because the former was experienced together with a poorer alternative (10%) while the latter was experienced together with a more attractive alternative (45%) (Exp. 2). All four experiments provide results that are consistent with the hypothesis that contextual ranking is essential to determine the bees' choices.

      Thank you for the kind and supportive words.

      Three main points require consideration to render this manuscript even more attractive than what it is already.

      1) The experiments involved in all cases four different colors and different sucrose concentrations (range: 5 - 45 % w/w). An essential requisite of these experiment is that bees should be able to discriminate the options provided, both in terms of color and in terms of reward quality. Asking about ranking or absolute value makes no sense if bees cannot distinguish, say, 15% from 10%, or yellow from orange, and so on. The authors are obviously aware of this point as they mention it explicitly (lines 267-269). Yet, although they mentioned that they verified this point, the only experimental proof available is provided in Fig. Suppl. 4, where a single comparison (from the many possible) was tested; the discrimination test provided involved blue and yellow, which were associated in a balanced way with the two highest sucrose concentrations used, 45% and 30%. In terms of color information, the choice involved the colors that were easy to distinguish (see their loci in the color hexagon). Yet, what about the other colors? Could they be equally well discriminated? Probably not, because some occupied very close loci in the hexagon. Admittedly, the tests B vs. C involved similar colors (yellow vs. orange) and bees showed significant preferences supporting the presence of color discrimination. Yet, no information is available for yellow and green and other color combinations assayed. Even more important would be to show that bees rank the different sucrose solutions differently, which is not clear in all cases. Concentrations were chosen following theoretical considerations based on Weber's law (2), but do bees really respond differently to them? Providing an experimental assessment of this question would be important.

      The perceptual distances between colours used in our experiments ranged from 0.141-0.333 hexagon units, which are intermediate to large colour distances that can be trained to high asymptotic levels of discrimination within 50-60 visits [Dyer and Chittka 2004, doi: 10.1007/s00359-003-0475-2]. Our training procedure involved 50 drinking experiences for each colour (100 drinking experiences in total), which is enough for clear discrimination by bees. Further, we adopted a counterbalanced paradigm, for which the results of each group are displayed in the presented figures. The visualisation of the individual data points indicate there was no significant difference between colour combinations, but more importantly, the GLM statistical analyses show that colour combinations had no effect on bees’ preferences (reported on lines 424 – 426). Finally, we have added data and description of new experiments using orange and yellow flowers and which show that bumblebees are able to quickly and easily discriminate the orange and yellow colours (the shortest hexagon loci distances for the colours used) even when not learned side-by-side (new Figure 2). Importantly, we used sucrose concentration pairs with much greater differences compared to what bees are capable of discriminating behaviourally (bees can tell differences in sucrose concentration as low as 1.5% [Whitney et al., 2008, doi:10.1007/s00114-008-0393-9]) and neurophysiologically (Miriyala et al., 2018, doi: 10.1016/j.cub.2018.03.070). This is now discussed on lines 348 - 353.

      2) Figures B, D, and F are of fundamental importance to draw conclusions about the strategy used by the bees in the three adjacent experiments. Yet, the kind of representation chosen by the authors does not help to follow their conclusions. Firstly, it is not clear what the data points represent. If, for instance, in Fig. 1B, 40 bees were tested (line 247), how many bees per combination were tested (only one combination is mentioned in line 249)? Moreover, given that bees were tested with B vs. C, and if I guess correctly, there are ca. 10 data points per combination, what do these 10 proportions represent? How were these values computed? I could not find this information in the Methods section. No description of the test methodology is provided for Experiments 1 to 3. Moreover, data points in Figs B, D, F are barely visible and appear clustered around 50% in several cases, thus casting doubt on the reported significance of the comparisons. This needs to be improved by means of visible and clear graphic displays. The same kind of consideration can be applied to Fig. 2B, even if the results are clearer.

      We apologise for the lack of clarity and missing information. We now note in the figure legend that each filled circle represents the proportion of choices for a particular option by an individual bumblebee (10 individuals per group). We have now added more description on the test methodology and analyses for experiments 1-3 on lines 101 – 105, 126-127, 380 - 384. We have also increased the size of the individual data points in each of the figures for better clarity.

      3) A final point relates to results obtained in a different experimental framework but which asks whether animals can rank and order in transitive terms experienced alternatives. A considerable amount of work in the field of experimental psychology has addressed the question of transitive inferences in many species (3-12), and even in bees and wasps (13, 14). In these studies, animals are trained with premise pairs presenting different reinforcement outcomes (e.g. A+ B-/B+ C-/C+ D-/D+ E-) to determine if they establish relative rankings (A ˃ B ˃ C ˃ D ˃ E), or on the contrary use associative learning of absolute reinforcement outcomes (in which case, A ˃ B = C = D ˃ E). To determine the strategy followed by animals, they are tested with a non-overlapping pair never experienced during the training (B vs. D). In the first case, animals prefer B to D while in the second case they choose equally between both options. There might be, therefore, some parallels or contact points between these experiments and the experiments reported in this manuscript. Could the authors discuss these parallels and provide a broader view of absolute vs. relative remembered subjective value?

      Thank you for this suggestion. We believe that tests for transitive inference only resemble our methods superficially. However, we do feel that because of this we should provide a brief explanation as to how and why they are fundamentally different. We have now included a paragraph in the Introduction on lines 73 - 89.

      Reviewer #3 (Public Review):

      The central conclusion of this beautiful experimental study is that bumblebees prefer flowers on the basis of their remembered ranking in their context, but are insensitive to their absolute properties. Thus, let's say that there 4 flower types, ranked as follows in nectar concentration: A>B>C>D. However, when the bee learns about these flowers, it does in either of two 'contexts', populated as follows: A & B, or C & D. Thus, the bee experiences that B is the worse option in the context in which it is found, and C is the better one in its own context. If, at a later time, the bee has to make a novel choice, this time between B and C, its memory for ranking leads it to prefer C over B, while its (putative) memory for nectar concentration should favour B over C. The authors find, in a variety of different treatments, evidence for the influence for ranking, but they do not find any evidence for sensitivity to absolute properties (i.e., concentration).

      Thank you for the complimentary sentiments.

      One difficulty that permeates the argument is the ubiquitous difficulty in proving the null hypothesis as true: lack of significant evidence for a putative effect in one or a few experiments, does not mean reliable absence of the effect.

      We appreciate this thought, and we hope that the additional experiments on absolute information usage, together with the original experiments, might collectively form a clearer argument here.

      Another difficulty is that in my view memory for absolute properties was not given a full chance: bees were always trained in situations where both dimensions (concentration and ranking) were present. In such situations, they preferentially used ranking. However, to learn ranking between flower types in sequential encounters, they must remember the absolute properties, so that in each encounter they contrast the present flower with the memory for others. Say the bee encounters a type B flower. How does it store its ranking if it doesn't remember the properties of A at all? To take this objection into account and still maintain the claim, it is necessary to say that it remembers the properties of A when in the A & B context, but it erases it from memory when in the context B & C.

      Neglecting memory for concentration may be an overshadowing effect. Overshadowing is known in learning studies, and it means that, when more than one cue is paired with an outcome, the most salient between them may reduce learning about the predicting power of the other. In this case, bees may remember and use concentration when trained in contexts where there is only flower type, so that there is no chance of using ranking, and then offered choices between pairs of them. In this case, the bees would not have access to ranking, so that there would be a stronger opportunity for absolute memory to manifest itself.

      Thank you for these suggestions. We apologise for not making our arguments clearer in our first submission. Yes, absolute information must be used at some level, otherwise no ranking, or even discrimination, could take place. Our claim is that while bumblebees must detect and use absolute information to compare flowers, our experiments show they do not retain (or utilise memory of) this information for very long (i.e. for much longer than several minutes). We now make this clearer throughout the manuscript, e.g. on lines 31 - 36 and 204 - 233. We have also added new experiments which show that when bees were trained with each of two flowers alone and then tested together, absolute information can be used if visits to the different flowers are separated by only a few minutes, but not if separated by an hour. This suggests that absolute information is retained and used by bumblebees as short-, but not mid-term memories (Menzel 2001), in order to make comparisons of options and rank them. Please see lines 385 - 401 for a detailed description of these experiments, as well as Figure 2.

      In experiment 4, during training, they could move between two zones representing the 'contexts', each with 2 flower types, and they were then given choices between the 4 types, rather than just binary choices as previously. In this case, the bees did prefer the top-quality flower type (type A), which is consistent with memory for absolute concentration and with ranking, because A offered the highest concentration of the 4-type context. Why this happened is not clear, but it indicates that the context of choice may be crucial. It is known from other studies that the number of options at the time of choice can be very influential. For instance, in one study, it was shown that starlings appeared to be risk prone when offered a binary choice and risk averse when offered a trinary choice, even if the choices were all intermingled in the same sessions. In any case, this experiment raises doubts as to the claimed insensitivity to memory for nectar concentration. Another possibility is that the separation between contexts in this experiment (a partially avoidable wall) was not extreme as in the previous ones, so that the bees could now establish a ranking among the 4 types because they were all encountered intermingled to an extent.

      Again, we apologise for the lack of clarity in our original argument. We have made clear in the manuscript that bumblebees are not insensitive to absolute metrics, and indeed require them to distinguish between sequentially visited flowers. We have also added new data and descriptions of experiments which we believe help set the stage for and help interpret the results of experiment 6 (previously experiment 4). The newly added experiments (experiments 4 and 5) show that when bees learn flowers in isolation, and therefore have no ranking information available, they still do not retain or utilise memory of absolute information in a new context, unless the temporal separation between flower experiences is short (a few minutes). The results of experiment 6 (previously experiment 4) essentially help show that in a more ecologically realistic scenario (what bees normally experiment in the wild), the time between flower visits are short enough so that absolute information can be compared and used to rank flowers. We now explain this better on lines 204 – 233 and 249 - 260.

      There is one potential mechanism that may also be discussed. It is known from other species, that state at the time of learning influences subjective value of alternatives. To explain this effect I will exemplify the problem with a non-eusocial consumer. Say that food sources B and C are of equal caloric value. Say, further, that B is encountered when the subject is less food deprived than when it encounters C. Then the hedonic (conditioning) power of B will be lower, because it causes a smaller improvement in fitness (this was Daniel Bernoulli's argument regarding the concept of utility). In animal studies this effect is called State-Dependent Valuation Learning (SDVL). Since in the present experiments the context A & B was richer than the context C & D, the bees would have been in a consequently more favourable state (maybe carrying bigger sugar loads), so that each encounter with B would cause a smaller improvement than each encounter with C. This effect is totally different from remembering the ranking of flower types. The two alternative explanations for preference of C over B (ranking and SDVL) can, fortunately, be confronted because it is possible to change the state of the bees by a common 3rd source that could be used to equate or manipulate the average richness of the contexts.

      Thanks for this suggestion. Although we believe SDVL cannot account for bumblebees’ behaviours as well as ordinal ranking, we agree that it would be valuable to discuss SDVL with reference to some of the literature on the subject. We have now added description of SDVL to Table 1 on lines 156 – 162 and added a paragraph in the Discussion on lines 283 – 290.

      All the reasons mentioned above should make it clear that this reviewer finds the study of very great interest and much merit, but considers that the conclusion for exclusive impact of ranking on preference should be tempered, or at least defended more strongly against these doubts.

      Thanks again for the valuable and positive critique.

    1. Author Response

      Reviewer #3: (Public Review):

      In this ms Li et al. examine the molecular interaction of Rabphilin 3A with the SNARE complex protein SNAP25 and its potential impact in SNARE complex assembly and dense core vesicle fusion.

      Overall the literature of rabphilin as a major rab3/27effector on synaptic function has been quite enigmatic. After its cloning and initial biochemical analysis, rather little new has been found about rabphilin, in particular since loss of function analysis has shown rather little synaptic phenotypes (Schluter 1999, Deak 2006), arguing against that rabphilin plays a crucial role in synaptic function.

      While the interaction of rabphilin to SNAP25 via its bottom part of the C2 domain has been already described biochemically and structurally in the Deak et al. 2006, and others, the authors make significant efforts to further map the interactions between SNAP25 and rabphilin and indeed identified additional binding motifs in the first 10 amino acids of SNAP25 that appear critical for the rabphilin interaction.

      Using KD-rescue experiments for SNAP25, in TIRF based imaging analysis of labeled dense core vesicles showed that the N-terminus of SN25 is absolutely essential for SV membrane proximity and release. Similar, somewhat weaker phenotypes were observed when binding deficient rabphilin mutants were overexpressed in PC12 cells coexpressing WT rabphilin. The loss of function phenotypes in the SN25 and rabphilin interaction mutants made the authors to claim that rabphilin-SN25 interactions are critical for docking and exocytosis. The role of these interaction sites were subsequently tested in SNARE assembly assays, which were largely supportive of rabphilin accelerating SNARE assembly in a SN25 -terminal dependent way.

      Regarding the impact of this work, the transition of synaptic vesicles to form fusion competent trans-SNARE complex is very critical in our understanding of regulated vesicle exocytosis, and the authors put forward an attractive model forward in which rabphilin aids in catalyzing the SNARE complex assembly by controlling SNAP25 a-helicalicity of the SNARE motif. This would provide here a similar regulatory mechanism as put forward for the other two SNARE proteins via their interactions with Munc18 and intersection, respectively.

      We thank the reviewer #3 for the summary of the paper and for the praise of our work. The point-to-point replies are as follow:

      While discovery of the novel interaction site of rabphilin with the N-Terminus of SNAP25 is interesting, I have issues with the functional experiments. The key reliance of the paper is whether it provides convincing data on the functional role of the interactions, given the history of loss of function phenotypes for Rabphilin. First, the authors use PC12 cells and dense core vesicle docking and fusion assays. Primary neurons, where rabphilin function has been tested before, has unfortunately not been utilized, reducing the impact of docking and fusion phenotype.

      We have discussed these questions as mentioned in our response to Essential Revisions 3 and added this corresponding passage to the Discussion section (pp.18-19, lines 407-427).

      In particular the loss of function phenotype in figure 3 of the n-terminally deleted SNAP25 in docking and fusion is profound, and at a similar level than the complete loss of the SNARE protein itself. This is of concern as this is in stark contrast to the phenotype of rabphilin loss in mammalian neurons where the phenotype of SNAP25 loss is very severe while rabphilin loss has almost no effect on secretion. This would argue that the N-terminal of SNAPP25 has other critical functions besides interacting with rabphilin. In addition, it could argue that the n-Terminal SNAP25 deletion mutant may be made in the cell (as indicated from the western blot) but may not be properly trafficked to the site of release

      To test whether the N-peptide deletion mutant of SN25 can properly target to the plasma membrane, we overexpressed the SN25 FL or SN25 (11–206) with C-terminal EGFP-tag in PC12 cells and monitored the localization of SN25 FL-EGFP and SN25 (11–206)-EGFP near the plasma membrane by TIRF microscopy. We observed that the average fluorescence intensity of SN25 (11–206)-EGFP showed no significant difference with SN25 FL-EGFP as below, suggesting that the N-peptide deletion mutant may not influence the trafficking of SN25 to plasma membrane.

      (A) TIRF imaging assay to monitor the localization of SN25-EGFP near the plasma membrane. Overexpression of SN25 FL-EGFP (left) and SN25 (11–206)-EGFP (right) using pEGFP-N3 vector in PC12 cells. Scale bars, 10 μm. (B) Quantification of the average fluorescence intensity of SN25-EGFP near the plasma membrane in (A). Data are presented as mean ± SEM (n ≥ 10 cells in each). Statistical significance and P values were determined by Student’s t-test. ns, not significant.

    1. Author Response

      Reviewer #1 (Public Review):

      This work focuses on the mechanisms that underlie a previous observation by the authors that the type VI secretion system (T6SS) of a Pseudomonas chlororaphis (Pchl) strain can induce sporulation in Bacillus subtilis (Bsub). The authors bioinformatically characterize the T6SS system in Pchl and identify all the core components of the T6SS, as well as 8 putative effectors and their domain structures. They then show that the Pchl T6SS, and in particular its effector Tse1, is necessary to induce sporulation in Bsub. They demonstrate that Tse1 has peptidoglycan hydrolase activity and causes cell wall and cell membrane defects in Bsub. Finally, the authors also study the signaling pathway in Bsub that leads to the induction of sporulation, and their data suggest that cell wall damage may lead to the degradation of the anti-sigma factor RsiW, leading to activation of the extracellular sigma factor σW that causes increased levels of ppGpp. Sensing of high ppGpp levels by the kinases KinA and KinB may lead to phosphorylation of Spo0F, and induction of the sporulation cascade.

      The findings add to the field's understanding of how competitive bacterial interactions work mechanistically and provide a detailed example of how bacteria may antagonize their neighbors, how this antagonism may be sensed, and the resulting defensive measures initiated.

      While several of the conclusions of this paper are supported by the data, additional controls would bolster some aspects of the data, and some of the final interpretations are not substantiated by the current data.

      • The Bsub signaling pathway that is proposed is intricate and extensive as shown in Fig 5A. However, the data supporting that is very sparse:

      a) The authors show no data showing that the proteases PrsW and/or RasP, or the extracellular sigma factor σW are necessary, or that the cleavage of RsiW is needed, for induction of sporulation - this could presumably be tested using mutants of those genes.

      It has been previously demonstrated that the proteases PrsW and/or RasP cleave RsiW under certain conditions such as alkaline-shock (Heinrich et al., 2009). In first place, PrsW cleaves RsiW and the resulting cleaved-RsiW serves as substrate to RasP. In the previous version of the manuscript, we already demonstrated that treatment with Tse1 causes damage to PG and delocalization of RsiW, however as the reviewer comments we did not show the participation of any of these proteases in the proposed signaling pathway. We have now generated single mutants in rsiW and prsW and they have been treated with Tse1. We have observed no variation in the levels of sporulation compared to untreated strains (Figure 1) a finding according to their suggested implication in the sporulation signaling pathway activated by Tse1. Positive controls, that is the single mutants grown at 37ºC, were still able to sporulate. This data has been added to Figure 6B in the new version of the manuscript.

      As suggested by other reviewers, we have generated a sister plot of this figure showing the raw CFUs in each case. These data are included in Supplementary file 3. This experiment and the related figure have been incorporated into the new version of the manuscript.

      Figure 1. A) Quantification of the percentage of sporulated Bsub, rsiW and prsW cells after treatment with purified Tse1 showing that rsiW and prsW single mutants are blind to the presence of Tse1. B) Cell density (CFUs/mL) of total (blue bars) and sporulated population (brown bars) of different Bacillus strains (Bsub, ∆rsiW and ∆prsW) untreated and treated with Tse1. Sporulation at 37ºC is shown as positive control in each strain. Statistical significance was assessed via t-tests. p value < 0.1, p value < 0.001, **p value < 0.0001.

      Similarly, they don't demonstrate that the levels of ppGpp increase in the cell upon exposure to Pchl.

      We have not been able to measure the levels of ppGpp, however, given that in the same proposed sporulation cascade the levels of different nucleotides are altered (Kriel et al., 2013, Tojo et al., 2013, López and Kolter, 2010), we have alternatively analyzed the levels of ATP using an ATP Determination Kit (Thermo, A22066). We have found that ATP levels increased by 3-fold in Bsub cells treated with Tse1 compared to untreated control cells. Consistently, no increase in ATP levels were observed in rsiW or prsW mutants treated with Tse1. We have incorporated all the raw luminescence data obtained for each sample and treatment in Figure 6-source data 1. This experiment, figures (Figure 6A in the new version of the manuscript) and description in “Materials and Methods” have been added to the new version of the manuscript.

      c) There is some data showing that kinA and kinB mutants don't induce sporulation (Fig supplement 7A), but that is lacking the 'no attacker' control that would demonstrate an induction.

      We have included in the new version of the manuscript the ‘no attacker’ control sporulation (%). The figure shows that the presence of Pchl strains induces the sporulation of all kinase mutants. This new data has been incorporated in Figure 6-figure supplement 1A in the new version of the manuscript.

      d) There is some data showing that RsiW may be cleaved (Fig 5C, D), but that data would benefit from a positive control showing that the lack of YFP foci is seen in a condition where RsiW is known to be cleaved, as well as from a time-course showing that the foci are present prior to the addition of Tse1, and then disappear. As it is shown now, it is possible that the addition of Tse1 just blocks the production of RsiW or its insertion into the membrane (especially given the membrane damage seen). Further, there is no data that the disappearance of the YFP loci requires the proteases PrsW and /or RasP - such data would also support the idea that the disappearance is due to cleavage of RsiW.

      Thank you for your useful suggestion. It is important to consider that we have not seen repression of the expression of genes that encode any of the two proteases on cells treated with Tse1 in our transcriptomics analysis. However, we agree that additional experiments would enhance the significance of our findings. We have repeated the whole experiment including a positive control to demonstrate that YFP foci disappears in a condition in which RsiW is known to be degraded by PrsW and RasP. Bacillus cells have been incubated in medium at pH 10 which provokes an alkaline shock that triggers RsiW cleavage (Asai, 2017; Heinrich et al., 2009). As shown in Fiugre 6D under this condition we also observed disappearance of YFP foci . We have also provided extra images with quantification of average signal from YFP-foci in Figure 6-figure supplement 2 .

      • The entire manuscript suggests that T6SS is solely responsible for the induction of sporulation. While T6SS does appear to play a major part in explaining the sporulation induction seen, in the absence of 'no attacker' controls for Fig. 2A, it is impossible to see this. From the data shown in Fig. 2C, and figure supplement 2A, the 'no attacker' sporulation rate seems to be ~20%, while the rate is ~40% with Pchl strains lacking T6SS, suggesting that an additional factor may be playing a role.

      This must be a misunderstanding of the message of this manuscript. The conceptual fundament of this study was settled in our previous manuscript (Molina-Santiago et al., 2019). We demonstrated that B. subtilis sporulated in the presence of P. chlororaphis. Interestingly, the overgrowth of P. chlororaphis over B. subtilis colony did not eliminate cells of B. subtilis, given that most of them were sporulated. The data we obtained strongly suggested that a functional T6SS was involved in the cellular response of Bacillus in the close cell to cell contact. In this new manuscript, we have explored this idea, and found that indeed, the T6SS of P. chlororaphis mobilized at least one effector, Tse1, which is able to trigger sporulation in Bacillus. Thus we did not conclude, and neither have done in this new study, that T6SS is the only factor expressed by P. chlororaphis responsible for sporulation activation in Bacillus. We have accordingly rephrased some sentences of the manuscript to clarify the proposed implication of T6SS in B. subtilis sporulation.

      In addition, as mentioned above, we have included data of sporulation percentages in the absence of an attacker to better compare the induction of sporulation observed in the presence of the different Pchl strains and in the presence of Tse1.

      Reviewer #2 (Public Review):

      In a previous study, the authors showed that cell-cell contact with Pseudomonas chlororaphis induces sporulation in Bacillus subtilis. Here, the authors build on this finding and elucidate the mechanism behind this observation. They describe the enzymatic activity of a protein (Tse1) secreted by the type VI secretion system (T6SS) of P. chlororaphis (Pch), which partially degrades the peptidoglycan (PG) of targeted B. subtilis cells and triggers a signal cascade culminating in sporulation.

      Most of the key conclusions of this paper (Tse1 being secreted by the T6SS and inducing sporulation in targeted cells) are well supported by the data. One conclusion (sporulation response being an anti-T6SS "defense" strategy) is not well supported by the data and should be removed or rephrased.

      The authors elucidate the enzymatic activity of Tse1, a T6SS effector protein, in a genus (Pseudomonas) of great interest to microbiologists, and to researchers studying the T6SS specifically. They also carefully dissect the cellular response (signal cascade and sporulation) of an important model organism (B. subtilis; Bsub) specifically to exposure to Tse1. The results describing this cellular response contribute substantially to our understanding of how T6SS effector proteins interact with cells of Gram-positive species.

      My only major concerns regard the interpretation of these results as sporulation being an adaptive and/or specific response to attacks by the T6SS. I outline my reasoning below.

      • Interpretation of sporulation as a "defense" mechanism/strategy against the T6SS. In order for a phenotype X to be regarded as a "defense against Y" mechanism, it has to be shown that phenotype X (sporulation in response to Tse1) evolved - at least in part - for the purposes of increasing survival in the presence of Y (T6SS attacker). There are no experiments in this study comparing e.g. a sporulating Bsub with a non-sporulating Bsub, that would allow testing if sporulation increases survival. The experiments carefully describe the cellular response to Tse1, but no inference can be made with regards to this being adaptive for Bsub, or if it helps the cells survive against T6SS attacks, etc. A more parsimonious explanation would be that Tse1 happens to target the PG and causes envelope stress, triggering sporulation. So, it would be a general stress response that also happens to be triggered by T6SS. Now, some general (cell envelope) stress responses are known to be very effective at protecting against the T6SS. But in those instances, a beneficial effect for survival in the face of T6SS attacks has been shown in dedicated experiments. Purely observing a response to a T6SS effector, as this study does (very well), is not evidence that the response has evolved for the purpose of surviving T6SS attacks. Tucked away in the supplement (and briefly mentioned in the main text) is data on Bsub and Bacillus cereus, showing that i) cell densities of the sporulating Bsub and a sporulating B. cereus strain are not affected by an active T6SS, and ii) cell densities of an asporogenic B. cereus are slightly reduced by an active T6SS. However, the effect sizes of density reduction by the T6SS in the asporogenic B. cereus are minute (20x10^6 vs. ~50x10^6). In typical killing assays against e.g. gram-negative strains, a typical effect size for T6SS killing would be a several order of magnitude reduction in survival of the target strain when exposed to a T6SS attacker. Based on this dataset alone (Figure Suppl. 8), I would say that all three Bacillus strains are not experiencing any "fitness-relevant" killing by the T6SS, which is in line with the T6SS often being useless against gram-positives when it comes to killing. Hence, no claims about fitness benefits of sporulation in response to a T6SS attack, or this being a "defense mechanism/strategy" should be made in the manuscript.

      Thanks for this interesting introductory and specific comments. We agree with the reviewer and have rephrased some sentences of the manuscript. Sporulation is not an adaptive or specific response of Bacillus to T6SS, indeed and as stated by reviewer 2, sporulation is a general stress response. It might happen that the way the manuscript was written, at some points, gave the wrong impression. In consequence we have rephrased some sentences. Nevertheless, in Figure supplement 8 (in the new version of the manuscript is Figure 6-figure supplement 3) we made a mistake during generation of the Figure. We have again done this experiment and we have generated a new and corrected chart that shows three orders of magnitude reduction in survival of the asporogenic B. cereus strain in competition with Pchl mutant strains compared to Pchl WT strain. These new findings show that the absence of sporulation ability leads to a severe reduction in survival of Bacillus cereus DSM 2302 population in competition with Pchl with an active T6SS compared to the survival in competition with Pchl hcp mutant. In this figure, it is also shown that Bacillus population also decreased in competition with tse1 mutant, demonstrating that Tse1 is responsible for killing Bacillus. However, there is a statistical difference in the survival of Bacillus competing with hcp or tse1 mutants. The increased survival of Bacillus in the interaction with tse1 strain compared to Bacillus-hcp competition, is suggestive of the ability of this strain to deliver additional T6SS-dependent toxins. This observation is in accordance to the data presented in Fig. 2B, which indicated that tse1 mutant has an active T6SS able to kill E. coli.

      • Data supporting baseline "no competitor" sporulation rates being no different from those triggered by T6SS mutants is not convincing. For the data shown in Fig. 2A, a key comparison here would be to show baseline Bsub sporulation rates in absence of a competitor. This measurement is shown in Fig supplement 2A, and the value shown there (roughly 22% on average) appears to be much lower than the average T6SS mutant shown in Fig. 2A. The main text states that sporulation rates induced rate by the different T6SS mutants are "statistically" similar to the no-competitor baseline (L206/207). I am not convinced by this, since i) overall sporulation rates (incl of WT Pch) appear to have been lower in the experiment shown in supplement 2A, so a direct comparison between the no-competitor baseline and the data shown in Fig. 2A is not possible; and ii) hcp and tse1 mutants were tested in different experiments throughout the study, and sporulation rates appear to consistently hover around 30-40%, which is higher than the roughly 22% for "no competitor" depicted in Supplement Fig2A. I am focussing on this, because for the interpretation of the results, and the main narrative of the paper, knowing if "simply interacting with a T6SS-negative P. chlororaphis" induces some sporulation would make a big difference. One sentence in the discussion adds to my confusion about this: L464/465, "... a strain lacking paar (Δpaar) had an active T6SS that triggered sporulation comparably to Δhcp, ΔtssA, and Δtse1 strains", suggesting that the authors' claims that even strains lacking active T6SS trigger increased sporulation (which I would agree with, based on the data).

      We understand the reviewer's comment that a direct comparison between the two figures is not correct due to fluctuations of the baseline sporulation rates between experiments. To solve this issue, we have added the baseline "no competitor" sporulation percentages in the experiments represented in Figure 2B in the new version of the manuscript.

      Related with the sporulation provoked by a T6SS-negative P. chlororaphis, the reviewer is right. Bacillus sporulation occurs due to many external factors (abiotic and biotic stresses) so the presence of P. chlororaphis in the competition already has an effect on the sporulation percentage of B. subtilis. Accordingly, we have removed the statement on the sporulation rates induced by the different T6SS mutants are "statistically" similar to the no-competitor. However, our previous data (Molina-Santiago, Nat Comm 2019) and current findings convincedly demonstrate the relevance of the T6SS and, specifically the Tse1 toxin, in the induction of sporulation at least in the close cell to cell contact.

      • Claim regarding "bacteriolytic activity" when tse1 is heterologously expressed in E. coli. The data supporting this claim (Fig2-supplement 2C) only shows a lower net population growth rate after induction of tse1 (truncated vs. non-truncated) expression. This could be caused by: slower growth (but no death), equal growth (with some death), or a combination of the two. The claim of "bacteriolytic" activity in E. coli is therefore not supported by this dataset.

      We agree with the reviewer and we have decided to remove this figure and the experiment of “bacteriolytic activity” given that it does not contribute conceptually to the message of the manuscript.

      I cannot comment in more detail on the validity of the biochemistry/enzymatic activity assays as these are not my area of expertise.

      Reviewer #3 (Public Review):

      The authors identify tse1, a gene located in the type 6 secretion system (T6SS) locus of the bacterium Pseudomonas chlororaphis, as necessary and sufficient for induction of Bacillus subtilis sporulation. The authors demonstrate that Tse1 is a hydrolase that targets peptidoglycan in the bacterial cell wall, triggering activation of the regulatory sigma factor sigma-w. The sporulation-inducing effects of sigma-w are dependent on the downstream presence of the sensor histidine kinases KinA and KinB. Overall, this is a well-structured paper that uses a combination of methods including bacterial genetics, HPCL, microscopy, and immunohistochemistry to elucidate the mechanism of action of Tse1 against B. subtilis peptidoglycan. There are some concerns regarding a few experimental controls that were not included/discussed and (in a few figures) the visual representation of the data could be improved. The structure of the manuscript and experiments is such that key questions are addressed in a logical flow that demonstrates the mechanisms described by the authors.

      To begin, we have concerns regarding the sporulation assays and their results. The data should be presented as "Percent sporulation" or "Sporulation (%)" - not as a "sporulation rate": there is no kinetic element to any of these measurements, so no rate is being measured (be careful of this in the text as well, for instance near lines 204). More importantly, there is no data provided to indicate that changes in percent spores are not instead just the death of non-sporulated cells. For example, imagine that within a population of B. subtilis cells, 85% of the cells are vegetative and 15% are spores. If, upon exposure to tse1, a large proportion of the vegetative cells are killed (say, 80% of them), this could lead to an apparent increase in sporulation: from 15% for the untreated population to ~50% of the treated, but the difference would be entirely due to a change in the vegetative population, not due to a change in sporulation. The authors need to clearly describe how they conducted their sporulation assays (currently there is no information about this in the methods) as well as provide the raw data of the counts of vegetative cells for their assays to eliminate this concern.

      Thanks for the suggestion. We have changed all the titles and data presented as “sporulation rate” by “sporulation (%)” or “sporulation percentage”. As also suggested by reviewer 2, we have included the raw data of the CFUs counts of total population and sporulated cells to show that there is no substantial change in the rate of death. Also, we have added a section in Material and Methods to specify how sporulation assays have been done. Quote text:

      “Sporulation assays

      Spots of bacteria were resuspended in 1 mL sterile distilled water. Then, serial dilutions were made and cultured in LB solid media for vegetative cells CFU counts. The same serial dilutions were further heated at 80ºC for 10 minutes to kill vegetative cells and immediately cultured again in LB solid media. Plates were grown overnight at 28 ºC and the resulting colonies were counted to calculate the percentage of Bsub sporulation (%). A list of raw CFUs (total and spore population) from all figures with sporulation percentage is shown in Supplementary file 3.”

      A related concern is regarding the analysis of the kinases and the effects of their deletions on the impact of Tse1. Previous literature shows that the basal levels of sporulation in a B. subtilis kinA or a kinB mutant are severely defective relative to a wild-type strain; these mutants sporulate poorly on their own. Therefore, the data presented on Lines 394+ and the associated Supplemental Figure regarding the sporulation defects of these two mutants are not compelling for showing that these kinases are required for this effector to act. It is likely that simply missing these kinases would severely impact the ability of these strains to sporulate at all, irrespective of the presence of Tse1, and no discussion of this confounding concern is discussed.

      Previous literature shows that mutation of kinases affects sporulation of B. subtilis. Histidine kinases KinA and KinB are the first responsible for initiation of sporulation cascade upon phosphorylation of spo0F. However, as shown in Figure 6-figure supplement 1A, single mutants in these kinases (ΔkinA, ΔkinB) still sporulate given that the phosphorylation cascade is controlled by numerous intermediaries and other histidine kinases that form a multicomponent phosphorelay (KinA-E). In this context, the sporulation of B. subtilis can be also triggered by KinC or KinD in the absence of KinA or KinB, as KinC/KinD can act directly on the master regulator of sporulation Spo0A (Burbulys et al., 1991; Wang et al., 2017).

      In addition, as suggested by reviewer 1, we have added to Figure 6-figure supplement 1A of the new version of the manuscript, the sporulation percentage 'no competitor' control of each kinase mutant and B. subtilis WT. The results show that, as commented by the reviewer and also supported by literature, these mutants sporulate poorly on their own in the absence of an attacker (none). However, as shown in the figure, all kinase mutants increase the sporulation percentage in the presence of a competitor.

      Another concern is regarding the statistical tests used in Figure 2. For statistical tests in A, B, and D, it should be stated whether a post-test was used to correct for multiple comparisons, and, if so, which post-test was used. to provide a stronger control comparison. For C, we suggest the inclusion of a mock control in addition to the two conditions already included (i.e., an extraction from an E. coli strain expressing the empty vector)

      We have clarified the statistical tests used in Figure 2. Briefly, we have used one-way ANOVA followed by the Dunnett test in Figure 2A, B and D for the statistical analysis of the sporulation percentage of Bsub in competition with Pchl as control group. In relation to Figure 2C, it is not possible to add a mock control with a strain carrying the empty vector, because this is a suicide plasmid (pDEST17) unable to replicate in E. coli without chromosome integration.

      An additional concern regarding controls is that there is an absence of loading controls for the immunoblot assays. In Figure 5D and all immunoblot assays, there is no mention of a loading control, which is a critical control that should be included.

      In the previous version of the manuscript, we already included a loading control for Figure 5D in Figure supplement 7B, both for cell and for supernatant fractions. In the new version of the manuscript, the loading control of Figure 6E (in the previous version of the manuscript Figure 5D) is shown in Figure 6-figure supplement 2C. We have also included the original unedited gels and blot (Figure 6-figure supplement 2- source data 1 and Figure 6-figure supplement 2-source data 2).

      Some of the visualizations could be improved to help the reader understand and appropriately interpret the data presented. For instance, in Figures 3 and 4 the scale bars are different across each of the Figure's imaging panels. These should be scaled consistently for better comparison. Additionally, the red false colorization makes the printed images difficult to see. Black-and-white would be easier to see and would not subtract from the images.

      The reviewer is right. Scales bar equal 2 in Figure 3A, but the length of the bars was not the same. We have edited the images to have the same magnifications for better comparison.

      In relation to Figure 4, we have changed the magnifications and now all the figures have the same scale bars and magnifications. In addition, we have added more images of broader fields in Figure 4-figure supplement 1 which were used to measure the percentage of permeabilized cells and to obtain the fluorescence intensity measures shown in Figure 4.

      An additional weakness of the paper is that the RNA-seq data is not fully investigated, and there is an absence of methods included regarding the RNA-seq differential abundance analysis (it is mentioned on L379-380 but no information is provided in the methods). As stated by the authors, 58% of differentially regulated genes belonged to the sw regulon, but the other 42% of genes are not discussed, and will hopefully be a target of future investigations.

      The methods section has been modified for a better explanation of the RNA-seq differential abundance analysis. Quote text: “The raw reads were pre-processed with SeqTrimNext (Falgueras et al., 2010) using the specific NGS technology configuration parameters. This pre-processing removes low-quality, ambiguous and low-complexity stretches, linkers, adapters, vector fragments, and contaminated sequences while keeping the longest informative parts of the reads. SeqTrimNext also discarded sequences below 25 bp. Subsequently, clean reads were aligned and annotated using the Bsub reference genome with Bowtie2 (Langmead and Salzberg, 2012) in BAM files, which were then sorted and indexed using SAMtools v1.484(Li et al., 2009). Uniquely localized reads were used to calculate the read number value for each gene via Sam2counts (https://github.com/vsbuffalo/sam2counts). Differentially expressed genes (DEGs) were analyzed via DEgenes Hunter, which provides a combined p value calculated (based on Fisher’s method) using the nominal p values provided by edgeR (Robinson et al., 2010) and DEseq2. This combined p value was adjusted using the Benjamini-Hochberg (BH) procedure (false discovery rate approach) and used to rank all the obtained DEGs. For each gene, combined p value < 0.05 and log2-fold change > 1 or < −1 were considered as the significance threshold”

      Regarding the RNA-seq analysis, we are aware of the amount of information that can be extracted. Previous to filtering the information shown in the manuscript, we have done bioinformatic analysis trying to find a connection with the cellular response, that is increase of sporulation. Besides this, we had some observations but with no direct connection to sporulation, which would be interesting to pursue in future studies, but not for the clarity of this story (Figure 23 below). In any case, we are including the whole picture of the transcriptomics changes occurring in Bsub after treatment with Tse1. KEGG pathway analyses of genes differentially expressed showed induction of flagellar assembly and aminobenzoate degradation, nitrogen and amino acid metabolisms. Interestingly, fatty acid degradation and CAMP resistance pathways were also induced, probably related to changes suffered in the cell wall after the action of Tse1 toxin. On the other hand, synthesis and degradation of ketone bodies pathway was mostly repressed.

      Figure 2. KEGG pathway analyses of genes differentially expressed occurring in Bsub after treatment with Tse1.

      Another methodological concern in this paper is the limited details provided for the calculation of the permeabilization rate (Figure 4, L359, L662-664). It is not clear how, or if, cell density was controlled for in these experiments.

      We agree with the reviewer and we have explained with more detail how the permeabilization rate was calculated. Quote text: “N=3 for Bsub treated with Tse1 and N=3 for untreated Bsub. N refers to the number of CLSM fields analyzed to calculate the number of permeabilized cells of the total of cells in the field”

      Finally, one weakness of the paper is the broad conclusions that they draw. The authors claim that the mechanism of sporulation activation is conserved across Bacilli when the authors only test one B. subtilis and one B. cereus strain. They further argue (lines 469+) that Tse1 requires a PAAR repeat for its targeting, but do not provide direct evidence for this possibility.

      We have reduced the tone of the final conclusion in order to specify that the activation of sporulation is a mechanism that can be found in different Bacillus species such as Bsub and Bcer. Related with the second appreciation, we have included a further explanation for this argument. Quote text: “As shown in Figure 2B, a paar mutant has an active T6SS able to kill E. coli. However, as shown in Figure 2A, we noticed that a paar mutant (which encodes tse1) is not able to trigger B. subtilis sporulation to a similar level than Pchl WT strain. Given that paar deletion apparently abolishes Tse1 secretion, we suggest that Tse1 is a PAAR-associated effector that requires a PAAR repeat domain protein to be targeted for secretion, thereby increasing Bacillus sporulation during contact with Pseudomonas cells (Cianfanelli et al., 2016; Hachani et al., 2014; Whitney et al., 2014)”.

    1. Author Response

      Reviewer #3 (Public Review):

      The work is of general interest to audiences of public policy and public health. The data found some evidence that mobile health interventions may be affected by the type of mobile used but failed to substantiate the claim conclusively on how the lack of mobile ownership may hinder their rollout process. The claim about gender or geographic inequality must be elaborated in detail and many countries in developing countries are now connecting more users in rural areas through unconventional methods such as village phones instead of just mobile ownership.

      Strengths:

      The main strength of this paper is the usage of the cross-sectional data from the R7 Afrobarometer survey which is a newly available dataset and contains comprehensive data from more than 50 African countries. The usage of the Bayesian Logistic Regression (BLR) model produced some useful findings.

      Weakness:

      1) The authors have generalized a lot of things in a very simple manner. For example, they have assumed if participants have access to the internet means they own a smartphone and if they don't then they are basic phone users. It is possible a lot of smartphone owners do not have subscriptions to the internet due to the high cost of internet in African countries.

      We agree with the Reviewer that some smartphone owners may not have access to the internet due to the high cost of internet in African countries. Therefore, to estimate the percentage of SP owners who may not pay to access the internet, we looked at the frequency of access to the internet within this sub-group (Methods: lines 133-138). In the Afrobarometer surveys, participants were asked how often they accessed the internet; they were not asked to specify how they accessed the internet. We analyzed these data, stratified on the basis of the type of mobile phone that we assumed individuals owned (we assumed that an individual owned a smartphone if they reported that their mobile phone could access the internet, and that an individual owned a basic mobile phone if they reported that their mobile phone could not access the internet).

      Notably, we found that only 13% of individuals that we classified as SP owners (and 89% of individuals that we classified as owners of BP) reported that they never accessed the internet. We now include the results of this analysis in our revised manuscript (Results: lines 219-221); they are presented in Figure 1—figure supplement 2.

      Additionally, we now mention that in order to implement mHealth interventions that are based on smartphones, individuals will need to both own a smartphone and have financial means to access the internet.

      2) They have consistently talked about inequalities in gender, and rural-urban geographic regions based on the odds ratio derived from the BLR. A regression decomposition technique can quantify these differences more elaborately in detail.

      The purpose of our study was to determine – for 33 African countries – what proportion of people owned mobile phones (basic phones & smartphones) in each country, and if there were inequalities/inequities in the ownership of mobile phones based on: (i) gender, (ii) age, (iii) urban-rural residency, (iv) wealth, and (v) distance to a healthcare facility.

      We found a high ownership of mobile phone ownership that our results show varies substantially amongst the 33 countries. Additionally, by conducting a Bayesian Logistic Regression we have found that there are significant inequalities/inequities in all five variables. Additionally, we have identified substantial differences in the degree of these inequities in the 33 countries.

      We agree with the Reviewer that we have not explained why these inequalities exist, and that we could use a regression decomposition analysis to identify explanatory factors. We note that this is the next stage, and current focus, of our research. This next stage requires constructing new statistical models – and utilizing a different dataset – than the models that we present and the dataset that we utilize in our submitted manuscript. Consequently, conducting a regression decomposition analysis is beyond the scope of the present study: it will be an article in its own right.

      However, in response to this Comment, we have now added a description of potential factors that may explain inequalities in gender and rural-urban geographic regions (Discussion: lines 328-339). These factors have been identified in previous studies.

      3) They failed to explain why a lot of poor people own smartphones. This could be due to the usage of village phones (first implemented by Grameen Phone in Bangladesh). This has expanded in African countries as well where multiple users communicate through a community phone connecting more users in rural areas.

      We agree with the Reviewer. We now discuss the utilization of village phones in Africa, as well as other explanatory reasons for why a lot of poor people own smartphones (Discussion: lines 339-354).

      4) Basic phones may also be effective for mobile health interventions through voice-enabled systems and disseminating important messages to communities. (For e.g. there is extensive literature on how community-level messages, such as instructions on personal hygiene and usage of masks, were transmitted through basic phones during the beginning of covid19 in developing parts of Asia).

      We agree with the Reviewer that basic mobile phones may also be effective for mHealth interventions through voice-enabled systems and disseminating important messages to communities. We have added a paragraph (Discussion: lines 370-396) to discuss current mHealth interventions that are being utilized in Africa, including both those based on smartphones and those based on basic mobile phones.

      5) Further clarification of why lack of ownership of a mobile phone may propagate inequalities in health is needed beyond just simple associations. A latent factor may also cause these differences.

      We have added a paragraph (Discussion: lines 356-368) to discuss this topic.

    1. Author Response

      Reviewer 1 (Public Review):

      To me, the strengths of the paper are predominantly in the experimental work, there's a huge amount of data generated through mutagenesis, screening, and DMS. This is likely to constitute a valuable dataset for future work.

      We are grateful to the reviewer for their generous comment.

      Scientifically, I think what is perhaps missing, and I don't want this to be misconstrued as a request for additional work, is a deeper analysis of the structural and dynamic molecular basis for the observations. In some ways, the ML is used to replace this and I think it doesn't do as good a job. It is clear for example that there are common mechanisms underpinning the allostery between these proteins, but they are left hanging to some degree. It should be possible to work out what these are with further biophysical analysis…. Actually testing that hypothesis experimentally/computationally would be nice (rather than relying on inference from ML).

      We agree with the reviewer that this study should motivate a deeper biophysical analysis of molecular mechanisms. However, in our view, the ML portion of our work was not intended as a replacement for mechanistic analysis, nor could it serve as one. We treated ML as a hypothesis-generating tool. We hypothesized that distant homologs are likely to have similar allosteric mechanisms which may not be evident from visual analysis of DMS maps. We used ML to (a) extract underlying similarities between homologs (b) make cross predictions across homologs. In fact, the chief conclusion of our work is that while common patterns exist across homologs, the molecular details differ. ML provides tantalizing evidence to this effect. The conclusive evidence will require, as the reviewer rightly suggests, detailed experimental or molecular dynamics characterization. Along this line, we note that we have recently reported our atomistic MD analysis of allostery hotspots in TetR (JACS, 2022, 144, 10870). See ref. 41.

      Changes to manuscript:<br /> “Detailed biophysical or molecular dynamics characterization will be required to further validate our conclusions(38).”

      Reviewer 3 (Public Review):

      However - at least in the manuscript's present form - the paper suffers from key conceptual difficulties and a lack of rigor in data analysis that substantially limits one's confidence in the authors' interpretations.

      We hope the responses below address and allay the reviewer’s concerns.

      A key conceptual challenge shaping the interpretation of this work lies in the definition of allostery, and allosteric hotspot. The authors define allosteric mutations as those that abrogate the response of a given aTF to a small molecule effector (inducer). Thus, the results focus on mutations that are "allosterically dead". However, this assay would seem to miss other types of allosteric mutations: for example, mutations that enhance the allosteric response to ligand would not be captured, and neither would mutations that more subtly tune the dynamic range between uninduced ("off) and induced ("on") states (without wholesale breaking the observed allostery). Prior work has even indicated the presence of TetR mutations that reverse the activity of the effector, causing it to act as a co-repressor rather than an inducer (Scholz et al (2004) PMID: 15255892). Because the work focuses only on allosterically dead mutations, it is unclear how the outcome of the experiments would change if a broader (and in our view more complete) definition of allostery were considered.

      We agree with the reviewer that mutations that impact allostery manifest in many different ways. Furthermore, the effect size of these mutations runs the full gamut from subtle changes in dynamic range to drastic reversal of function. To unpack allostery further, allostery of aTF can be described, not just by the dynamic range, but by the actual basal and induced expression levels of the reporter, EC50 and Hill coefficient. Given the systemic nature of allostery, a substantial fraction of aTF mutations may have some subtle impact on one or more of these metrics. To take the reviewer’s argument one step further, one would have to accurately quantify the effect size of every single amino acid mutation on all the above properties to have a comprehensive sequence-function landscape of allostery. Needless to say, this is extremely hard! Resolution of small effect sizes is very difficult, even at high sequencing depth. To the best of our knowledge, a heroic effort approaching such comprehensive analysis has been accomplished so far only once (PMID: 3491352).

      Our focus, therefore, was to screen for the strongest phenotypic impact on allostery i.e., loss of function. Mutations leading to loss of function can be relatively easily identified by cell-sorting. Because our goal was to compare hotspots across homologs, we surmised that loss of function mutations, given their strong phenotypic impact, are likely to provide the clearest evidence of whether allosteric hotspots are conserved across remote homologs.

      The reviewer raised the point of activity-reversing mutations. Yes, there are activity reversing mutations in TetR. However, they represent an insignificant fraction. In the paper cited by the reviewer, there are 15 activity-reversing mutations among 4000 screened. Furthermore, the paper shows that activity-reversing in TetR requires two-tofour mutations, while our library is exclusively single amino acid substitutions. For these reasons, we did not screen for activity-reversing mutations. Nonetheless, we agree with the reviewer that screening for activity-reversing mutations across homologs would be very interesting.

      The separation in fluorescence between the uninduced and induced states (the assay dynamic range, or fold induction) varies substantially amongst the four aTF homologs. Most concerningly, the fluorescence distributions for the uninduced and induced populations of the RolR single mutant library overlap almost completely (Figure 1, supplement 1), making it unclear if the authors can truly detect meaningful variation in regulation for this homolog.

      Yes, the reviewer is correct that the fold induction ratio varies among the four aTF homologs. However, we note that such differences are common among natural aTFs. Depending on the native downstream gene regulated by the aTF, some aTFs show higher ligand-induced activation, and others are lower. While this is not a hard and fast rule, aTFs that regulate efflux pumps tend to have higher fold induction than those that regulate metabolic enzymes. In summary, the variation in fold induction among the four aTFs is not a flaw in experimental design nor indicates experimental inconsistency but is instead just an inherent property of protein-DNA interaction strength and the allosteric response of each aTF.

      Among the four aTFs, wildtype RolR has the weakest fold induction (15-fold) which makes sorting the RolR library particularly challenging. To minimize false positives as much as possible, we require that dead mutant be present in (a) non-fluorescent cells after ligandinduction (b) non-fluorescent cells before ligand-induction (c) at least two out of the three replicates for both sorts. Additionally, for RolR specifically, we adjusted the nonfluorescent gate to be far more stringent than the other three aTFs (Fig. 1 – figure supplement 1). Furthermore, we assign residues as allosteric hotspots, not individual dead mutations. This buffers against false strong signals from stray individual dead mutations. Finally, the top interquartile range winnows them to residues showing strong consistent dead phenotype. As a result of these “safeguards” we have built in, the number of allosteric hotspots of RolR (57) is comparable to the other three aTFs (51, 53 and 48). This suggests that we are not overestimating the number of hotspots despite the weaker fold induction of RolR. We highlight in a new supplementary figure (Figure 1 – figure supplement 4) that changing the read count threshold from 5X to 10X produces near identical patterns of mutations suggesting that our results are also robust to changes in ready depth stringency.

      Changes to manuscript: In response to the reviewer's comment, we have added the following sentence.

      “We note that the lower fold induction (dynamic range) of RolR makes it particularly challenging to separate the dead variants from the rest.”

      The methods state that "variants with at least 5 reads in both the presence and absence of ligand in at least two replicates were identified as dead". However, the use of a single threshold (5 reads) to define allosterically dead mutations across all mutations in all four homologs overlooks several important factors:

      Depending on the starting number of reads for a given mutation in the population (which may differ in orders of magnitude), the observation of 5 reads in the gated nonfluorescent region might be highly significant, or not significant at all. Often this is handled by considering a relative enrichment (say in the induced vs uninduced population) rather than a flat threshold across all variants.

      We regret the lack of clarity in our presentation. We wish to better explain the rationale behind our approach. First, we understand the reviewer’s point on considering relative enrichment to define a threshold. This approach works well in DMS experiments involving genetic selections, which is commonly the case, because activity scales well with selection stringency. One can then pick enrichment/depletion relative to the middle of the read count distribution as a measure of gain or loss of function.

      Second, this strategy does not, in practice, work well for cell-sorting screens. While it may be tempting to think of cell sorting as comparably activity-scaled as genetic selections, in reality, the fidelity of fluorescent-activated cell sorters is much lower. Making quantitative claims of activity based on cell sorting enrichment can be risky. It is wiser to treat cell sorting results as yes/no binary i.e., does the mutation disrupt allostery or not. More importantly, the yes/no binary classification suffices for our need to identify if a certain mutation adversely impacts allosteric activity or not.

      Third, the above argument does not imply that all mutations have the same effect size on allostery. They don’t. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.

      Fourth, a variant to be classified as allosterically dead, it must be present both in uninduced and induced DNA-bound populations in at least two out of three replicates (four conditions total). This is a stringent criterion for selecting dead variants resulting in highly consistent regions of importance in the protein even upon varying read count thresholds. To the extent possible, we have minimized the possibility of false positive bleed-through.

      Finally, two separate normalizations were performed on the total sequence reads to be able to draw a common read count threshold 1) between experimental conditions & replicates and 2) across proteins. First, total sequencing reads were normalized to 200k total across all sample conditions (presorted, -inducer, and +inducer) and replicates for each homolog, allowing comparisons within a single protein. Next, reads were normalized again to account for differences in the theoretical size of each protein’s single-mutant library, allowing for comparisons across proteins by drawing a commont readcount cutoff. For example, total sequencing reads of RolR (4,332 possible mutants) increased by 1.18x relative to MphR (3,667 possible mutants) for a total of 236k reads.

      Changes to manuscript: We have provided substantial additional details in the Fluorescence-activated cell sorting and NGS preparation and analysis sections.

      We also added the following in the main text.

      “In other words, we use cell sorting as a binary classifier i.e., does the mutation disrupt allostery or not. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.”

      Depending on the noise in the data (as captured in the nucleotide-specific q-scores) and the number of nucleotides changed relative to the WT (anywhere between 1-3 for a given amino acid mutation) one might have more or less chance of observing five reads for a given mutation simply due to sequencing noise.

      All the reads considered in our analyses pass the Illumina quality threshold of Q-score ≥ 30 which as per Illumina represent “perfect reads with no errors or ambiguities”. This translates into a probability of 1 in 1000 incorrect base call or 99.9% base call accuracy.

      We use chip-based oligonucleotides to build our DMS library, which allows us to prespecify the exact codon that encodes a point mutation. This means the nucleotide count and protein count are the same. The scenario referred to by the reviewer i.e., “anywhere between 1-3 for a given amino acid mutation” only applies to codon randomized or errorprone PCR library generation. We regret if the chip-based library assembly part was unclear.

      Depending on the shape and separation of the induced (fluorescent) and uninduced (non-fluorescent) population distributions, one might have more or less chance of observing five reads by chance in the gated non-fluorescent region. The current single threshold does not account for variation in the dynamic range of the assay across homologs.

      We have addressed the concern raised by the reviewer on fluorescent population distributions in answers to questions 10 and 11.

      The reviewer makes an important point about the choice of sequencing threshold. We use the sequencing threshold to simply make a binary choice for whether a certain variant exists in the sorted population or not. We do not use the sequencing reads as to scale the activity of the variant. To address the reviewer's comment, we have included a new supplementary figure (Fig 1 – figure supplement 4) where we compare the data by adjust the threshold two levels – 5 and 10 reads. As is evident in the new figure, the fundamental pattern of allosteric hotspots and the overall data interpretation does not change.

      TetR: 5x – 53 hotspots, 10x – 51 hotspots

      TtgR: 5x – 51 hotspots, 10x – 51 hotspots

      MphR: 5x – 48 hotspots, 10x – 48 hotspots

      RolR: 5x – 57 hotspots, 10x – 60 hotspots

      In other words, changing the threshold to be more or less strict may have a modest impact on the overall number of hotspots in the dataset. Still, the regions of functional importance are consistent across different thresholds. We have expanded the discussion in the manuscript to address this point.

      Changes to manuscript: We have now included a new supplementary comparing hotspot data at two thresholds: Figure 1 – figure supplement 4.

      We also added the following in the main text.

      “To assess the robustness of our classification of hotspots, we determined the number of hotspots at two different sequencing thresholds – 5x and 10x. At 5x and 10x, the number of hotspots are – TetR: 53, 51; TtgR: 51, 51; MphR: 48, 48 and RolR: 57,60, respectively. Changing the threshold has a modest impact on the overall number of hotspots and the regions of functional importance are consistent at both thresholds”

      The authors provide a brief written description of the "weighted score" used to define allosteric hotspots (see y-axis for figure 1B), but without an equation, it is not clear what was calculated. Nonetheless, understanding this weighted score seems central to their definition of allosteric hotspots.

      We regret the lack of clarity in our presentation. The weighted score was used to quantify the “deadness” of every residue position in the protein. At each position in the protein, the number of mutations that inhibited activity was summed up and the ‘deadness’ of each mutation was weighted based on how many replicates is appeared to inactivate the protein. Weighted score at each residue position is given by

      Where at position x in the protein, D1 is the number of mutations dead in one replicate only, D2 is the number of mutations dead in 2 replicates, D3 is the number of mutations dead in 3 replicates, and Total is the total number of variants present in the data set (based on sequencing data). Any dead mutation that is seen in only one replicate is discarded and does not contribute to the “deadness” of the residue. Mutations seen in two and three replicates contribute to the score. We have included a new supplementary figure (Fig. 1 – figure supplement 2) to give the reader a detailed heatmap of all mutations and their impact for each protein.

      Changes to manuscript: The weighted scoring scheme is now described in greater detail under Materials and Methods in the “NGS preparation and analysis” section.

      The authors do not provide some of the standard "controls" often used to assess deep mutational scanning data. For example, one might expect that synonymous mutations are not categorized as allosterically dead using their methods (because they should still respond to ligand) and that most nonsense mutations are also not allosterically dead (because they should no longer repress GFP under either condition). In general, it is not clear how the authors validated the assay/confirmed that it is giving the expected results.

      As we state in response to question 12, we use chip-based oligonucleotides to build our DMS library, which allows us to pre-specify the exact codon that encodes a point mutation. We have no synonymous or nonsense mutations in our DMS library. Each protein mutation is encoded by a single unique codon. The only stop codon is at 3’end of the gene.

      The authors performed three replicates of the experiment, but reproducibility across replicates and noise in the assay is not presented/discussed.

      Changes to manuscript: A new supplementary table (Table 1) is now provided with the pairwise correlation coefficients between all replicates for each protein.

      In the analysis of long-range interactions, the authors assert that "hotspot interactions are more likely to be long-range than those of non-hotspots", but this was not accompanied by a statistical test (Figure 2 - figure supplement 1).

      In response to the reviewer's comment, we now include a paired t-test comparing nonhotspots and hotspots with long-range interactions in the main text.

      Changes to manuscript: In all four aTFs, hotspots constituted a higher fraction of LRIs than non-hotspots (Figure 2 – figure supplement 1; P = 0.07).

    1. Author Resonse

      Reviewer #1 (Public Review):

      The authors trained rats to self-initiated a trial by poking into a nose poke, and to make a sequence of 8 licks in the nose poke after a visual cue. Trials were considered valid (called "timely") only if rats waited for more than 2.5 sec after the end of the previous trial. An attempt to initiate a trial (nose poking) before the 2.5 sec criterion was regarded as "premature". The authors recorded from the dorsal striatum while rats performed in this task. The authors first show that some neurons exhibited a phasic activation around the time of port entry detected using an infrared detector ("Entry cell"), as well as port exit ("Exit cell). Some neurons showed activation at both entry and exit ("Entry and Exit cell") or between these two events ("Inside-port cell"). Fractions of neurons that fall into these four categories are roughly the same (Fig. 3C). The main conclusions drawn from this study are that (1) the activity preceding a port entry was positively correlated with the latency to initiate a trial (or "waiting time"; Fig. 4E), which appear to reflect the value upcoming reward, and that (2) in adolescent rats, the activity rose more steeply with the latency to trial initiation (Fig. 7J).

      These observations are potentially interesting, in particular, the possible difference between adult and adolescent rats is intriguing. However, this study does not examine whether this brain region actually plays a role in the task. Some of the conclusions appear to be premature.

      1) Previous studies have found correlations between the activity of neurons in the striatum and the latency to trial initiation (e.g. Wang et al., Nat. Neurosci., 2013) or action initiation more generally (e.g. Kunimatsu et al., eLife, 2018). In the former study, the trial initiation was self-generated, similar to the present study, and was modulated by the overall reward value (state value). In the latter study, the latency was instructed by a cue. Furthermore, there are many studies that showed correlations between striatal activity and future rewards (e.g. Samejima et al., Science, 2005; Lau and Glimcher, 2008). Many of these studies varied the value of upcoming reward (e.g. amount or probability). Although some details are different, the basic concepts have been demonstrated in previous studies.

      Although there are other studies linking striatal activity to trial/action initiation and reward probability, here the striatal activity preceding the execution of a learned sequence is dependent on the internal representation of the time waited. Elapsed time is the only cue the animal has regarding the possible outcome until it is too late and the trial has already been initiated. Although a light cue then tells the rat if the timing was correct or not, providing an opportunity to stop the behavior, the behavior released during premature trials resembles very closely that observed during unrewarded timely trials. This remarkable similarity between premature trials and timely unrewarded trials allowed comparing very advantageously the effect of wait time-based modulation of anticipatory striatal activity. Moreover, we have compared striatal activity between adult and adolescent rats finding a steeper wait time-based modulation of striatal activity in adolescent animals that correlates with a more impulsive behavior in these animals.

      2) The authors conclude that "in this task, the firing rate modulation preceding trial initiation discriminates between premature and timely trials and does not predict the speed, regularity, structure, value or vigor of the subsequently released action sequence". This conclusion is based on the observation that premature and timely trials did not differ in terms of kinematic parameters as measured using accelerometer. Although the result supports that the difference in activity between premature and timely cannot be explained by the kinematic variables, it does not exclude the possibility that the activity is modulated by some kinematic variables in a way orthogonal to these trial types.

      While our accelerometer data do not support that differences in movement initiation time or velocity could explain the differences in striatal activity between adolescent and adult rats, we can not rule out that kinematic variables not captured by the head accelerometer recordings could explain some of the results. This is acknowledged in the main text, results section, page 8, line 180.

      3) The firing rate plot shown in Figure 4D should be replotted by aligning trials by movement initiation (presumably available from accelometer or video recording). Is it possible that the activity rise similarly between trials types but the activity is cut off depending on when the animal enters the port at different latency from the movement initiation? In any case, the port entry is a little indirect measure of "trial initiation".

      Unfortunately, we have not systematically obtained video recordings of the sessions and only have accelerometer recordings of a few of the animals that provided the neuronal data, which precludes replotting the data as suggested. Accelerometer recordings are available from two of adult and two adolescent rats. Latency from movement initiation to port entry do not differ between premature and timely trials at both ages. This is now reported on page 8 line 175 for adult rats, and page 15 line 341 for adolescent rats. These results appear to be at odds with the idea that decreased neuronal activity in premature trials is the result of a cut-off of the response.

      4) The difference between adult and adolescent rats are not particularly big, with the data from the adolescent rats showing a noisy trace.

      New data from two adolescent rats reduced the variability and confirmed the behavioral and physiological differences with adult rats. All panels from figure 7 now include the data from 5 adolescent animals instead of 3. The number of neurons analyzed in the adolescent group passed from 552 to 876. The inclusion of these new data allowed us to perform new statistical comparisons. We adjusted a logistic function to accumulated trial initiation timing data (Fig.7N) and found that the rate of accumulation is higher in adolescent rats. Importantly, this is observed not only in the part of the curve corresponding to premature responding but also during timely responding, indicating that adolescent rats' premature responding is a manifestation of a more general behavioral trait that makes them self-initiate trials faster than adults (Fig. 7N). The noisy trace of curves showing the amplitude modulation of anticipatory activity as a function of waiting time was partly due to the relatively low number of premature trials that demanded using relatively long time bins. With more data available we have been able to replot these curves using a smaller bin size for the short waiting times (Fig. 7M). We have adjusted a logistic function to these data and observed a higher rate of increase of this activity modulation in adolescent rats, paralleling the behavioral data. Moreover, we report a significant correlation between the behavioral and neurophysiological data (a steeper rate of trial initiation times curve correlates with a steeper wait modulation of anticipatory activity, Fig. 7O). These new findings are reported in the results section, from page 17 line 405 to page 18 line 417.

      Reviewer #2 (Public Review):

      The authors conduct an ambitious set of experiments to study how neural activity in the dorsal striatum relates to how animals can wait to perform an action sequence for reward. There are a lot of interesting studies on striatal encoding of actions/skills, and additionally evidence that striatal activity can help control response timing and time-related response selection. The authors bridge these issues here in an impressive effort. Recordings were made in the dorsal striatum on several tasks, and activity was assessed with respect to action initiation, completion, and outcome processing with respect to whether animals could wait appropriately or could not wait and responded prematurely. Conducting recordings of this sort in this task, particularly in some adolescent animals, is technically advanced. I think there is a very timely and potentially very interesting set of results here. However, I have some concerns that I hope can be addressed:

      It seems like the recordings were made throughout the dorsal striatum (histology map), including some recordings near/in the DLS. Is this accurate? The manuscript is written as though only the DMS was recorded.

      We acknowledge that our recordings are spread along the medial and central regions of the dorsal striatum. Although we are not sure that there is a consensus regarding the limits of the DMS and DLS, we believe that none of our recordings are clearly located within the DLS. Following your suggestion, we have modified the text and refer to the location of our recordings as “dorsal striatum”. We believe that, as there is a lot of work on the roles of the DLS and DMS in reward learning, it is still important to refer to this work in the Introduction section and to discuss our findings in its context, particularly, since we find that most task-related activity is concentrated at the beginning and end of the task as shown in several studies focused in the DLS.

      If I understand correctly, the rats must lick 8 times to get the water. If this is true, one strategy is to just keep licking until the water comes. Therefore, the rats may not have learned an 8-lick action sequence. The authors should clarify this possibility, and if it is, to consider avoiding using phrases like "automatized action sequence" since no real action sequence might have been learned. In short, I am not convinced the animals have learned an action pattern rather than to just keep licking once a waiting period has elapsed.

      We acknowledge that the experiments do not allow us to establish if the rats know what the exact number of licks needed is; when the skill is acquired, licking becomes highly stereotyped and the rats might as well be learning a time after which continuous licking leads to reward. We still believe that the stereotyped performance, the inability to stop the behavior when the absence of the light cue unequivocally indicates that no reward will be obtained in premature trials, and the rapid decrease of lick rate after the eighth lick was emitted and no reward was obtained, support that the behavior is automatic until the time of expected reward delivery. A representative raster plot showing lick sequences during a whole session in a trained adult rat is presented in Fig. 1I and Figure 7 – supplement 1H shows an example of the licks of an adolescent rat.

      The number of subjects per group is very low. This is fine for analysis of within-animal neural activity. However, comparing the behavior between these groups of animals does not seem appropriate unless the Ns are substantially increased.

      The revised version of the manuscript includes a higher number of adolescent rats from which striatal activity and behavior were recorded, which allowed us to perform a more detailed statistical analysis of the correlations between these measures. In addition, we now include new behavioral data from an independent sample of non-implanted 6 adults and 6 adolescent rats that confirms the results obtained with the implanted animals (presented in Figure 7 – supplement 4).

      I found the manuscript difficult to decipher. There are many groups. If I understand correctly, there are the following:

      -ITI 2.5s experiment

      -ITI 5 s experiment

      -ITI2.5-5s experiment

      -ITI 2.5 s experiment (adolescent)

      -Two accelerometer animals (unclear which experiment)

      -Two animals in ITI 2.5 sec without recordings (unclear how incorporated into analyses)

      Within each group, there are multiple categories of behavioral performance. This produces a large list of variables. In some parts of the results, these groups are separated and compared, but not all groups are compared in those such sections. In other sections the different groups (all or just some?) appear to be combined for analysis, but it is not clearly described. Another consequence of mixing the groups and conditions together in analysis as they do is that some of the statements in the results are very hard to follow (E.g., line 305 "...similar behavior observed in 8-lick prematurely released and timely unrewarded trials...").

      To clarify the experimental groups, we now include a table (Table 1) summarizing which tasks were used and how many animals were trained in each task.

      Generally, it is difficult to understand the results without first understanding the details of the different tasks, the different groups of animals, and the different epochs of comparison for neural analysis. It took me a long time to work through the methods and I am still not sure I completely understand it. On this point, some sentences are very long and should be broken up into smaller, clearer sentences. There are a lot of phrases that only someone familiar with the cited articles might understand what they mean (e.g., even one paragraph starting with line 39 includes all of the following terms: automaticity in behavior; behavioral unit or chunk; reward expectancy; reward prediction errors and trial outcomes; explore-exploit; cost-benefit; speed-accuracy tradeoffs; tolerance to delayed rewards; internal urgency states). It is very hard to follow how each of these processes are to be understood in terms of behavioral measures used to study them and how they do or do not relate to the hypothesis of the present study. The discussion similarly uses a lot of different phrases to discuss the task and neural responses in a way that makes it hard to understand exactly what the author's interpretation of the data are. Is there maybe a 'most likely' interpretation that can be stated for some of the responses?

      Our main aim is to disclose the mechanisms underlying differences between adult and adolescent rats relating to impulsivity. We hope that this will become clearer in this version of the manuscript after deepening the analysis of the differences between them. We believe that our data do not allow us to unequivocally determine what is the ultimate cognitive process producing the striatal activity differences between adult and adolescent rats, i.e., differences in internal urgency states, time perception, tolerance to delayed rewards, and tried to reflect that fairly in the Discussion.

      The data set is extremely rich; there are lot of data here. As a result it can be hard to understand how all of the data relate to the main hypothesis of the article. It often reads as an exploratory set of results section rather than a series of hypothesis tests.

      We have tried to improve the overall clarity of the text.

      Reviewer #3 (Public Review):

      Cecilia-Martinez et al., implement a task that allows the study of premature versus timely actions in rats. First, they show that rats can learn this task. Next, they record the activity in the DMS showing start/stop signals in the cells recorded, next they propose that the activity detected before the release of actions sequences discriminate the premature vs the timely initiations showing a relationship between the waiting time and the activity of cells recorded, furthermore they show that it could be the expectancy of reward what could be encoded in the activity before entering the port. Last they show that adolescent rats show more premature starts than adult rats documenting a difference in activity modulation of DMS cells in the relation between waiting time and firing rate (although above the premature threshold, see comments below).

      Overall the paper is well presented describing a well-developed set of experiments and deserves publication attending only minor comments.

      1) I understand rats learn to execute sequences of <8licks or 8 licks, although diagrams are presented, no examples of the individual trials with 8 licks, neither distributions of bouts of these licks are presented.

      Rats learn to execute a lick sequence to obtain the reward. The experiments do not allow us to establish if they know what the exact number of licks needed is; when the skill is acquired, licking becomes highly stereotyped and the rats might as well be learning a time after which continuous licking leads to reward. A representative raster plot showing lick sequences in a session in a trained adult rat is presented in Figure 1I and Figure 7 - supplement 1H shows an example of the licks of an adolescent rat.

      2) Relevant to the statement: "in this task, the firing rate modulation preceding trial initiation discriminates between premature and timely trials and does not predict the speed, regularity, structure, value or vigor of the subsequently released action sequence"... It is not clear if the latency to first lick (plot 2D) and the inter-lick interval (2E) is only from the 8Lick sequences or not. If that is not the case, it is important to compare only the ones with 8Licks.

      The data are from 8 lick sequences, this is now indicated in the figure legend.

      3) Related to the implications of the previous statement, there seems to be a tendency for longer latency to first lick in timely vs premature trials in Figure 2D (timely-trials-Late vs premature-trials-late)? Again here it is important to compare the 8licks sequences only.

      Only 8-lick sequences are compared and the two-way ANOVA showed a significant effect of the training stage without significant effects of trial timing (premature versus timely) and a non-significant interaction. The average ± SEM latencies to the first lick (of the eighth lick sequence) were 0.717 s ± 0.063 for timely trials late and 0.805 s ± 0.086 for premature trials late.

      4) I could not find in the main text whether the individual points in Fig.2 (e.g. 2B-E) are individual animals. Please specify that.

      In this figure panels every individual point corresponds to the mean of a session, the data correspond to 5 adult animals (2-5 sessions per animal and timing condition). Whether the data correspond to animals or sessions is now clarified in all figure legends.

      5) Although very elegant the argument presented in Figure 4C and 6C, I wonder if the head acceleration may lose differences in movements outside the head in the two kinds of trials. If that is the case please acknowledge it.

      We acknowledge in the main text, results section, page 8, line 180, that the accelerometer does not allow us to determine if the movements of other body parts differ between trial types.

      6) Also in 4C, small separations between timely vs premature signals are seen before 0. Is there a way to know if animals in timely vs premature trials approached the entry port in the same way? This request is pertinent in order to rule out motor contribution to the differences in Figure 4A-B.

      Although it is not possible to completely rule out small movement differences between premature and timely trials, no evident behavioral differences can be detected by trained observers or by analyzing video recordings taken during some sessions. The available accelerometer recordings also suggest that a similar motor pattern is displayed in premature and timely trials (Figure 4C).

      7) when saying: "Similar results were obtained in rats trained with a longer waiting interval (Supplementary Figure 5)", "is hard to see the similarity in the premature range, while in the 2.5 seconds task there is a positive relationship in the 5 seconds task it is not.

      Please note that a positive relationship is observed for the two bins preceding trial initiation, which are about 2.75s and 1s before port entry. The bin that seems to not fit is centered 4s before port entry (1s after exiting the port in the previous trial). Because of the longer waiting time, in the 5 s task behavior becomes less organized during the first seconds after port exit, however, the modulation of activity is still observed in the bins that are close to port entry.

      8) The data showing that the waiting modulation of reward anticipation grows at a faster rate in adolescent rats is clear, however, it is not clear how it could be related to the data showing that the adolescent rats were more impulsive.

      We acknowledge that the data do not provide a causal link with behavior. After adding two new adolescent rats we have been able to study in more detail the relationship between the waiting modulation of neuronal activity and the accumulation of trial initiations (depicted in figures 7M and 7N respectively) by adjusting logistic functions to the data. The new results are explained on page 17,line 384. There is a striking parallel between the growth rate of both curves, and the curves of adolescent rats are significantly steeper than those of adult rats. Moreover, there is a significant correlation between the coefficients that mark the rate of growth of the behavioral and neurophysiological data (Fig. 7O).

      9) Related to the sentence: "the strength of anticipatory activity increased with the time waited before response release and was higher in the more impulsive adolescent rats"....One may expect to see a difference in the range of the premature time however the differences were observed in the range >2.5 seconds. Please explain how to reconcile this finding with the fact that the adolescent rats were more impulsive.

      Please, note that the more impulsive behavior of adolescent rats (and the faster growth of the wait modulation of anticipatory activity) is observed along waiting times that exceed the 2.5s criterion wait time; we added a phrase in the Results section (page 18, lines 413) and in the Discussion section (page 19, line 443) to emphasize this point. Regarding the premature trials, a related issue was raised by reviewer #1, concern 4. The addition of new data from adolescent animals allowed us to used smaller bins to better discriminate what happens at short waiting times and included an inset in Figure 7M that allows to better appreciate what happens at these intervals.

    1. Author Response

      Reviewer #2 (Public Review):

      “The authors wish to relate beat-to-beat coordination of cardiac function (in this case as measured left ventricular pressure) to the activity of sympathetic neuron spiking within the stellate ganglion. A strength includes the challenging measurements from multiple stellate neuron activity over long durations in situ in the anesthetized pig.”

      We thank the reviewer for their feedback.

      “A major and overriding weakness is the founding assumption of the analysis that the underlying sympathetic neurons are all cardiac functioning in nature - an assumption that is overwhelmingly unlikely given the evidence in other species including humans that stellate postganglionic neurons are functionally mixed and have functional noncardiac targets. The use of broad and poorly explained/defined terms such as "event entropy" is difficult to follow and find meaning from. The manuscript is filled with difficult-to-follow text like "The neural specificity metric (Sudarshan et al., 2021). Fig. 5", is used to evaluate the degree to which neural activity is biased toward control target states taken here as LVP" and "The neural specificity is reduced from a multivariate signal to a univariate signal by computing the Shannon entropy at each timestamp of the mapped neural specificity metric". The figures are difficult to understand with axes that often bear no units or are quite compressed obscuring the intuitive meaning of the data trends. Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning". Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      “Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning”

      We would like to clarify this point with the understanding that the reviewer is referring to the time axis in Figure 3C in the manuscript.

      The coactivity matrix constructed in Figure 3C computes the cross correlation in sliding mean/std spike activities for different pairs of channels. The mean spiking activities across channels, as the reviewer correctly pointed out, do indeed have a weak autocorrelation with the period of the heart rate. The weak correlation for the heart rate period, possibly due to slow firing rates, was seen across all channels of both control and HF animals. But, the cause of a large proportion of channel-pairs exhibiting high coactivity, termed as cofluctuation (Shown as red tracings in Fig 3D), is not known and cannot be directly associated with cardiac functioning.

      The cofluctuation was also found to be aperiodic in nature approximating a lognormal distribution (Fig R1) with the HF animals containing heavy tails outside their confidence intervals (Fig R1B). The event rate computed from the cofluctuation time series (shown as blue steps in Fig 3E) for an animal is a measure of spatial coherence among SG neural populations and was developed as a novel metric to be used in future studies.

      Figure R1: Cofluctuation histograms (calculated from mean or standard deviation of sliding spike rate, referred as Cofluctuation_MEAN and Cofluctuation_STD, respectively) and log-normal fits for each animal group. μF IT and σF IT are the respective mean and standard deviation (STD) of fitted distribution, used for 68% confidence interval bounds. A-B: Control animals have narrower bounds and represent a better fit to log-normal distribution. C-D: Heart failure (HF) animals display more heavily skewed distributions that indicate heavy tails.

      “Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      We thank the reviewer for pointing this out. We have added four additional citations that include methods such as neural population bias and spatiotemporal dynamics linkages to control targets in the neuroscience literature. We have added these citations to page 15 in the “Conclusion” section of the manuscript. In addition, it is our group’s specialty to carry these cardiac nervous system experiments, we are not aware of another group collecting multi-electrode array data from the cardiac nervous system and studying population dynamics of cardiac neurons. Hence we build on based on our previous learnings. The most relevant literature (not necessarily related to cardiac nervous system) can be found in the neuroscience references we cited that contain applications of neural population recordings for different brain areas, mainly in neuropsychiatry domain to understand disease dynamics.

      “For the expert or even the uninformed reader, this report is broadly confused and confusing. The premises (beat to beat or whether LVP conveys cardiac function) are poorly supported. The conclusions are quite vague.”

      Thank you for your feedback. To simplify the understanding, we moved all mathematical details to supplementary material, re-wrote the abstract and the conclusion from scratch, and splitted the methods figures that may be confusion. We believe that our novel metrics event rate and entropy capture non-trivial linkages between heart failure status, cardiac neural activity (spike activity), and peripheral activity (LVP). We have supported our metrics with 17 animals with state-of-the-art surgical techniques and technology, and reported our results with detailed statistical analyses. Our manuscript essentially highlights that event rate and entropy metrics are significantly different between control animals and animals with heart failure. These metrics can be used to design future studies with these animal models to provide a more quantitative approach to heart disease, rather than binary (yes or no) descriptions.

      “Discussion: The abstract does not convey conclusions from the findings and contains broad statements such as "signatures based on linking neuronal population cofluctuation and examine differences in "neural specificity" of SG network" that have little substantive value or conclusion for the reader. Fundamentally what does the title "signatures based on linking neuronal population" cofluctuation mean to the reader? What changed in HF?”

      Thank you for this comment. We completely revised the abstract and conclusion as detailed in our response to Essential Revision #1. Event rate is a metric related to neural activity recordings and entropy is related to the association of neural activity to left ventricular blood pressure. Our findings suggest that both the neural population activity itself (event rate) and its ability to pay attention to cycles of left ventricular pressure (neural specificity) are significantly higher in animals with HF compared to controls.

    1. Author Response

      Reviewer #2 (Public Review):

      McCoy et al. has developed a new urban tree species database from existing city tree inventories. They designed procedures to collect and clean a large amount of data, i.e., more than five million trees from 63 US cities. They found that urban trees were significantly clustered by species in 93% of cities using the compiled data. They also showed that climate significantly shaped both nativity and tree diversity. Also, they identified the homogenization effect of the non-native species. The interest in patterns of urban biodiversity and its driving mechanism has been rising recently. This paper provides an important data source for addressing research questions on this topic. The finding presented by the authors exemplified its potential. Strengths Compared to the existing urban tree database, such as the one developed by Ossola et al.(Global Ecology and Biogeography 2020), the new database added information on spatial location, nativity statuses, and tree health conditions besides occurrences. The new information expands data usability and saves valuable time for researchers. The authors also make the tools available so others can use them to process their own data sets. Because of the added information, various analyses of the diversity pattern of urban trees and the potential driving mechanism could be conducted. The authors found that individual species nonrandomly clustered urban trees. This finding corroborates the existing knowledge that some common species dominate urban trees. Nevertheless, the authors showed that the dominance was apparent in the spatial dimension. The preliminary finding that the native status of a tree had no apparent impact on tree health is interesting. It can potentially contribute to the debate on native vs. exotic in urban tree species selection, which the author mentioned in the paper.

      Thank you for the feedback!

      Weakness

      While the new database and the analysis based on it has strengths, some aspects of the concepts and data analysis need to be clarified and extended.

      We appreciate these helpful comments and have made many changes in response, detailed below.

      First, the authors need to define several critical concepts used in the paper, including city trees, urban forests, biodiversity, and species diversity. The authors used city trees and urban forests interchangeably throughout the paper. Nevertheless, a widely accepted definition of the urban forest is:"All woody and associated vegetation in and around dense human settlements." Konijnendijk et al. had a good discussion on the terminology used in urban forestry (Urban Forestry & Urban Greening, 2006). Similarly, biodiversity is different from species diversity. Effective species number is a diversity indicator. Therefore, it is challenging to accept conclusions being drawn on biodiversity in urban forests without clear definitions.

      We appreciate these clarifications– we have clarified our terminology throughout and added these important definitions.

      • “...urban forests, which are the woody and associated vegetation in and around dense human settlements (Konijnendijk et al., 2006).”

      • “City tree communities, an essential component of urban forests, provide many services.”

      We replaced the term “biodiversity” throughout the text where really we meant to say “tree species diversity” or just “diversity.”

      Second, the tree inventories varied significantly regarding the number of records (214~720,140). The variation can be due to the actual variation of tree abundance in studied cities or incomplete inventories. Biases can be introduced into the findings when comparing these inventories without adjusting the unequal sample sizes. The authors did not detail how they dealt with this issue when conducting the analysis.

      We redid all of our relevant analyses and applied Chao’s rarefaction and extrapolation techniques throughout the manuscript. The (substantial) changes are fully described above in the “Essential Revisions” section. We also copy them here.

      First, we redid all of our diversity calculations applying Chao’s rarefaction and extrapolation techniques through the R package iNext. Therefore, our summary datasheet now has many new columns to include the following values for each city:

      ○ Effective species number:

      ■ Raw effective species number

      ■ Asymptotic estimate of effective species number with confidence interval

      ■ Estimate of effective species number for a given population size (37,000 trees– the median population size rounded to the nearest 1,000) with confidence interval

      ○ Species richness:

      ■ Raw species richness (number of species)

      ■ Asymptotic estimate of number of species with confidence interval

      ■ Estimate of number of species for a given population size (37,000 trees– the median population size rounded to the nearest 1,000) with confidence interval

      ○ The same for the native-only population of trees in each city (e.g., not just raw number of effective number of native species but also the iNext estimates and confidence intervals)

      ○ Whether or not each of the values above was calculated using extrapolation or interpolation

      ○ Sample coverage estimates

      Second, we re-ran our models testing for significant correlations between species diversity in a city and other factors (including climate), where we used the extrapolated / interpolated effective species numbers from iNext. Specifically, we found the best fit model, which included the following predictors: environmental PCA1, environmental PCA1:environmental PCA2, and whether or not a city was designated as a Tree City USA. Then, we ran this model under six sensitivity conditions, varying the independent variable and/or which cities we included based on completeness of their sample. Climate was still a significant correlate of diversity.

      ○ first, with independent variable = effective species as calculated for a given population of 37,000 trees ("effective species for a standardized population size");

      ○ second, independent variable = the asymptotic estimate of the effective species number for that city as calculated using iNext;

      ○ third, the raw effective species number;

      ○ fourth, excluding cities with fewer than 10,000 trees;

      ○ fifth, excluding cities with <50% spatial coverage;

      ○ sixth, excluding cities with <0.995 sample coverage as calculated by iNext.

      ○ For the fourth, fifth, and sixth models, the independent variable was effective species for a standardized population size of 37,000 trees.

      Third, we redid our comparisons of tree populations in parks versus those in urban areas. Parks were still more diverse than urban areas.

      ○ Specifically, we used iNext to calculate diversity metrics based on the smaller of the two population sizes (park vs urban) to enable fair comparison for each city.

      ○ We reported comparison results for (i) raw effective species number, (ii) asymptotic estimate, and (iii) estimate for a given population.

      ○ In doing so, we eliminated Milwaukee from the comparison (it had only 28 trees recorded as being in an urban setting).

      Fourth, we redid our pairwise comparisons of tree community composition between cities in order to account for different population sizes and sampling efforts. To do so, we randomly subsampled the larger city to make its population equal to the smaller city, calculated comparison metrics, and repeated this process 50 times. We report the average comparison metrics.

      Our new Methods text is copied here for your convenience:

      ○ “Throughout our analyses, it was necessary to control for different sample sizes (and different, but unknown, sampling efforts across cities). To do so, we relied on the rarefaction / extrapolation methods developed by Chao and colleagues (Chao et al., 2015, 2014; Chao & Jost, 2012) and implemented through the R software package iNext (Hsieh et al., 2016). In short, these methods use statistical rarefaction and/or extrapolation to generate comparable estimates of diversity across populations with different sampling efforts or population sizes, alongside confidence intervals for these diversity estimates. iNext performs these tasks for Hill numbers of orders q = 0, 1, and 2. We used two techniques in iNext to allow for comparisons across cities (and between parks and urban areas within cities). First, we generated asymptotic diversity estimates for each; second, we generated diversity estimates for a given standardized population size. For our diversity analyses, the standardized population size we used was 37,000 trees (the rounded median of all cities). For analyses of the diversity of native trees, we used a standardized population size of 10,000 trees. For comparisons of the diversity between park and urban areas in a city, we used the smaller of the two population sizes (park or urban). In all cases we also recorded confidence estimates, and plotted rarefaction/extrapolation curves.

      ○ To control for variation in how uniformly trees were sampled across a city’s geographic range, we developed a procedure to score each city’s spatial coverage (see section Spatial Structure below).

      ○ We identified the best-fitting model, and then repeated our analysis under six sensitivity conditions to control for differences in population size, sampling effort, spatial coverage, and sample coverage. Our sensitivity analyses were as follows: first, with independent variable = effective species as calculated for a given population of 37,000 trees ("effective species for a standardized population size"); second, independent variable = the asymptotic estimate of the effective species number for that city as calculated using iNext; third, the raw effective species number; fourth, excluding cities with fewer than 10,000 trees; fifth, excluding cities with <50% spatial coverage; sixth, excluding cities with <0.995 sample coverage as calculated by iNext. For the fourth, fifth, and sixth models, the independent variable was effective species for a standardized population size of 37,000 trees.”

      Reviewer #3 (Public Review):

      This paper's strength is in the utility of the assembled datasets and some interesting and creative proof of concept analyses. This is an amazing resource for comparative analysis. However the paper felt a little sparse in the conceptual and methodological underpinnings of the questions asked to demonstrate the utility of the analysis. Specifically, I suggest:

      A) More substance in the introduction (currently only two short paragraphs) and a clear statement of research questions.

      We have added text to frame our goals and hypotheses:

      ○ “In particular, we wanted to know whether local climatic conditions are associated with the species diversity of city tree communities, how species diversity was distributed in space within cities, and whether introduced tree species contribute to biotic homogenization among urban ecosystems.”

      B) Add data on the extent to which each dataset represents a complete sample of each city's trees. I know are complete inventories, but some consist of 720 trees and cannot be a complete sample. A column in the meta data indicating effort and if there were any bias in where sampling occurred if the dataset is not complete are needed for others to use this data appropriately. For example, we know tree cover/diversity increases with wealth (which the author rightly cites). Let's say in City X, trees were only inventoried in one wealthy neighborhood. They would not be a representative sample of the city and dataset users need to be aware of this before they draw incorrect conclusions about City X where the sample was biased compared to city Y where the inventory was complete, including a sampling of all affluent and poor areas. This is also needed to support the research questions throughout the paper.

      We completely agree, and have made two major changes in response.

      First, we redid all of our diversity analyses after applying Chao’s rarefaction and extrapolation methods to permit comparison between populations of different sizes and sampling efforts. We added new columns to our datasheet with sample coverage estimates, asymptotic estimates of diversity, and diversity estimates for a given population size.

      Second, we also examined spatial coverage in a city because of the valid concern you raised that trees may only be sampled from particular neighborhoods or areas. In short, we divided each city into grid cells, counted trees per grid cell, and calculated metrics of coverage (adjusted number of trees per grid cell, and proportion grid cells that were empty) and bias (skew, kurtosis of number trees in occupied grid cells). These factors are presented in Spatial_Coverage_Supplement.zip. AS you can see even just from a glance at the spatial coverage plots, some cities are indeed extremely biased! Therefore, we ran a sensitivity analysis where we excluded cities with <50% spatial coverage.

      C) The authors chose to use effective species counts as their alpha diversity metric of choice. They explain why: "effective species counts (a measure that allows comparison between cities of different sizes)" (Ln 109). While effective species number is an excellent metric with much better behavior and attributes in linear modeling, I believe it is still strongly dependent on both city area and the number of individual trees sampled and so the above statement and all of the comparisons that flow out of it in the manuscript are currently unsupported. Just as species richness needs to be rarified or extrapolated to be compared at an equivalent # of individuals or area to be accurate so too does EFN (effective species count). Fortunately there is an R package (iNext) based on Chao's method (citation below) that makes it very easy to create effective species accumulation curves for each city by tree individuals sampled.

      a. Chao, Anne, Nicholas J. Gotelli, T. C. Hsieh, Elizabeth L. Sander, K. H. Ma, Robert K. Colwell, and Aaron M. Ellison. 2014. "Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies." Ecological Monographs 84 (1): 45-67. https://doi.org/https://doi.org/10.1890/13-0133.1.

      b. The standardization (rarefaction/extrapolation) of EFN or richness for # individual trees sampled needs to be made for all analyses that make claims to compare diversity metrics across cities or between groups like urban and park areas (i.e. Fig 2a,b,c; Fig 3b; Fig 5a,b, S1a, S2a, S5, Table S2)

      c. If the authors have an argument for why diversity/area or diversity/sampling effort relationships do not apply for a particular question, then they should make that case instead.

      We very much appreciate this suggestion. Indeed, as described above, we applied Chao’s method to all of our analyses.

      D) The question posed by the Beta diversity analysis is fascinating (i.e. is it non-native species that are driving biotic homogenization across species. However, while frequency (which I assume is relative abundance but maybe it is incidence data- please define) is used to deal with different sample sizes consider whether it makes sense to include incomplete, or very small city datasets in the analysis even with frequency data. For example one city only has ~720 trees listed. If this is an incomplete dataset which seems likely, it will probably be much more differentiated (overlap less) from another city with small numbers simply due to incomplete sampling. Diversity analysis in cities always requires tradeoffs and cannot be identical to methods used in "natural" forested ecosystems, but I encourage the authors to explore this a bit. Perhaps a sensitivity analysis could help where incomplete or small sample sizes are dropped or datasets are resampled via random draw to equalize sizes? The latter would handle incomplete samples but would not deal with bias in which neighborhoods were sampled (see point B above).

      Great suggestion. We redid this analysis using a random drawn approach, as you suggested, to equalize sizes. The new analysis found the same results as our old analysis, with slightly different values. The new method is described here:

      ○ “How similar are species compositions across cities? For N = 1953 city-city comparisons of street tree communities, we could calculate weighted measures of similarity because we had frequency data. We calculated similarity scores for the entire tree population, the naturally-occurring trees only, and the introduced trees only. We used chi-square distance metrics on species frequency data, and we controlled for different population sizes (and potentially, sampling efforts) between cities by sub-sampling the larger city 50 times to match the smaller city’s tree population size and calculating average metrics. In this manner we controlled for differences in sample size.”

      E) Additional context/conceptual underpinning the clustering analysis would be great.

      a. The authors state in Line 390-395:"For city trees, which are often organized along grids or the underlying street layout of a city, this method can more meaningfully cluster trees than merely calculating the meters between trees and identifying nearest neighbors (which may be close as the crow flies but separated from each other by tall buildings)."- I very much agree with this sentiment and it is biologically meaningful for animal and plant dispersal, but as written it is unclear to me how the method described in the text "knows" that a tall building or elevation or some sort of feature exists to separate clusters rather than empty space or a ball field. Please clarify.

      We appreciate these comments, and we have added text and references for the interested reader. Here is the new description in full:

      ○ “We wanted to quantify the degree to which trees were spatially clustered by species within a city (rather than randomly arranged). To do so, we first clustered all trees within each city using hierarchical density based spatial clustering through the hdbscan library in Python (McInnes et al., 2017). HDBSCAN, unlike typical methods such as “k nearest neighbors”, takes into account the underlying spatial structure of the dataset and allows the user to modify parameters in order to find biologically meaningful clusters. For city trees, which are often organized along grids or the underlying street layout of a city, this method can more meaningfully cluster trees than merely calculating the meters between trees and identifying nearest neighbors (which may be close as the crow flies but separated from each other by tall buildings). In particular, using the Manhattan metric rather than Euclidean metrics improves clustering analysis in cities (which tend to be organized along city blocks). For further discussion of why hbdscan is preferable to other clustering metrics, see (Berba, 2020; Leland McInnes et al., 2016; McInnes et al., 2017).”

      b. Would you ever expect composition to be truly random either in a city or a natural forest given environmental conditions etc.? In some sense, the ones closest to random are the most surprising. Can you dive into one to give an example of what is going on in that city?

      c. It seems like there are two metrics here- the size of the cluster and then the observed/expected EFN per cluster. The latter is analyzed in this paper but is there any important information in the former? It seems like an interesting structural measurement of the city and possibly useful in its own right.

      d. Are there any target levels of randomness? Could the authors suggest how this might be determined moving forward with their datasets to illustrate this for foresters?

      Great points. We have given a lot of thought to your comments– these are large and interesting questions!! In the end, I think these questions fall mostly beyond the scope of this study, but we added a substantial amount of text to address your comments:

      ○ “Clustering by species is not necessarily a negative, nor indeed should we necessarily expect trees to be randomly arranged (see suggestions for further research in “Future Analyses” section below). Here, we take a first step toward making spatial clustering a metric of interest in city tree planning.”

      ○ “Researchers could also use this dataset to perform more refined analysis of clustering. For example, what is the biological significance of variation in cluster size (as determined by the hdbscan clustering algorithms)? The size and arrangement of the clusters themselves may be useful metrics. How clustered should we expect trees to be in both wild and urban settings? That is, what our are null expectations? Further, researchers could apply network theory to predict how pest species would proliferate through each of these cities (depending on the spatial arrangement of pest-sensitive trees).”

      F) The statement that this dataset enables "the design of rich heterogenous ecosystems built around urban forests" (Ln 72) seems strange. To my mind this tool will enable a more nuanced evaluation of the urban forests that already exist and suggest ways to target future plantings for increased resilience to climate, pest resistance, biodiversity support etc. I don't understand what ecosystem you would build around and not in the urban forest. If this is what is meant please elaborate. For example, do you mean non-tree installations?

      We agree with you and have changed the text as follows:

      ○ “With these tools, we may evaluate existing city tree communities with more nuance and design future plantings to maximize resistance to pests and climate change. We depend on city trees.”

    1. Author Response

      Reviewer #2 (Public Review):

      According to the authors, the goal is to identify a method to study changes in hospital presentation and outcomes of new COVID-19 variants using publicly available population-level data on variant relative frequency to infer SARS-CoV variants likely responsible for clinical cases. This would assist in answering questions asked by public health authorities as to differences in disease severity and risk factors and vaccine protection.

      Authors use patients' data collected prospectively in 30 countries in their pre-Omicron period (Omicron variant is less than 10% of SARS-CoV2 variants) to the Omicron period (Omicron variant prevalence is >90% of circulating variants). The following factors are analyzed and adjusted for: age/gender, symptoms, comorbidities, vaccination, and outcomes during pre and Omicron periods.

      Their model shows that overall, patients were younger, had less symptoms and that the mortality rate was lower in the Omicron period (even if it doesn't reflect in some country reports). No conclusion can be made on vaccination status.

      Major weaknesses and strengths:

      1) The study is presented as a multi-center international study that includes more than 100,000 patients from 30 countries, however, 96.6% of the study patients originated from 2 countries, South Africa (54%) and the United Kingdom (42.6%) (and the relative contribution of South Africa to the study data was hugely different in the 2 study periods, pre-Omicron and Omicron period).

      The huge imbalance in the number of patients recruited by center could create many bias in data interpretation. For example, some countries do not report any increase in patients aged less than 12 years old in the omicron period. Country specific medians suggest that the younger age of patients after the Omicron variant experience in the combined dataset is at least partially explained by an increase of data contributed by South Africa, relative to the proportion of data contributed by other countries. In total only 11 countries contributed data on more than 100 hospitalized cases.

      The differences in study data contribution between countries, with more than 90% of all records being from the United Kingdom and South Africa, required both an adapted analytical approach, that transparently presented country-level data rather than only aggregated estimates, and careful discussion of our findings. Indeed, we agree with the reviewer that this imbalance in country-level data contribution and the varying contribution of some countries to the two study periods could lead to erroneous inferences if ignored (i.e. if only aggregated results were reported); for this reason, we presented country-specific data in the Results section. In our descriptive analyses, to achieve this goal without jeopardising intelligibility, we present findings for a subset of countries, those with at least 50 observations per study period; note that this criterion was modified based on another comment from this reviewer. This approach also addresses the reviewer’s concern, which we share, that the varying relative contribution of different countries to study periods could lead to spurious aggregated patterns. In fact, we highlight this problem in the following paragraph of the Results section:

      “The median (IQR) ages of patients during the pre-Omicron and Omicron periods were 62 (43 – 76) and 50 (30 – 72) years, respectively; however, country-specific medians suggest that the younger age of patients after Omicron variant emergence in the combined dataset is at least partially explained by an increase in the proportion of data contributed by South Africa, relative to the proportion of data contributed by other countries (Table S6).”

      Recruitment of patients is unclear. We don't really know which patients are selected to be part of the study. The authors mention the use of the ISARIC (International Severe Acute Respiratory and Emerging Infections Consortium) COVID-19 database (l. 173). This would imply that patients with severe respiratory symptomatic COVID-19 are recruited in the study. It could explain why patients recruited from Brazil or the Netherlands have the same proportion of patients presenting with shortness of breath in the pre- and Omicron period.

      Due to the time-sensitivity and scale of this work, involving hundreds of investigators in 30 countries, although the study only included hospitalised patients with SARS-CoV-2 infection, the approach used for patient recruitment in each institution was defined by local investigators. Whilst the sampling strategy was not uniform across sites, one should keep in mind that: (i) recommendations on sampling strategy were shared with local investigators; and (ii) most of the partner institutions involved in this work had previously contributed data to the ISARIC platform and are experienced in patient recruitment and clinical and epidemiological research.

      More generally, recruitment approaches could influence the interpretation of our findings in two ways: by reducing the representativeness of the study population in each country; and by inducing bias that could affect the association of interest (the association between study period and fatality risk). Regarding the former, it is possible that in some countries hospitals contributing to this effort admitted patients with more severe disease compared to the local population of COVID-19 hospitalised patients, the target population. Regarding the second potential problem, bias, hospital-based studies might suffer from collider bias, where both the exposure of interest and the outcome directly influence recruitment (selection) to the study or are associated with selection or recruitment through confounders; this is a well-described problem in hospital-based studies that assess COVID-19 outcomes (see Griffith et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nature Communications 2020. for a discussion on how different COVID-19 clinical factors can induce bias when different sampling frames are used). Note that collider bias is not the only mechanism of selection bias affecting effect measures; as explained by Miguel Hernán (in Invited Commentary: Selection Bias Without Colliders. American Journal of Epidemiology 2017) between-exposure stratum heterogeneity in the association between the outcome and selection could bias the association between the exposure and the outcome (relative to the effect measure in the target population). However, recruitment approaches used by partner institutions are unlikely to have systematically changed during the study period, and we are unaware of evidence suggesting any association that might have existed between recruitment procedure and outcome differed in the two study periods for most, or indeed some, partner institutions.

      We have now modified the Discussion section to highlight this potential weakness of our study:

      “Another weakness of our study is that recruitment procedure was not standardised and was defined locally. Whilst this likely affected the generalisability of our descriptive estimates (fatality risk and frequencies of symptoms and comorbidities) to local populations of hospitalised COVID-19 cases (Lash and Rothman, Selection Bias and Generalizability. in Modern Epidemiology 4th Edition 2021; Rothman et al. Why representativeness should be avoided. International Journal of Epidemiology 2013), it might not have affected the association between study period and fatality risk, at least not beyond the well-described potential for collider bias in hospital-based studies on COVID-19 outcomes (Griffith et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nature Communications 2020).”

      In Nepal, patients were more often recruited from critical care setting (l.572).

      However, the authors mention elsewhere that patients recruited for the study were:

      • Omicron variant infections in hospitalised patients (I. 161),

      • Patients with confirmed or suspected COVID-19 (l.183),

      • "some patients were admitted for a medical condition other than covid19 but tested incidentally during hospitalization (l.243)"

      • In some countries, information on whether covid-19 was the main reason for hospitalization was also collected. 69.0% of patients admitted during the omicron periods were admitted due to covid-19, patients for whom this information was available were primarily from South Africa (94.9%), (L.310)

      • For 35.5% of patients admitted to hospital date of symptoms onset was missing and it was assumed that these were not hospital acquired infections (l.233)

      • Information on whether covid-19 was the main reason for hospitalization was collected during the study period and suggest that for a non-negligible proportion of patients, others clinical conditions might have prompted hospitalization.

      • In their discussion the authors state that "Finally it is also possible that the question on the primary reason for hospitalization might have been interpreted differently in different countries and even in different hospitals in the same country." In the few clinical studies from United Kingdom and South Africa 40% to 70% of admissions were qualified as "incidental" COVID-19.

      This comment relates to the previous comment and to the sampling strategy used in the study. Please, see our response to the previous comment.

      Regarding incidental infections, we have now included information on recent studies (Klann et al. Distinguishing Admissions Specifically for COVID-19 From Incidental SARS-CoV-2 Admissions: National Retrospective Electronic Health Record Study. J Med Internet Res; Voor in ’t holt et al. Admissions to a large tertiary care hospital and Omicron BA.1 and BA.2 SARS-CoV-2 polymerase chain reaction positivity: primary, contributing, or incidental COVID-19. International Journal of Infectious Diseases 2022).

      “One possible explanation for this finding would be if incidental SARS-CoV-2 infections, i.e. infections that were not the primary reason for hospitalisation, were more frequent during the Omicron period; the high transmissibility of this variant, and the consequent peaks in numbers of infections, together with its reported association with lower severity, provides support for this hypothesis. However, in the subset of patients with data on the reason for hospitalisation there was no increase in the proportion of admissions thought to be incidental infections and indeed proportions in both study periods were consistent with frequencies of incidental infections in recent studies in the United States (Klann et al. Distinguishing Admissions Specifically for COVID-19 From Incidental SARS-CoV-2 Admissions: National Retrospective Electronic Health Record Study. J Med Internet Res) and the Netherlands (Voor in ’t holt et al. Admissions to a large tertiary care hospital and Omicron BA.1 and BA.2 SARS-CoV-2 polymerase chain reaction positivity: primary, contributing, or incidental COVID-19. International Journal of Infectious Diseases 2022), although in the latter, non-incidental infections included patients for whom COVID-19 was a contributing but not the main cause of hospitalisation.”

      Absence of data standardization.

      There doesn't seem to be standardized questionnaires across all countries. Some countries do not report on symptoms, others do not report on vaccination status. In total, it seems that less than a third of patients have full data (symptoms, co-morbidities, vaccination, and outcome), and such patients are reported by few countries.

      South Africa (that represents 54% of patients) didn't systematically report on symptoms. Hence data showed for symptoms might reflect in volume mainly the United Kingdom patients. In the United Kingdom vaccination rates during the omicron period was 70.3% as compared to 27.9% for South Africa. The authors find that patients with Omicron variant display less symptoms, (which confirms previous findings) however it could have been as plausible that patients from South Africa being less vaccinated exhibit more symptoms.

      Analysis for each group of data is based on different patients' group according to the data available for such group.

      Data from South Africa used in this analysis are part of the DATCOV national hospital surveillance database. The case report form (CRF) used by the National Institute for Communicable Diseases in South Africa was adapted from the ISARIC CRF; although most sections of that CRF were used for the data collection in the country, information on symptoms was not systematically collected. However, as mentioned above, in our analysis, we also report country-level frequencies of symptoms, rather than only presenting aggregated estimates. We agree with the reviewer that we cannot exclude the possibility that in South Africa a different pattern occurred. Based on this comment, we have now included the following statement in the Discussion section:

      “Finally, missing information on symptoms for patients from South Africa prevented our descriptive analysis of changes in clinical presentation in an African setting.”

      Vaccination data.

      Vaccination data are available for less than 50% of the patients and there is considerable inter-country variation in vaccination rates, as we know but also in the recruitment of patients for the study.

      As an example, Table 1 shows the vaccination status by country and study period for 24 countries: Brazil has a vaccination rate of 84.6% and India of 34.8% but on respectively 13 and 23 observations. There are less than 30 observations in 19 countries for pre omicron and less than 30 observations in 15 countries for the omicron period. No conclusion can be made.

      Our study was not designed to assess vaccine effectiveness against the Omicron and non-Omicron variants as controls (e.g. patients hospitalised with respiratory infection caused by pathogens other than SARS-CoV-2) were not recruited. Whilst we descriptively report the frequency of previous vaccination by country and age groups (see Figure S3 in the Supplementary Appendix, with numbers of records in each category presented for transparency), the primary objective in using vaccination data was to control confounding by this factor. The point made by the reviewer, that missing data on vaccination reduced sample size for this comparison, is valid and we have included the following statement in the Discussion section:

      “We also observed that history of COVID-19 vaccination was more frequent during the Omicron period, although for most countries the number of patients with vaccination information was limited, especially after stratification by age. Whilst this pattern would be expected if current vaccines were less effective against the Omicron variant compared to previously circulating variants, as suggested by a recent study in England analysing symptomatic disease, there were changes in vaccination coverage in many settings during the second half of 2021 and early 2022, including in response to the reports of Omicron variant cases. Since non-COVID-19 patients (e.g., patients with respiratory infections caused by other pathogens) were not systematically recruited for this multi-country study, it is not possible to estimate vaccine effectiveness during the two study periods and assess its change.”

      Major findings of the study:

      Major findings of the study match previous individual-based reports: 1-in many settings patients hospitalized with Omicron less often presented with commonly reported symptoms compared to patients infected with pre-omicron variants.

      2) In a mixed-effects logistic model on 14-day fatality risk that adjusted for sex, age categories and vaccination status hospitalization during the Omicron period were associated with lower risk of death. Similar results were obtained when using 28-days fatality risk and when excluding patients who reported being admitted to hospital due to a medical condition other than covid-19.

      3) History of COVID-19 vaccination was more frequent during the Omicron period, but the authors cannot make any conclusion on vaccine effectiveness

      How to interpret these data? The impact in terms of disease severity of new variants has been shown to be context specific due to regional differences in terms of variability of previous exposure, vaccinations rates and population comorbidity level frequency. As a result of recruitment bias and small recruitment in some countries, several countries have different findings described that do not fit with the conclusions.

      As mentioned by the authors, the strength of the project is to have succeeded in engaging so many countries to work together which could definitely assist in the future in understanding new variants characteristics shared globally and identify country specific impact on these variants according to the history of previous variant exposure, vaccine coverage, population morbidity and access to health.

      Reviewer #3 (Public Review):

      The authors combine outcomes data from patients hospitalised with COVID-19 across 30 countries to investigate differences in likelihood of death from the Omicron variant vs pre-Omicron variants. Data are from the ISARC COVID-19 database; variant status is inferred from country-specific GISAID data. The principal finding is a 36% reduced risk of 14-day death in the Omicron period (OR 0.64 (0.59 - 0.69)) compared with the pre-Omicron period, after multiple adjustment.

      The strengths of this paper are the large N and large number of participating countries from different regions, and also the careful and thorough analytical approaches. The main findings are stress-tested through a range of sensitivity analyses using different variant-dominance thresholds and statistical approaches and found to be robust. The figures are clear, well-chosen and easily interpretable.

      The principal weaknesses, as acknowledged in the discussion, are the imbalance in the data sources (96.6% of the observations came from GBR or SA), and the lack of fidelity of data on vaccination (vaccination status is limited to a binary 'one or more vaccinations received Y/N' variable). This latter means that conclusions about the innate severity of Omicron vs pre-Omicron variants cannot be drawn.

      Nonetheless the findings represent a useful contribution to the literature on the severity of COVID-19 variants, and the approach establishes a template for rapid international collaboration, using GISAID data to infer variant status, that will be useful for formulating policy in response to new variants in the future.

      The limited data on timing of vaccination and number of previous doses imply that residual confounding could partially explain the observed association; we mention this limitation in the Discussion section. Although our data alone cannot provide sufficient evidence for differences in innate severity between variants, mechanistic studies (see Shuai et al. Attenuated replication and pathogenicity of SARS-CoV-2 B.1.1.529 Omicron. Nature 2022, and Halfmann et al. SARS-CoV-2 Omicron virus causes attenuated disease in mice and hamsters. Nature 2022) suggest the Omicron variant might be less virulent. We modified the following paragraph in the Discussion section:

      “All these factors might have contributed to the observed association, possibly to different degrees in different countries, reason for which this result should not be assumed to necessarily relate to the differences in variant virulence previously suggested by mechanistic studies (Shuai et al. Attenuated replication and pathogenicity of SARS-CoV-2 B.1.1.529 Omicron. Nature 2022; Halfmann et al. SARS-CoV-2 Omicron virus causes attenuated disease in mice and hamsters. Nature 2022).”

  2. Aug 2022
    1. Author Response

      Reviewer #2 (Public Review):

      The time-dependency of the model simulations was not analyzed, and the nature of the observed biphasic time-dependent APAP response remains elusive. It would be interesting to see how the model can explain the time course of the APAP stimulation experiment.

      The alternative model at its current state can only describe steady state conditions. However, we understand that the reviewer is interested in the dynamic behavior of the model. However, our approach provides a proof of principle that the alternative model can phenomenologically explain the changes of YAP localization as a response to APAP treatment. The question of how to model Hippo pathway in a time-dependent manner as a response to APAP treatment is very challenging and would require further investigations and, most notably, further development of the PDE simulation algorithms and the SME software. Hence, a technical update of the software algorithms would be required, which cannot be in the scope of this manuscript.

      Nevertheless, we decided to share our first and preliminary analyses on dynamic processes caused by APAP with the reviewer. For this, we simulated the steady state model in an arbitrary manner, where APAP initiates (early time-point) and slows down (late time-points) YAP phosphorylation in the nucleus (see Figure below).

      The simulated alternative model shows that increased YAP phosphorylation about 50% leads to the cytoplasmic localization of YAP (Rebuttal Figure R5A/B). However, this shuttling is not detectable in our protein fractionation and live-cell imaging experiments (see also Rebuttal Figure R7C/D). At late time points, decreasing YAP phosphorylation (about 60%) led to a clear nuclear enrichment and dephosphorylation of YAP was observed in our experiments. Thus, our mathematical model nicely describes cellular events of Hippo pathway dynamics observed at later stages after APAP treatment (nuclear enrichment). However, early events cannot be completely explained (suggested nuclear YAP exclusion is not detectable).

      We suggest two explanations for this observation. First, other molecular mechanisms (not yet identified and therefore not part of the model topology) oppose the exclusion YAP enrichment that is expected at early time points. Second, detection methods used in this study (Western Blotting and life cell imaging) cannot capture minimal changes and cellular heterogeneity in the chosen experimental setup. We clarify this aspect/limitation of our study in the discussion chapter of the manuscript. Page 12, lines 436-440

      Time-dependency of YAP (orange) localization based on the simulated APAP treatment. (A): Simulated control (ctrl) and APAP treatment for 2 and 48h. The treatment was simulated by changing the phosphorylation coefficient of YAP in the nucleus. (B): Simulated pYAP/YAP ratio during control and APAP treatment for 2 and 48 hours at the steady state of the model. (C): Simulated NCR of the total YAP during control and APAP treatment for 2 and 48 hours at the steady state.

    1. Author Response

      Reviewer #1 (Public Review):

      This study is a follow-up to the previous work by the authors in establishing a surprising role for the presynaptic adhesion molecules, neurexin (Nrxn) variants containing the SS4+ splice site, in differentially controlling postsynaptic NMDA and AMPA receptors by forming links through a shared system of extracellular cerebellins (Cbln) and postsynaptic GluD1. Here the authors show at CA1 to subiculum synapses, that the role for Clbn2 in mediating the effects of Nrxn1-SS4+ and Nrxn3-SS4+ in enhancing NMDAR and suppressing AMPAR, respectively, is redundant with that of Clbn1. Moreover, Clbns do not appear to play a role in synapse formation. Dai and colleagues extend their previous work also by highlighting the common function for Nrxn-Clbn signaling system across different synapses albeit with subtle differences and point to a lack of a role for Nrxn-Clbn signaling in morphological synapse development. Overall the data are solid, while the key findings are mostly incremental, and the basis for the selectivity in the observed differential regulation of AMPARs and NMDARs via the same trans-synaptic link through Clbns at various types of synapses remain to be clarified. Importantly, the authors make a definitive conclusion concerning the lack of a role for Nrxn-Cbln signaling complexes in synapse formation during development. Nevertheless, this is a contentious issue, and as such, the conclusions could be more compellingly supported with further experiments.

      We appreciate the reviewer’s positive assessment of our study.

      Reviewer #2 (Public Review):

      In this manuscript Dai et al. investigated the role of Nrxn-Cbln complexes in regulating AMPA- and NMDA- receptor function in different brain regions. Using a combination of genetic manipulations, together with electrophysiological and biochemical assays, the authors showed that, at CA1-subiculum synapses, Cbln2 regulates NMDA- and AMPA- receptors via Nrxn1SS4+ -Cbln2 and Nrxn3SS4+-Cbln2 signaling complexes, respectively. In the prefrontal cortex, only Nrxn1SS4+-Cbln2 signaling-dependent regulation of NMDA receptors occurs, while in the cerebellum, only Nrxn3SS4+-Cbln1 signaling-dependent regulation of AMPA receptor occurs. This systematic investigation of the function of different Neurexin-Cerebellin signaling complexes contributes to our understanding of how different members of the same family, in combination pairs, regulate synaptic transmission with circuit specificity. This work adds to the authors' systemic investigation of molecular mechanisms regulating synaptogenesis, synaptic transmission and synaptic plasticity.

      We thank the reviewer for the positive and astute comments.

      Some suggestions for clarifications:

      1) Regarding expression of Cbln1 in the subiculum, in lines 271-273, the authors stated that "in these and earlier experiments we only studied Cbln2, but quantifications show that Cbln1 is also expressed in the subiculum, albeit at lower levels Figure S3)." However, Figure S3 does not include any quantifications, and the example image does not show visible Cbln1 expression. Thus, the above-mentioned statement is inconsistent with the data presented. Please revise. If the authors would like to keep the statement about quantifications of Cbln1, then quantification should be provided for all panels of this Figure, in order to give the readers some ideas about relative expression levels.

      We agree, and have addressed this issue as described above (introductory point 4).

      2) Does Cbln4, which is also broadly expressed in the brain, play a role in regulating AMPA- and NMDA-receptors at the synapses investigated? Does Cbln3 contribute to regulation of synaptic transmission in the cerebellum? Please discussion.

      Cbln4 is not expressed in the subiculum, but is expressed in the PFC. Specifically, Cbln1, Cbln2, and Cbln4 are broadly expressed in brain, whereas Cbln3 is restricted to cerebellar granule cells and requires Cbln1 or Cbln2 for secretion (Bao et al., 2006; Miura et al., 2006). Remarkably, Cbln1, Cbln2, and Cbln4 are not uniformly expressed in all neurons, but synthesized in restricted subsets of neurons (Seigneur and Südhof, 2017). For example, cerebellar granule cells express high levels of Cbln1 but only modest levels of Cbln2, excitatory entorhinal cortex (EC) neurons express predominantly Cbln4, and neurons in the medial habenula (mHb) express Cbln2 or Cbln4 (Seigneur and Südhof, 2017).

      Cbln4 is poorly studied, and Cbln3 has not been functionally studied at all. To the best of our knowledge, there are only four studies on Cbln4 function, three of which are from our lab. The Seigneur & Sudhof (2018) paper showed that the deletion of Cbln4 in a large number of brain regions caused no change in excitatory or inhibitory synapse numbers. Subsequently, the Seigneur et al. (2018) paper demonstrated that genetic deletion of Cbln4 in the mHb had no major effect on synapse numbers, but because of the limits of this preparation (synaptic transmission is hard to monitor in the mHB), no detailed synaptic studies were done. The Fossati et al. (2019) paper in Neuron shows that Cbln4 regulates inhibitory synapse numbers in the cortex by binding to GluD1, but this study depended on RNAi, not genetic manipulations. Its results are puzzling because structural biology studies have shown that Cbln4 does not bind to GluD2, which is highly homologous to GluD1 and has the same function as GluD1. Instead of binding to GluD’s, Cbln4 binds to another class of receptors, Neogenin-1 and DCC, making the Fossati et al. (2019) paper difficult to interpret. The Liakath-Ali et al. (2022) paper, finally, demonstrated that deletion of Cbln4 in the EC or deletion of Neo1 in the dentate gyrus (DG) blocks long-term potentiation at EC→DG synapses but does not change basal synaptic transmission or synapse numbers, again consistent with the notion that Cbln4 regulates synapse properties similar to Cbln1 and Cbln2.

      We have now described these studies in the introduction to the paper. Many synaptic proteins are associated with contentious studies in the literature, and we completely concur that it is essential to evenly discuss the issues in detail, even if this expands the size of a paper.

      Reviewer #3 (Public Review):

      In this study, Dai and colleagues used genetic models combined to electrophysiological recordings and behavior as well as immunostaining and immunoblotting to investigate the role of trans-synaptic complexes involving presynaptic neurexins and cerebellins in shaping the function of central synapses. The study extends previous findings from the same authors as well as other groups showing an important role of these complexes in regulating the function of central synapses. Here, the authors sought to achieve two main objectives: (1) investigating whether their previous findings obtained at mature CA1-> subiculum synapses (Aoto et al., 2013; Dai et al., Neuron 2019; Dai et al., Nature 2021) extend to different synapse subtypes in the subiculum as well as to other central synapses including cortical and cerebellar synapses and (2) investigating whether Nrx-Cbln-GluD trans-synaptic complexes play a role in synapse formation as previously proposed by other groups.

      Overall, the study provides interesting and solid electrophysiological data showing that different Nrxns and Cblns assemble trans-synaptic complexes that differently regulate AMPAR and NMDAmediated synaptic transmission across distinct synaptic circuits (most likely through binding to postsynaptic GluD receptors).

      We appreciate the reviewer’s accurate and positive assessment of our study.

      However, the study has several important weaknesses:

      1) The novelty of the findings appears limited. Indeed, previous studies from the same authors with similar experimental paradigms and readouts already demonstrated the role of Nrxn-CblnGluD complexes in regulating AMPARs versus NMDARs in mature neurons (Aoto et al., Cell 2013; Dai et al., Neuron 2019; Dai et al., Nature 2021). Moreover, the absence of role of Cblns and GluD receptors in synapse formation was already suggested in previous studies from the same authors (Seigneur and Sudhof, J Neurosci 2018; Seigneur et al., PNAS 2018; Dai et al., Nature 2021).

      Not surprisingly, we disagree with this comment. We do concur that our data are consistent with previous studies, but believe that this reproducibility is a strength since so many data in the literature are irreproducible.

      We do not agree, however, that our findings lack novelty. The novelty is admittedly limited, after all we like to be consistent, but our paper is the first to demonstrate that the Nrxn1/Cbln/GluD and Nrxn3/Cbln/GluD complexes are differentially active in different synapses, with the subiculum synapses having both, the mPFC synapses only the former, and the cerebellum only the latter. This is a very important innovation that illustrates the power of the Nrxn/Cbln/GluD signaling complex in shaping synapses. In addition, our paper is the first to analyze a possible developmental function of Cbln2 in depth, to analyze its differential role at the two dominant types of pyramidal neurons in the subiculum, regular- and burst-spiking neurons, to analyze conditional deletions of Cbln1 in the adult cerebellum, and to directly measure the effect of Cbln2 deletions in the PFC. Especially in view of the recent Nature paper that concluded that Cbln2 regulates spine numbers in the PFC, these findings are highly relevant.

      2) The conclusion made by the authors that the Nrxn-Cbln-GluD trans-synaptic complexes do not play a role in synapse formation/development is not sufficiently supported by their data, while previous studies suggest the opposite. Actually, this conclusion is essentially based on the two following measurements taken as a 'proxy' for synapse density: (1) 'the average vGluT1 intensity calculated from the entire area of subiculum' and (2) the 'synaptic proteins levels' assessed by immunoblotting. None of these measurements (only performed in the subiculum) allow to precisely assess synapse density on the neurons of interest. While the average vGluT1 intensity over large fields of view does not directly reflect the density of synapses and does not take into account the postsynaptic compartment, the immunoblotting data only reflects the overall expression of synaptic proteins without discriminating between intracellular, surface and synaptic pools and between cell types. In the subiculum from Cbln1+2 KO mice, the authors performed mEPSCs recordings and found an increase in frequency. However, this increase may reflect the unsilencing and/or potentiation of AMPAR-EPSCs above the detection threshold, irrespectively of the actual synapse number. Finally, the decrease in NMDAR-EPSCs is not discussed by the authors while it could actually reflect a decrease in synapse number.

      We agree that additional data on synapse numbers are helpful for our paper. We have now performed these studies as described in detail in our response to introductory point 1 above. However, we would also like to refer to the already existing body of evidence on the role of neurexin-based complexes in synapse numbers. We have shown in papers published over the last two decades that deletions of individual neurexins or of multiple neurexins, as well as blocking cerebellin binding to neurexins by ablating splicing site #4 (SS4) in neurexins, have NO effect on synapse numbers. The most important of these papers are:

      1. Missler, M., Zhang, W., Rohlmann, A., Kattenstroth, G., Hammer, R.E., Gottmann, K., and Südhof, T.C. (2003) α-Neurexins Couple Ca2+-Channels to Synaptic Vesicle Exocytosis. Nature 423, 939948.
      2. Kattenstroth, G., Tantalaki, E., Südhof, T.C., Gottmann, K., and Missler, M. (2004) Postsynaptic Nmethyl-D-aspartate receptor function requires α-neurexins. Proc. Natl. Acad. Sci. U.S.A. 101, 2607-2612.
      3. Dudanova, I., Tabuchi, K., Rohlmann, A., Südhof, T.C., and Missler, M. (2007) Deletion of α-Neurexins Does Not Cause a Major Impairment of Axonal Pathfinding or Synapse Formation. J. Comp. Neurol. 502, 261-274.
      4. Etherton, M.R., Blaiss, C., Powell, C.M., and Südhof, T.C. (2009) Mouse neurexin-1α deletion causes correlated electrophysiological and behavioral changes consistent with cognitive impairments. Proc. Natl. Acad. Sci. U.S.A. 106, 17998-18003.
      5. Soler-Llavina, G.J., Fuccillo, M.V., Ko, J., Südhof, T.C., and Malenka, R.C. (2011) The neurexin ligands, neuroligins and LRRTMs, perform convergent and divergent synaptic functions in vivo. Proc. Natl. Acad. Sci. U.S.A. 108, 16502-16509.
      6. Aoto, J., Martinelli, D.C., Malenka, R.C., Tabuchi, K., and Südhof, T.C. (2013) Presynaptic Neurexin-3 Alternative Splicing Trans-Synaptically Controls Postsynaptic AMPA-Receptor Trafficking. Cell 154, 75-88. PMCID: PMC3756801.
      7. Aoto, J., Földy, C., Ilcus, S.M., Tabuchi, K., and Südhof, T.C. (2015) Distinct circuit-dependent functions of presynaptic neurexin-3 at GABAergic and glutamatergic synapses. Nat Neurosci. 18, 997-1007.
      8. Anderson, G.R., Aoto, J., Tabuchi, K., Földy, F., Covy, J., Yee, A.X., Wu, D., Lee, S.-J., Chen, L., Malenka, R.C., Südhof, T.C. (2015) α-Neurexins Control Neural Circuit Dynamics by Regulating Endocannabinoid Signaling at Excitatory Synapses. Cell 162, 593-606. PMCID: PMC4709013
      9. Chen, L.Y., Jiang, M., Zhang, B., Gokce, O., and Südhof, T.C. (2017) Conditional Deletion of All Neurexins Defines Diversity of Essential Synaptic Organizer Functions for Neurexins. Neuron 94, 611-625. PMCID: PMC5501922
      10. Dai, J., Aoto, J., and Südhof, T.C. (2019) Alternative Splicing of Presynaptic Neurexins Differentially Controls Postsynaptic NMDA- and AMPA-Receptor Responses. Neuron 102, 993-1008. PMCID: PMC6554035
      11. Luo, F., Sclip, A., Jiang, M., and Südhof, T.C. (2020) Neurexins Cluster Ca2+ Channels within presynaptic Active Zone. EMBO J. 39, e103208. PMCID: PMC7110102
      12. Khajal, A.J., Sterky, F.H., Sclip, A., Schwenk, J., Brunger, A.T., Fakler, B., and Südhof, T.C. (2020) Deorphanizing FAM19A Proteins as Pan-Neurexin Ligands with an Unusual Biosynthetic Binding Mechanism. J. Cell Biol. 219, e202004164
      13. Luo, F., Sclip, A., and Südhof, T.C. (2021) Universal role of neurexins in regulating presynaptic GABAB-receptors. Nature Comm. 12, 2380. PMCID: PMC8062527
      14. Wang, C.Y., Trotter, J.H., Liakath-Ali, K., Lee, S.J., Liu, X., and Südhof, T.C. (2021) Molecular SelfAvoidance in Synaptic Neurexin Complexes. Science Advances 7, eabk1924. PMCID: PMC8682996
      15. Dai, J., Patzke, C., Liakath-Ali, K., Seigneur, E., and Südhof, T.C. (2021) GluD1, A signal transduction machine disguised as an ionotropic receptor. Nature 595, 261-265. PMCID: PMC8776294

      Individual papers may not convince the reviewer, but the cumulative evidence seems to us to be hopefully persuasive. We have published less evidence on the lack of a role of cerebellins and GluD’s in synapse numbers than on neurexins, but the only in-depth studies of these molecules by others are in the cerebellum. Here, deletions of Cbln1 and GluD2 indeed cause a significant, albeit partial, loss of synapses. However, this loss may not be due a lack of synapse formation, but to an elimination of synapses that have been formed, as demonstrated by many beautiful papers from leading investigators. It is regrettable that reviews and textbooks continue to state that cerebellins mediate synapse formation as an established fact because as far as we can see, there is limited evidence for that conclusion, but it keeps coming back again and again.

      3) The authors do not provide sufficient data in order to interpret the increase in AMPAR-EPSCs and decrease in NMDAR-EPSCs amplitudes. Are the changes in AMPARs and NMDARs occurring at pre-existing synapses or do they result from alterations in the number of physical synapses and/or active synapses (see point#2)? In particular, the increase in AMPAR/NMDAR ratio accompanied by the increase in mEPSCs frequency might be well explained by the unsilencing of some synapses and/or by the fact that the available pool of AMPARs is distributed over a smaller number of synapses, resulting in higher quantal size. These effects could explain the blockade of LTP, i.e., through an occlusion mechanism.

      We addressed these points in previous studies (Aoto et al., 2013; Dai et al., 2019 and 2021), as discussed and cited in the present paper, and expanded on these points in the present paper.

      In a nutshell, we showed by surface AMPAR staining that presynaptic Nrxn3-SS4+ decreases postsynaptic AMPAR levels, and by direct application of AMPA that it decreases the functional surface levels of AMPARs, whereas presynaptic Nrxn1-SS4+ increases the functional surface levels of NMDARs. We also demonstrated the opposite effects for the GluD1 KO, and furthermore showed by minimal stimulation experiments that the Cbln2 deletion does not alter the number of silent synapses. In the present manuscript, we performed a detailed analysis of the miniature quantal size for AMPAR- and NMDAREPSCs.

      Finally, we have demonstrated in a large number of papers, including this one, that genetic manipulations of neurexins, cerebellins, and GluD’s do not alter synapse numbers with a few exceptions in which synapses are secondarily eliminated, like in the cerebellum. Together, these data show that the observed changes are mediated by a regulation of postsynaptic functional AMPARs and NMDARs, not by alterations in synapse numbers or by synapse silencing/unsilencing.

      4) The authors did not demonstrate (or did not cite relevant studies) that the deletion of Cbln1 and/or Cbln2 does not affect the expression of the remaining Cblns isoforms (Cbln2 and/or Cbln4) or Nrxns1/3 and GluD1/2. This verification is important to preclude the emergence of any compensatory effect.

      To address this point, we have now measured the mRNA expression levels of Nrxns, Cblns, and GluDs in both the subiculum and the prefrontal cortex in littermate P35-42 Cbln2 WT and KO mice. The result show that the constitutive Cbln2 deletion causes no compensatory expression effects (new suppl Fig. S5). Please note that compensatory expression effects are often raised as a possibility for explaining genetically induced changes (or the lack thereof), but such effects are virtually never found.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors try to shed light on how plant stem cells located in a ring‐like structure in the (the procambial cells or cambium) can generate two distinct differentiated tissues, one filling the interior of the ring (the xylem) and the other one surrounding the ring (the phloem). To achieve this goal, the authors propose different models increasing in complexity, and perform a series of comparisons between the model outcomes and experimental data in the Arabidopsis hypocotyl. This work seems to provide for the first time a computational framework to model the radial formation of the cambium, xylem and phloem in the hypocotyl. Some of the features of the wild type and mutants could be qualitatively recapitulated, such as the radial organization of the xylem, cambium and phloem in wild type, and a striking phenotype upon the overexpression of CLE41 transgene.

      We thank the reviewer for appreciating the novelty of this work.

      Although this work is very well written and understandable at the introduction, when paying careful attention to the presented results, there are different aspects that would require further work and investigation, on both experimental and modelling sides: The authors chose to study different models increasing in complexity, reaching a more complete model (Model 3, Figure 5A‐D) that the authors claim it is recapitulating the experimental data and the explored experimental perturbations (Figure 5E‐F). This model is substantially more complex than Model 1 and Model 2, and it is difficult to understand all the claims by the authors, and the radial pattern formation capabilities of it. Yet, a feature that is clear to the eye, both in the pictures and in the movies, is that this model seems more likely to present a front instability of the cambium front progression, disrupting the radial organization of the different tissues (see Figure 5B), which does not seem to happen in the wild type hypocotyl from Arabidopsis. This effect is even more extreme when looking at the pxy mutant (Figure 5F) and when the xylem cell wall thickness is explored through the simulations (Figure 6). The authors claim this model is able to recapitulate a basic feature of the pxy mutant, which is the fact that the distal cambium appears in patches. Although these patches appear in the simulations, this effect in the model might be produced by the instability of the cambium front progression itself, which might be fundamentally different from what happens in the experimental data. In the experimental data, the PXYpro:CFP cambium does not seem to present such front instability, but rather is the xylem that gets fragmented. To make a link between the Model 3 and the pxy mutant, a careful study of the different stages of this phenotype could be useful to do, both on the modelling and experimental side.

      Thanks for this valuable comment and for appreciating our writing style. Front stability was not part of our considerations but provides certainly a very interesting aspect to our study. The reviewer is correct when noticing that the front of domains observed in planta is very stable but that this is not the case for our computational simulations. We believe that instability in the computational models is due to local noise in the cellular pattern leading to differential diffusion of chemicals* with respect to its radial position and to a progressive deviation of domain from a perfect circle. Such a deviation seems to be corrected by an unknown mechanism in planta but such a corrective mechanism is, due to the absence of a good idea of its nature, not implemented in our models. In order to investigate this point and the contribution of front instability to phenotypes of perturbed lines, we performed a time course analysis of anatomies of wt, IRX3pro:CLE41 and pxy lines with the help of the PXYpro:CFP/SMXL5pro:YFP markers, now shown in Fig. S1, and compared their dynamics to the respective movies 4A, 5A, and 6A. For pxy mutants, we observed ‘gaps’ in the cambium domain already at early stages of development (Fig. S1I, J) arguing against the fact that the pxy anatomy is caused by increased front instability but rather by differential signaling within a circular domain leading to a breakdown of cambium patterning and cell fate determination. Although a corrective mechanism ensuring front stability in planta is difficult to predict, we believe that our model now allows to test respective ideas like directional movement of chemicals or stabilizing communication between cells within a particular circular domain. This aspect is now discussed in the discussion.

      The authors have a parameter search strategy based on matching the proportion of cell types in Model 3. I am wondering how effective is this strategy in a system where these features are evolving in time, especially in Model 3, which seems to present a front instability. Moreover, this strategy does not tell anything about the model robustness for recapitulating the different features of the pattern.

      We thank the reviewer for pointing out these aspects regarding the parameter search. We agree that there are some limitations to estimating dynamic parameters based on the proportion of cell types. As a consequence, we have focused our parameter search on those parameters that directly impact tissue formation: cell division thresholds, cell differentiation thresholds and maximal cell sizes. We have further expanded our parameter search until we obtained five distinct parameter sets that recapitulate central features of cambium activity. This increases the likelihood that the behavior we saw in the subsequent analyses was actually a feature of the system and not a characteristic of that particular parameter set. This strategy did not solve the front instability of model 3, which suggests that there are factors at play ‐ beyond the CLE41‐PXY module and cell wall stability – which are currently beyond the scope of our model.

      In the last model, the authors try to link the cell wall thickness with the radiality of the divisions. Although the idea of looking at the division trajectories seems interesting, more clarity is needed to see how helpful is the radiality measure, and perhaps a better measure is needed ‐ note that the proliferation trajectory in Figure 6C might have the same amount of ramifications than in Figure 6B, and therefore, effectively speaking, the amount of periclinal divisions might be the same in both cases. The authors claim that the increase of xylem thickness contributes in having a more radial growth, but this could be related to the cambium front instability, which seems to be more pronounced as well for higher xylem thickness.

      We agree with the reviewer that this is a critical point as a robust measurement of ‘radiality’ of cell lineages is central for accessing the degree of pericliniality of cell divisions with the computational model. After extensively considering different measurement methods, we indeed think that calculating R2 of cell connectors is the most appropriate and quantitative one in the context of our computational model. In fact, the amount of ramifications is not considered by this method but the geometry of ‘cell connectors’ which clearly shows a more ‘radial’ pattern of cell lineages when xylem cells are ‘stiffer’ (Fig. 6D). Ramifications would be a measurement of the amount of cell divisions, which we did not want to target in this case. We also did not claim that increased xylem thickness leads to more radial growth. In fact, Fig. S4 shows that this is rather the opposite. We expect that increased front instability when ‘xylem stiffness’ is increased, would rather decrease radiality of cell* lineages and mask respective positive effects. The fact that we still see increased ‘radiality’ argues against the assumption that front instability is causative.

      On the experimental side, the claims about the proximal and distal cambium, together with the cell proliferation data are not very well supported with the presented data in Figures 2, 3A and S1. Moreover, these different figures seem to show different behaviors ‐ are these sections at different stages of the hypocotyl? Also, seeing more of the H4 marker in a region of the tissue not necessarily indicates a higher proliferation rate (it could also simply be that cells are more synchronized in the S phase in that region of the cambium, and/or the cell cycle lasts for longer in that part of the tissue). A quantification and the proper repeats to support these claims is lacking. A quantitative and more extensive study of the pxy mutant would enable a better comparison with the simulated model. Is there PXYpro:CFP expression between the fragmented xylem?

      We agree with these concerns toward the H4 marker used in the initial submission. Because H4 expression is not specifically associated with cell division but with DNA synthesis in general and, thus, with endoreduplication, H4 expression does not report faithfully on cell division. As a response, we removed related figures and now reference our previous study characterizing cell division levels in different cambium domains based on cell linage analyses (Shi et al., 2019). Because this is a far more reliable analysis and convincingly supports our claims, we believe that we thereby addressed this concern. As mentioned above, we also added a more extensive analysis of the pxy mutant (Fig. S1) showing that there is no PXY expression between the fragmented xylem domains.

      This work might help progress in the field of understanding radial patterning in plants. The introduction and the first models could attract a more general plant audience, but once the models increase in complexity, the narrative and presented results are more relevant to those scientists more specialized in xylem and phloem formation.

      We thank the reviewer for appreciating the general relevance of our models for a larger audience.

      Reviewer #2 (Public Review):

      The paper uses computer modeling and simulations to show how a radially growing circular plant organ, such as a hypocotyl, can develop and maintain its organization into tissues including, in particular, cambium, xylem and phloem. The results are illustrated with useful movies representing the simulations. The paper is organized as a sequence of models, which has some rationale ‐ it presumably depicts the path of refinements through which the authors arrived at the final model ‐ but the intermediate steps are of limited interest. At the same time, mathematical details of the models are not presented to the full extent. Fortunately, the models can be downloaded over the Internet, and the supplementary materials include detailed instructions for executing them (using the VirtualLeaf framework). Consequently, the paper and its results can potentially serve as a stepping stone for further model‐assisted studies of radial tissue organization and growth.

      Again, we thank the reviewer for appreciating the usefulness of our model and its general implications. In the revised version of the manuscript we substantially expanded explanations of the mathematical details in the main text and the supplemental methods. We still would argue that intermediate steps are of common interests as they illustrate why certain assumptions being extensively discussed within the field were refused providing important justifications for the final model.

    1. Author response

      Reviewer #3 (Public Review):

      Sensory preconditioning (SPC) refers to a conceptually important, higher-order form of Pavlovian conditioning. It involves two training phases and a final test. In the first, pre-conditioning training phase two 'neutral' stimuli are presented together (S1, S2). In the second training phase, one of them is paired with for example a punishment (S1+). In the final test conditioned response to the respective other stimulus is assessed (S2).

      The conclusion that sensory preconditioning does indeed occur requires showing that i) conditioned responding is observed for S2 but not for other, not pre-conditioned stimuli (S3); ii) that conditioned responding to S2 depends on the jointness of presentation of S1 and S2; iii) that conditioned responding to S2 depends on S1 indeed being paired with punishment. It is a strength of the current paper that these requirements are met and that this is the case both at the behavioural level and for a plausible stand-in at the physiological level.

      A weakness is that key data belonging together are not shown and analysed together.

      We have rearranged the data.

    1. Author Response

      Reviewer #1 (Public Review):

      Mikelov et al. investigated IgH repertoires of memory B cells, plasmablasts, and plasma cells from peripheral blood collected at three time-points over the course of a year. In order to obtain deep and unbiased repertoire sequences, authors adopted uniquely developed IgH repertoire profiling technology. Based on collected peripheral blood data, authors claim that:

      1) A high degree of clonal persistence in individual memory B cell subsets with inter-individual convergence in memory and ASCs.

      2) ASC clonotypes are transient over time and related to memory B cells.

      3) Reactivation of persisting memory B cells with new rounds of affinity maturation during proliferation and differentiation into ASCs.

      4) Both positive and negative selection contribute to persisting and reactivated lineages preserving the functionality and specificity of BCRs.

      The present study provides useful technical application for the analysis of longitudinal B cell repertoires, and bioinformatics and statistical data analysis are impressive. Regarding point 1), clonal persistence of memory B cells is already well known. On the other hand, inter-individual convergence between memory B cells and plasma cells might not be shown in healthy individuals even though the biological significance of circulating plasma cells is questionable.

      We thank the reviewer for careful analysis of our manuscript and are grateful for the positive view and all the criticism of our study.

      To the best of our knowledge the clonal persistence of memory B cells was previously studied mostly in the contexts of active immune response after natural challenge or after immunization. Here we used the full set of modern experimental and analytical repertoire sequencing approaches to characterize the connection and dynamics of memory and the two antibody-secreting B cell subpopulations during a long period in healthy donors, i.e. in donors without severe inflammatory diseases or who were not experienced intensive response against a natural antigen close to the sample collection time points. In other words, we carefully dissected the repertoire of peripheral blood antigen-experienced B cells in normal state. Thus we believe that our study brings a number of essentially new details to the overall picture of B cell immunity.

      By assessing the intra- and inter-individual repertoire overlaps we found high reproducibility of B cell memory clones between timepoints, which was just a little bit lower compared to the overlap between replicates. About 5% of largest clonotypes were identical (Fig. 2B left), while the V usage distribution changed more substantially over the time (Fig. 2A left), assuming the impact of non-persistent memory IGH clonotypes. Compared to the intra-individual reproducibility, the number of shared clonotypes between unrelated donors was extremely low, but still detectable, showing impact of convergent clonotypes in antigen-experienced B cells repertoire overlap of unrelated donors. Together, our findings show a high level of individuality of IGH repertoire of antigen-experienced B cells, while common challenges converge it to some extent at the level of most expanded clones, which are extremely stable (persistent) over the time. On the way from naive to the antigen-experienced B cells the germ-line encoded sequence of CDR1 and CDR2 make an impact, which is similar between individuals with similar genetic and environmental context. The latter further supports the previously reported findings on the role of germ-line encoded parts of IGH in the response against specific antigens (Collins et al. DOI: 10.1016/j.coisb.2020.10.011).

      Regarding 2), temporal stability of plasma cell clonotypes has been demonstrated already in the bone marrow with serial biopsies over time (Wu et al. DOI: 10.1038/ncomms13838). The Association of clonotypes between memory and plasma cells in the blood of healthy donors might be new, however, again its biological significance is questionable.

      Long-term stability of plasma cells was previously shown by a number of studies demonstrating presence of antigen-specific clones or even cells during months and years in human bone marrow and other sites, as well as in mice and primates (Wu et al. DOI: 10.1038/ncomms13838; Landsverk et al. DOI: 10.1084/jem.20161590; Manz et al. DOI: 10.1038/40540; Hammarlund et al. DOI: 10.1038/s41467-017-01901-w; Xu et al. DOI: 10.7554/eLife.59850; Davis et al. DOI: 10.1126/science.aaz8432). We agree that BM samples would add the additional layer to our investigation by describing the interconnection of the B cell memory pool with BM PCs. We also agree that the nature of circulating plasma cells is not fully clear at the moment and the relation of such cells/clones to BM PCs remains to be detailed. However, we cannot agree with the reviewer’s remark about the low (or absent) biological significance of the circulating ASCs. According to modern view, raising from large number of different studies conducted for previous several decades on mice, human and other organisms, the differentiation events in GC after antigen-priming lead to formation of cells switched to antibody-secreting program, and some part of them further reaches the bone marrow as site of residence. The bone marrow niches provide necessary signals required for further differentiation of newly migrated ASC cells to long-living or short-living plasma cells and their further survival in BM. However, the ASCs migrating to BM can be sampled from blood during their migration. The presence of an apoptotic-resistant subset of PCs expressing high-affinity Abs in circulation early after booster immunization in humans was previously shown (Inés González-García et al. DOI: 10.1182/blood-2007-08-108118). Similar in vitro survival ability for transcriptomically different blood ASC subsets was demonstrated by other authors (Garmilla et al. DOI: 10.1172/jci.insight.126732). Recent study, using artificial system modeling the BM niche in vitro, show that peripheral blood ASCs are able to differentiate to LLPC (Joyner et al. DOI: 10.26508/lsa.202101285). Besides, in a number of other studies it was also previously shown the increase of plasmablasts and plasma cells in PB during intensive immune response after primary or secondary immunization/natural challenge (Blink et al. DOI: 10.1084/jem.20042060; Odendahl et al. DOI: 10.1182/blood-2004-07-2507; Lee et al. DOI: 10.4049/jimmunol.1002932) or in active autoimmune condition (Szabo et al. DOI: 10.1111/cei.12703; Jacobi et al. DOI: 10.1002/art.10949). So, we considered ASC subsets in our work as a source of ASCs enriched in recently differentiated antibody-producers different in expression of CD138, which is the marker of LLPC in BM plasma cells and seemingly marks differently differentiated ASCs in circulation. Thus, these ASC subsets complement antigen-primed peripheral blood B cells playing an important role in ongoing immune response and influence to the plasma cells population in the BM. The connection on clonal lineage level between persisting memory B cells and the ASC subsets shown in our study, and findings recently published by Antonio Lanzavecchia’s lab (Phad et al. DOI: 10.1038/s41590-022-01230-1), support the idea that the circulating CD19-/lowCD20-CD27+CD138+/- B cells in PB represent the antibody-producing progeny of reactivated memory.

      Regarding 3) and 4), it is hard to generalize observations from the presented data because the analysis was based on just four donor cases with different health conditions, i.e. a combination of healthy and allergic. The cell number of plasmablasts and plasma cells isolated from peripheral blood is extremely low compared to memory B cells, and in fact, the vast majority of ASCs reside in the tissues such as lymphoid organs, bone marrow, and mucosal tissues rather than in circulating blood (Mandric et al. DOI: 10.1038/s41467-020-16857-7). As the most critical problem, direct pieces of evidence to claim points, 3) and 4) are missing.

      We fully agree that our study has a set of limitations and added more detailed discussion of them to the revised version (lines 582-600). We agree that our cohort group is not large, nevertheless our observations demonstrate reproducibility among different donors and hold statistical significance for detected differences. To justify our generalization of this cohort group, combined from healthy and allergic donors, we added more detailed analysis as a Supplementary Note, showing that within our study design we observe no difference between healthy and allergic donors both on the level of the clonal repertoire and the level of clonal lineages.

      The number of sampled plasmablasts and plasma cells compared to memory B cells in our study reflects the ratio between the subpopulations in the peripheral blood of middle aged donors and corresponds to the previous estimations published by the others. According to the fact that about 15% of the most abundant clonotypes on average were reproducible between parallel samples (replicates), the sampled numbers of PBL and PL allowed us to reach a relatively high reproducibility of the clone sampling at the level of cells. This as well as the diversity estimations point out that we sequenced the representative number of ASCs in peripheral blood to characterize their clonal repertoire and their connection with the B cell memory pool. Indeed the vast majority of plasma cells reside in different tissues, mostly in the bone marrow, but we believe that the ASCs in circulation represent the pool of newly generated and/or migrating between sites ASCs at different stages of differentiation. However, the further studies showing clonal relationship between memory B cells and ASCs in circulation and tissue-resident ASCs are still required to provide a more detailed view to this aspect.

      We agree that we cannot provide much direct evidence to support points 3) and 4), however we revealed a bunch of indirect ones, which are very consistent between each other supporting the points on memory reactivation and clonal selection claimed:

      1. From the biological sense, rapid increase of frequency of LBmem lineages and its’ perfect reproducibility between replicates (Supplementary Figure S7E), indicate increase in the number of the sampled cells, i.e. lineage expansion, occurred due to proliferation after antigen challenge or migration between tissues of residence due to some other signals. Predominance of ASC phenotype indicates their involvement in ongoing immune response.

      2. Large G-MRCA distance in LBmem lineages together with low inter-lineage genetic divergence points out on that the observed clonotypes of LBmem lineages diverged recently, originate from some mature clonotype and represent only a single clade of full lineage phylogeny.

      3. Most of LBmem lineages (47 out of 52) includes Bmem clonotypes, showing interconnection of LBmem cluster to Bmem subset. For 38 out of 52 LBmem lineages we detected Bmem clonotype at the time point prior to lineage expansion.

      4. Significant difference in SHM patterns between HBmem and LBmem lineages reflects difference in selection forces, affecting their evolution. In evolutionary genomics, it is rarely possible to study evolution directly, and most often changes in genetic sequences are the only type of data available. Therefore, we are inclined to trust the conclusions drawn from the use of tools designed for this type of problem. If negative selection is expected in the evolution of any protein, positive selection is much more tricky to detect. Thus the presence of its signs suggests new rounds of affinity maturation or presence of some mechanism, leading to reactivation of the best-fitted representatives of the lineage.

      In addition to the indirect evidence, we found direct and clear example of memory reactivation inside the clonal lineage (Fig. 4F). We added alignment of the CDR3 region of this lineage as Supplementary Figure S7 to confirm that both its’ HBmem - like and LBmem - like parts originate from the same recombination event.

      These findings lead to the conclusion that most of the LBmem lineages in analysis originated from some pre-existing memory. However we can not say for sure that in all the cases the memory is similar in properties to the persistent memory of HBmem cluster. The one exemplary clonal lineage shows that at least some of LBmem lineages represent re-activation of persistent HBmem lineages. The most recent study in the field published by Phad et al. (DOI: 10.1038/s41590-022-01230-1) have also demonstrated clonal relatedness of peripheral blood plasmablasts to the persistent memory. It should also be noted that in the present study we focused on the most expanded clones and clonal lineages, while the mechanisms determining the power of expansion are well not defined and thus the behavior of not so large clones can be different. To conclude, we believe that our findings can be generalized while probably representing only a part of the whole complex picture describing the behavior of B cell memory in normal state.

      Reviewer #2 (Public Review):

      The findings in this manuscript have been properly hypothesized and adequately demonstrated, and have some levels of practical guidance. The authors performed a detailed longitudinal analysis of a subset of immune-experienced B cells from donors without severe pathology. They selected a comprehensive analytical framework for BCR clonal lineage from these data and suggested interconnected B-cell clone-level subsets, B-cell memory fusion in donor-independent, and long-term persistent peripheral blood memory-enriched clonal lineages. Lastly, their evolutionary results analyzing the B-cell clonal lineage plus annotation suggest that activating B-cell subsets of preexisting memory-B cells is accompanied by the maturation of new rounds of affinity.

      We thank the Reviewer for careful analysis and positive view on our study.

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, the science is sound and interesting, and the results are clearly presented. However, the paper falls in-between describing a novel method and studying biology. As a consequence, it is a bit difficult to grasp the general flow, central story and focus point. The study does uncover several interesting phenomena, but none are really studied in much detail and the novel biological insight is therefore a bit limited and lost in the abundance of observations. Several interesting novel interactions are uncovered, in particular for the SPS sensor and GAPDH paralogs, but these are not followed up on in much detail. The same can be said for the more general observations, eg the fact that different types of mutations (missense vs nonsense) in different types of genes (essential vs non-essential, housekeeping vs. stress-regulated...) cause different effects.

      This is not to say that the paper has no merit - far from it even. But, in its current form, it is a bit chaotic. Maybe there is simply too much in the paper? To me, it would already help if the authors would explicitly state that the paper is a "methods" paper that describes a novel technique for studying the effects of mutations on protein abundance, and then goes on to demonstrate the possibilities of the technology by giving a few examples of the phenomena that can be studied. The discussion section ends in this way, but it may be helpful if this was moved to the end of the introduction.

      We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      Schubert et al. describe a new pooled screening strategy that combines protein abundance measurements of 11 proteins determined via FACS with genome-wide mutagenesis of stop codons and missense mutations (achieved via a base editor) in yeast. The method allows to identify genetic perturbations that affect steady state protein levels (vs transcript abundance), and in this way define regulators of protein abundance. The authors find that perturbation of essential genes more often alters protein abundance than of nonessential genes and proteins with core cellular functions more often decrease in abundance in response to genetic perturbations than stress proteins. Genes whose knockouts affected the level of several of the 11 proteins were enriched in protein biosynthetic processes while genes whose knockouts affected specific proteins were enriched for functions in transcriptional regulation. The authors also leverage the dataset to confirm known and identify new regulatory relationships, such as a link between the SDS amino acid sensor and the stress response gene Yhb1 or between Ras/PKA signalling and GAPDH isoenzymes Tdh1, 2, and 3. In addition, the paper contains a section on benchmarking of the base editor in yeast, where it has not been used before.

      Strengths and weaknesses of the paper

      The authors establish the BE3 base editor as a screening tool in S. cerevisiae and very thoroughly benchmark its functionality for single edits and in different screening formats (fitness and FACS screening). This will be very beneficial for the yeast community.

      The strategy established here allows measuring the effect of genetic perturbations on protein abundances in highly complex libraries. This complements capabilities for measuring effects of genetic perturbations on transcript levels, which is important as for some proteins mRNA and protein levels do not correlate well. The ability to measure proteins directly therefore promises to close an important gap in determining all their regulatory inputs. The strategy is furthermore broadly applicable beyond the current study. All experimental procedures are very well described and plasmids and scripts are openly shared, maximizing utility for the community.

      There is a good balance between global analyses aimed at characterizing properties of the regulatory network and more detailed analyses of interesting new regulatory relationships. Some of the key conclusions are further supported by additional experimental evidence, which includes re-making specific mutations and confirming their effects on protein levels by mass spectrometry.

      The conclusions of the paper are mostly well supported, but I am missing some analyses on reproducibility and potential confounders and some of the data analysis steps should be clarified.

      The paper starts on the premise that measuring protein levels will identify regulators and regulatory principles that would not be found by measuring transcripts, but since the findings are not discussed in light of studies looking at mRNA levels it is unclear how the current study extends knowledge regarding the regulatory inputs of each protein.

      See response to Comment #10.

      Specific comments regarding data analysis, reproducibility, confounders

      1) The authors use the number of unique barcodes per guide RNA rather than barcode counts to determine fold-changes. For reliable fold changes the number of unique barcodes per gRNA should then ideally be in the 100s for each guide, is that the case? It would also be important to show the distribution of the number of barcodes per gRNA and their abundances determined from read counts. I could imagine that if the distribution of barcodes per gRNA or the abundance of these barcodes is highly skewed (particularly if there are many barcodes with only few reads) that could lead to spurious differences in unique barcode number between the high and low fluorescence pool. I imagine some skew is present as is normal in pooled library experiments. The fold-changes in the control pools could show whether spurious differences are a problem, but it is not clear to me if and how these controls are used in the protein screen.

      Because of the large number of screens performed in this study (11 proteins, with 8 replicates for each) we had to trade off sequencing depth and power against cell sorting time and sequencing cost, resulting in lower read and barcode numbers than what might be ideally aimed for. As described further in the response to Comment #5, we added a new figure to the manuscript that shows that the correlation of fold-changes between replicates is high (Figure 3–S1A). The second figure below shows that the correlation between the number of unique barcodes and the number of reads per gRNA is highly significant (p < 2.2e-16).

      2) I like the idea of using an additional barcode (plasmid barcode) to distinguish between different cells with the same gRNA - this would directly allow to assess variability and serve as a sort of replicate within replicate. However, this information is not leveraged in the analysis. It would be nice to see an analysis of how well the different plasmid barcodes tagging the same gRNA agree (for fitness and protein abundance), to show how reproducible and reliable the findings are.

      We agree with the reviewer that this would be nice to do in principle, but our sequencing depth for the sorted cell populations was not high enough to compare the same barcode across the low/unsorted/high samples. See also our response to Comment #5 for the replicate analyses.

      3) From Fig 1 and previous research on base editors it is clear that mutation outcomes are often heterogeneous for the same gRNA and comprise a substantial fraction of wild-type alleles, alleles where only part of the Cs in the target window or where Cs outside the target window are edited, and non C-to-T edits. How does this reflect on the variability of phenotypic measurements, given that any barcode represents a genetically heterogeneous population of cells rather than a specific genotype? This would be important information for anyone planning to use the base editor in future.

      We agree with the reviewer that the heterogeneity of editing outcomes is an important point to keep in mind when working with base editors. In genetic screens, like the ones described here, often the individual edit is less important, and the overall effects of the base editor are specific/localized enough to obtain insights into the effects of mutations in the area where the gRNA targets the genome. For example, in our test screens for Canavanine resistance and fitness effects, in which we used gRNAs predicted to introduce stop codons into the CAN1 gene and into essential genes, respectively, we see the expected loss-of-function effect for a majority of the gRNAs (canavanine screen: expected effect for 67% of all gRNAs introducing stop codons into CAN1; fitness screen: expected effect for 59% of all gRNAs introducing stop codons into essential genes) (Figure 2). In the canavanine screen, we also see that gRNAs predicted to introduce missense mutations at highly conserved residues are more likely to lead to a loss-of-function effect than gRNAs predicted to introduce missense mutations at less conserved residues, further highlighting the differentiated results that can be obtained with the base editor despite the heterogeneity in editing outcomes overall. We would certainly advise anyone to confirm by sequencing the base edits in individual mutants whenever a precise mutation is desired, as we did in this study when following up on selected findings with individual mutants.

      4) How common are additional mutations in the genome of these cells and could they confound the measured effects? I can think of several sources of additional mutations, such as off-target editing, edits outside the target window, or when 2 gRNA plasmids are present in the same cell (both target windows obtain edits). Could some of these events explain the discrepancy in phenotype for two gRNAs that should make the same mutation (Fig S4)? Even though BE3 has been described in mammalian cells, an off-target analysis would be desirable as there can be substantial differences in off-target behavior between cell types and organisms.

      Generally, we are not very concerned about random off-target activity of the base editor because we would not expect this to cause a consistent signal that would be picked up in our screen as a significant effect of a particular gRNA. Reproducible off-target editing with a specific gRNA at a site other than the intended target site would be problematic, though. We limited the chance of this happening by not using gRNAs that may target similar sequences to the intended target site in the genome. Specifically, we excluded gRNAs that have more than one target in the genome when the 12 nucleotides in the seed region (directly upstream of the PAM site) are considered (DiCarlo et al., Nucleic Acids Research, 2013).

      We do observe some off-target editing right outside the target window, but generally at much lower frequency than the on-target editing in the target window (Figure 1B and Figure 1–S2). Since for most of our analyses we grouped perturbations per gene, such off-target edits should not affect our findings. In addition, we validated key findings with independent experiments. For our study, we used the Base Editor v3 (Komor et al., Nature, 2016); more recently, additional base editors have been developed that show improved accuracy and efficiency, and we would recommend these base editors when starting a new study (see, e.g., Anzalone et al., Nature Biotechnology, 2020).

      We are not concerned about cases in which one cell gets two gRNAs, since the chance that the same two gRNAs end up in one cell repeatedly is low, and such events would therefore not result in a significant signal in our screens.

      We don’t think that off-target mutations can explain the discrepancy between pairs of gRNAs that should introduce the same mutation (Figure 3–S1. The effect of the two gRNAs is actually well-correlated, but, often, one of the two gRNAs doesn’t pass our significance cut-off or simply doesn’t edit efficiently (i.e., most discrepancies arise from false negatives rather than false positives). We may therefore miss the effects of some mutations, but we are unlikely to draw erroneous conclusions from significant signals.

      5) In the protein screen normalization uses the total unique barcode counts. Does this efficiently correct for differences from sequencing (rather than total read counts or other methods)? It would be nice to see some replicate plots for the analysis of the fitness as well as the protein screen to be able to judge that.

      We made a new figure that shows a replicate comparison for the protein screen (see below; in the manuscript it is Figure 3–S1A) and commented on it in the manuscript. For this analysis, the eight replicates for each protein were split into two groups of four replicates each and analyzed the same way as the eight replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16). The second figure shows that the total number of reads and the total number of unique barcodes are well correlated.

      For the fitness screen, we used read counts rather than barcode counts for the analysis since read counts better reflect the dropout of cells due to reduced fitness. The figure below shows a replicate comparison for the fitness screen. For this analysis, the four replicates were split into two groups of two replicates each and analyzed the same way as the four replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16).

      6) In the main text the authors mention very high agreement between gRNAs introducing the same mutation but this is only based on 20 or so gRNA pairs; for many more pairs that introduce the same mutation only one reaches significance, and the correlation in their effects is lower (Fig S4). It would be better to reflect this in the text directly rather than exclusively in the supplementary information.

      We clarified this in the manuscript main text: “For 78 of these gRNA pairs, at least one gRNA had a significant effect (FDR < 0.05) on at least one of the eleven proteins; their effects were highly correlated (Pearson’s R2 = 0.43, p < 2.2E-16) (Figure 3–S1B). For the 20 gRNA pairs for which both gRNAs had a significant effect, the correlation was even higher (Pearson’s R2 = 0.819, p = 8.8e-13) (Figure 3–S1C). These findings show that the significant gRNA effects that we identify have a low false positive rate, but they also suggest that many real gRNA effects are not detected in the screen due to limitations in statistical power.”

      7) When the different gRNAs for a targeted gene are combined, instead of using an averaged measure of their effects the authors use the largest fold-change. This seems not ideal to me as it is sensitive to outliers (experimental error or background mutations present in that strain).

      We agree that the method we used is more sensitive to outliers than averaging per gene. However, because many gRNAs have no effect either because they are not editing efficiently or because the edit doesn’t have a phenotypic consequence, an averaging method across all gRNAs targeting the same gene would be too conservative and not properly capture the effect of a perturbation of that gene.

      8) Phenotyping is performed directly after editing, when the base editor is still present in the cells and could still interact with target sites. I could imagine this could lead to reduced levels of the proteins targeted for mutagenesis as it could act like a CRISPRi transcriptional roadblock. Could this enhance some of the effects or alter them in case of some missense mutations?

      To reduce potential “CRISPRi-like” effects of the base editor on gene expression, we placed the base editor under a galactose-inducible promoter. For both the fitness and protein screens we grew the cultures in media without galactose for another 24 hours (fitness screen) or 8-9 hours (protein screens) before sampling. In the latter case, this recovery time corresponded to more than three cell divisions, after which we assume base editor levels to have strongly decreased, and therefore to no longer interfere with transcription. This is also supported by our ability to detect discordant effects of gRNAs targeting the same gene (e.g., the two mutations leading to loss-of-function and gain-of-function of RAS2), which would otherwise be overshadowed by a CRISPRi effect.

      9) I feel that the main text does not reflect the actual editing efficiency very well (the main numbers I noticed were 95% C to T conversion and 89% of these occurring in a specific window). More informative for interpreting the results would be to know what fraction of the alleles show an edit (vs wild-type) and how many show the 'complete' edit (as the authors assume 100% of the genotypes generated by a gRNA to be conversion of all Cs to Ts in the target window). It would be important to state in the main text how variable this is for different gRNAs and what the typical purity of editing outcomes is.

      We now show the editing efficiency and purity in a new figure (Figure 1B), and discuss it in the main text as follows: “We found that the target window and mutagenesis pattern are very similar to those described in human cells: 95% of edits are C-to-T transitions, and 89% of these occurred in a five-nucleotide window 13 to 17 base pairs upstream of the PAM sequence (Figure 1A; Figure 1–S2) (Komor et al., 2016). Editing efficiency was variable across the eight gRNAs and ranged from 4% to 64% if considering only cases where all Cs in the window are edited; percentages are higher if incomplete edits are considered, too (Figure 1B).”

      Comments regarding findings

      10) It would be nice to see a comparison of the results to the effects of ~1500 yeast gene knockouts on cellular transcriptomes (https://doi.org/10.1016/j.cell.2014.02.054). This would show where the current study extends established knowledge regarding the regulatory inputs of each protein and highlight the importance of directly measuring protein levels. This would be particularly interesting for proteins whose abundance cannot be predicted well from mRNA abundance.

      We agree with the reviewer that it would be very interesting to compare the effect of perturbations on mRNA vs protein levels. We have compared our protein-level data to mRNA-level data from Kemmeren and colleagues (Kemmeren et al., Cell 2014), and we find very good agreement between the effects of gene perturbations on mRNA and protein levels when considering only genes with q < 0.05 and Log2FC > 0.5 in both studies (Pearson’s R = 0.79, p < 5.3e-15).

      Gene perturbations with effects detected only on mRNA but not protein levels are enriched in genes with a role in “chromatin organization” (FDR = 0.01; as a background for the analysis, only the 1098 genes covered in both studies were considered). This suggests that perturbations of genes involved in chromatin organization tend to affect mRNA levels but are then buffered and do not lead to altered protein levels. There was no enrichment of functional annotations among gene perturbations with effects on protein levels but not mRNA levels.

      We did not include these results in the manuscript because there are some limitations to the conclusions that can be drawn from these comparisons, including that our study has a relatively high number of false negatives, and that the genes perturbed in the Kemmeren et al. study were selected to play a role in gene regulation, meaning that differences in mRNA-vs-protein effects of perturbations are limited to this function, and other gene functions cannot be assessed.

      11) The finding that genes that affect only one or two proteins are enriched for roles in transcriptional regulation could be a consequence of 'only' looking at 10 proteins rather than a globally valid conclusion. Particularly as the 10 proteins were selected for diverse functions that are subject to distinct regulatory cascades. ('only' because I appreciate this was a lot of work.)

      We agree with this, and we think it is clear in the abstract and the main text of the manuscript that here we studied 11 proteins. We made this point also more explicit in the discussion, so that it is clear for readers that the findings are based on the 11 proteins and may not extrapolate to the entire yeast proteome.

      Reviewer #3 (Public Review):

      This manuscript presents two main contributions. First, the authors modified a CRISPR base editing system for use in an important model organism: budding yeast. Second, they demonstrate the utility of this system by using it to conduct an extremely high throughput study the effects of mutation on protein abundance. This study confirms known protein regulatory relationships and detects several important new ones. It also reveals trends in the type of mutations that influence protein abundances. Overall, the findings are of high significance and the method appears to be extremely useful. I found the conclusions to be justified by the data.

      One potential weakness is that some of the methods are not described in main body of the paper, so the reader has to really dive into the methods section to understand particular aspects of the study, for example, how the fitness competition was conducted.

      We expanded the first section for better readability.

      Another potential weakness is the comparison of this study (of protein abundances) to previous studies (of transcript abundances) was a little cursory, and left some open questions. For example, is it remarkable that the mutations affecting protein abundance are predominantly in genes involved in translation rather than transcription, or is this an expected result of a study focusing on protein levels?

      We thank the reviewer for pointing out that this paragraph requires more explanation. We expanded it as follows: “Of these 29 genes, 21 (72%) have roles in protein translation—more specifically, in ribosome biogenesis and tRNA metabolism (FDR < 8.0e-4, Figure 5C). In contrast, perturbations that affect the abundance of only one or two of the eleven proteins mostly occur in genes with roles in transcription (e.g., GO:0006351, FDR < 1.3e-5). Protein biosynthesis entails both transcription and translation, and these results suggest that perturbations of translational machinery alter protein abundance broadly, while perturbations of transcriptional machinery can tune the abundance of individual proteins. Thus, genes with post-transcriptional functions are more likely to appear as hubs in protein regulatory networks, whereas genes with transcriptional functions are likely to show fewer connections.”

      Overall, the strengths of this study far outweigh these weaknesses. This manuscript represents a very large amount of work and demonstrates important new insights into protein regulatory networks.

    1. Author Response

      Reviewer #2 (Public Review):

      In this paper, the authors identify topological metrics in gene-regulatory networks that have the potential to predict the sub-types of phenotypic steady states that the network can lead to. The results hold great value for the field of Theoretical Systems Biology.

      The paper becomes too technical too quickly and assumes a lot of knowledge from the reader. Equations and theoretical concepts are not always well defined. In general, I would recommend connecting the results from the simulations/topology metrics to EMP biology earlier in the paper. Alternatively, rather than investigating 5 networks related to EMP, the generalization of the statements could become stronger if the authors explore the trends of the theoretical analysis in networks modeling other biological processes (such as SCLC).

      One of the main findings of the paper is that the distance between the matrix of correlation values between nodes in all steady states obtained from simulation and influence matrix indicates that the mean group strength is a good quantity to identify teams of nodes in the network. However, it remains unclear how to identify groups/teams in the networks based on influence: is it unsupervised (hierarchical?) clustering? How do the authors identify the number of teams of nodes in randomized?

      The authors also explore whether team structure correlates with the stability of relevant biological phenotypes. To characterize stability, they define static (e.g., frustration and stead state frequency) and dynamic network metrics (e.g., coherence and higher-order perturbations), and correlate them to the mean group strength in both WT and randomized networks. Results are promising: team structure and group mean strength show interesting correlative trends with both the static and dynamic metrics. However, everything relies on the mean group strength, which as mentioned before is not convincingly defined in randomized networks.

      Taken together, the conclusions of this paper would be better supported if a better explanation of team identification in gene-regulatory networks would be provided, and if networks related to other biological processes would be investigated.

      We thank the referee for their encouraging remarks and valuable suggestions about improving the manuscript. We are excited that the referee finds our results promising and of great value to the field of theoretical systems biology. Following the suggestions given here, we have now included further clarification on various aspects, included results for regulatory networks of melanoma and small cell lung cancer (SCLC, Fig 9, S11), and described in detail the algorithm used to identify teams in a given network (Methods)

    1. Author Response

      Reviewer #3 (Public Review):

      The manuscript by Barr et al., investigates the molecular phenotype, regulation by type 2 immunity, and function, of ectopic tuft cells that appear in the lungs of mice recovering from infection by the mouse-adapted PR8 strain of influenza A virus. They use reporter mice and either bulk or single cell RNA sequencing to reveal the molecular heterogeneity among tuft cells present in lungs of mice 43 days after PR8 infection. Lineage tracing using a Krt5-CreER driver line was used to demonstrate the basal cell origin of ectopic tuft cells and mice harboring homozygous null alleles for either Pou2f3, Trpm5, IL4Ra or IL25, were evaluated to determine roles for tuft cells and type 2 immunity in regulation of dysplastic epithelial remodeling. Their data confirm that ectopic tuft cells are derived from dysplastic Krt5-expressing cells that appear following PR8 infection, that pre-existing tuft cells play no role in basal cell dysplasia, and that ectopic tuft cells derived from dysplastic basal cells play no role in lung remodeling. Furthermore, they show that neither type 2 cytokines nor IL25, an upstream regulator of type 2 immune responses, play roles in regulating the pulmonary response to PR8 infection. Finally, they show that tuft cells are also induced in the lungs of bleomycin-injured mice and that the presence of tuft cells in alveolar regions of PR8-infected mice does not influence the inability of dysplastic basal cells to assume alveolar epithelial cell fates. The manuscript is well written and experiments were performed with rigorous experimental design and data of high quality. However, even though findings have potential importance and could be of interest, results seem preliminary and lack a strong rationale.

      Major concerns:

      1) Studies of tuft cells in the gut and their response to type 2 immunity, which were the basis for this line of investigation into ectopic tuft cells in the PR8-infected lung, have shown that tuft cells are part of a feed-forward loop leading to tuft cell expansion and enhanced type 2 immune responses including increased abundance of goblet cells. Since ectopic pulmonary tuft cells are derived from dysplastic basal cells after PR8 infection, rather than the reverse, this is clearly not the case in lungs of PR8 infected mice. Furthermore, since tuft cells are derived from hyperplastic basal cells in lungs of PR8-infected mice, it would seem unlikely that they impact the extent of basal cell hyperplasia.

      Ultimately the reviewer is correct in that the mechanisms at play in the post-flu lung promoting ectopic tuft cell expansion are clearly distinct from those in the small intestine. However, this was not a foregone conclusion, especially given that similar Type 2-dependent mechanisms clearly have a role in brush cell (now also termed tuft cell) expansion in the trachea. Regarding tuft cell influence on basal cell hyperplasia, we originally hypothesized that tuft cells differentiating from the migrating, proliferating basal cells may act in a feed-forward fashion to promote continued proliferation of the basal cells, akin to what happens upon tuft cell activation in the intestine. Nevertheless the Reviewer is correct in that our results show that basal cell hyperplasia is independent of tuft cell differentiation, and we feel this is valuable information for the field.

      2) Tuft cell expansion following parasitic infection of the gut and associated type 2 inflammation, and basal cell differentiation into tuft cells leading to their increased abundance following lung injury, are distinct processes and likely to be regulated through distinct mechanisms. As such, the rationale for investigating the roles of type 2 cytokines in the regulation of tuft cell appearance is rather weak. In the absence of data demonstrating how basal to tuft cell differentiation is regulated, this component of the study seems preliminary.

      Amplification of tuft cells in the small intestine (Gerbe et al., 2016; Howitt et al., 2016; von Moltke et al., 2016) and upper airways (Ualiyeva et al., 2021, Bankova et al., 2018) are either totally dependent on or highly influenced by Type 2 cytokines, respectively. Accordingly, it was critical to examine whether a similar mechanism was at play in the lung after influenza injury, i.e. promoting tuft cell amplification downstream of Type 2 cytokines. While our findings demonstrate that post-flu tuft cells arise largely independent of Th2 signals, new findings in other tissues published after submission of the current manuscript do indeed demonstrate Th2 / ILC2-indepdent functions of tuft cells (O’Leary et al., DOI: 10.1126/sciimmunol.abj1080). Our findings support the existence of novel mechanisms regulating tuft cell differentiation, and as the Reviewer suggests, we hope to uncover these mechanisms in future work.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors here follow-up on roles for signaling pathways like ERK in epithelial patterning that have been studied in an emerging literature in both, broadly, the cell competition field and, more specifically, in mouse intestinal organoids. They employ timelapse microscopy to study behavior of human colonic organoids in monolayers as the organoids initially self-organize. They then follow maintenance of organization into densely clustered nodes that have increased cells in cell cycle and the remaining more sparsely populated regions with fewer cycling cells. Nodes also show markers of in vivo colonic stem cells (Lgr5 and myc). They follow propagation of ERK waves using a genetic tool (ERK-JTR) and show that they can emerge from single apoptotic cells in between nodes.

      Strengths of the study include novelty of showing self-organization and behavior of human organoids over time, with good resolution, using microscopy, as well as sophisticated analysis techniques to interpret and present cumulative data over many experiments. Additionally, the paper adds important pieces of the puzzle with respect to how cells may compete and respond across an entire monolayer, and the tools and approaches lend themselves to studying many genes and signaling pathways besides simply Wnt vs ERK.

      Weaknesses in the current version of the manuscript:

      1) The manuscript is focused nearly exclusively on ERK and Wnt but not in terms of the broader context of interpretation of the response of a monolayer to apoptosis of single cells. Some of the original work in the field showed that apoptotic cells enacted Rho- and MLCK-dependent actomyosin contractility, which was proposed to signal neighboring cells by initially pulling them inwards via the contraction (PMIDS: 9456322, 10459006, 11283606, 21721944). But a more intestine-specific literature has long-been extant following up on the critical role of ROCK and MLCK in maintaining barrier after specifically intestinal-cell apoptosis (15825080, 21237166).

      -- A suggestion would be 1) to cite the relevant literature and 2) to interpret some of the experiments within the cytoskeletal mechanistic context already known. In addition to comments about PMA and ERK activation (see next point), the authors could test whether the ERK waves cause myosin II activation and/or are ROCK/MLCK-dependent. Given ROCK inhibition is frequently used in organoid culture, this would seem an obvious avenue to explore. Does the ERK wave propagate the cytoskeletal changes to close the gap and increase centrifugal motility and/or conversely does the actomyosin tugging of the apoptotic cell trigger ERK activation? (Admittedly, the latter question may be hard to address). In short, there is a lot known about monolayer behavior in terms of dynamic cytoskeletal changes that can be addressed here to integrate with the Wnt/ERK roles.

      We completely agree that contractility and the cytoskeleton play vital roles in this process. We have added a section on this in the discussion and cited the relevant literature you suggested. We have conducted an unbiased screen for Erk wave dynamics and have several novel hits related to the mechanical aspect of this process. We are currently validating these hits and feel it would be too preliminary to include here. We are preparing a separate study that will focus on the role of mechanical signaling during Erk wave propagation.

      2) The authors use only PMA as an ERK activator. PMA is a broadly acting drug, principally known as a PI3K inducer. Obviously, Akt and other downstream action of PI3K means many other pathways are stimulated besides ERK. Indeed, ROCK and Src and other cytoskeleton-modifying pathways are modulated by PMA that may not correlate with the ERK effects. Additionally, the movies showing the effects of PMA treatment show a striking increase in apoptotic cells throughout the field, which would obviously confound the interpretation of what happens after relatively rare, internodal apoptotic cells die

      -- A strong suggestion would be to increase the routes to ERK activation the authors use. This could be via receptor tyrosine kinase stimulation (again, like ROCK, EGF is a key organoid medium component), though obviously that would not be much more specific than PMA, but the authors use EGFr inhibition to block ERK, so wouldn’t stimulation be an apt converse approach? Genetic constitutively active KRAS might be introduced. Alternatively, there are pharmacological ways to increase pERK dramatically by inhibiting the dual action phosphatase (see eg PMID: 30475204 in a previous eLife paper). At the least, it would seem the authors should not use an approach that increases apoptosis dramatically.

      This is a great suggestion. We have added an additional figure describing a set of experiments that activate Erk through the expression of an oncogenic KRAS allele (G12V) under control of doxycycline. This resulted in increased uncoordinated Erk activity and loss of nodes. Further, we show that the Wnt inhibitor Pyrvinium also increased Erk activity in organoid monolayers and led to node loss. Consequently, we have tested three independent activators of Erk, all of which led to loss of the proliferative/stem cell niche.

      3) The movies clearly show many dividing cells that are between nodes, and they show apoptotic cells within nodes (eg movie 3a towards the end). While it's clear that apoptotic cells in internodal regions can elicit the wave behavior, it would seem that apoptosis does not universally do this, given the counter-examples.

      -- It would help if the authors could speak to this. Namely, in what cases are there no waves after apoptosis and what are the factors that might contribute (nearness to a node? nearness in time and space to another apoptotic cell?). Presumably, the events are relatively stochastic so there would be occasions for non-stereotypical behavior like wave front interference or augmentation in the case of closely located apoptotic cells.

      We agree. As shown in the movie 3a, there are occasional cell death events in the proliferative region of organoid monolayers. We observed that these cell death events did induce waves but were less frequent compared to non-proliferative regions as quantified in figure 3H. Cells within the proliferative compartment also contain elevated Wnt signaling as shown by Top-GFP signal in figure S6 and LRG5 staining in figure 2B. The margin of the proliferative compartment is also the region where Erk waves tend to die off. Our hypothesis is that Wnt largely suppresses apoptosis and Erk waves.

      Reviewer #2 (Public Review):

      The work by Pond, et al., uses patient derived organoid monolayers to interrogate MAPK signaling in real-time using an ERK reporter. This technology was developed previously to use a target domain of ERK that responds to phosphorylation by altering nuclear-cytoplasmic localization. The active ERK kinase can be inferred by cytoplasmic localization of the reporter. The premise of the paper is that this reporter can be used in human organoid cultures to understand ERK signaling dynamics. Figures 1 and 2 demonstrate the monolayer culture properties and how stem-like and differentiated domains for within the cultures, validated using RNA FISH for MYC, LGR5, and KRT20. Figure 3 describes how an ERK wave radiates out from an apoptotic cell in the cultures, and that the living cells migrate towards to dying cell, presumably to sustain a barrier. In figure 4, data is presented showing that PMA-mediated activation of ERK disrupts the patterning of the monolayers, dispersing the nodes of cells associated with stem/proliferative identity. Finally, in figure 5, the authors show that treating cultures with Wnt3a suppresses ERK activity, while inhibiting ERK may expand WNT/stem cells in the cultures.

      The study is interesting and the model system has a lot of potential.

      However, there are some concerns about the novelty. The reasons for this are:

      1 - the monolayer system has been demonstrated before, very nicely in a 2018 Dev. Cell paper from the Altschuler lab and one of the current manuscript authors.

      2 - ERK-KTR reporters have been used to demonstrate apoptosis induced signaling waves in the epithelium (Gagliardi, 2021, Dev. Cell.)

      3 - ERK activity suppressing stem cell fate has been documented previously (Riemer, 2015; Leach, 2021; Reischmann, 2020; Tong, 2017)

      So while there are exciting aspects of the work, including use of human tissues and live imaging of pathway dynamics, I feel that the novel discoveries using these technologies are somewhat limited.

      Point 1: We agree that the Thorne 2018 paper showed the feasibility of 2D enteroid monolayers using mouse small intestine, yet it was not obvious that this approach would translate to human organoid models. We have demonstrated that this approach can be used for patient derived organoids from human colon, which contributes greatly to the translational potential. Additionally, a major challenge with organoids is tracking cells in space and time in 3D culture condition. We have shown that these primary cultures can be combined with lentiviral live kinase reporters and are amenable to long term culture for the study of single cell dynamics of heterogenous organoid cultures without laborious 3D image analysis.

      Point 2 and 3: We agree KTRs are a well-known and useful tool for studying single cell kinase dynamics. In mammalian cell lines (Gagliardi 2021) and drosophila epithelium (Valon 2021), Erk waves driven by apoptosis were reported to prevent apoptosis in nearby cells and instruct movement to prevent barrier disruption. Here, we showed that Erk waves effect the patterning of the differentiated and stem cell compartments. Our work 1) establishes that Erk waves are found in human colonic epithelium, 2) that this effects the patterning of the differentiated and stem cell compartments and 3) Erk wave signaling is a fundamental part of human colonic epithelial homeostasis. The novelty of this report is connecting apoptosis-driven Erk dynamics to spatial partitioning of cell fates.

    1. Author Response

      Reviewer #1: “(Public Review):

      The main result of the paper is a statistical dependence between the evolved size control strategy and the structure of the cell cycle, in that size control that manifests early (later) in the cell cycle tends to give adder- (weakly sizer-) like strategies. Notably, even when the final evolved network shows weak adder or weak sizer-like behaviour, they find strong sizer-like control in the evolutionary transient. Finally, they constrain the evolutionary algorithm to sense cell size only through stochastic fluctuations of protein concentrations and uncover a strategy that exhibits hallmarks of self-organised criticality.

      The questions studied by the authors are both interesting and timely, and their results are intriguing and well documented. On the whole, the conclusions are convincingly argued, and the authors do an excellent job of extracting qualitative features from their evolved networks. However, the manuscript is a little difficult to read, with the figures being crowded and difficult to parse. In addition, while there is a lot of detail in some places (as in the description of one particular feedback control strategy), other results are less fleshed out (such as statistical summaries of the different simulations). The manuscript would benefit from a sharper presentation of the results.’

      We have done our best to tighten the writing and better focus on the main results of the paper. We have done this in response to the specific criticisms of the reviewers, however, most of the comments indicated that our manuscript was rather dense and so important points had been lost. Therefore, in the revision, we have mostly focused on increasing the clarity rather than condensing our prose further.

      A particularly interesting question addressed in the paper is why adders are more commonly found when sizers are believed to be better at controlling cell size. Here, the authors' simulations give two answers: first, that sizers tend to appear when cell size control is exerted later in the cycle (as in S. pombe). Second, that even when adders eventually evolve, the evolutionary transient passes through a strong sizer strategy. As the adder-vs-sizer question is repeatedly raised, it would strengthen the paper to have a longer and sharper discussion on (a) why early cell size control favours adders, and (b) why sizers appear as transients when fluctuations in cell size are large?’

      We now clarify these key points and extend our discussion. The question as to why sizers appear as transients when fluctuations in cell size are large is more complex. We see repeatedly that sloppy sizers evolve first. But, these sizers are not necessarily that good at giving a low CV. Then, as the system continues to evolve, adders appear that are better at reducing CV than the noisy sizers. This emphasizes that the contribution to reducing the CV comes from two parts, first the slope contribution defining the relationship between the amount of growth in the cell cycle and the cell size at birth, and second, the amount of noise in this process, i.e., how variable the result will be for two cells born the same size. The system proceeds from a noisy sizer to a less noisy adder while reducing the CV as selected for. Thus, we speculate that in the later stages of evolution, where the system has already significantly reduced cell size variability, the ability to more accurately perform size control with less noise reduces the selection pressure on the slope so that adders tend to emerge. To address the comment, we have extended our discussion as to why early cell size control favors adders. We have broken the penultimate paragraph in the discussion into two parts where we now write:

      “Our evolution simulations gave insight into factors that bias evolution towards sizer or adder type control mechanisms (Fig. 4). First, it is worth noting that our evolution simulations were not deterministic. There was no one-to-one correspondence between a given evolutionary pressure and any one specific cell size control mechanism. Rather, our claims represent an average behavior observed over the course of many simulations. It is first worth noting that size control, as measured by the CV at a particular point in the cell cycle, has contribution both from the slope of the correlation between cell size and the amount of cell growth and from the amount of noise characterizing the differences between cells that are initially the same size (Di Talia et al., 2007). It is therefore possible that a low noise adder can produce a lower CV than a higher noise sizer. This is reflected in the evolutionary paths of some of our simulations, which traverse from a noisy sizer to a less noisy adder (Fig. 5). However, we anticipate even noisy sizers will be better than adders at controlling cell size in response to large deviations away from the steady state distribution. This is because sizers will always return the cell size to be within the steady state distribution within a cell cycle.

      In the selection of a size controlling G1 network followed by a timer in S/G2/M, we observed a prevalence of adders that is consistent with the prevalence of adders reported in the literature. While fewer in number, sizers have also been observed. That the most accurate sizers have been observed in the fission yeast S. pombe (Fantes, 1977; Sveiczer et al., 1996; Wood & Nurse, 2015), and that this organism performs cell size control at G2/M rather than at G1/S led us to explore the effect of cell cycle structure on the evolution of cell size control. We found that controlling cell size later in the cycle in S/G2/M biases evolution away from adders and towards sizers. In retrospect, this result can be rationalized since any size deviations incurred earlier during the timer period can be compensated for by the end of the cycle with the sizer. However, when the order is inverted, any size deviations escaping a G1 control mechanism would only be amplified by exponential volume growth during the S/G2/M timer period. A second recent case exhibiting sizer control was found in mouse epidermal stem cells, which exhibit a greatly elongated G1 phase and a relatively short S/G2/M phase (Mesa et al., 2018; Xie & Skotheim, 2020). We found that if we increased the relative duration of G1 in our simulations by shortening the S/G2/M timer, we also see a bias towards sizer control. In essence, by extending G1 to a larger and larger fraction of the cell cycle the control system is gradually approaching a size control taking place at the end of the cell cycle, i.e., an S/G2/M size control. Taken together, these simulations suggest the principle that having size-dependent transitions later in the cell cycle selects for sizers, while having such transitions earlier selects for adders.”

      The final part of the paper, which describes a strategy based on sensing size through concentration fluctuations, is very interesting but brief, which is understandable given the quantity of results presented earlier in the paper. Nonetheless, it provides an excellent example of the power of the authors' approach.

      Overall, the results in this paper are a compelling addition to the recent interest in cell size control.’

      We thank the reviewer for their careful reading of our manuscript and their support.

      Reviewer #2 (Public Review):

      The use of evolutionary models to understand the emergence of cell size control is novel and interesting. One strength of the approach is that simulations do not impose any mechanistic model for cell size control, rather the feedback motif for size control emerges from optimisation of chosen fitness functions. This allows the authors to come up with various size control motifs for given evolutionary pressures and model rules. Interestingly, the authors find that there is no one-to-one correspondence between specific size control mechanisms and evolutionary pressures, rather size control mechanisms are dependent on cell cycle structures. The authors also evolve a size control model based on the sensing of protein concentration fluctuations. This model exhibits interesting features such as self-organized criticality and the existence of very large cells that achieve size homeostasis by undergoing rapid cell divisions. The authors' model, however, comes with many arbitrary choices and assumptions that need further justifications and theoretical results should be compared with experimental data to establish the applicability of the model.

      We thank the reviewer for their careful reading of our manuscript and have worked to address its previous shortcomings as described below.

      Major Comments:

      1) Fitness function choices: Two fitness functions are used for the majority of this paper, number of cell divisions and CV_birth. What motivates the choice of these fitness functions and how do they relate to single-cell fitness?

      We added some text describing the choice of fitness function in the Supplement in the S3A - Fitness subsection. Using the number of cell divisions as a fitness makes sense since the higher the number of divisions in a given window of time, the bigger the population, which corresponds to the classical Darwinian fitness. Adding CV as an extra fitness specifically pushes the system towards better size control, which is the problem we aim to study, and also helps the optimization process. This is an effective way to include in our simulated evolution all observed detrimental effects observed when cell size is not controlled well. In the methods section we write:

      “We impose two evolutionary selection pressures in the form of two fitness functions. The first fitness function is simply the number of cell divisions during a long period, which we call NDiv . This is consistent with the classical definition of fitness as optimizing the number of offspring and is to be maximized by the algorithm. The second fitness function is the coefficient of variation of the volume distribution at birth for those NDiv generations, which we call CVBirth and is to be minimized by the algorithm. This penalizes broad distributions of volume at birth, which are detrimental to cell size homeostasis, which is what we aim to examine here.”

      Since the selection for tight size distribution is enforced via minimization of CV_birth, the model is unlikely to explain the timer control that is observed in some parts of the cell cycle. The authors discuss how a single fitness function results in all-or-nothing selection in the evolutionary algorithm, however, a third simultaneous fitness function is not considered. Are the results of this paper robust with respect to the addition of other selection pressure (for instance, optimization of growth rate)? This is a crucial question that is not addressed in the text.

      While we could always add more fitness functions, we have to start somewhere. The two fitness functions we use make most sense for the problem we are interested in, and allows us to obtain some clear results from the examination of an already complex starting point. Adding more than two fitness functions greatly increases the complexity of the problem. In fact, we are not aware of any work in the field of computational evolution using more than two fitness functions. One reason is that simulated evolution under control of two fitness functions is already not well understood in general (as we discussed previously in Francois & Siggia, Physical Biology 2008; Henry et al Plos Comp Bio 2018). We hope our simulations will inspire other work in this direction.

      2) Cell-cycle structure not considered to be changeable in evolution: Based on the presented details of the evolutionary algorithm, the network topology parameters are varied but not the temporal structure of the cell cycle, i.e. timer in G1/S and sizer S/G2/M or sizer in G1/S and timer in S/G2/M, etc. How do you justify evolution in one part of the cell cycle but not in the other? Do your results hold when the temporal structure is permitted to evolve?

      We are very interested in how the network structure affects the results. To address this point, we did invert size-dependence of the cell cycle phases as suggested by the reviewer i.e., we considered a fission yeast-like network with a timer in G1 and a sizer in S/G2/M (see Fig. 4,5, and S10). The possibilities of performing different types of evolution experiments is almost endless. We therefore restricted our examination to cases inspired by naturally occurring networks in well studied model organisms such as budding and fission yeasts. While it is in principle possible that size control could take place in multiple cell cycle phases, we do not yet know of a naturally occurring example and so chose not to explore this possibility at the present time. Nevertheless, the reviewer is raising a very interesting question as to why evolution selecting for cell size control tends to pick one or another cell cycle phase, but possibly not both, in a particular organism. We do not know the answer to this question at present and refrain from attempting to address it since our manuscript is already quite dense. Future work can explore this interesting direction.

      3) Noise sources: The authors consider noise protein quantity or concentration while neglecting noise in growth rate or division. Can the assumption that growth noise is negligible compared to protein production noise be supported by experimental data? This is a crucial assumption that is not supported by a discussion of physical values or citations. In addition, it is assumed later in the supplement (S132-133) that there is no division noise without presenting justification for why that noise is negligible on the scale of protein production noise.

      As for many other points raised by the referee, there is a necessary balance to achieve between biochemical realism and simplifying assumptions to theoretically study such problems. Of course we fully agree with the reviewer that there are multiple sources of noise in the system. In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding sources of noise if needed. We chose to first focus on noise in the cell cycle phases themselves whose CV can be as high as 50% (cf Fig. 1 in Di Talia et al 2007 Nature). For this reason, we first introduced noise in the precise timing of the G1/S transition as well as in the timing of the S/G2/M phase duration. Next, we introduced protein production noise because it is larger than the noise associated with cell division and cell growth rate in several cases where it has been measured. For example, the CV of cell growth rate in a diploid budding yeast is ~14% (Di Talia et al 2007 Nature; cf Table S12). The noise in partitioning at cell division is easier to measure in symmetrically dividing cells. For human cells grown in culture, division noise is ~10% (cf Fig. 3G in Zatulovskiy et al 2020 Science). In contrast, noise in protein concentrations is typically higher. This can be seen in the examination of molecular noise across all GFP labeled proteins in budding yeast (Newman et al, Nature 2006, PMID: 16699522). The CV in concentration of regulatory proteins in similarly sized cells is ~20-30% which is larger than noise in division by partitioning or noise in cell growth rate. We therefore next focused our analysis on the effects of protein production noise.

      In revising our manuscript, we now also consider noise in cell growth rate and noise in partitioning of mass at division as suggested by the reviewer. This results in slightly lower control, and more noise in alignment with our intuition. However, broadly speaking, our results are unchanged (see new supporting figures Fig. S6-S7 shown below). We now describe the logic of our series of simulations of increasing complexity in the methods section, which has two new paragraphs that reads as follows: “In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding additional sources of noise in subsequent analyses. All simulations presented include noise (stochastic control of G1/S transition and timing of S/G2/M, see below) in the cell cycle phases, whose CV has been found to be as high as 50% (Di Talia et al., 2007). Then, we introduced protein production noise via Langevin noise because the CV of regulatory protein concentrations is typically 20-30% (Newman et al., 2006). Importantly, the cell volume also contributes to stochastic effects, which are larger in smaller cells with fewer molecules. Thus, for stochastic simulations, we include a multiplicative 1/√V contribution to the added Gaussian noise term (see more complete description in the Supplement).

      We also checked that our results are largely invariant when adding other sources of noise (see Figs. S5-S7). In these simulations, we also included noise in cell growth rate (CV ~15%; e.g. (Di Talia et al., 2007), and in mass partitioning at cytokinesis (CV ~10%; e.g. (Zatulovskiy et al., 2020).”

      4) Types of biochemical interactions considered: It is assumed that inhibitor protein production rate scales with cell volume. Is this assumption supported by data? The assumption is contrary to the production rate of the inhibitor protein Whi5 in budding yeast, which does not scale with cell volume.

      In general, most proteins are at relatively constant concentration as cells grow. This means that their production rate (measured in number of proteins per time) has to scale in proportion to cell volume. As noted by the reviewer, Whi5 in the budding yeast is an exception to the general rule where the production rate does not scale with cell volume. This Is why Whi5 is diluted by growth, leading to a sizer in G1. However, allowing the network to generate size control with a diluted inhibitor starting point is basically too simple because it would start with a size sensor and does not need to evolve any feedback mechanism. Here, we are focused on exploring how cell size control can be done by a network with multiple feedbacks rather than just the concentration of a single protein. We made those points more explicit in the text, which now included the following sentences in the methods section: “We note that we are not allowing the cell to employ proteins such as Whi5 in budding yeast whose production is independent of cell size so that its concentration is a direct readout of cell size (Schmoller et al 2015; Swaffer et al 2021). We chose to do this because we want to explore how cell size control can be done by a network with multiple feedbacks rather than just the concentration of a single protein with a special dedicated synthesis mechanism.”

      5) Comparisons to data: Currently no attempt has been made to compare the model predictions quantitatively with experimental data that are easily available. For instance, how does the CV of cell birth size predicted by the model compare with cell size distribution in budding yeast or in the fission yeast? The same goes for the scaling of added volume with initial cell volume in different phases of the cell cycle. Furthermore, the noise parameters should also be calibrated to reproduce the cell size variability seen in experiments.

      To facilitate the comparison of our evolution simulations with model organisms we have included Table S1 in the supporting material, where we show the published results for budding yeast, fission yeast, and mammalian cells grown in culture and mouse epidermal stem cells growing in the animal. In fact, it turns out that distribution and CV that we obtained in our simulations are relatively similar in some cases to what is observed experimentally, but can also be much lower and exhibit a tighter control when optimized. However, the comparison is not perfectly fair since the model organisms were grown in laboratory conditions rather than their natural environment for which they are likely more optimized.

      Reviewer #3 (Public Review):

      In this paper, Proulx-Giraldeau et al. develop evolutionary simulations to study how size control can emerge. In the first part of the paper, the authors initiate cell cycle simulations with a simple network that does not allow cell size sensing and ask what molecular networks can lead to size control after evolution. Results show that a wide range of network types allows size control, some of which are comparable to experimentally identified networks such as the dilution inhibitor model in budding yeast. In the second part of the paper, the authors use their framework to ask how the structure of the cell cycle, including the duration of G1 vs. S/G2/M and the form of size control in each of these phases (i.e. 'sizer' or 'adder'), affects the overall size control. While this is a very important question and the authors bring comprehensive and interesting answers, it is less clear that framing the findings in the context of evolution is meaningful. Indeed, the solutions for how the combination of strength of size control, noise levels, and respective duration of the phases can be found analytically/with simulations that are not 'evolving' the cell cycle structure. Additionally, the finding that a sizer in G1 can lead to an overall adder if it is followed by a timer in S/G2/M is only true if a significant amount of noise is added during the timer phase. At present, this finding is discussed as a result of 'evolution' which is confusing and the dependency of this conclusion on the level of noise during S/G2 does not appear very clearly.

      With more cautiously formulated conclusions and a better discussion of already established theoretical and experimental work, this paper will become more accessible to experimentalists and will be a very valuable contribution to the field of cell size control.

      We thank the reviewer for their careful reading of the manuscript and their thoughtful comments.

      Major suggestions:

      1) Fig 4-5. While the use of the evolution simulation seems interesting to identify which underlying network(s) can result in size control, the use of the same framework to compare the result of sizer+timer vs. timer+sizer is less easy to interpret. Previous analytical/simulation approaches have explored how noise & duration of the timer phase can alter the 'sizer' or 'adder' signature (see doi.org/10.1016/j.celrep.2020.107992, doi.org/10.3389/fcell.2017.00092, for example) and what evolutionary simulations add to this question is unclear.’

      We thank the reviewer for pointing out this highly relevant work, which we now cite where appropriate at various places in the manuscript. We agree that several of our results could have been derived from non-evolutionary analysis as performed in this work (such as the conclusion that a sizer followed by a timer can yield an adder). However, many of our other results cannot. For example, we are interested in how a network based on constant concentrations of proteins can measure cell size. Our evolution simulations yield highly non-trivial networks which we then proceed to analyze. We now clarify the distinction between our approach using evolution simulations to the more traditional analytical approach in the discussion. We added the following text: “We note that these generic results of how sizers and adders can govern cell size homeostasis can be derived from more traditional analytical methods (Barber et al., 2017; Willis et al., 2020). However, our evolution simulations are particularly useful because the molecular networks that evolved give non-trivial insights into how the observed size homeostasis dynamics can be regulated.”

      – What is the authors' interpretation of why the optimization of Pareto vs. number of divisions yield different size control results (Fig. 4A)? Is it possible that these different fitness parameters allow for the evolution of different levels of noise/duration of the timer phase?

      This relates to what we discuss in section “A two-step evolutionary pathway for cell size control”. We think the effect is intuitive : if there is no selection on CV, there is no reason for the system to evolve good noise control in general. Then in the absence of secondary effects such as size dependent growth rates, etc…, networks such as the one presented in Fig 5 A are essentially optimum for the number of divisions, and are pure sizers. This is not related to the timer phase as far as we can see. We added a few words at the end of that section to make this more explicit.

      – In the conclusion: 'G1 control is more conducive to the evolution of adders, while G2 control is more conducive to sizers', do the authors really believe that this is an evolutionary acquired trait, or are their observations instead the natural consequence of having a noise-adding phase (timer + multiplicative noise) after a phase with size control?

      We believe what the reviewer says, ie, adder is a consequence of noise-adding phase after the size control. We do not think this is necessarily an evolutionary acquired trait. As discussed above, and now in our discussion, this result could have been found using traditional analytical approaches. That the result is similar in a computational evolution simulation is interesting because the flexibility of the PhiEvo algorithm might have allowed for different phenomenological results to emerge. That they did not do so further strengthens the intuition built up from the analytical approach.

      – A perfect sizer in G1, followed by a timer (with exponential growth) in S/G2/M would simply give an overall 'noisy sizer' (i.e. the slope of final volume vs. initial volume would still be 0 but with some variability around the slope). Only beyond a certain level of noise added in S/G2/M, would the sizer signature be lost. Would it be possible for the authors to perform simulations with different levels of noise (on the timer in S/G2) to help understand this conclusion better? This conclusion could be one of the most valuable to experimentalists studying different organisms.

      This is an excellent suggestion by the reviewer and we have performed these evolution experiments examining the effect of modulating the noise in the S/G2/M timer. We consider a CV in the timer of 0, 5, and 8% corresponding to no, medium, and high noise respectively. The average duration of the timer is half the time it takes to double the cell’s volume. Having specified the S/G2/M timer parameters, we then evolved and selected networks as previously, and compared ensembles of 60 networks for each noise level. The results are in line with our and the reviewer’s intuition. Increasing the noise, progressively leads to a loss of the sizer signature and increases the CV of cell size at birth. These results are described in a new paragraph in the results section modulating cell cycle structural constraints selects for sizers and adders, which reads as: “We next considered the effect of changing the amount of noise in the timer phase of the cell cycle. To do this, we examined the evolution of networks performing size control in G1 and where the S/G2/M phase with an increasing amount of noise. Increasing the noise in the timer progressively reduced the amount of size control done by the network (Fig. S5). This is likely because the fixed duration of S/G2/M allows the system to accurately reset protein concentrations for the subsequent cell cycle to promote accurate G1 control (Willis et al., 2020). We also examined the effects of adding noise to the cellular growth rate and to volume partitioning at division and found similar results (Fig. S6-S7).”

      The results are shown in the new supporting figure 5.

      2) Some aspects of the mathematical formalism were unclear: - Working with the hypothesis that growth is exponential and at a constant rate is reasonable. However, the description of the scenario where growth modulation contributes to size homeostasis is incorrect. E.g. the statement 'cells further from the optimum size grow slower' is not accurate. If size control occurs via growth regulation, what is expected is a negative correlation between size and growth rate (big cells grow slow, small cells grow fast).

      To clarify this point, we have modified the sentence to read as: “In the first class, it is crucial that the growth rate per unit mass of a cell depends on cell size so that cells that are significantly larger than the optimum cell size grow slower.”

      – The quantity I is produced with a rate proportional to volume, degraded at a constant rate, diluted by cell growth': why is I diluted? Concentration should be constant if I increases at the same rate as volume. 'the quantity of I does not initially depend in any way on the volume'. Does the quantity of I not increase with volume (since concentration is constant)?

      The equation for the amount of I does not have a dilution term, but the equation for the concentration of I does. This is easy to see if you consider stopping synthesis of I but continuing cell growth. In the case where I is stable, the concentration of I would decrease in proportion to the growth rate of the cell, which is the dilution term. In the case of constant synthesis of I, the concentration is indeed constant at equilibrium and reflects a balance between protein synthesis and dilution and degradation (e.g., see Eq. S4).

      Fig. 3, The rescaling of the variables to tau and Veq was difficult to understand. Fig. 3A: If T_S/G2/M is at ~0.5 of the doubling time tau, how relevant is it to look at the behaviour of T_(Vc) for values of T_(Vc)/tau above 0.5 (and beyond 1)? Fig 3B: for which value of T(Vc) is the prediction made?

      Time is rescaled to the amount of time it takes to double the biomass. Volume was rescaled to the average volume at the G1/S transition for a population of cells at the size distribution's steady state. We realize now that this nomenclature is unclear, and have replaced Veq with <VG1/S>, which we believe is more clear.

      Because of the timer constraint, T_(Vc)/tau has to be at least 0.5, which corresponds to a G1 phase with 0 duration. But, in principle, T_(Vc)/tau could have any value larger than 0.5. The range of T_(Vc)/tau is set by the size control mechanism after we specify the range of Vc that we wish to examine. To clarify this, we now denote what parts of the plot correspond to cells increasing or decreasing in size.

      The prediction is the solid line and is made for a bit more than the range of cell sizes that we see in the steady state simulation. We think there is confusion about our nomenclature for a single point indicated on each line as ‘Added Veq’. This point represents the average amount of volume added at steady state. To clarify this we now label this as <∆V>.

      4) Discussion:

      – Including a discussion of previous theoretical work that explored the consequences of varying the relative duration of the timer and sizer phases would be valuable.’

      As discussed above, we have now cited the previous theoretical work in the introduction, results, and discussion. We thank the reviewer for pointing out this omission.

      – A reason commonly evoked to explain why cells might show sizer vs. adder behaviour is the role of the growth mode: S. pombe is a sizer but is thought to grow linearly, E. coli behaves like a sizer when it grows slower than usual (see Walden et al. 2015). It would be helpful to mention this when discussing S. pombe and remind the reader that the findings of this paper are limited to exponential growth mode.

      As suggested, we clarify that our analysis is restricted to exponential growth rates and that S. pombe growth rates have been reported to deviate from exponential.

      – The paper seems to be focusing on the noise of the size control mechanism (i.e. probability of transitioning through G1/S based on levels if I) but does not address the question of other sources of noise (i.e. asymmetry at division). What do the authors think about the role of such sources of noise as selective pressure on size control mechanisms evolution?

      This point was also raised by referee 2. There is a necessary balance to achieve between biochemical realism and simplifying assumptions to theoretically study such problems. Of course we fully agree with the reviewer that there are multiple sources of noise in the system. In this study, we chose a hierarchical way of introducing noise in the system that starts with the biggest contributing factor and incrementally adding sources of noise if needed.

      In revising our manuscript, we now also consider noise in cell growth rate and noise in partitioning of mass at division as suggested by the reviewer. This results in slightly lower control, and more noise in alignment with our intuition. However, broadly speaking, our results are unchanged (see new supporting figures Figs. S6-S7). We now describe the logic of our series of simulations of increasing complexity in the methods section, which has a new paragraph that reads as follows: “In this study, we chose a hierarchical way of introducing noise in the system, starting with the biggest contributing factor and incrementally adding additional sources of noise in subsequent analyses. All simulations presented include noise (stochastic control of G1/S transition and timing of S/G2/M, see below) in the cell cycle phases, whose CV has been found to be as high as 50% (Di Talia et al., 2007). Then, we introduced protein production noise via Langevin noise because the CV of regulatory protein concentrations is typically 20-30% (Newman et al., 2006). Importantly, the cell volume also contributes to stochastic effects, which are larger in smaller cells with fewer molecules. Thus, for stochastic simulations, we include a multiplicative 1/√V contribution to the added Gaussian noise term (see more complete description in the Supplement).

      We also checked that our results are largely invariant when adding other sources of noise (see Figs. S5-S7). In these simulations, we also included noise in cell growth rate (CV ~15%; e.g. (Di Talia et al., 2007), and in mass partitioning at cytokinesis (CV ~10%; e.g. (Zatulovskiy et al., 2020).”

    1. Author Response

      Reviewer #1 (Public Review):

      “The synthesis and metabolism of sphingolipid (SL) are involved in wide range of biological processes. In the present study, the authors investigate the role of SPTLC1, one of the essential subunits of serine palmitoyl transferase complex, in both physiological and pathophysiological angiogenesis, via using inducible endothelial-specific SPTLC1 knockout mice. They found SPTLC1 deficiency in ECs inhibited retinal angiogenesis along with reducing several SL metabolites in plasma, red blood cells, and peripheral organs. In addition, the authors found SPTLC1 EC-KO mice are resistant to APAP-induced liver injury. Overall, the in vivo findings in the present study are of potential interest and the authors have given clear evidence that endothelial SPTLC1 is critical to retinal angiogenesis. However, the underlying mechanisms are completely lacking in the present study. Most of the evidence provided is circumstantial, associative, and indirect.”

      We appreciate the positive comments of the reviewer. We have addressed the reviewer’s concern regarding underlying mechanisms as detailed below.

      “To be specific,

      1. The authors found endothelial SPTLC1 is important to both angiogenesis and the plasma lipid profile. However, the authors did not present the data to demonstrate the relationship between them. The in vivo findings about the phenotype and the plasma lipid profile might be true and unrelated. It would be important to know whether supplementing the reduced lipid induced by SPTLC1 KO could rescue the angiogenesis related phenotype in mice, or, whether the alternative way to inhibit the SL synthesis could mimic the phenotype of KO mice.”

      In the manuscript, we discussed the possibility whether S1P is involved, since it is one of the most down-regulated SL in the plasma and a major regulator of angiogenesis. We think it is unlikely that reduced plasma S1P is responsible for the phenotype. First, the retinal angiogenesis defect in Sptlc1 ECKO mice is the opposite of S1pr1 ECKO as we have published previously (PMID: 22975328, PMID: 32059774). Moreover, deletion of sphingosine kinase, the enzyme produces S1P, in the endothelium does not influence retinal angiogenesis at P6 (Figure 3 Supplement 2 A and B). Loss of S1P chaperone ApoM- i.e., Apom KO, which exhibits 50% reduction of plasma S1P, does not show change in retinal vascular development (Figure 3 Supplement 2 C and D). Taken together, our results strongly suggest that reduction in plasma S1P is not the cause of vascular defect in Sptlc1 ECKO retinas.

      Based on our results in the manuscript, loss of SPT enzyme activity in endothelial cells reduced SL species in the endothelial cells and the plasma. Our in vitro and VEGF intraocular injection experiments (new data) suggests that the angiogenic defects seen in Sptlc1 ECKO mice is due to cell intrinsic defects in VEGF signaling and not due to changes in plasma SL levels. We have edited the discussion section to address this issue.

      “2. A major issue is that the present study did not reveal is a real downstream target. It is possible that VEGF signaling might be impaired by SPTLC1 knockout as discussed by the authors. However, the authors did not demonstrate this point with data. Including both in vivo and in vitro data to evaluate the effects of SPTLC1 deficiency on VEGF signaling might further strengthen the hypothesis. Besides, with in vitro experiments, the authors might further find the critical metabolite(s) involved in VEGF signaling and angiogenesis.”

      As discussed above, we agree with the review’s critique and have addressed this essential point with new experiments (both in vitro and in vivo) in Figure 5. Our new data shows that SPT pathway supplies the glycosphingolipid GM1, which is needed for efficient VEGF-induced ERK phosphorylation and tip cell formation.

      Reviewer #2 (Public Review):

      “Andrew Kuo et al. investigated the role of endothelial de novo sphingolipids (SL) synthesis using endothelial cell specific SPTLC1 knockout (ECKO) mice. They showed that these mice exhibited low concentration of various SL species in not only ECs but also RBC, circulation, and other non-EC tissues. They also showed that ECKO mice exhibited impaired angiogenesis in normal and oxygen-induced retinopathy models, consistent with the decrease of endothelial proliferation and tip cell formation. They finally revealed that these mice were resistant to acetaminophen-induced acute liver injury in early phase. The experiments were well-designed, and the results were clear and convincing. The authors concluded that endothelial cells were the major source of SL in circulation and various organs (liver and lung) other than retina (and probably brain). The weakness of the current version of the manuscript is that the authors did not elucidate the mechanisms underlying the observed phenomena.

      1) The authors showed impaired angiogenesis in ECKO mice using neonatal retina model. Based on the fact that this phenotype was similar to that in endothelial VEGFR2 deficient mice, they suggested that VEGF responsiveness is altered in ECKO mice. Although this hypothesis is plausible, the authors would need to prove it by evaluating VEGFR signaling (VEGFR phosphorylation, Akt activation etc.) in ECKO mice.”

      We thank the reviewer for positive comments. As for the weakness identified, we have addressed this point by conducting new in vitro and in vivo experiments (detailed above). The new Figure 5 addresses this issue directly.

      “2) The acetaminophen-induced liver injury was reduced in ECKO mice in early phase. However, it is still unclear whether SL production itself affects liver injury. The authors discussed the possibility that gene deficiency increases unconsumed serine resulting in GSH increase, but it is essentially independent to SL. If possible, it would be good if the authors could investigate the effect of SL administration on the liver injury progression.”

      We appreciate the reviewer’s concern about liver injury model in the Sptlc1 ECKO mice. Our data suggests that SL species supplied from EC impacts hepatocyte response to stress. Since the acetaminophen induced liver injury is highly dependent on reactive oxygen species, our finding that increased glutathione levels in the Sptlc1 ECKO mice may be involved in the phenotype. However, we are simply considering them as biochemical markers of liver injury. This has been addressed in the discussion.

      “3) This paper showed the impaired cell proliferation in Sptlc1 KO EC mice, and discussed it. Authors described that this phenotype was similar to that of Nos3 KO mice, but its inconsistency with Sptlc2 ECKO adult mice was only justified by a word "isoform-selective function". Authors could quantify eNOS expressions in Sptlc1 KO mice, compared results and then discuss this matter. “

      In figure 1C, we used eNOS as an EC marker to show purity during our EC isolation process. In fact, we did not observe change of eNOS expression in Sptlc1 ECKO. We also did not detect elevated phospho-eNOS in Sptl1c ECKO in contrast to Sptlc2 ECKO adult mice (Figure1 supplement 4). Additionally, our work in the retina was performed in postnatal-genedeletion pups from P6-P17 which is different from the published Sptlc2 ECKO study. The differences in gene deletion strategy (early postnatal vs. adult) could result in differences in eNOS expression . We have added discussion about this issue.

    1. Author Response

      Joint Public Review

      1) The structures of the PDZ domains of PSD95 have been determined and they are well-folded and stable. In addition, the PSG module has been shown to adopt a stable structure after expression and purification. The authors should cite papers, their own and those by Zeng et al. (e.g. J. Mol. Bio, 2018), to reassure readers that the protein is not destabilized by the cysteine mutations. The authors need to state how many purifications of the mutants have been done and how many replicates have been made for the FRET measurements. Did the FRET data change over time?

      We appreciate the importance of selecting labeling sites that do not disrupt protein structure and activity. There are two protein constructs in this work: full-length PSD-95 and the PSG truncation of this same protein, which have been expressed hundreds of times over more than a decade in my lab. The cysteine mutations used in this work have all been validated as non-disruptive to the protein and the dyes in several ways. 1) We selected labeling sites using the available x-ray and NMR structures to ensure surface accessible residues within alpha helices or short loops to minimize tertiary structural disruption; 2) we ensured that the two point mutations don’t affect the expression and purification protocols. Misfolding or changes in conformation would be visible on elution profiles from chromatography as well as proteolytic cleavage patterns, which are sensitive to protein folding; 3) in our previous work, we measured both donor anisotropy and acceptor quantum yield for all of the variants in use here but one, which relied on existing sites in a new combination. Dyes involved in interactions with proteins or changes in dye environment would become apparent through changes in quantum yield and anisotropy. Any problematic labeling sites have been purged from the current work, which uses a small subset of the mutants from our earlier work. The repeatability of the expression and purification of all these constructs has been demonstrated in our published work and is not affected by the specific labeling mutants in use. The stability of these constructs is supported by the numerous other NMR and x-ray crystallography studies published on these robustly expressing proteins. To highlight this important issue, we have added additional discussion of the origin and validation of these mutants in the text on page 4 and in the methods section. We also included references to the tables of photophysical measurements for the library of PSD-95 cysteine mutants adapted for this study.

      We did not explicitly track the number of purifications used in this work, which spanned more than five years. We were not aware of any expectation to provide such records but will be more aware going forward. The measurements for this paper come from one or in some cases two protein expression runs, each of which generates 2 or more cell pellets. Each of these pellets generates a single affinity and ion exchange purified sample. This is then aliquoted and frozen, which may produce more than a dozen samples for fluorescent labeling. Individual labeled samples are given additional rounds of desalting and size exclusion chromatography immediately before measurements to ensure than the full length proteins are used and that there has been no aggregation or degradation. In terms of repeatability, the data shown in this manuscript involves repeat measurements of the same constructs using different FRET dye pairs, collected on different instruments at different times and still shows excellent agreement. All of the measurements involve as few as one protein expression run and a minimum of two separate labeling and purifications for two independent sets of measurements. Some variants exceeded this standard but this was not tracked during this long study.

      Regarding the agreement of experimental observables across different protein preparations, one of the variants within the existing dataset (P2-S3) was measured on two experimental setups, two years apart, using two different expression runs each with separate protein purifications and labeling reactions. Comparison of these measurements revealed that the mean FRET efficiency values measured at Clemson were 0.70 while that measured at HHU was 0.71 w mean DDA lifetimes were 2.29 and 2.4, respectively.

      2) The authors have not explained how the approach taken in this paper compares to their previous simulated annealing approach of mapping PDZ3 using FRET data in McCann et al., 2012. That study resulted in a model in which PDZ3 binds to a completely different interface, which is not mentioned in this manuscript.

      We apologize for this oversight and thank the reviewers for this reminder. The omission was an error of trimming the manuscript for brevity and we appreciate the opportunity to highlight how much our approach has improved over the intervening time. We have included commentary on our previous modeling in the revised discussion.

      3) The biochemical disulfide (DS) mapping experiments provide a useful check of predictions of the FRET and DMD conclusions. However, in order to interpret these correctly, the authors need to show data from negative controls testing cysteine pairs that are predicted NOT to interact.

      We agree that negative controls are a critical part of the disulfide mapping experiments and thank the reviewers for this suggestion. As a negative control, we selected a cysteine pair that showed low FRET in our 2012 PNAS paper (Q374C-K591C), which was not included in this work nor was the cysteine pair involved in contact interfaces identified from simulations or modeling. This cysteine pair showed no evidence of intramolecular disulfide formation. In the manuscript, we have provide an additional supplemental figure panel to document that this negative control sample does not form disulfides.

      4) The SH3-GUK domain of PSD95 can undergo domain swap dimerization and the dimerization is promoted by binding of the synGAP PDZ-ligand to PDZ3. The authors should mention the existence of domain-swap dimerization (citing McGee [2001] and Zeng et al. [2018]) and indicate whether they tested that the FRET-labeled proteins are monodisperse. This is particularly important in light of the high variation in diffusion time for individual variants - 0.91-10.19 ms (see also #10 below). In particular, the P3-G4 FRET variant has a long diffusion time of 10.19; could it be undergoing domain swap dimerization?

      We are very interested in the prospect of domain swapping as has been suggested previously. However, we have not seen evidence for this at the concentrations used here. As reported in our 2012 PNAS paper, both full-length PSD-95 and the PSG fragment are monodisperse as judged by size exclusion chromatography, which suggests that lack of stably populated oligomeric states under these conditions at 10-5 molar concentrations. The PSG fragment runs very true to its calculated formula weight while the full-length protein does migrate faster than expected based on formula weight but not high enough to be a dimer.

      The DS mapping experiments did reveal some higher molecular weight species. However, these higher order species never accounted for more than 5% of the total input. Thus, any intramolecular interaction is transient and not well occupied under the buffer conditions and concentrations used in these studies. Our size exclusion and disulfide mapping experiments are carried out at protein concentrations that are orders of magnitude higher than used for single molecule imaging. Thus, dimerization is unlikely at the single-molecule concentrations used for the present FRET experiments. If dimerization were to occur, we would expect the appearance of additional static subpopulations in the MFD histograms. If dimerization were significant, we would also expect the appearance of an additional diffusion term in fluorescence correlation curves, which was not the case in these experiments.

      5) On page 4, line 5 the authors state: "the number and occupancy of conformational states were set as global fitting parameters". This assumes that the protein is unbiased by the labeling and that the protein behaviour is independent of the purification batch. Have the authors verified this?

      The reviewers are correct in stressing the importance of quality control in the selection of labeling sites and reproducibility in sample preparation. The PSD-95 purification has been carried out hundreds of times in the Bowen lab using different variants. The cysteine mutations used in this work have all been validated as non-disruptive to the protein and the dyes in several ways. 1) We selected labeling sites using the available x-ray and NMR structures to ensure surface accessible residues within alpha helices or short loops to minimize tertiary structural disruption; 2) we ensured that the two point mutations don’t affect the expression and purification protocols. Misfolding or changes in conformation would be visible on elution profiles from chromatography as well as proteolytic cleavage patterns, which are sensitive to protein folding; 3) in our previous work, we measured both donor anisotropy and acceptor quantum yield all of the variants in use here but one, which relied on existing sites in a new combination. We have insured that sites with poor properties are never included in our published work. Indeed, the reproducibility of sample preparation, using chromatography before and after labeling, gives confidence that the attachment of fluorescent dyes is not altering macromolecular properties. For the dyes to change the protein structure, they would have to interact competitively with the protein interfaces or disrupt local structure. These would be expected to change the dye quantum yield or the anisotropy, which were each measured in our previous work. In addition, the multiparameter fluorescence detection includes anisotropy measurements of the current samples. None of these measurements reveal aberrant fluorophore behavior (Supplemental File 3C).

      This alone does not rule out that the dyes affect the conformational ensemble. One can take additional confidence that our protein handling workflow does not affect the results from the cross-methods agreement that we demonstrate in the current work. First, between measurements of both full-length PSD-95 and its PSG truncation, using confocal and TIRF experiments boosts confidence. The labeled samples for each experiment were prepared from the same purified proteins but labeled independently with different dye pairs. The different dyes attached to the samples used for confocal and TIRF did not impact the time averaged distances between these residue pairs save for one slight outlier. Additionally, our cross-validation using disulfide mapping, which is entirely label free, provides additional confidence that the interdomain contact interfaces, observed in the data collected using the labeled proteins, are preserved when the labels are not present. Finally, independent DMD simulations of label-free PSG were in excellent agreement with regards to the predominant states identified from rigid body docking based on experimental FRET distance and the disulfide mapping.

      6) On line 6 the authors state: "Based on fitting statistics, we demonstrate that a two-state model with a small donor-only (or no FRET) population (Supplementary file 1C &D) is sufficient to fit all data.” From the average Χ2 this can be concluded, but for individual datasets sometimes a 1 state model or 3 state model seems more appropriate. The authors should explain why measuring more cys mutants justified using 'one unifying model'? How can the data contain donor-only contributions if pulsed-interleaved excitation (PIE) is used to select only molecules with active donor and acceptor fluorophores?

      We apologize for the lack of clarity as to how we arrived at the determination that two states were present in the conformational ensemble. The fitting statistics show that there is an improvement in global fitting upon increasing the number of states in the model from one-state to multiple states. The statistics in the former Supplementary file 1C show significant improvement upon fitting with two states relative to one while adding a 3rd state marginally improves 3 variants while the remaining 9 remain unchanged or show a slightly worse fit. The former Supplementary File 1D (now 3C) provides a list of the values for each of the constants that arise from fitting the 2-state model to all datasets simultaneously and the individual fit statistic for fitting this model to the specific variant dataset. This table assigns the global population fractions and their associated donor lifetimes but was not used to assign the number of states. That there are two states is based solely on the improvement in fitting statistics with two states shown in the former Supplementary File 1C. Thus, the statistics do not justify us including an additional state. Because this is such a critical point, we have moved the former Supplementary File 1C to the main text as Table 2 and add additional discussion to the manuscript to highlight how we arrived at a 2 state model.

      The reviewer is correct that a global fit of the dataset could result in suboptimal fits for an individual FRET pairs to satisfy the global minimum. In this case, most variants were best fit by a two state model. The reason for using one unifying model is our underlying assumption that the same conformational distribution for PSD-95 is sensed differently by each labeling combination. A primary conclusion from this assumption is that all variants share a population distribution. A secondary assumption is that protein handling is not biasing this conformational ensemble, which we verify as described above. Each measurement provides part of the same story so we were only interested in models which simultaneously explained all observed FRET data, and as such enforced the single global model. A global fit also proved the best way to uniquely assign each distance to its corresponding state. Furthermore, the FRET Network Robustness analysis explicitly examined how much our model depends on any one labeling variant and found no systematic deviations. This revealed an ensemble of structures that satisfy the data without enforcing a global model for all samples simultaneously.

      We also thank the reviewer for correctly observing that we misapplied the term donor-only (DO) in the manuscript. The population we referred to is more appropriately termed a “No-FRET” or “low-FRET” population. The reviewer is correct that active, FRET-labeled molecules were selected using PIE parameters. We have corrected this in the manuscript.

      7) All variants are shown to be dynamic, but they are positioned differently on the dynamic FRET line (Fig. 1D and S3). Does the same kinetic model underly each variant? If the same state occupancies are implied, then why not the same kinetic constants, especially for distances probing the same two domains?

      While the global population fraction is shared between variants the transitions rates for Individual variants are not constrained. As such the variants do not share a single equilibrium rate constant. While the FRET data is fit to two global states, our DMD simulations showed that there is substantial fuzziness within these global states. Thus, the full kinetic network is more complex than the global 2-state transition. As our screening of DMD snapshots showed, each FRET variant is uniquely sensitive to the underlying conformational transitions. Hence, the system is underdetermined and we are not able to adequately determine forward and backward kinetic rates for each variant individually.

      It is important to recall that the data shown in multiparameter FRET histograms has been binned with millisecond time resolution, which is slower than the local conformational dynamics arising from fuzzy domain rearrangements. The position of the peak will depend on the underlying rate constants. Our Photon Distribution Analysis reveals the kinetic processes that dominate the broadening of the FRET efficiency distributions. This analysis also measures the fractions of the effectively “static” population. Fast transitions, which do not significantly contribute to changes in FRET efficiency (or broadening) on the binning timescale, will appear as static populations. Thus, the simple PDA model captures the broadening that is also present in MFD histograms, but does not adequately describe dynamics at the fastest timescales.

      8) Could the data also be explained by "fuzziness" within domains, without interdomain dynamics? The authors should discuss this given the possibility of domain swap dimerization of the SH3-GuK domain.

      In this work, we use the term fuzziness to refer to alternate residue interactions and domain orientations within a global contact basin. Using this definition, we do not expect significant structural rearrangements within the PDZ, SH3 or GuK domains. These domains are well folded and have been studied individually and in combination using x-ray crystallography and NMR, which did not reveal local distortions of the domain fold (e.g. SH3-GuK interactions). This is not to say that there are not conformational dynamics within loop regions or other small scale subdomain motions. Our rubric for selection of labeling sites is to avoid large loops to minimize the local dynamics as this conformational variability compromises the resolving power of the FRET restraint. Our DMD screening provides details as to how each FRET pair senses changes in local and global conformation. In comparison to the global changes extracted from the fluorescence lifetime decays, the intradomain dynamics are occurring rapidly on small length scale and are not expected to affect our global positioning of PDZ3. We do not observe a significant population of dimers or other multimers under the concentrations used for these experiments as discussed above.

      9) Regarding supplemental File 2: The authors should justify that PDA is an appropriate method to quantify relaxation time of Fluorophores. Dynamics being so fast, how do the authors explain that when binned in 2 ms time bins, discrete subpopulations in the PDA histograms are still clearly observed (e.g. Figure 2B, Fig. 2 supp 3)? Why would the protein move through certain very discrete states and not others? Doesn't this imply that the model is oversimplifying the actual mechanism (even though the Chi^2 is alright)? It is strange that for some mutants (fig 2 supp 3B P1G3) PDA displayed discrete states, while for others (e.g. fig 2 supp 3A P2G6) PDA histograms were smooth, implying it cannot be a low-histogram-count artifact. Or can it?

      We apologize for this confusion but the photon distribution analysis was not used to “quantify relaxation times” of the fluorophores, which comes from fitting of the lifetime decays. Rather, PDA was used to estimate the rates of exchange between limiting states (i.e. the inter-fluorophore distances derived from fitting the fluorescence decays). Obtaining the rates is accomplished by fitting time-binned FRET efficiency histograms with a model that accounts for broadening due to exchange between limiting states.

      We agree with the reviewers that the two-state model, which is sufficient to fit the lifetime decays, is too simplified to fully describe the dynamic exchange between limiting states. To address this, we performed the FNR analysis to describe the limiting state basins within which fast dynamics occur. This extends the model beyond two discrete limiting states. Further, DMD screening shows that different FRET variants do report differently on the underlying conformational landscape. Some exhibit a degree of degeneracy showing similar FRET efficiency for different conformations making each variant insensitive to specific subsets of possible transitions.

      Using fluctuation correlation analysis to probe FRET-induced changes in intensities, we observed dynamics on the 10-5 second timescale, which is much too fast to give rise to broadening in the fluorescence observable histograms. However, these dynamic transitions did not correspond to exchange between states with large differences in FRET efficiency because, if such fast dynamics involved a large change in FRET, this would be associated with a narrow distribution about the mean in MFD histograms. We explain the appearance of distinct peaks for some variants as an increase in the relative contribution of fast dynamics within limiting ensembles compared to the slower processes of exchange between limiting ensembles. This can occur without a relative shift in forward/backward exchange rates and with only a slight shift in the overall relaxation rates on the timescales to which PDA is sensitive (~.01-1 ms).

      10) Regarding supp file 3A and Table S9: The spread on tdiff, (the average diffusion time through the confocal volume) for individual variants is very broad - 0.91-10.19 ms. Considering that the authors use global fits for many different parameters, it's surprising that they didn't use it for this parameter which should unbiasedly be the same for all the protein mutants, at least if all are well-behaved (i.e. non-aggregating). The high variation in tdiff may be a warning that the model is not accurately accounting for all dynamics. For example, might the P3-G4 variant be undergoing domain swap dimerization?

      We thank the reviewers for their observation and apologize for the confusion as to why there are differences in the diffusion time through the confocal volume for the different variants. We expect that there would be three distinct diffusion times because the samples were measured on two experimental setups using different confocal volumes and pinhole sizes. There are also two distinct protein constructs (full-length and PSG), which differ in molecular weight. The longest timescale processes included in the fFCS fits are attributed to long-timescale photophysical effects, such as blinking. As discussed above, we do not expect a significant population of dimers or other multimers at the pM concentrations used for these single molecule experiments.

      We agree with reviewers that the diffusion time for a given construct on a given instrumental setup should be a constant. In this light, we reanalyzed the filtered fFCS curves with enforced consistency for the diffusion times in measurements involving the same construct measured on the same setup. While this refitting slightly changed the values of fit parameters, none of these differences significantly affected the parameters used for modeling and therefore the conclusions of the paper have not been impacted. We have updated the manuscript to indicate the change in the fit models.

      11) In the results section, the authors state: "Summarizing the dynamics observed for the PDZ3-GuK variants, fFCS depicts three relaxation times." This is an overstatement because the authors imposed these three broad relaxation times. The authors should describe how they made these assignments. Is this common practice? Regarding Supplemental File 2 versus Supplemental File 3A: In principle, the relaxation time implied from fFCS and that from PDA should align. However, the 'Average' of fFCS and the T_R of PDA do not align. Is it possible that the dynamics analysis from PDA should have been constrained in some way by the results from fFCS? It would be useful to add error estimations for PDA here.

      We agree with the reviewer that it is an overstatement to say that the number of relaxation terms arises from the correlation analysis. We have removed this statement and instead focus on the differences in dynamics. The assignment of three relaxation terms was made to probe the extent of dynamics across decades in time as each time regime is typically associated with distinct forms of protein dynamics. We enforced these consistent timescales in order to directly compare amplitudes across different FRET variants. However, we do not enforce any assignment that dynamics arising from a particular type of exchange process occur at the same timescale.

      We also agree that obtaining agreement between PDA and fFCS is desirable. In our experience, such agreement is only obtainable for simple kinetic schemes when dynamics probed by fFCS and PDA all occur within the same relative timescales. Here, the contributions to dynamics occur across several decades in time including those obtainable only through fFCS analysis but too fast to be quantified by PDA. Using the methods we employed, we recover only the effective relaxation times rather than the absolute kinetic rate constants because the system is underdetermined. Differences for individual variants arise because the variants differ in sensitivity to specific transitions (Figure 8-Figure Supplement 1) while fFCS and PDA differentially report on the underlying kinetic scheme.

      12) Regarding the DS bond formation data, the authors state, "The α-basin variant showed slightly more DS formation than the beta-basin variant in full-length PSD-95 but the rates of DS formation were similar". It isn't clear what this means physically. It seems to suggest that there is static heterogeneity in the population, i.e. some proteins can and some proteins cannot form DS bonds. The presence of this effect may contradict the assumption that every state at some point interconverts to any other state, which underlies the FRET PDA analysis. The authors should discuss this possible inconsistency.

      We agree with the reviewer that this statement was not clear. It was never our intention that the DS formation kinetics be directly related to FRET data in this way. The goal of DS mapping experiments was to provide qualitative confirmation that supertertiary structures suggested by DMD and FRET experiments occur in solution. We meant to focus on the DS formation kinetics, which are in indication of structural proximity. The extent of DS formation comes from the fitting as a matter of course. The reactions progress to near completion (Figure 7-Figure Supplement 1). The differences in extent of disulfide formation, while real, are very small and we did not intend to highlight them. We have removed any discussion of the extent of DS formation in the manuscript.

      13) In the discussion of the DS experiments, the authors state, "We also observe significant kinetic differences when PSD-95 is truncated in agreement with FRET studies." This sentence is vague. The authors need to state more completely what they mean here. Exactly what is in agreement with the FRET studies?

      We agree with the reviewers that the claim was vague. We intended to communicate that the DS mapping is generally consistent with FRET experiments in that they confirm the proposed limiting conformational states. The formation of disulfides at these points confirms the accessibility and proximity of these sites with respect to one another within the supertertiary structure. Also, both DS mapping and fFCS observed changes when PSD-95 was truncated to the PSG fragment. However, the rates of DS formation are not directly comparable to the rates of conformational dynamics. We have removed this statement from the paper to avoid directly linking these two unrelated kinetic measurements.

      14) The text in the section on "Structural Modeling with Experimental FRET Restraints" is often unclear. The authors appear to have equated States A and B, formerly used only in the seTCSPC analysis to the alpha and beta basins extracted from the DMD snapshots. The authors should discuss whether there might be other conformations in the DMD results that would be consistent with the FRET-derived distances from seTCSPC? It seems possible that there could be, given that in Fig 6 sup 1, large discrepancies exist between simulated distances and FRET-measured distances for some of the FRET pairs. The authors should discuss explanations for the discrepancies that do not compromise the actual model.

      We apologize for the lack of clarity in our description of structural modeling with FRET restraints. We thank the reviewer for the suggestions as to how we can improve this discussion. In the course of this study, we do reach the conclusion that states A and B, obtained from modeling solely based on FRET data, are equivalent to conformations within the alpha and beta basins from DMD, respectively. Because the representative structures were obtained independently via distinct techniques, we felt that it would be premature to use the same terminology when we are introducing the FRET results.

      We agree that more than a single snapshot from DMD per basin appropriately satisfies the FRET restraints and that no one model satisfies all restraints equally. Our goal with the later FNR analysis, which explicitly incorporates FRET-derived restraints, was to identify ensembles of structural snapshots from DMD that are compatible with experimental data. Instead of finding the single best model for the full set of FRET-derived distances, each snapshot in the ensembles from FNR satisfies all distance thresholds independently. Thus, the ensembles from FNR do refer to both experiment and DMD.

      Further, the vertical lines shown in Figure 8 Figure Supplement 1 represent the distances from the initial global fit of all samples simultaneously. For some variants, this likely includes biases in certain distances due to the enforcement of this global model, which FNR seeks to alleviate. For SH3-GK FRET pairs, these deviations are most likely the result of the restraints placed on the motions of the GK domain in the DMD simulations.

      15) A weakness of the modeling approaches in this manuscript is that they are difficult to validate. Could the authors include a test of the modeling in which they show how small changes of the input FRET data would influence the final FRET-restrained model? Could they quantify their confidence in the final model, given all the limitations of the FRET data?

      We agree with reviewers as to the importance of validating structural models regardless of the data modality used in their determination. We respectfully disagree that this study is lacking in model validation. In this work, we generated models based on confocal FRET data and validated the FRET models using independent DMD simulations and disulfide mapping. We also employed smTIRF measurements using a different dye pair to independently validate the time-averaged FRET from confocal measurements. While this may fall short of complete validation of the associated dynamic information, we feel that this represents the state of the art in model validation regardless of the experimental approach. While it is difficult to validate novel methods for deriving structural models, we feel that have done so through cross-validation against other established techniques.

      As suggested, we did test the dependency of the experimental models on small changes in the input FRET data. To accomplish this, we used the same analysis framework described for FRET Network Robustness Analysis. Instead of removing datasets as in FNR, we introduced artificial error into the FRET distances for each variant and repeated the classification of DMD structures using the altered distances. For each trial, we introduced a random, artificial error on each of the FRET distances and repeated the classification of structures from DMD into the two basin ensembles. To check the dependence on the magnitude of the error, we used introduced a random error to each variant between 5 and -5% or between 15 and -15% of the original distance. Each condition was repeated 3 times with different random errors. To compare conditions, we measured the change in the center of mass of the surface distribution composed from the individual PDZ3 centers of mass identified by that screen (Figure 8-Figure supplement 2). We found that increasing the distance error did not significantly impact the classification of structures into the two ensembles. The variance in the mean ensemble positions over three repeats increased with increasing error along with small shifts in the mean positions. Notably, +/-15% is greater than the uncertainties in distances obtained via global fitting of fluorescence decays, suggesting that the intrinsic uncertainty in the FRET-derived distances from a single fit (Supplemental file 3D) does not significantly impact the ensemble assignment or their fuzziness.

    1. Author Response

      We thank the reviewers for their thoughtful and constructive comments which have helped us improve our manuscript. In our revised manuscript, we will respond to three main weaknesses:

      1. We will address the inconsistency in the experimental design across the behavior and the transcription experiments by repeating the behavior with an experimental timeline that more exactly matches that of the animals used in transcriptional studies;

      2. We will further validate and justify our use of TRAP and our focus on the NAc as the sole brain region of investigation;

      3. We will revise the language throughout the manuscript, especially in the discussion, to reduce anthropomorphizing of our results and interpretations. Below we have provided responses to specific concerns articulated by each reviewer.

      Reviewer #1 (Public Review):

      The monogamous vole provides unique opportunities to study the neural basis of pair bonding and this study exploits that opportunity in a novel way. Focusing on the nucleus accumbens, the authors conduct RNA-Seq to characterize the transcriptome in same-sex and opposite-sex pairs when bonded, when separated for a short time and when separated for a long time at which point the literature has in the past demonstrated the willingness to form a new bond. They determine that the transcriptome of pair bonding includes a preponderance of glial-associated gene changes and that it degrades with long-term separation. To the latter point, they then conduct a neuron enriching trap schema to find those genes subject to change specifically in neurons.

      The strength of the report is the clever experimental design, the unusual animal model, and the comparisons of same-sex and opposite-sex pairs and long-term and short-term separations.

      The weakness is that the behavioral changes observed are not what was expected based on prior work and are relatively modest, providing a disconnect between the outcome and the more dramatic transcriptional changes. A second weakness is the focus on the nucleus accumbens which is a brain region most closely associated with reward. While pair bonding may be rewarding, that component may be independent of the memory of a partner or the willingness to partner anew. Lastly, there is no clear connection between the identified transcriptome and either the formation or degradation of the pair bond.

      We thank the reviewer for noting the unique strengths of using prairie voles to investigate this specific question and for praising our experimental design, which compares opposite-sex and same-sex paired males at each time point to disentangle the effects of pair bonding from general social affiliation and isolation.

      Reviewers #1 and #3 noted the mismatch between the behavioral and transcriptional responses. Specifically, we found little evidence of bond dissolution following long term separation despite substantial erosion of the pair bond transcriptional signature. They further note that the experimental design employed to assess behavior and transcription differed, which may have contributed to the apparent mismatch. Importantly, our initial behavioral assessment as presented in Figure 1 of the manuscript had two strengths. It measured intra-animal changes in behavior over time and minimized the number of animals required. However, we agree with the reviewers, and we are currently repeating the behavior experiments to match the transcription experiments. Specifically, separated partners will be kept in separate colony rooms to ensure no possible access to partner-associated sensory cues (visual, auditory, olfactory), and we will use separate cohorts of animals for short- and long-term separation. This design avoids partner re-introduction during the short-term partner preference test. The results of this work will be informative regardless of outcome. If we observe a dissolution of pair bond behaviors, it indicates that re-exposure to a partner after a short, 48-hour separation has a powerful effect on bond duration following separation. If we do not observe any change in pair bond behaviors following separation, it would confirm that pair bond behaviors are more resistant to erosion than are transcriptional signatures of pair bonding.

      We have focused on the NAc because it is a critical hub that is engaged upon attachment formation and is implicated in loss processing. Specifically, studies have shown that blockade of neuromodulatory signaling (i.e. oxytocin and dopamine) in this region impairs bond formation and can lead to bond dissolution. Our group and others have demonstrated that plasticity within this region - in patterns of neuronal activity and in synaptic response to oxytocin - are associated with bond formation and maturation (1, 2). And literature on drugs of abuse has demonstrated an important role for the NAc in encoding of reward associations (3), which ultimately underlies partner preference. Additionally, in human neuroimaging studies, Prolonged Grief Disorder is associated with an enhanced signal in the NAc when viewing images of the lost loved one, suggesting that normal resolution of grief corresponds with a decrease in NAc activity elicited by reminders of the lost loved one (4). Thus, our focus on this region is well supported. Nonetheless, we recognize that the NAc does not act in a vacuum, and the efferent and afferent connectivity of different NAc cell types is well delineated, paving the way for future work (5, 6).

      Additionally, we agree with the reviewer that pair bonding behavior is multifaceted and comprised of several discrete behaviors that are not dissociable in the partner preference test. Partner-associated reward and partner memory may be independently encoded, and disruption of either process would manifest as a decrease or lack of partner preference. In our complete response to reviewers and revision of the manuscript, we will address this point more thoroughly. Finally, we interpret the reviewer’s last comment to be a request for functional manipulations to validate that the predicted transcriptional changes have a behavioral effect. This is beyond the scope of this manuscript but an active area of future research.

      Reviewer #2 (Public Review):

      The goal of this study is to understand the molecular mechanisms by which pair bonded animals recover following the loss of a partner.

      Strengths of this work include: (1) The organism - a novel model for studying pair bonding and loss; (2) The integrative nature of the study; it integrates behavior and brain gene expression RNASeq data and vTRAP; (3) The important and understudied question about how pair bonded animals recover from loss; (4) The thorough and careful analysis of highly multidimensional and complex datasets

      Weaknesses include: (1) the major comparison is between same vs opposite sex housed pairs. This design controls for social effects somewhat, but the two treatment groups differ not just with respect to whether or not they are pair bonded, but also in whether or not they had associated with a male or female. Differences between the treatments could reflect pair bonding, or perhaps something about the sex of the partner. It would be useful to have an additional control group, or data on the behavior of individuals within both types of pairs while they are co-housed. Were transcriptomic effects more detectable in pairs that were more bonded together behaviorally? That would suggest that the gene expression signatures really reflect something about the bond rather than other confounds, for example; (2) The vTRAP method is fancy but what is it really adding? (3) The authors interpret the transcriptomic differences as promoting the ability to form a new bond but there are probably other processes that are contributing to the differences in gene expression. Some of the differentially expressed genes could be involved in promoting a new pair bond, but there could also be a signature of the memory of the identity of the partner, the signature of the bond itself, etc. (4) Some of the interpretations go a little too far, especially in terms of anthropomorphism. The impact of the work includes further development of voles as an important model for studying social behavior and insights into the molecular processes important for recovering from the loss of a partner.

      We thank the reviewer for recognizing the strength of our study organism and experimental techniques as well as rigorous analyses to answer an important question about adapting to partner loss.

      Regarding the noted weaknesses:

      (1) We chose to compare opposite sex pair bonds to same sex affiliative relationships as this is the standard within the field, and we note that reviewers 1 and 3 found this to be a strength of our study design (7–11). Peer relationships in prairie voles are difficult to distinguish behaviorally from those of opposite-sex pairs (Fig 1) because both same and opposite-sex paired voles show selective preference for their pairmate and selective agression towards other voles (7). As such, the critical feature that makes pair bonding different is mating, which requires an opposite sex partner in voles, and our experiments are optimally designed to identify the longitudinal transcriptional changes that result from mating and cohabitating with an opposite-sex partner. In order to best match our two groups, only animals with a preference score >50% were included in the transcriptional experiment, ensuring that we were comparing animals with an affiliative preference for their partner - whether that individual was the same or opposite sex.

      We interpret the reviewers comment to be that they want us to compare opposite-sex-paired animals with and without bonds. This can be achieved two ways. First, we can compare to a promiscuous species, such as meadow voles, which will mate and cohabitate without forming bonds, but this is confounded by species differences in transcription that may exist independent of bonding. Second, we can compare bonded voles to the small subset that do not form bonds. While intriguing, this is experimentally challenging as only ~10-20% of males fail to form a bond when paired with a sexually receptive female (in the current study, 16% had a preference < 50% after two weeks of pairing, which is consistent with prior reports - (9–11)). Put simply, we would need to pair hundreds of voles to opportunistically collect a sufficient number of non-bonders for transcriptional assessment across our experimental conditions. While we hope to eventually be able to do such an experiment, litter sizes, consideration of animal welfare, and other constraints make this largely untenable at present.

      Data on the behavior of individuals within both types of pairs while they are co-housed is already provided via results of a partner preference test performed after 2 weeks of co-housing and prior to re-housing or separation (Fig 2B and 3B). We find the reviewer’s suggestion of finding a relationship between the transcriptional signature and the pair bonding strength an interesting question, and we undertook a preliminary analysis examining whether animals with different pair bond strength aggregate on a PCA analysis of gene expression. There was no apparent relationship, although we are performing additional analyses such as exploratory factor analysis. The fact that we have not found a relationship between the baseline partner preference and the transcription in these initial analyses is perhaps unsurprising. First, bonding may require some threshold change in gene expression, with bond strength reflected in non-genomic information, such as synapse formation or strengthening, or axonal ensheathment. Second, we only performed transcriptional analyses on animals with a baseline partner preference >50%; we would not necessarily expect a dissociation given the uniformly strong bonds across these animals.

      (2) We feel that inclusion of TRAP adds substantially to this manuscript and to our understanding of the neuromolecular underpinnings of bonding and loss in the NAc. The value of this experiment is twofold. As noted by Reviewer 3, “the TRAP approach in prairie voles is novel and will provide a great resource to the research community.” The prairie vole community has just developed its first transgenic Cre lines, which could be paired with vTRAP to query bond-associated gene expression changes exclusively in Cre-expressing neurons (15). Second, we noticed a puzzle in our tissue-level data. The majority of cells in the NAc are neurons (16, 17), and the vast majority of pair bonding studies of this region have focused on neuronal phenotypes, but our transcriptional signatures were linked to changes in glial populations. Ultimately, changes in glia are likely to act via their interactions with neurons, and vTRAP enables us to query the neuronal transcriptional changes within our data. Supporting that this provides novel insights into our datasets, when we cluster transcripts based on their expression profiles following short and long-term separation, we predict different GO terms from the tissue level and neuronally-enriched gene sets. For instance, the GO terms resulting from cluster 2 for neuronal genes (Fig 4) includes “response to amphetamine” within the top 10 results, but the same cluster of genes from tissue level sequencing predicts this GO term as the 34th result.

      (3) We agree with the reviewer that adapting to partner loss is a multifaceted process that likely engages numerous biological and emotional systems in voles. The explanation we offer for the transcriptional changes during loss is based on previous work in the field and is one possible interpretation. We will expand on this point during revision of the manuscript.

      (4) We thank the reviewer for encouraging us to be objective with our interpretations. We will address this comment during revision of the manuscript.

      Finally, we thank the reviewer for recognizing the value of our study for not only the field of voles but the bereavement field more broadly.

      Reviewer #3 (Public Review):

      In this manuscript, the authors investigate the behavioral and brain transcriptional alterations associated with short- and long-term partner separation in the monogamous male prairie vole. Male prairie voles continue to show affiliative behavior after short- (2 days) and long-term (4-weeks) partner separation, with similar effects for same and opposite-sex pairs. However, the transcriptional signature in the nucleus accumbens exhibits marked alterations after long-term separation.

      Strengths:

      1) A key strength of this manuscript is its use of the monogamous prairie vole to investigate transcriptional alterations associated with pair bonding and subsequent pair separation. This sort of behavior cannot be investigated in typical rodent model systems (e.g., mice, rats), and the choice of using prairie voles allows for dissection of potential mechanisms of social bonding with relevance to partner loss in humans.

      2) Investigation of behavioral measures and transcriptional alterations at both short- and long-term time points after pairing and separation is a strength of the manuscript. These time points were selected based on previous studies in laboratory and wild prairie voles related to the time it takes to form a pair bond and for the male prairie vole to leave the nest after the loss of the female pair. The datasets generated will be of great use to the scientific community.

      3) The authors investigate the behavior and transcriptional profiles after same-sex as well as opposite-sex pairing. This is considered a thoughtful decision on the authors' part which allows them to tease apart the effects of same vs. opposite sex.

      4) The use of numerous behavioral measures to assess both affiliative and aggressive behaviors is a strength of the approach.

      5) The authors use many biostatistical approaches (e.g., RRHO, WGCNA, Enrichr) to probe the transcriptomics data. These approaches allow the authors to move beyond simply assessing transcriptional profiles separately, but to look for patterns that are similar or different across datasets.

      6) The authors use rigorous statistical methods to assess behavioral measures.

      7) The TRAP approach in prairie voles is novel and will provide a great resource to the research community.

      Weaknesses:

      1) The methods state that prairie voles were treated differently in the behavioral and transcriptomics studies. Specifically, for the separation in the behavioral studies, prairie voles were separated by sight, but not necessarily by the smell from partners (i.e., partners were kept ~1 foot apart). However, prairie voles in the transcriptomics studies were separated by both sight and smell (i.e., partners were sacrificed after separation). Thus, it is possible that the lack of degradation of pair bond-related behavior after long-term separation might be due to these prairie voles being able to smell their partners after separation. This is considered a moderate flaw in the design of the studies which limits the integration of results between behavior and transcriptomics. This might be why the authors do not see a strong behavioral degradation of pair bond-related behavior after long-term separation but do see a strong transcriptional signature.

      2) While RRHO is helpful to assess overall patterns of transcriptional signatures across datasets, its utility for determining the exact transcripts is limited. This is because of how RRHO determines the overlapping transcripts for its Venn diagram feature (by taking the point where the p-value is most significant and taking the list to the outside corner of that quadrant).

      3) TRAP expression was verified in only one animal. Thus, the approach has not been appropriately confirmed.

      We thank the reviewer for their thoughtful comments on the innovative strengths and advantages of our manuscript.

      Regarding the noted weaknesses:

      (1) Please see our response to Reviewer #1, who shares your concerns.

      (2) We agree that RRHO is particularly useful for assessment of overall patterns. We interpret the Reviewer’s comment to mean that when extracting the overlapping gene lists from an RRHO quadrant for downstream analyses, we should filter that list for genes whose differential expression passes a nominal p-value cutoff to reduce the amount of biologically insignificant conclusions we are drawing from the RRHO data. Our initial analyses used just such a threshold-based approach by identifying GO terms via differentially expressed genes of the combined pair bond (Figure 2) using both p-value and log2Fold cutoffs. This analysis revealed a number of terms associated with glial cell proliferation, differentiation, and function (Fig 2H). Such processes occur over a time frame of days to weeks, with different phases of differentiation characterized by different gene expression profiles. To explore this further, we used the genes in the UU and DD RRHO quadrants without implementing a p-value cutoff to see if additional genes associated with these GO-identified pathways may be showing subtle but consistent directional changes (Fig 3). Importantly, we only use the overlapping RRHO gene lists to determine how previously defined biological processes via DEG-predicted GO terms change across conditions; we are not using the RRHO gene lists to generate new GO terms. This allowed us to look for patterns within the identified pathways that may give insight into how transcription might be affecting gliogenesis. This analysis was similarly suggested to us from other experienced users of RRHO plots (see Acknowledgements). There are also several published studies that use RRHO UU and DD quadrant overlap (18–22).

      (3) Most labs rarely confirm Cre-dependence of vectors in more than one or two animals as the results, including those shown in Fig S9A, are typically definitive (i.e. no expression in the absence of Cre, expression in the presence of Cre). In addition to the images shown in figure S9A, we used fluorescent guided dissection to harvest tissue/mRNA, serving as an additional visual confirmation of RPL10-GFP expression in the animals used to generate Figure 4. Since submission, we have also confirmed that this vector also expresses in rats when Cre-recombinase is present. However, prior to resubmission, we will perform additional surgeries to confirm that TRAP is only expressed in the presence of Cre-recombinase.

      References

      1. J. L. Scribner, E. A. Vance, D. S. W. Protter, W. M. Sheeran, E. Saslow, R. T. Cameron, E. M. Klein, J. C. Jimenez, M. A. Kheirbek, Z. R. Donaldson, A neuronal signature for monogamous reunion. Proceedings of the National Academy of Sciences. 117, 11076–11084 (2020).
      2. A. M. Borie, S. Agezo, P. Lunsford, A. J. Boender, J.-D. Guo, H. Zhu, G. J. Berman, L. J. Young, R. C. Liu, Social experience alters oxytocinergic modulation in the nucleus accumbens of female prairie voles. Current Biology. 32, 1026-1037.e4 (2022).
      3. E. S. Calipari, R. C. Bagot, I. Purushothaman, T. J. Davidson, J. T. Yorgason, C. J. Peña, D. M. Walker, S. T. Pirpinias, K. G. Guise, C. Ramakrishnan, K. Deisseroth, E. J. Nestler, In vivo imaging identifies temporal signature of D1 and D2 medium spiny neurons in cocaine reward. Proc. Natl. Acad. Sci. U.S.A. 113, 2726–2731 (2016).
      4. M.-F. O’Connor, D. K. Wellisch, A. L. Stanton, N. I. Eisenberger, M. R. Irwin, M. D. Lieberman, Craving love? Enduring grief activates brain’s reward center. NeuroImage. 42, 969–972 (2008).
      5. T. Hikida, S. Yao, T. Macpherson, A. Fukakusa, M. Morita, H. Kimura, K. Hirai, T. Ando, H. Toyoshiba, A. Sawa, Nucleus accumbens pathways control cell-specific gene expression in the medial prefrontal cortex. Sci Rep. 10, 1838 (2020).
      6. C. Baimel, L. M. McGarry, A. G. Carter, The Projection Targets of Medium Spiny Neurons Govern Cocaine-Evoked Synaptic Plasticity in the Nucleus Accumbens. Cell Reports. 28, 2256-2263.e3 (2019).
      7. N. S. Lee, N. L. Goodwin, K. E. Freitas, A. K. Beery, Affiliation, aggression, and selectivity of peer relationships in meadow and prairie voles. Frontiers in Behavioral Neuroscience. 13 (2019), doi:10.3389/fnbeh.2019.00052.
      8. O. J. Bosch, H. P. Nair, T. H. Ahern, I. D. Neumann, L. J. Young, The CRF System Mediates Increased Passive Stress-Coping Behavior Following the Loss of a Bonded Partner in a Monogamous Rodent. Neuropsychopharmacology. 34, 1406–1415 (2009).
      9. O. J. Bosch, J. Dabrowska, M. E. Modi, Z. V. Johnson, A. C. Keebaugh, C. E. Barrett, T. H. Ahern, J. Guo, V. Grinevich, D. G. Rainnie, I. D. Neumann, L. J. Young, Oxytocin in the nucleus accumbens shell reverses CRFR2-evoked passive stress-coping after partner loss in monogamous male prairie voles. Psychoneuroendocrinology. 64, 66–78 (2016).
      10. A. J. Grippo, B. S. Cushing, C. S. Carter, Depression-like behavior and stressor-induced neuroendocrine activation in female prairie voles exposed to chronic social isolation. Psychosomatic Medicine. 69, 149–157 (2007).
      11. A. J. Grippo, D. Gerena, J. Huang, N. Kumar, M. Shah, R. Ughreja, C. Sue Carter, Social isolation induces behavioral and neuroendocrine disturbances relevant to depression in female and male prairie voles. Psychoneuroendocrinology (2007), doi:10.1016/j.psyneuen.2007.07.004.
      12. J. R. WILLIAMS, C. S. CARTER, T. INSEL, Partner Preference Development in Female Prairie Voles Is Facilitated by Mating or the Central Infusion of Oxytocin. Annals of the New York Academy of Sciences. 652, 487–489 (1992).
      13. C. Sue Carter, A. Courtney Devries, L. L. Getz, Physiological substrates of mammalian monogamy: The prairie vole model. Neuroscience and Biobehavioral Reviews. 19, 303–314 (1995).
      14. L. L. Getz, C. S. Carter, L. Gavish, The mating system of the prairie vole, Microtus ochrogaster: Field and laboratory evidence for pair-bonding. Behavioral Ecology and Sociobiology. 8, 189–194 (1981).
      15. K. Horie, K. Inoue, S. Suzuki, S. Adachi, S. Yada, T. Hirayama, S. Hidema, L. J. Young, K. Nishimori, Oxytocin receptor knockout prairie voles generated by CRISPR/Cas9 editing show reduced preference for social novelty and exaggerated repetitive behaviors. Horm Behav. 111, 60–69 (2019).
      16. K. E. Savell, J. J. Tuscher, M. E. Zipperly, C. G. Duke, R. A. Phillips, A. J. Bauman, S. Thukral, F. A. Sultan, N. A. Goska, L. Ianov, J. J. Day, A dopamine-induced gene expression signature regulates neuronal function and cocaine response. Sci Adv. 6, eaba4221 (2020).
      17. D. Avey, S. Sankararaman, A. K. Y. Yim, R. Barve, J. Milbrandt, R. D. Mitra, Single-Cell RNA-Seq Uncovers a Robust Transcriptional Response to Morphine by Glia. Cell Reports. 24, 3619-3629.e4 (2018).
      18. S. L. Fulton, S. Mitra, A. E. Lepack, J. A. Martin, A. F. Stewart, J. Converse, M. Hochstetler, D. M. Dietz, I. Maze, Histone H3 dopaminylation in ventral tegmental area underlies heroin-induced transcriptional and behavioral plasticity in male rats. Neuropsychopharmacology. 47, 1776 (2022).
      19. S. G. Caradonna, T.-Y. Zhang, N. O’Toole, M.-J. Shen, H. Khalil, N. R. Einhorn, X. Wen, C. Parent, F. S. Lee, H. Akil, M. J. Meaney, B. S. McEwen, J. Marrocco, Genomic modules and intramodular network concordance in susceptible and resilient male mice across models of stress. Neuropsychopharmacol. 47, 987–999 (2022).
      20. J. S. Wang, T. Kamath, C. M. Mazur, F. Mirzamohammadi, D. Rotter, H. Hojo, C. D. Castro, N. Tokavanich, R. Patel, N. Govea, T. Enishi, Y. Wu, J. da Silva Martins, M. Bruce, D. J. Brooks, M. L. Bouxsein, D. Tokarz, C. P. Lin, A. Abdul, E. Z. Macosko, M. Fiscaletti, C. F. Munns, P. Ryder, M. Kost-Alimova, P. Byrne, B. Cimini, M. Fujiwara, H. M. Kronenberg, M. N. Wein, Control of osteocyte dendrite formation by Sp7 and its target gene osteocrin. Nat Commun. 12, 6271 (2021).
      21. D. A. Gallegos, M. Minto, F. Liu, M. F. Hazlett, S. Aryana Yousefzadeh, L. C. Bartelt, A. E. West, Cell-type specific transcriptional adaptations of nucleus accumbens interneurons to amphetamine. Mol Psychiatry, 1–15 (2022).
      22. B. J. Hilton, A. Husch, B. Schaffran, T. Lin, E. R. Burnside, S. Dupraz, M. Schelski, J. Kim, J. A. Müller, S. Schoch, C. Imig, N. Brose, F. Bradke, An active vesicle priming machinery suppresses axon regeneration upon adult CNS injury. Neuron. 110, 51-69.e7 (2022).
    1. Author Response

      Reviewer #1 (Public Review):

      In this paper the authors present variations in carbon oxidation state and hydration state in proteomes available in RefSeq. Then they use this information to predict community level proteomes, and their corresponding carbon oxidation states and hydration states, based on available 16S rRNA gene sequences from selected previously published datasets. When combining this with information about the environmental setting of the individual samples analyzed, the authors are able to demonstrate connections between redox conditions and proteomic carbon oxidation state and hydration state. Furthermore, they explore how individual taxonomic groups at different taxonomic levels contribute to forming these connections.

      A weakness with the study is that the described environmental proteomes are inferred from 16S rRNA gene sequence data and not observed directly. However, there is good reason to believe that the conclusions drawn in the paper are valid.

      The study sheds light on microbial adaptations on the genome level that so far have received relatively little attention. The paper is also interesting from an ecological perspective regarding the general question of how microbial communities are shaped by environmental settings.

      To attempt to bring more attention to environmental constraints, a plot (Figure 4E in the published paper) was redrawn to more clearly show how carbon oxidation state of estimated community proteomes not only is lower in more reducing conditions for a variety of environments but also shows the largest differences for hydrothermal systems and shale-gas wells. This finding is discussed in terms of geological sources of reductants and provides new evidence that the chemical makeup of microbial communities reflects their geological context.

      Reviewer #2 (Public Review):

      This manuscript mainly investigated the carbon oxidation and stoichiometric hydration states of the inferred community proteomes according to 16S rRNA gene compositions from the published datasets and explored their potential associations with environmental parameters such as redox gradients, oxygen concentrations and salinity.

      Predictions of the carbon oxidation and stoichiometric hydration states on the basis of microbial proteomes can provide some meaningful information for disentangling microbial response to environmental changes. As we know, some genes in microbial genomes are not expressed and transformed to proteins. Therefore, such gene redundancy in genomes may lead to bias in predicting the carbon oxidation and stoichiometric hydration states.

      Our study uses available data sources to identify informative differences of elemental compositions of proteomes predicted from genomes. There are numerous examples in the literature of using protein sequences predicted from genomes to make comparisons of amino acid composition (for example, in eLife: https://doi.org/10.7554/eLife.57347), so it would appear to be acceptable with some level of uncertainty to use genomic data to make comparisons between (amino acid or elemental) compositions of predicted proteomes.

      Furthermore, this study compiled many 16S rRNA gene datasets from previous studies. Different primer sets were applied in those studies, and such difference will result in distinct 16S rRNA gene compositions. Accordingly, it is essential to deal with the influence of different primer sets on the 16S rRNA gene compositions among samples. Unfortunately, such information is missing in the method section.

      Primer sets used in the source studies have been added to Table 1 in the published paper. The Discussion was modified to acknowledge limitations in making comparisons *between* datasets obtained using different primers. However, the main results of this study are based on differences of carbon oxidation state (Zc) *within* individual datasets (for instance, along the vertical redox gradients shown in Figure 3).

      The intra-dataset differences of Zc themselves are compared across datasets in Figure 4E. However, it can be expected that the effects of technical variability – including not only primer pairs but also DNA extraction methods, etc. – would tend to be reduced in these inter-dataset comparisons of intra-dataset differences, in contrast to direct inter-dataset comparisons. The index plot at the center of Figure 2 does make a direct inter-dataset comparison, but the outcome is consistent with trends identified in previous analyses of shotgun metagenomic datasets, 16S primers and other technical differences between studies notwithstanding.

      Additionally, the community proteomes in this study were inferred from 16S rRNA genes. The marker gene of 16S rRNA cannot well predict their corresponding genomes, possibly leading to prediction of biased proteomes. Therefore, it should avoid to use 16S rRNA genes for predicting microbial genomes and proteomes.

      Despite the various sources of uncertainty in making estimates of elemental composition of communities from 16S rRNA genes and reference proteomes, comparisons with shotgun metagenomic data support the reliable identification of trends within datasets (Figure 5 in the published paper).

      It seems that the relationships between carbon oxidation states/stoichiometric hydration state and redox/salinity gradients have been reported in previous studies (e.g., Dick et al 2019, 2020, 2021). The finding of this study is not new in comparison with the previously reported.

      The explorations in previous studies of chemical links between communities and environments were based on analysis of shotgun metagenomic data. The ability to reproduce those findings by analyzing 16S rRNA gene sequence data is a new advance in this study.

      Other new results in the published paper are the different magnitudes of Zc differences in various environments (which were not previously documented from shotgun metagenomes; Figure 4E) and the comparison of shotgun metagenome and 16S-based estimates of Zc for the time series of injected fluids in the Marcellus Shale (Figure 5B). The latter results are particularly interesting; the close correspondence for Days 0, 7, and 13 supports the basic reliability of the 16S-based estimates, while the increasing divergence at Days 82 and 328 suggests the onset of some interfering mechanisms (the speculation is made that this could be related to viral lysis and heterotrophic degradation of the released DNA). Also, the published paper presents the first analysis of carbon oxidation state of proteins – from either shotgun metagenome sequences or 16S rRNA-based estimates – for microbial communities in various body sites using data from the Human Microbiome Project (Figure 5D).

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript by de la Vega and colleagues describes Neuroscout, a powerful and easy-to-use online software platform for analyzing data from naturalistic fMRI studies using forward models of stimulus features. Overall, the paper is interesting, clearly written, and describes a tool that will no doubt be of great use to the neuroimaging community. I have just a few suggestions that, if addressed, I believe would strengthen the paper.

      Major comments

      1) How does Neuroscout handle collinearity among predictors for a given stimulus? Does it check for this and/or throw any warnings? In media stimuli that have been adopted for neuroimaging experiments, low-level audiovisual features are not infrequently correlated with mid-level features such as the presence of faces on screen (see Grall & Finn, 2022 for an example involving the Human Connectome Project video clips). How to disentangle correlated features is a frequent concern among researchers working with naturalistic data.

      We agree with the reviewer that collinearity between predictors is one of the biggest challenges for naturalistic data analysis. However, absent consensus on how to best model these data, we find that it is out of scope of the present report to make strong recommendations. Instead, our goal was to design an agnostic platform that would enable users to thoughtfully design statistical models for their particular goal. Papers such as Grall & Finn (2022) will be critical in advancing the debate on how to best analyze and interpret such data.

      We explicitly address this challenge in a new paragraph in the discussion under “Challenges and future directions:

      “A major challenge in the analysis of naturalistic stimuli is the high degree of collinearity between features, as the interpretation of individual features is dependent on co-occurring features. In many cases, controlling for confounding variables is critical for the interpretation of the primary feature— as is evident in our investigation of the relationship between FFA and face perception. However, it can also be argued that in dynamic narrative driven media (i.e. films and movies), the so-called confounds themselves encode information of interest that cannot or should not be cleanly regressed out (Grall & Finn, 2022).[…] Absent a consensus on how to model naturalistic data, we designed Neuroscout to be agnostic to the goals of the user and empower them to construct sensibly designed models through comprehensive model reports. An ongoing goal of the platform—especially as the number of features continues to increase—will be to expand the visualizations and quality control reports to enable users to better understand the predictors and their relationship. For instance, we are developing an interactive visualization of the covariance between all features in Neuroscout that may help users discover relationships between a predictor of interest and potential confounds.” (pg. 11)

      Note we shortened the second paragraph of the discussion by two sentences as it had touched on this subject, and was better addressed separately.

      In addition, we ensured to highlight the covariance structure visualization in the Results section:

      “At this point, users can inspect the model through quality-control reports and interactive visualizations of the design matrix and predictor covariance matrix, iteratively refining models if necessary.” (pg. 3)

      2) On a related note, do the authors and/or software have opinions about whether it is moreappropriate to run several regressions each with a single predictor of interest or to combine all predictors of interest into a single regression? (Or potentially a third, more sophisticated solution involving variance partitioning or another technique to [attempt to] isolate variance attributable to each unique predictor?) Does the answer to this depend on the degree of collinearity among the predictors? Some discussion of this would be helpful, as it is a frequent issue encountered when analyzing naturalistic data.

      This is a very sensitive methodological point, but one for which it is hard to find a univocal answer in the literature. While on the one hand it can be deceptive to model a single feature in isolation (as illustrated by our face perception analyses), more complex models pose different challenges in terms of robust parameter estimation and variance attribution. Resolving these challenges goes beyond the scope of our work, and it is ultimately our goal to provide a flexible tool which will enable these types of investigations, and enable users to take responsibility and provide motivations for methodological choices made using the platform. We touch on Neuroscout’s agnostic philosophy on this issue under “Challenges and future directions” (pg. 11; quoted above).

      However, we also agree that in part the solution to this problem will be methodological. This is particularly true for modeling deep learning based embeddings, which can have hundreds of features in a single model. We are currently working on expanding beyond traditional GLM models in Neuroscout, opening the door to more sophisticated variance partitioning techniques, and more robust parameter estimation in complex models. We highlight current and future efforts to expand Neuroscout’s statistical models in the following paragraph:

      “However, as the number of features continues to grow, a critical future direction for Neuroscout will be to implement statistical models which are optimized to estimate a large number of covarying targets. Of note are regularized encoding models, such as the banded-ridge regression as implemented by the Himalaya package. These models have the additional advantage of implementing feature-space selection and variance partitioning methods, which can deal with the difficult problem of model selection in highly complex feature spaces such as naturalistic stimuli. Such models are particularly useful for modeling high-dimensional embeddings, such as those produced by deep learning models. Many such extractors are already implemented in pliers and we have begun to extract and analyze these data in a prototype workflow that will soon be made widely available. “ (pg. 11)

      3) What the authors refer to as "high-level features" - i.e., visual categories such as buildings,faces, and tools - I would argue are better described as "mid-level features", reserving the term "high-level" for features that are present only in continuous, engaging, narrative or narrative-like stimuli. Examples: emotional tone or valence, suspense, schema for real-world situations, other operationalizations of a narrative arc, etc. After all, as the authors point out, one doesn't need naturalistic paradigms to study brain responses to visual categories or single-word properties. Much of the work that has been done so far with forward models of naturalistic stimuli has been largely confirmatory (e.g., places/scenes still activate PPA even during a rich film as opposed to a serial visual presentation paradigm). This is a good first step, but the promise of naturalistic paradigms is ultimately to go beyond these isolated features toward more holistic models of cognitive and affective processes in context. One challenge is that extracting true high-level features is not easily automated, although the ability to crowdsource human ratings using online data collection has made it feasible to create manual annotations. However, there are still technical challenges associated with collecting continuous-response measurement (CRM) data during a relatively long stimulus from a large number of individuals online. Does Neuroscout have any plans to develop support for collecting CRM data, perhaps through integration with Amazon MTurk and/or Prolific? Just a thought and I am sure there are a number of features under consideration for future development, but it would be fabulous if users could quickly and easily collect CRM data for high-level features on a stimulus that has been uploaded to Neuroscout (and share these data with other end users).

      The reviewer makes a very good point regarding the fact that many so-called “high-level” features are best called “mid-level”. As such, we have changed our use of “high-level” to “mid-level perceptual features” throughout the manuscript.

      “Currently available features include hundreds of predictors coding for both low-level (e.g., brightness, loudness) and mid-level (e.g., object recognition indicators) properties of audiovisual stimuli…” (pg. 3)

      That said, we do believe that as machine learning (and in particular deep learning) models evolve, it will become more feasible to extract higher level features automatically. This has already been shown with transformer language models, which are able to extract higher-level semantic information from natural text. To this end, we have ensured to design our underlying feature extraction platform, pliers, to be easily extensible, to ensure the continued growth of the platform as algorithms evolve. We ensure to highlight this in the Results section ‘Automated annotation of stimuli’:

      “The set of available predictors can be easily expanded through community-driven implementation of new pliers extractors, as well as public repositories of deep learning models, such as HuggingFace and TensorFlowHub. We expect that as machine learning models continue to evolve, it will be possible to automatically extract higher-level features from naturalistic stimuli.” (pg. 3)

      We also ensured to highlight the extensibility of pliers to increasingly power deep learning models in the Discussion by revising this sentence

      “As a result, we have designed Neuroscout and its underlying feature extraction framework pliers to facilitate community-led expansion to novel extractors— made possible by the rapid increase in public repositories of pre-trained deep learning models such as HuggingFace and TensorFlow Hub” (pg. 10)

      As to the point of a potential extension to Neuroscout for easily collecting crowd source stimuli annotations, we are in full agreement that this would be very useful. In fact, this feature was part of the original plan for Neuroscout, but fell out of scope as other features took priority. Although we are unsure if this extension is a short term priority for the Neuroscout team (as it likely would take substantial effort to develop a general purpose extension), the ability to submit user-generated features to the Neuroscout API should make it possible to design a modular extension to Neuroscout to collect such features.

      We mention this possibility briefly in the future directions section:

      “Other important expansions include facilitating analysis execution by directly integrating with cloud-based neuroscience analysis platforms (such as Brainlife.io) and facilitating the collection of higher-level stimulus features by integrating with crowdsourcing platforms such as MechanicalTurk or Prolific.” (pg. 11)

      4) Can the authors talk a bit more about the choice to demean and rescale certain predictors, namely the word-level features for speech analysis? This makes sense as a default step, but I wonder if there are situations in which the authors would not recommend normalizing features prior to computing the GLM (e.g., if sign is meaningful, if the distribution of values is highly skewed if the units reflect absolute real-world measurements, etc). Does Neuroscout do any normalization automatically under the hood for features computed using the software itself and/or features that have been calculated offline and uploaded by the user?

      In keeping with Neuroscout’s philosophy to be a general purpose platform, we have not performed any standardization of features. Instead, users can choose to modify raw predictor values by applying transformations on a model-by-model basis. Currently available transformations through the web interface include: scale, orthogonalize and threshold. Note that there is a wider range of transformations available in the BIDS Stats Model, but we are hesitant to advertise these yet, as they are more difficult to use.

      We revised our description of transformations in the Result section to clarify these transformations are model specific:

      “Raw predictor values can be modified by applying model-specific transformations such as scaling, thresholding, orthogonalization, and hemodynamic convolution.” (pg. 3)

      We also clarify that variables are ingested without any in-place modifications in the Methods section. The only exception is that we down-sample highly dense variables (such as those from auditory files, which can result in thousands of value per second), to save disk space:

      “Feature values are ingested directly with no in place modifications, with the exception of down sampling of temporally dense variables to 3hz to reduce storage on the server.” (pg. 13)

      With respect to the word frequency analysis, the primary reason we scaled variables was to facilitate imputing missing values for words not found in the look-up dictionary. By scaling the variable, we were able to replace missing values with zero, effectively assigning them the average word frequency value. We clarified this strategy in the Methods section:

      “In all analyses, this variable was demeaned and rescaled prior to HRF convolution. For a small percentage of words not found in the dictionary, a value of zero was applied after rescaling, effectively imputing the value as the mean word frequency.” (pg. 17)

      On a more general note, when interpreting a single variable with a dummy coded contrast (i.e. 1 for the predictor of interest, and 0 for all other variables), it’s not necessary to normalize features prior to modeling, as fMRI t-stat maps are scale-invariant (although the parameter estimates will be affected).

      We added a note with our recommendations in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs//web/builder/transformations.html#scale

      Reviewer #2 (Public Review):

      The authors present a new platform for constructing and sharing fMRI analyses, specifically geared toward analyzing publicly-available naturalistic datasets using automatically-extracted features. Using a web interface, users can design their analysis and produce an executable package, which they can then execute on their local hardware. After execution, the results are automatically uploaded to NeuroVault. The paper also describes several examples of analyses that can be run using this system, showing how some classical feature-sensitive ROIs can be derived from a meta-analysis of naturalistic datasets.

      The Neuroscout system is impressive in a number of ways. It provides easy access to a number of publicly-available datasets (though I would like to see the current set of 13 datasets increase in the future), has a wide variety of machine-learning features precomputed on the video and audio features of these stimuli, and builds on top of established software for creating and sandboxing analysis workflows. Performing meta-analyses across multiple datasets are challenging both practically and statistically, but this kind of multi-dataset analysis is easy to specify using Neuroscout. It also allows researchers to easily share a reproducible version of their pipeline simply by pointing to the publicly-available analysis package hosted on Neuroscout. The platform also provides a way for researchers to upload their own custom models/predictors to extend those available by default.

      The case studies described in the paper are also quite interesting, showing that traditional functional ROIs such as PPA and VWFA can be defined without using controlled stimuli. They also show that, running a contrast for faces does not produce FFA until speech (and optionally adaptation) is properly controlled for, and that VWFA shows relationships to lexical processing even for speech stimuli.

      I have some questions about the intended workflow for this tool: is Neuroscout meant to be used for analysis development in addition to sharing a final pipeline? The fact that the whole analysis is packaged into a single command is excellent for reproducibility but seems challenging to use when iterating on a project. For example, if we wanted to add another contrast to a model, it appears that this would require cloning the analysis and re-starting the process from scratch.

      An important principle of Neuroscout from the onset of the project was to minimize undocumented researcher degrees of freedom, and maximize transparency in order to reduce the file drawer effect which can contribute to biased results in the published literature. As such, we require analyses to be registered and locked as the modal usage of our application. In the case of adding a contrast, it is true that this would require a user to clone the analysis. Although all of the information from the previous model would be encoded in the new model, this would require re-estimating the design matrix which could be time consuming. However, in our experience, users almost always add new variables to the design-matrix when a study is cloned, which would in any case require re-estimating the design matrix for all runs and subjects. We believe this trade-off is worthwhile to ensure maximal reproducibility, but also point out that since Neuroscout’s data is freely available via our API, power users could directly access the data if they need to use it in a less constrained manner.

      We believe that these important distinctions are best addressed in the newly developed Neuroscout documentation which we now reference throughout the text (https://neuroscout.org/docs/web/browse/clone.html).

      I'm also unsure about how versioning of the input datasets and the predictors is planned to be handled by the platform; if datasets have been processed with multiple versions of fmriprep, will all of those options be available to choose from? If the software used to compute features is updated, will there be multiple versions of the features to choose from?

      The reviewer makes an astute observation regarding the versions of input data (predictors & datasets). Currently we have only pre-processed the imaging data once per data, and as such this has not been an issue. However, in the long run we certainly agree this would be important to give users the ability to choose which pre-processed version of the raw dataset they want to use, as certainly there could be differing but equally valid versions. We have opened an issue in Neuroscout’s repository to track this issue, and plan to incorporate this ability in a future version (https://github.com/neuroscout/neuroscout/issues/1076).

      With respect to feature versions, every time a feature is re-extracted, a new predictor_id is generated, and the accompanying meta-data such as time of extraction is tracked for that specific version. As such, if a feature is updated and re-extracted, this will not change existing analyses. By default, we have chosen to obscure this from the user to make the user experience simpler. However, there is an open issue to expand the frontend’s ability to explicitly display different versions, and allow users to update older analyses with newer versions of features. Advanced users already have access to this functionality by using the Python API (PyNS) to directly access all features, and create analyses with more precision.

      We have made a note regarding this behavior in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs/web/builder/predictors.html

      I also had some difficulty attempting to test out the platform, so additional user testing may be necessary to ensure that novice users are able to successfully run analyses.

      We thank the reviewer for this bug report, which allowed us to fix a previously unnoticed issue with a subset of Neurosout datasets. We have been incontact with the reviewer to ensure that this issue was successfully addressed.

    1. Author Response

      Reviewer #1 (Public Review):

      1) While the authors identify the suppressors in known genetic interactors (GIs) of the yeast SEC53, it is worth testing if the compensatory mutations are rewiring the GIs, thereby explaining the lack of comparable compensations observed in reconstituted strains. If altered GIs explain the suppression, then while yeast serves as an excellent tool to perform these assays, the human context of the disease may require a different set of genetic suppressors and, therefore, a different target than the yeast PGM1 ortholog.

      Our data show that pgm1 mutations alone greatly improve growth of sec53-V238M strains. Our data also indicate other pathways of compensation. Whether each of these compensatory mechanisms translate to humans is unknown. However, the observed enrichment of compensatory mutations in genes whose human homologs are associated with Type 1 CDG, suggests that many of these genetic interactions are likely to be conserved.

      Also, are Sec53 and Pgm1 proteins directly interacting in yeast and whether these mutations are on the interaction interface?

      As we mention above, there is no support for a direct physical interaction between Sec53 and Pgm1.

      2) Based on the data obtained between pACT1 and pSEC53-driven expression of the SEC53 mutant alleles, the pattern of suppressors appears to be different. Authors report that the variants expressed from strong pACT1 promoters show more suppressors than those driven by native promoters. Is this a general trend in experimental evolution that slower-growing strains tend to show lesser suppressors? For example, on Page 6, line 154, "compensating for Sec53-F126L dimerization defects are rare or not easily accessible". The statement suggests that the authors did obtain suppressors that compensate for the dimerization defect. At the same time, while rare (also, are authors suggesting suppression of dimerization defect as in better dimerization?), the rate of obtaining suppressors seems to be linked to the severity of the fitness defects of the strains. The lack of suppressors may be a limitation of the evolution experiments. Indeed later in the manuscript, the authors noticed that while PGM1 suppressors obtained in V238M can also suppress F126L alleles, the suppression was not as efficient. Could it be that evolution experiments in slower-growing strains predominantly enrich suppressors in other pathways (i.e., not in the CDG orthologs) that restore the growth better and compete out the relatively weaker suppressors in PGM1? In fact, the authors report similar effects on Page 7, lines 204-210. These two paragraphs are contradictory and should be explained further.

      All of our sequencing was performed on strains with sec53 under the control of the pACT1 promoter. While we did not identify unique sec53-F126L suppressors, we cannot exclude that sec53-F126L suppressors exist, so we describe them as “rare or not easily accessible”. While it is possible that the slower growth rate of the sec53-F126L allele could impact the likelihood of observing suppressors, we think it is more likely due to the nature of the variant (dimerization defect versus stability defect) rather than growth rate. In other laboratory evolution experiments the same beneficial mutation typically has a greater effect in slower-growing backgrounds (for example: doi.org/10.1126/science.1250939).

      3) Authors report that the LOF of PGM1 compensates for the SEC53 mutations. However, the evolution experiments did not capture any LOFs in PGM1. The fitness comparisons in evolution experiments are different as many different genotypes compete in a mix. Therefore, the fitness assays in a clonal population may not represent these differences well. To test this argument, authors can try to mimic the evolution experiments by mixing two genotypes to check competitive fitness, like the co-culture of pgm1 suppressor obtained via evolution experiments with pgm1Δ.

      Though we did not perform a direct head-to-head competition between a pgm1 suppressor and a pgm1Δ, our data suggest that the pgm1 delete would outcompete some of the lower-fitness suppressors. In the Discussion we speculate as to why we do not see deletion mutations: “Given that most of the evolved clones containing pgm1 mutations are more fit than the reconstructed strains, it is possible that other evolved mutations interact epistatically only with non-loss-of-function pgm1 mutations.”. Though it is beyond the scope of the present manuscript, it would be possible to rerun the evolution experiment in sec53-V238M strains carrying either a pgm1 missense suppressor or a pgm1Δ. Under the hypothesis of additional interacting loci, only the pgm1 missense suppressors would be more likely to acquire additional compensatory mutations.

      Reviewer #3 (Public Review):

      Vignogna et al. used yeast genetics, experimental evolution and biochemistry to tackle human congenital disorders of glycosylation (CDG), a disease mostly caused by mutations in PMM2. They took advantage of the observation that the budding yeast gene SEC53 is almost identical to human PMM2, and used experimental evolution to find interactors of SEC53/PMM2. They found an overrepresentation of mutations in genes corresponding to other human CDG genes, including PGM1. Genetic and biochemical characterizations of the pgm1 mutations were carried out. This work is solid, although authors did not reveal why reduction of pgm1 activity could compensate for defects of a particular mutant allele of sec53.

      Out of curiosity, if the authors were to simply focus on the preexisting mutations, would they have gotten the materials for most of the experiments in this article? In other words, how important is the experimental evolution?

      The evolution experiment was crucial as the specific pgm1 mutations we identified here have not been reported elsewhere, nor have the orthologous mutations been identified in human PGM1.

      A strain table with full genotypes is needed.

      We added a strain genotype table (Supplemental Dataset 2).

    1. Author Response

      Reviewer #2 (Public Review):

      In this MEG work employing two types of bistable perception test and unique regression analyses, the authors identified different neural frequencies to different components of visual perception: its content and stability.

      Strengths:

      This study has a nice set of three different experiments to clarify neural differences between content, memory and stability of visual perception.

      The state space analysis appears to be powerful to identify such different neural signatures for different cognitive components as well.

      Weaknesses:

      Despite such strengths, this work may have the somewhat critical weakness specified in the recommendations for the authors.

      First, in the analysis to identify content-specific neural frequency, the authors concluded that the SCP is more relevant to the visual perceptual content compared to the neural activity in the alpha and beta-band frequencies. In my impression, to claim this, it would be necessary to show statistically significant differences in the prediction accuracy between the SCP and the other frequencies. Given the not-so-high prediction accuracy seen in the SCP-based analysis, such statistical supports appear essential.

      We have now directly compared decoding accuracy for SCP and alpha/beta oscillations, which showed statistically significant differences in both the ambiguous and unambiguous conditions for both ambiguous images. We have added these results as a supplementary figure (new Figure 2—figure supplement 1).

      Second, two behavioural metrics in the neural state space analysis-i.e., Switch and Direction-may be too arbitrary. As suggested by the power-law distribution of the percept duration, the neural dynamics during seemingly stable percept may not be able to be described in linear functions. Instead, the brain may go back and forth between several neural states even when we are thinking we're experiencing stable visual consciousness. If so, the current definition of the Switch metric and Direction index, which seems to be based on the behaviour of the Switch index, may be arbitrary. In other words, I feel the authors may have to elaborate the rationale for the definitions of such metrics.

      First, we note it is generally accepted in the field that the distribution of percept durations follows a gamma distribution instead of a power-law distribution (e.g., Sterzer et al., TiCS 2009; Blake & Logothetis Nature Rev. Neurosci 2002; Kleinschmidt et al., 1998; Leopold et al., TiCS 1999), and microswitches have not been reported either using the more classic task as that employed here or the more recently developed ‘no-report’ task of using eye-tracking statistics to deduce perceptual switches without overt report (e.g., Frassle et al., J Neurosci 2014).

      Second, while brain activity may fluctuate during these time periods, it never crosses the threshold of evoking a conscious report, and thus we would expect that such fluctuations, if they do occur, would be of a lower magnitude than those that do produce a conscious report.

      Most importantly, our goal here is to define behavioral metrics in order to identify components of neural dynamics underpinning the relevant aspect of behavior. As such, our definition of the behavioral metric should not be directly informed by observed spontaneous dynamics of brain activity (especially those that may be observed in the data but are of unclear relevance to perceptual switching); otherwise the analysis would be prone to circularity and spurious correlations (i.e., using observed brain dynamics to inform construction of behavioral metrics might pick up aspect of brain dynamics not really relevant to behavior in the analysis results).

      Finally, the timing characteristics of ‘Switch’ and ‘Direction’ behavioral metrics are not arbitrary; instead they are the simplest behavioral functions that allow a comparison of pre- and post-switching periods (or when the percepts might be in the ‘stabilizing’ phase vs. the ‘destabilizing’ phase). Nevertheless, the regression analysis can pick up on other temporal patterns of changes not exactly the same as our defined behavioral metric. This can be seen for SCP and beta activity projected onto the Direction axis, where it has the lowest value at ~20th percentile of the trial (not 50th percentile as assumed by the behavioral metric). To confirm that the analysis is not highly dependent on the precise timing definition of the behavioral metrics, we ran a control analysis, where the switching point was set at 30%tile (rather than 50%tile as in the original analysis). This control analysis resulted in a similar pattern of neural results (Figure R1).

      Figure R1: Changing temporal behavior definition (switching point moved from 50th percentile to 30th percentile of percept duration) does not significantly alter the neural results. Compare to Figure 4—figure supplement 1, ‘Switch’ and “Direction’ Columns.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper shows that a principled, interpretable model of auditory stimulus classification can not only capture behavioural data on which the model was trained but somewhat accurately predict behaviour for manipulated stimuli. This is a real achievement and gives an opportunity to use the model to probe potential underlying mechanisms. There are two main weaknesses. Firstly, the task is very simple: distinguishing between just two classes of stimuli. Both model and animals may be using shortcuts to solve the task, for example (this is suggested somewhat by Figure 8 which shows the guinea pig and model can both handle time-reversed stimuli).

      The task structure is indeed simple. In the context of categorization tasks that are typically used in animal experiments, however, we would argue that we are the higher end of stimulus complexity. Auditory categories used in most animal experiments typically employ a category boundary along a single stimulus parameter (for example, tone frequency or modulation frequency of AM noise). Only a few recent studies (for example, Yin et al., 2020; Town et al., 2018) have explored animal behavior with “non-compact” stimulus categories. Thus, we consider our task a significant step towards more naturalistic tasks.

      We were also faced with the practical factor of the trainability of guinea pigs (GPs). Prior to this study, guinea pigs have been trained using classical conditioning and aversive reinforcement on detecting tone frequency (e.g., Heffner et al., 1971; Edeline et al., 1993). More recently, competitive training paradigms have been developed for appetitive conditioning, using a single “footstep” sound as a target stimulus and manipulated sounds as non-target stimuli (Ojima and Horikawa, 2016). But as GPs had never been trained on more complex tasks before our study, we started with a conservative one vs. one categorization task. We mention this in the Discussion section of the revised manuscript (page 27, line 665).

      To determine whether these results hold for more complex tasks as well, after receiving the reviews of the original manuscript, we trained two GPs (that were originally trained and tested on the wheeks vs. whines task) further on a wheeks vs. many (whines, purrs, chuts) task. As earlier, we tested these GPs with new exemplars and verified that they generalized. In the figure below, the average performance of the two GPs on the regular (training) stimuli and novel (generalization) stimuli are shown in gray bars, and individual animal performances are shown as colored discs. The GPs achieved high performance for the novel stimuli, demonstrating generalization. We also implemented a 4-way WTA stage for a wheek vs. many model and verified that the model generalized to new stimuli as well.

      For frequency-shifted calls, these two GPs performed better for wheeks vs. many compared to the average for wheeks vs. whines shown in the main manuscript. The 4-way WTA model closely tracked GP behavioral trends.

      The psychometric curves for wheeks vs. many categorization in noise (different SNRs) did not differ substantially from the wheeks vs. whines task.

      We focused our one vs. many training on the two conditions that showed the greatest modulation in the one vs. one tasks. However, these preliminary results suggest that the one vs. one results presented in the manuscript are likely to extend to more complex classification tasks as well. We chose not to include these new data in the revised manuscript because we performed these experiments on only 2 animals, which were previously trained on a wheeks vs. whines task. In future studies, we plan to directly train animals on one vs. many tasks.

      Secondly, the predictions of the model do not appear to be quite as strong as the abstract and text suggest.

      We now replace subjective descriptors with actual effect size numbers to avoid overstatingresults. We also include additional modeling (classification based on the long-term spectrum) and discuss alternative possibilities to provide readers with points of comparison. Thus, readers can form their own opinions of the strengths of the observed effects.

      The model uses "maximally informative features" found by randomly initialising 1500 possible features and selecting the 20 most informative (in an information-theoretic sense). This is a really interesting approach to take compared to directly optimising some function to maximise performance at a task, or training a deep neural network. It is suggestive of a plausible biological approach and may serve to avoid overfitting the data. In a machine learning sense, it may be acting as a sort of regulariser to avoid overfitting and improve generalisation. The 'features' used are basically spectro-temporal patterns that are matched by sliding a crosscorrelator over the signal and thresholding, which is straightforward and interpretable.

      This intuition is indeed accurate – the greedy search algorithm (described in the original visionpaper by Ullman et al., 2002) sequentially adds features that add the most hits and the least false alarms compared to existing members of the MIF set to the final MIF set. The latter criterion (least false alarms) essentially guards against over-fitting for hits alone. A second factor is the intermediate size and complexity of MIFs. When MIFs are too large, there is certainly overfitting to the training exemplars, and the model does not generalize well (Liu et al., 2019).

      It is surprising and impressive that the model is able to classify the manipulated stimuli at all. However, I would slightly take issue with the statement that they match behaviour "to a remarkable degree". R^2 values between model and behaviour are 0.444, 0.674, 0.028, 0.011, 0.723, 0.468. For example, in figure 5 the lower R^2 value comes out because the model is not able to use as short segments as the guinea pigs (which the authors comment on in the results and discussion). In figure 6A (speeding up and slowing down the stimuli), the model does worse than the guinea pigs for faster stimuli and better for slower stimuli, which doesn't qualitatively match (not commented on by the authors). The authors state that the poor match is "likely because of random fluctuations in behavior (e..g motivation) across conditions that are unrelated to stimulus parameters" but it's not clear why that would be the case for this experiment and not for others, and there is no evidence shown for it.

      Thank you for this feedback. There are two levels at which we addressed these comments inthe revised manuscript.

      First, regarding the language – we have now replaced subjective descriptors with the statement that the model captures ~50% of the overall variance in behavioral data. The ~50% number is the average overall R2 between the model and data (0.6 and 0.37 for the chuts vs. purrs and wheeks vs. whine tasks respectively). We leave it to readers to interpret this number.

      Second, our original manuscript lacked clarity on exactly what aspects of the categorization behavior we were attempting to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022). Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      In figure 11, the authors compare the results of training their model with all classes, versus training only with the classes used in the task, and show that with the latter performance is worse and matches the experiment less well. This is a very interesting point, but it could just be the case that there is insufficient training data.

      This could indeed be the case, and we acknowledge this as a potential explanation in therevised manuscript (page 22, line 537; page 27, line 653). Our original thinking was that if GPs were also learning discriminative features only using our training exemplars, they would face a similar training data constraint as well. But despite this constraint, the model’s performance is above d’=1 for natural calls – both training and novel calls; it is only the similarity with behavior on the manipulated stimuli that is lower than the one vs. many model. This phenomenon warrants further investigation.

      Reviewer #2 (Public Review):

      Kar et al aim to further elucidate the main features representing call type categorization in guinea pigs. This paper presents a behavioral paradigm in which 8 guinea pigs (GPs) were trained in a call categorization task between pairs of call types (chuts vs purrs; wheek vs whines). The GPs successfully learned the task and are able to generalize to new exemplars. GPs were tested across pitch-shifted stimuli and stimuli with various temporal manipulations. Complementing this data is multivariate classifier data from a model trained to perform the same task. The classifier model is trained on auditory nerve outputs (not behavioral data) and reaches an accuracy metric comparable to that of the GPs. The authors argue that the model performance is similar to that of the GPs in the manipulated stimuli, therefore, suggesting that the 'mid-level features' that the model uses may be similar to those exploited by the GPs. The behavioral data is impressive: to my knowledge, there is scant previous behavioral data from GPs performing an auditory task beyond audiograms measured using aversive conditioning by Heffner et al., in. 1970. [One exception that is notably omitted from the manuscript is Ojima and Horikawa 2016 (Frontiers)]. Given the popularity of GPs as a model of auditory neurophysiology these data open new avenues for investigation. This paper would be useful for neuroscientists using classifier models to simulate behavioral choice data in similar Go/No-Go experiments, especially in guinea pigs. The significance of the findings rests on the similarity (or not) of the model and GP performance as a validation of the 'intermediary features' approach for categorization. At the moment the study is underpowered for the statistical analysis the authors attempt to employ which frequently relies on non-significant p values for its conclusions; using a more sophisticated approach (a mixed effects model utilizing single trial responses) would provide a more rigorous test of the manipulations on behavior and allow a more complete assessment of the authors' conclusions.

      We thank the reviewer for their feedback and the suggestion for a more robust statistical approach. We have now replaced the repeated measures ANOVA based statistics for the behavior and model where more than 2 test conditions were presented (SNR, segment length, tempo shift, and frequency shift) with generalized linear models with a logit link function (logistic activation function). In these models, we predict the trial-by-trial behavioral or model outcome from predictors including stimulus type (Go or Nogo), parameter value (e.g., SNR value), parameter sign (e.g., positive or negative freq. shift), and animal ID as a random effect. To evaluate whether parameter value and sign had a significant contribution to the model, we compare this ‘full’ model against a null model that only has stimulus type as a predictor and animal ID as a random effect. These analyses are described in detail in the Materials and Methods section of the revised manuscript (page 36, line 930).

      These analyses reveal significant effects of segment length changes, and weak effects of tempo changes on behavior (as expected by the reviewer). Both the behavior and model showed similar statistical significance (except tempo shift for wheeks vs. whines) for whether performance was significantly affected by a given parameter.

      The behavioral data presented here are descriptive. The central conceptual conclusions of the manuscript are derived from the comparison between the model and behavioral data. For these comparisons, the p-value of statistical tests is not used. We realized that a description of how we compared model and behavioral data was not clear in the original manuscript. To compare behavioral data with the model, we fit a line to the d’ values obtained from the model plotted against the d’ values obtained from behavior, and computed the R2 value. We used the mean absolute error (MAE) to quantify the absolute deviation between model and behavior d’ values. Thus, high R2 values would signify a close correspondence between the model and behavior regardless of statistical significance of individual data points. We now clarify this in page 12, line 289. We derive R2 values for individual stimulus manipulations, as well as an overall R2 by pooling across all manipulations (presented in Fig. 11). This is now clarified in page 21, line 494.

      Reviewer #3 (Public Review):

      The authors designed a behavioral experiment based on a Go/ No-Go paradigm, to train guinea pigs on call categorization. They used two different pairs of call categories: chuts vs. purrs and wheeks vs. whines. During the training of the animals, it turned out that they change their behavioral strategies. Initially, they do not associate the auditory stimuli with rewards, and hence they overweight the No-Go behavior (low hit and false alarm rate). Subsequently, they learned the association between auditory stimuli and reward, leading to overweighting the Go behavior (high hit and false alarm rates). Finally, they learn to discriminate between the two call categories and show the corresponding behaviors, i.e. suppress the Go behavior for No-go stimuli (improved discrimination performance due to stable hit rates but lower false alarm rates).

      In order to derive a mechanistic explanation of the observed behaviors, the authors implemented a computational feature-based model, with which they mirrored all animal experiments, and subsequently compared the resulting performances.

      Strengths:

      In order to construct their model, the authors identified several different sets of so-called MIFs (most informative features) for each call category, that were best suited to accomplish the categorization task. Overall, model performance was in general agreement with behavioral performance for both the chuts vs. purrs and wheeks vs. whines tasks, in a wide range of different scenarios.

      Different instances of their model, i.e. models using different of those sets of MIFs, performed equally well. In addition, the authors could show that guinea pigs and models can generalize to categorize new call exemplars very rapidly.

      The authors also tested the categorization performance of guinea pigs and models in a more realistic scenario, i.e. communication in noisy environments. They find that both, guinea pigs and the model exhibit similar categorization-in-noise thresholds.

      Additionally, the authors also investigated the effect of temporal stretching/compression of calls on categorization performance. Remarkably, this had virtually no negative effect on both, models and animals. And both performed equally well, even for time reversal. Finally, the authors tested the effect of pitch change on categorization performance, and found very similar effects in guinea pigs and models: discrimination performance crucially depends on pitch change, i.e. systematically decreases with the percentage of change.

      Weaknesses:

      While their computational model can explain certain aspects of call categorization after training, it cannot explain the time course of different behavioral strategies shown by the guinea pigs during learning/training.

      Thank you for bringing this up – in hindsight the original manuscript lacked clarity on exactlywhat aspects of the behavior we were trying to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022) . Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal, or behavioral strategies. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      Furthermore, the model cannot account for the fact that short-duration segments of calls (50ms) already carry sufficient information for call categorization in the guinea pig experiment. Model performance, however, only plateaued after a 200 ms duration, which might be due to the fact that the MIFs were on average about 110 ms long.

      The segment-length data indeed demonstrates a deviation between the data and the model.As we had acknowledged in the original manuscript, this observation suggests further constraints (perhaps on feature length and/or bandwidth) that need to be imposed on the model to better match GP behavior. We originally did not perform this analysis because we wanted to demonstrate that a model with minimal assumptions and parameter tuning could capture aspects of GP behavior.

      We have now repeated the modeling by constraining the features to a duration of 75 ms (thelowest duration for which GPs show above-threshold performance). We found that the constrained MIF model better matched GP behavior on the segment-length task (R2 of 0.62 and 0.58 for the chuts vs. purrs and wheeks vs. whines tasks; with the model crossing d’=1 for 75 ms segments for most tested cases). The constrained MIF model maintained similarity to behavior for the other manipulations as well, and yielded higher overall R2 values (0.66 for chuts vs. purrs, 0.51 for wheeks vs. whines), thereby explaining an additional 10% of variance in GP behavior.

      In the revised manuscript, we included these results (page 28, line 699), and present results from the new analyses as Figure 11 – Figure Supplement 2.

      In the temporal stretching/compressing experiment, it remains unclear, if the corresponding MIF kernels used by the models were just stretched/compressed in a temporal direction to compensate for the changed auditory input. If so, the modelling results are trivial. Furthermore, in this case, the model provides no mechanistic explanation of the underlying neural processes. Similarly, in the pitch change experiment, if MIF kernels have been stretched/compressed in the pitch direction, the same drawback applies.

      We did not alter the MIFs in any way for the tests – the MIFs were purely derived by trainingthe animal on natural calls. In learning to generalize over the variability in natural calls, the model also achieved the ability to generalize over some manipulated stimuli. The fact that the model tracks GP behavior is a key observation supporting our argument that GPs also learn MIF-like features to accomplish call categorization.

      We had mentioned at a few places that the model was only trained on natural calls. To addclarity, we have now included sentences in the time-compression and frequency-shifting results affirming that we did not manipulate the MIFs to match test stimuli. We also include a couple of sentences in the Discussion section’s first paragraph stating the above argument (page 26, line 615).

    1. Author Response

      Reviewer #1 (Public Review):

      The actual description of the methods does not allow the reader to evaluate the precision of two important processing steps. First, rCBF measures are supposed to be restricted to the cortex, but given the pCASL image spatial resolution, partial volume effects with white matter probably exist, especially in younger infants. Furthermore, segmenting tissues on the basis of anatomical images (especially T1-weighted) is complicated in the first postnatal year. As rCBF measurements are very different between grey and white matter, the performed procedure might impact the measures at each age, or even lead to a systematic bias on age-dependent changes. Second, the methodology and accuracy of the brain registration across infants are little detailed whereas it is a challenging aspect given the intense brain growth and folding, the changing contrast in T1w images at these ages, and the importance of this step to perform reliable voxelwise comparison across ages.

      We thank the reviewer for this comment. We have added more descriptions in the methods to address this comment. Briefly, individual rCBF map was generated in the individual space and calibrated by phase contrast MRI to minimize the individual variations of processing parameters such as T1 of arterial blood (Aslan et al., 2010). Cortical segmentation was also conducted in individual space. Then different types of images including rCBF map and gray matter segmentation probability map in the individual space were normalized into the template space. An averaged gray matter probability map was generated after inter-subject normalization. After carefully testing multiple thresholds in the averaged gray matter probability maps, 40% probability minimizing the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex was used to generate the binary gray matter mask shown on the left panel of Figure R1 below. Despite poor contrasts and poor cortical segmentation of T1-weighted images of younger infants rightfully pointed out by this reviewer, the poor cortical segmentation of younger infants was compensated by the averaged cortical mask and measurement of rCBF in the template space. As demonstrated in the right three panels in Figure R1, the rCBF measure in the cortical mask in the template space is consistent across ages for accurate and reliable voxelwise comparison across age.

      Figure R1. The gray matter mask and segmented cortical mask overlaid on rCBF map of three representative infants aged 3, 6, and 20 months in the template space. The gray matter mask on the left panel was created to minimize the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex. The contour of the gray matter mask was highlighted with bule line.

      The authors achieved their aim in showing that the rCBF increase differs across brain regions (the DMN showing intense changes compared to the visual and sensorimotor networks). Nevertheless, an analysis of covariance (instead of an ANOVA) including the infants' age as covariate (in addition to the brain region) would have allowed them to evaluate the interaction between age and region (i.e. different slopes of age-related changes across regions) in a more rigorous manner. Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures, the results only partly support the authors' conclusion. Actually, both measures strongly depend on the infants' age, as the authors highlight in the first parts of the study. Thus, considering this common age dependency would be required to show that the physiological and connectivity measurements are specifically related and that there is indeed a coupling.

      We thank the reviewer for this comment. Following the reviewer’s suggestion, we conducted an analysis of covariance (ANCOVA) and found significant interaction between regions and age (F(6, 322) = 2.45, p < 0.05) with age as a covariate. This ANCOVA result is consistent with Figure 3c showing differential rCBF increase rates across brain regions. The ANCOVA result was added in the last paragraph in the Results section “Faster rCBF increases in the DMN hub regions during infant brain development”.

      Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures (FC), the Figure 5, Figure 5–figure supplement 1 and 2 were generated exactly to test that the FC-rCBF coupling specifically localized in the DMN is not due to mutual age dependency. Briefly, Figure 5B demonstrated significant correlation only clustered in the DMN regions using the correlation method demonstrated in Figure 5-figure supplement 1. Furthermore, nonparametric permutation tests with 10,000 permutations were conducted. Such permutation tests are sensitive and effective with Figure 5c revealing significant coupling only in the DMN regions. If coupling is related to mutual age dependency, Figure 5c would demonstrate significant coupling in Vis and SM network regions too.

    1. Author Response

      Reviewer #1 (Public Review):

      In this work, Maxime R. and co-authors intended to investigate the consequence of dystrophin absence/alteration in myoblasts, the effector cells of muscle growth and regeneration, and the early role of such cells in the pathogenesis of the disease. They carried out a transcriptomic analysis, comparing transcripts expressed by dystrophic myoblasts isolated from two murine models of DMD (Dmdmdx and Dmdmdx-βgeo) and control healthy mice. The expression of a large number of genes, comprising key regulator of myogenic differentiation (Myod1, Myog, Pax3 etc.) resulted affected in comparison to control in both mouse lines.

      We believe that the novelty and importance of these result lie in demonstrating for the first time that the loss of full-length dystrophin expression is both necessary and sufficient to trigger molecular and functional abnormalities in myoblasts. The fundamental point is that, contrary to the prevailing belief, the dystrophin function may not be just to provide sarcolemma stability in myofibers but rather that there is a disease continuum: DMD defects in satellite cells (Dumont et al., 2015, Ref 45), cause myoblast dysfunctions diminishing muscle regeneration (this work), and also impairing myofiber differentiation (Shoji et al., Ref 4), with the resulting fibre being unstable and therefore degenerating. These data can better explain all the symptoms of dystrophic muscle pathology, where abnormalities in satellite cells, myoblasts and myofibers form the pathological vicious cycle. Moreover, we identify the key trigger behind these abnormalities in dystrophic myoblasts, which is MyoD downregulation. Furthermore, we demonstrate that the additional loss of short dystrophin isoforms, although these are expressed in myoblasts, do not exacerbate the phenotype. This latter finding is very important given the near complete lack of understanding of the pathology in dystrophin-null patients.

      Authors highlighted similar gene expression modifications also in a myoblast cell line previously established from the mdx mouse.

      Analogous alterations found in both primary myoblasts and in the established myoblast cell line demonstrate that this change is cell-autonomous and not evoked by the external factors in the dystrophic niche, e.g. inflammatory mediators. This also shows that the dystrophic phenotype resists the transcriptomic drift as it is maintained through numerous passages. This approach was praised later on in the review.

      To assess the outcomes from the gene ontology analysis, which pointed on the alteration of muscle system and regulation of muscle system processes, authors evaluated the proliferative, chemotactic and differentiative capacities of dystrophic myoblasts. Myoblasts presented increased proliferation, reduced chemotaxis and quite surprisingly, improved differentiating capacity, if considering the transcriptomic data.

      The key pathways (proliferation, migration and differentiation), that are essential for myoblast to evoke muscle regeneration, were confirmed to be altered in functional analyses, thus proving these transcriptomic alterations to be functional and biologically relevant. Our data showing accelerated differentiation in mdx myoblasts fully agree with findings by others, both in primary cultures and in isolated myofibers (Yablonka-Reuveni &Anderson, 2005, Ref 22).

      Finally, Maxime R. and co-authors carried out a transcriptomic analysis in myoblasts from DMD human subjects. Even though the profile of altered gene expression resulted similar and the GO studies seemed to focus on the same pathway categories, a significative divergence was observed particularly at the level of gene expression.

      Given that myoblasts from individual DMD patients present heterogeneous phenotypes (Choi et al., 2016), such divergence at the level of individual gene expression between mouse and human is to be expected. Nevertheless, these changes become convergent in altered GO categories and pathways. In the revised manuscript we have included additional genome-scale metabolic analysis in human DMD myoblasts. This revealed significant alteration in specific metabolic pathways. These are consistent with the metabolic alterations found previously in dystrophic muscle and brain, thus confirming the commonality of dystrophic defects found here in myoblasts and described before in dystrophic tissues. Moreover, this analysis is an additional proof that DMD myoblasts are significantly altered when compared to healthy cells.

      Authors link transcriptomic abnormalities and functional changes in proliferation, chemotaxis and differentiation of the dystrophic myoblasts with the alterations (probably epigenetic changes) occurring in satellite cells of dystrophic mice, consequent to the absence of the dystrophin protein. Such modifications in gene expression are supposed to be inherited by pathological myoblasts due to the division of the SC that is no longer asymmetric as occurring in healthy tissue.

      Strengths

      Transcriptomic data from samples of different sources are solid and rigorous statistical analyses have been carried out.

      Transcriptomic and functional data from primary proliferating myoblasts of the two mouse models and from the myoblast cell line are similar. This is a convincing evidence that the transcriptomic alterations observed in primary myoblasts are not influenced by the exposure to the niche environment present in the dystrophic muscle, but that are cell autonomous.

      Authors adopted a 3D culture for the functional analysis concerning myoblasts differentiations, in this way better mimicking the process occurring in vivo.

      Weaknesses

      The mdx mouse phenotype is mild in comparison to the severe symptoms and the rapid disease progression experimented by most of the human DMD subjects. Mdx mice is characterized by cycle of degeneration/regeneration initiating around the age of 6 weeks and continuing for several weeks. It was expected that authors discussed this point in detail, also considering that the animals used in this study were 8 weeks old.

      The mdx mouse has a mutation resulting in the loss of full-length dystrophin expression, which reflects the molecular defect affecting the majority of DMD patients. Therefore, mdx is the most commonly used pre-clinical model in DMD studies. The intensity of myonecrosis during this active degeneration and regeneration period (starting at 12 days and not at 6 weeks) is as aggressive as in patients. In fact, it has been suggested that the intensity of myonecrosis seen in mdx mice would be lethal to DMD patients (Duddy et al., 2015). The difference between human and mdx mouse pathology is that, starting at 10 weeks of age, the fibre replacement in mdx leg muscles reduces gradually, due to an unknown mechanism. Therefore, we isolated myoblasts at 8 weeks, when mdx replicates the human pathology. To emphasise the relevance of our findings for the human pathology, we discuss this point in detail in the revised manuscript.

      Furthermore, transcriptomic analysis of the human DMD myoblasts highlighted many differences as well as similarities when compared to mouse samples. Why do not focus more on this aspect? According to the authors, dystrophic abnormalities in myoblasts manifest irrespective of differences in genetic backgrounds and across species. The last one is a strong statement that should have been supported at least by functional data regarding chemotaxis proliferation and differentiation of human DMD myoblasts.

      What we meant by: “dystrophic abnormalities in myoblasts manifest irrespective of differences in genetic backgrounds and across species” is that the lack of full-length dystrophin expressions results in identical molecular defects in mouse and human primary myoblasts and also in the dystrophic cell line, despite numerous gene expression alterations triggered by the long-term culture in the latter We agree that linking the functional alterations in human dystrophic myoblasts to the transcriptomic alteration that we identified is important. And indeed, altered proliferation, migration and differentiation of human DMD myoblasts have been described before (Witkowski and Dubovitz., 1985; Nesmith et al., 2016; Sun et al., 2020). In fact, these previous findings that were never fully investigated, prompted us to undertake this study. Thus, our data provide a molecular underpinning for these abnormalities. In the revised manuscript we have elaborated on the existing functional data supporting alterations in human myoblasts.

      Further functional analyses will be needed to understand their consequences. It would require investigation of numerous parameters, including significant alterations in metabolic pathways, which we identified and described in the revised version of this manuscript. Given the aforementioned individual variability in patients’ population demonstrated by heterogeneous phenotypes in myoblasts, such functional analyses would need to involve a significant number of probands.

      Therefore, a detailed study in a sufficiently large cohort of DMD myoblasts is a logical next step from the identification of specific pathway alterations described here. But it is an extensive new project beyond our immediate capability.

      In the discussion, the authors suggest two possible mechanisms as responsible for alterations in the behavior of the SC that ultimately affect the functionality of myoblasts, an RNA-mediated pathological process or an alteration in epigenetic regulation. They consider the latter mechanism more likely. This is based in particular on transcriptomic data showing the downregulation of important genes involved in histone modifications, normally linked to transcriptional activation. They also reported from the literature that HDAC inhibitors upregulate MyoD, a gene that is effectively downregulated in this study. Since the authors postulate that the epigenetic dysregulation of Myod1 expression is responsible for the pathological cascade of gene downregulation, ultimately leading to the pathological phenotype, it would have been interesting to evaluate the impact of HDACi on this pathways or the overexpression of enzymes responsible for H3K4 methylation as Smid1 (downregulated in this study).

      We have presented several hypotheses regarding the mechanism in which loss of full-length dystrophin expression could affect myoblasts, including restricted spatio-temporal requirement for small amounts of full-length dystrophin and an RNA-based mechanism. The notion that epigenetic dysregulation of Myod1 expression causes a pathological cascade of transcription downregulation of genes controlled by MyoD was based on our finding that transcripts downregulated in dystrophic myoblasts exhibit overrepresentation of MyoD binding sites. We discussed this as a likely mechanism, supported by a body of literature on the known alterations of epigenetic regulation found in DMD (fifteen papers in total). We also offered a hypothesis that since treatment of mdx mice with histone deacetylase inhibitors (HDACi) promoted myogenesis (Saccone et al., 2014) and HDACi upregulate Myod1 (Mal et al., 2001), HDACi could increase myogenesis by counteracting the changes we found in dystrophic myoblast. However, while evaluation of the impact of HDACi or of the overexpression of enzymes responsible for H3K4 methylation would prove or disprove this one of the working hypotheses we made in the Discussion, it would, in no way, alter the key discovery of this study, which is that loss of full-length dystrophin expression results in major cell-autonomous abnormalities in proliferating myoblasts. Thus, if preferred, this Discussion paragraph could be shortened not to detract the reader from the main findings of this manuscript.

      Reviewer #2 (Public Review):

      This study is one of many that explore various abnormalities in the mononuclear myogenic cell compartments in DMD. Although the aim has been extensively investigated in the last several decades, it is still relevant.

      It is correct that abnormalities of proliferation, migration and differentiation in dystrophic myogenic cells have been reported over decades, but these were not followed up and often disregarded. Certainly, their causative link to DMD mutations and their consequences for the pathology were never investigated. Our study is the first to provide the comprehensive molecular underpinning for these abnormalities, demonstrating that the loss of full-length dystrophin expression directly and significantly affects myoblasts.

      The biggest limitation of this study is that it relies on the RNAseq analyses of extensively cultured myoblasts. While the computation analyses are profound, the study lacks any mechanistic explanation for the relevance of the transcriptional differences seen in the DMD myoblasts.

      We are not sure where this opinion had originated from. In fact, we used freshy isolated primary myoblasts in RNAseq experiments and then confirmed the key alterations functionally in primary myoblasts freshy isolated from two strains of DMD mice. Furthermore, we performed the mechanistic analyses, where we linked process alterations to functional defects, in which we focussed on proliferation, migration and differentiation, as processes known to impact the DMD pathology.

      In an approach considered as one of the strengths of our work by the other Reviewer, these findings in primary myoblasts were then reproduced in myoblast cell line, to demonstrate that alterations observed are not evoked by the exposure to the niche environment present in the dystrophic muscle, but that are cell-autonomous. Importantly, DMD mutant cells show these alterations despite being extensively cultured in vitro, demonstrating expressivity of this mutation. Finally, alterations were confirmed in human primary myoblasts.

      Cell purity, the myogenic status of the cells, passage number, and the period that cells were in culture are not well described. This study's cell isolation method allows contamination with non-myogenic cells that can significantly influence the RNAseq analyses. Immunostaining for myogenic markers, for example, MyoD, would indicate the purity of the cell culture. Extensive culturing of the primary myoblasts promotes clonal selection and introduces numerous molecular alterations; thus, the passage number and duration of the culture are significant factors. It looks that some assays were conducted with cells in the high passage. For example, in myogenic differentiation assay where they needed one million cells for each pellet. Maybe that is the reason for the low differentiation rate presented in Sup. Fig 2.

      Cell homogeneity across genotypes was fully confirmed by sample-based hierarchical clustering, clearly segregating transcripts into groups corresponding to genotypes. Furthermore, the same alterations were found in corresponding myoblast cell lines, which purity and myogenic potential was demonstrated previously (Onopiuk et al., 2015). Therefore, varying contamination with non-myogenic cells could not significantly influence these results. However, for completeness, in the revised manuscript (Supplementary Figure 8) we described cell characterisation using MyoD as a marker, proving that the well-established myoblast isolation procedure used by us produces pure myoblast cultures.

      As for the differentiation assay, isolated myoblasts were never passaged extensively (one passage only) but sufficient numbers were obtained through the efficient isolation. Moreover, cells from every genotype were maintained and treated identically. Therefore, under these given conditions, any differences were the result of the DMD gene mutation and not culturing.

      It is hard to explain how DMD myoblasts differentiate better than the WT controls if they have a suppressed myogenic program in the proliferation stage. Even at day 0 of differentiation, DMD myoblasts differentiated better according to the RT-qPCR presented in Figure 5c. Additionally, it is unusual that the marker of differentiation Myog and Myh1 reached the peak at day 2 of differentiation for the WT myoblasts.

      In fact, our data fully agree with findings by others, that mdx cells display accelerated differentiation both in primary cultures and in isolated myofibers (Yablonka-Reuveni &Anderson, 2005). Our team recently demonstrated that DMD mutations evoke marked transcriptome and miRNome dysregulations early in human muscle cell development (Mournetas et al, 2021). Expression of key coordinators of muscle differentiation was dysregulated in proliferating dystrophic myoblasts, the differentiation of which was subsequently found to be altered, in line with the mouse cells studied here. Clearly, further studies into the mechanisms of this and numerous other alterations described by us here are urgently needed, as these may uncover new therapeutic targets.

      As to whether it is unusual for these differentiation markers to peak at that time, we cannot comment, as no reference for this statement was given and the expressions can vary depending on the experimental conditions used – in our case the 3D culture could make the difference. Yet, again, cells from every genotype were maintained and treated identically and so any differences reflect the impact of the DMD mutation.

    1. Author Response

      Reviewer #1 (Public Review):

      The current manuscript examined patients with inborn errors of immunity (IEI) using whole exome sequencing (WES) and identified de novo variants (DNVs) associated with the disease. They found 14 genes associated with DNVs, including four novel genes - PSMB10, DDX1, KMT2C, and FBXW11, and conducted a systematic assessment of affected genes.

      Given the level of heterogeneity underlying IEI, the sample size is limited. Although the authors clearly stated this, the analysis of the current manuscript does not add much value to describing genes affected by DNVs. The sample size is small to perform exome-wide evaluation (authors described they did "exome-wide evaluation" in Abstract - line 10 but there is no statistical evaluation to prioritize effect genes). They could go with systems biology approaches, explaining the biological pathway of affected genes or underlying cell types from immune single-cell datasets. As the authors stated that IEI constitutes a large group of heterogeneous disorders, there should be some analysis to explain the functional convergence of affected genes in disease development.

      We believe the term ‘exome-wide evaluation’ might have led to misinterpretation. We used it in the context of reviewing each DNV found in a single patient’s exome outside the diagnostic IEI gene panel (i.e. ‘exome-wide’), instead of reviewing DNVs across all exomes. We have rephrased the sentences containing this term. The main purpose of this manuscript was to identify ‘all’ coding DNVs in each case, and explore whether they include any pathogenic or novel candidate DNVs. Our main purpose was to urge the IEI field to apply trio-based WES more systematically, and share candidate DNVs with the field for further validation.

      As the reviewer points out, our sample size would be too limited to perform systems biology approaches for variant prioritization. The signal-to-noise ratio would be very high, because many genes causing inborn errors of immunity remain to be discovered and the studied group of patients with inborn errors of immunity is very heterogeneous. This means that we would not have the power to investigate potential enrichment or burden of DNVs in specific genes nor the functional convergence of affected genes or pathways in specific phenotypes. In this study, we aimed to show the additional value of the systematic DNV analysis as a method to identify and prioritize candidate variants in individual cases, but ideally we would like to answer other important research questions using computational/statistical approaches in a larger cohort in the future, as has been performed in other rare disease fields. The suggestion of the reviewer is helpful, and this approach has been shown to implicate novel pathways enriched in disease for various forms of neurodevelopmental diseases for which ten-thousands of trio-based WES have been performed [9, 10].

      For DNV identification, the authors filtered out variants with ExAC & gnomAD AF > 0.1% or GoNL AF > 0.5%. I think this is too lenient a cutoff for filtering for DNV. For example, gnomAD AF 0.1% is approximately ~200 individuals in population. Given the filtering parameters (<5 variation reads, <20% variant allele frequency, or low coverage DNVs), they did not use specific filtering metrics to find DNV and there might be false-positive variants in the final DNV set. As far as I can find in the manuscript, they used the GATK pipeline from the previous study (REF 29). The GATK unified genotype generates a range of filtering metrics to increase specificity in variant filtering. It is very surprising that the authors seem to use three parameters (variation reads → FORMAT:AD[1]; variant allele frequency → FORMAT:AB? and low coverage → FORMAT:DP? but the authors did not state the cutoff) to filter de novo variants, which are fragile to false-positive variant calling.

      The chosen population database fraction cut-offs align with DNV filtering strategies in literature. We have not chosen a stricter cut-off to avoid missing true positives, since patients with IEI can exhibit late-onset disease, variable penetrance and have postzygotic mutations, while limiting the chance of false-positive findings. For instance, we have reduced local false-positives by filtering on allele frequencies in our in-house database and Dutch population database. Moreover, automated DNV calling required >2% alternate reads in either parent and variants were prioritized based on prediction scores and annotated immune function. Additionally, and in accordance with this expert reviewer, we have now put a stricter cut-off in place for variation reads (from 5 to 10) to further minimize false-positive findings. Lastly, we visually inspected the final 14 candidate DNVs in IGV and/or Alamut, which supports the validity of the findings. The DNVs reported in our final DNV list (Table 2B) are therefore unlikely to contain falsepositive findings.

      Reviewer #2 (Public Review):

      The manuscript by Hebert et al., reports on the utility of TRIO-based whole-exome sequencing (WES) in patients who presented as sporadic cases and are suspected of having inborn errors of immunity (IEIs). The authors developed an in-house pipeline for data analysis and used a set of known algorithms to prioritize the impact of genetic variants located mostly in the coding region of proteins. The data analysis was done in two steps; the first step involved the routine WES diagnostic analysis that led to the identification of pathogenic (P) and likely pathogenic variants (LP) in genes already associated with IEIs. The authors claim that this analysis resulted in a likely molecular diagnosis in 19 (~15%) of patients, while an additional 14% of cases were carriers for VUSs or other risk factors in the disease causal genes. As many of these variants are either inherited from one parent or are present as heterozygous (monoallelic) variants in genes associated with recessive diseases, their clinical significance is unclear.

      In the second step, the authors focused on the identification of de novo variants (DNVs), including SNVs, CNVs, and small indel, since these variants are more likely to be deleterious on protein function. The authors identified 136 non-synonymous DNVs, which were then filtered down to 14 best candidate variants using various in silico tools and database searches. These 14 variants included DNVs in genes previously associated with autoinflammatory diseases, such as CAPS and RELA haploinsufficiency. Three patients are found to carry de novo copy number variants (CNVs) of unknown clinical significance. Finally, several de novo loss-of-function (LoF) variants have been identified in genes that are not yet associated with any IEIs but are good functional candidates. Their potential pathogenicity is further supported by the observation that they are found in genes intolerant to loss of function. Functional validation has been performed only for the patient carrier of the novel FBXW11 splice variant. The authors state that the maximum solve rate (i.e., probable molecular diagnosis) in this cohort might be as high as 23%, which is comparable to similar reports of patients with IEIs, however, the reported results do not yet support this conclusion.

      The main conclusion of this study is that TRIO-based WES analysis for DNVs could improve the diagnostic rate and can result in the identification of novel disease-causing genes. TRIO-based sequencing is also preferable when analyzing patients from populations underrepresented in gnomAD and ExAC. As the cost of WES has come down, WES has been increasingly used in the clinical diagnosis of many human disorders. Despite the major progress in the development of novel sequencing technologies and new in silico tools, the diagnostic rate is still below 50%. In summary, this study suggests that despite the identification of over 400 genes associated with IEIs, there are many more genes to be identified and that the heritability of these diseases is very complex.

      We thank the reviewer for the elaborate summary of our study and the suggestions that have helped to further improve the manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      This is a study that is aimed at understanding the binding mechanism of D-serine to the two different binding lobes of the NMDA receptor. D-serine is a known agonist and binder of the GluN1 ligand-binding domain, but its interaction with the GluN2A is unknown. Using long time-scale conventional molecular dynamics simulations, the researchers observe that D-serine interacts and associates readily with both binding domains, often via protein surface pathways referred to as a guided-diffusion mechanism. As observed previously, free-energy calculations show that D-serine stabilizes the closure of both binding domains. Finally, analysis of the effect of glycans shows that these modifications play a role in further stabilizing the closed state of the ligand-binding domains.

      Amongst this broad and careful analysis, the major finding from this work is that D-serine surprisingly associates with GluN2A, which has been known to bind glutamate to enable activation of the channel. Since the binding of D-serine to GluN2A had not been observed previously, they proposed that D-serine acts as an inhibitor for glutamate at high concentrations. This hypothesis was investigated and supported by electrophysiological experiments, yielding a novel result that presents new interpretations for the field. However, the guided-diffusion mechanism still remains hypothetical and is unclear as to whether this is in fact a driving force, or requirement, for the binding. Specifically, the following questions warrant further investigation:

      1) Specific or non-specific association? It is possible that non-specific association events of ligands to the protein could be an intrinsic artifact of the MD simulations. To investigate this, it would be informative to compare the current results with a negative control simulation where the ligand was replaced with a similar amino acid or molecule that has been verified as a non-binder for NMDAR.

      To address this, we quantified the non-specific association signal by comparing the number of successful binding events to random association (see response to Essential Revisions #4). In theory, any appropriately small amino acid could associate with the conserved arginine of each LBD through its C-terminus (as evidenced by our PMF of glycine bound to GluN2A). However, an amino acid’s ability to remain bound long enough to induce LBD closure is largely dependent on the presence of interactions with the LBD bottom lobe.

      2) Dissociation events? Further clarification is required to understand whether any dissociation events are observed in these simulations to the non-specific sites or the final binding site. If dissociation is not observed, how does this impact the interpretation of the binding mechanisms that characterize only the association events?

      Association and dissociation are both observed and documented in Datasets S2-S4. We added clarification to the text on page 5 about the nature of both processes and how pathways are defined by residues that allow the agonist to enter and leave the binding site. As illustrated in the clustering dendrograms, association (even-numbered events) and dissociation (odd-numbered events) pathways are present in all clusters.

      3) Testing the hypothesis of guided diffusion. It is proposed that guided diffusion drives serine binding to its site. This would imply that the residues on this path are important, and if mutated, would decrease the association rate and the ability to compete with glutamate. Additional electrophysiological experiments or direct binding experiments would be useful in understanding the relevance of guided diffusion in the ligand-binding mechanism of NMDARs.

      To address this point, we performed additional TEVC experiments generating D-serine dose-response curves for GluN1a Arg694Ala and Arg695Ala, and GluN2A Arg692Ala and Arg695Ala. The curves for both GluN2A mutants support our guided diffusion mechanism, as they lowered the D-serine inhibition potency (These mutants also likely also alter glutamate binding, but since D-serine and glutamate bind through the same residues, it is not possible to separate out individual contributions.) The GluN1a mutants did not show altered behavior, supporting the increased diffusiveness of D-serine binding to GluN1 compared to GluN2A. These additional findings are included in the main text on page 12 and in Fig. 4D.

      Reviewer #2 (Public Review):

      In this manuscript, Yovanno et. al. did a comprehensive mechanistic study of D-serine binding to NMDAR ligand-binding domains (LBDs). The framework of the current investigation is built upon this research group's previous studies of NMDAR agonists glutamate and glycine binding. Using an aggregated 51 microseconds of all-atom MD simulations of spontaneous binding, the authors applied rigorous pathway similarity analysis to cluster the paths through which D-serine enters the LBDs from the bulk solution. The most interesting and unexpected result from this study is the spontaneous binding of D-serine to the GluN2A LBD, which was previously known to be the glutamate binding site.

      By computing the overlap coefficient for all binding pathways, the authors concluded that D-serine binding to GluN2A LBD through "guided" diffusion, while to GluN1 through random diffusion (the clustered pathways comprise random contacts rather than specific, conserved residue contacts). A "guided" binding pathway further suggests that the agonist binding could be sensitive to the conformational change within and around the binding pocket, and vice versa.

      To investigate whether D-serine binding events are able to modulate the GluN2A LBD conformation, the authors then computed a series of LBD conformational free energy landscapes (2D-PMF) using 2D-umbrella sampling simulations. The 2D-PMF profiles confirmed that D-serine stabilizes the closed LBD conformation, just like glutamate. Because the D-serine 2D-PMF shows a metastable state that was absent in glutamate 2D-PMF, the authors argue that D-serine may not stabilize the closed conformation to the same extent as glutamate. Likewise, based on the 2D-PMF of GluN1 LBD, the authors suggest that D-serine has a higher potency than glycine, in part due to its ability to more strongly stabilize a closed LBD conformation.

      The simulations above generated the hypothesis that D-serine could function as a competitive antagonist of glutamate at high concentrations. This computationally derived hypothesis is beautifully tested by the authors' dose-response curves and the Schild plot.

      One question that would merit further clarification is whether the binding affinity of D-serine to the two LBDs is stronger or weaker in comparison with glutamate and glycine. The difference in agonist potency could be due to the difference in binding affinity and/or efficacy. Stabilizing the closed LBD conformation may indicate the efficacy of the agonist, but affinity (Kd) will still play a role in the final potency.

      Indeed, as Reviewer 2 pointed out, affinity should play a role since the D-serine inhibition here is attributed to the competitive binding of D-serine against glutamate as we showed with our Schild plot. The bona fide binding site for D-serine is GluN1 LBD where D-serine binds more strongly than glycine (Furukawa/Gouaux 2003). In the GluN1 LBD, D-serine is a full agonist. The D-serine binding to the GluN2A LBD (the finding here) is substantially weaker (mM) than glutamate (~1 uM).

      While a glycosylated GluN1/GluN2A dimer was used for the majority of MD simulations, the authors also checked the "reality" by mapping the pathway residues onto the NMDAR heterotetramer structure. The role of glycans in D-serine binding pathways was further investigated by conducting an additional 30 microseconds simulations of the non-glycosylated dimer. It was found that glycans introduced small kinetic "traps" that slow down the binding process. Glycan was also found to stabilize LBD closure from 1D-PMF profiles.

      The detailed mechanistic insight and D-serine's inhibitory effect on NMDAR, unraveled by this study, may play an important role in therapeutic strategies, and thus is likely to have a broad impact in the field.

    1. Author Response

      Reviewer #2 (Public Review):

      Dr Muktupavela et al. present a novel likelihood-based method for inferring the strength of natural selection and basic demographic parameters, such as mobility rates, from time-stamped ancient DNA data in a spatially explicit framework. This is an elegant method that is, in many ways, a natural extension of previous work in the field that has focussed mainly on inferring natural selection from temporal data to a spatial setting. In addition to the simplest scenarios of isotropic dispersal the authors also consider models with different dispersal rate in longitudinal and latitudinal directions, as well as biased dispersal. Selection strength, dispersal rates and bias are assumed to be constant across space and piecewise constant in time (but it would be very straightforward to relax these assumptions). The bias component of the model is an interesting addition that, in principle, allows to broadly account for the effect of long-range dispersals such as the spread of agriculture across Europe from the fertile crescent and Bronze age migrations from the Asian steppes on the spatiotemporal pattern of allele frequencies.

      Although the main idea is clearly communicated, there is room for improvement of the manuscript regarding investigating the properties of the model and presenting the results. Notably, the authors assume that the age of mutation is known and correct in their assessment of the performance of the model on simulated data (which may inflate the reported accuracy of the reconstructions) and use estimates from the literature when the method is applied to empirical data. Although it is necessary to specify the age of the allele, and this could easily have been treated as a free parameter in the framework. I would like to see a discussion of why the method may not be suitable for this, and a more systematic test for the sensitivity of the method to misspecification of the age (which could be very substantial, especially if the population history has been complex). In the cases where the model is run for different allele age estimates in the manuscript, such as for the lactase persistence scenario, the authors should present the (approximate maximum) likelihoods for the different scenarios in the text.

      An explanation as to why we do not infer the age of the allele (see text below) has been added to the main text under section “Parameter search” (lines 531-533). Briefly, we chose to construct our method in a way that uses the age of the allele as an input parameter rather than estimating it since there are multiple equally possible solutions with various combinations of allele age and selection coefficient values. This is demonstrated Appendix A3.

      We also added a description of log-likelihood values when we vary the allele ages under section “Robustness of parameters to the assumed age of the allele” in lines 324-329, the results of which are presented in supplementary Figure 6–Figure Supplement 9 and Figure 8–Figure Supplement 6.

      Briefly, we assessed the likelihood of the best fitted models by varying the ages of the rs4988235(T) and rs1042602(A) alleles. We can see that in the case of rs4988235(T) allele the allele age used in this study (7,441 years) gives the most likely solution among the explored ages. In the case of the rs1042602(A) allele, we found that there are multiple nearly equally likely ages when looking at ages at least as old as 15,000 years.

      A further weakness of the method is that it uses the Fisher information matrix to estimate uncertainty. While this works well if the posterior distribution is narrow, it can severely underestimate the uncertainty if this is not case, in particular if the distribution is non-gaussian in the tails. It would be better, but perhaps computationally prohibitively expensive, to report Bayesian posterior distributions for the parameters as well as Bayes factors that could be used to formally compare the fit of different models to the data.

      We agree with the reviewer that implementing Bayesian parameter fitting would likely provide a more robust understanding of the uncertainty of the estimates as well as an opportunity to formally compare different models using Bayes factors (although at the cost of an increase of computational intensity). Changing the inference engine of our method in this manner (while keeping it computationally feasible) is something we are currently investigating and hope to release as part of a future Bayesian version of our method. In the meantime, we have added a discussion of this caveat in our manuscript (sixth paragraph).

      Finally, although the rationale behind the model is clearly described, the detailed descriptions of the model and the numerical implementation have some shortcomings. First, there are typos in the appendix where the continuous model is derived from a discrete approximation (the right-hand side of Eq. (8) should not contain the term p(x,y,t) for it to be consistent with Eqs. (9) and (10)). Second, any differential equation model is incomplete without specifying the boundary conditions. This is especially important here as the assumption of uniform diffusion and advection on the grid is violated by the constraints imposed by the land mask, where the population is assumed to vanish on water areas (suggesting an absorbing boundary condition). Further down in the methods, details are also missing on how Eq. (10) was solved numerically, merely that it was discretized at a certain resolution.

      Looking more closely at the Eq (8), we do believe that the term p(x,y,t) should be there since it is moved to the left-hand side of the Eq (9) by simple algebraic rearrangements of the terms of the equation.

    1. Author Response

      Evaluation Summary:

      1) The paper is well written, and its style/formatting are optimal. The baseline signature moderately predicted outcome, and the data after one cycle further improved the algorithm, though this decreases its utility as a pure predictive tool

      We thank the editor and the reviewers for their positive feedback regarding the style and formatting of the manuscript. We concur that longitudinal sampling of blood, before and after one cycle of treatment, renders the predictive signature marginally more laborious to generate. In an ideal setting, we would be able to solely generate a predictive signature based on baseline characteristics - unfortunately such a test does not yet exist.

      In this study, we propose adding an easily obtainable blood sample after the first cycle of treatment to significantly improve our ability to predict response. Due to the ease of sampling them, we believe that blood biopsies will be key as the search for predictive biomarkers expands. Since the inception of our study, there have been numerous impactful pieces of published literature assessing PBMCs, mainly in response to immune checkpoint blockade 1-6. Given that our risk signature is now validated in an immunotherapy trial (EACH trial NCT03494322), we are even more confident with our unique approach to longitudinal sampling to developing a predictive model to systemic therapy. The trial design of the validation study is now included as supplementary (Figure 2A) in the manuscript.

      2) Signatures were not prospectively validated on an independent cohort; the algorithm was developed around a first-line therapy that is no longer considered to be the standard of care for HNSCC; and, while most of the conclusions are supported by the data, some of the caveats (such as the lack of a validation cohort, key in predictive biomarker development), are not addressed.

      Thank you. We will address this comment in two parts – (a) with regards to the validation cohort part and (b) for the status of the EXTREME treatment regimen in the original cohort. In this revised version, we have validated our risk signature in an independent cohort of patients who received cetuximab and avelumab (anti-PD-L1) in a single-arm, phase 2 clinical trial setting. Beyond serving purely as a validation cohort, it also demonstrates the applicability of our model in predicting response to immune checkpoint blockade-based therapy in keeping with contemporary advances in systemic treatment for HNSCC. The risk signature strongly predicted response in the new independent cohort giving us more confidence in our model’s ability to predict outcome for systemic therapy regimens beyond cytotoxic chemotherapy and cetuximab. Figure 5B shows the strong correlation between the risk signature and disease outcome in the validation cohort (Kendall rank correlation, t=0.725 p=0.0181).

      Secondly, the EXTREME regimen (platinum/5-FU/cetuximab) remains a first-line standard of care treatment in the UK and European countries for HNSCC patients with negative PD-L1 status (CPS score <1) which account for around 15% of all HNSCC patients 7. While the US Food and Drug Administration (FDA) approved pembrolizumab in combination with chemotherapy as first-line treatment regardless of PD-L1 expression and pembrolizumab alone for patients with PD-L1-expressing tumours (CPS ≥1), the European Medicines Agency (EMA) approved pembrolizumab with or without chemotherapy only for patients with a CPS ≥1, and this has been highlighted in the European Society for Medical Oncology (ESMO) and the UK National Institute for Health and Care Excellence (NICE) guidelines 8 and (https://www.nice.org.uk/guidance/ta661/chapter/1-Recommendations).

      Furthermore, chemotherapy with EXTREME regimen is standard of care for patients with contraindications to immune checkpoint inhibitors such as autoimmune disease 8. It can also be considered as second-line treatment in patients who only received pembrolizumab monotherapy in the first line setting.

      3) However the overall impact in the field of this work seems limited by a number of factors, including that the authors focused on immune cell subpopulations and exosomes, which narrows the scope (no cytokines or other biomarkers were included).

      Thank you. We selected a finite number of covariates based on a few factors – (a) published literature, (b) previous data generated by the group and (c) the applicability of the findings to the clinic. Instead of an exploratory article in which we could generate an infinite number of covariates by a technique similar to RNA sequencing, we opted for a select set of covariates. This hypothesis-driven approach generated a strong signature that is now validated across two trials. The focus on immune population is driven by our hypothesis that systemic changes in the PBMCs are indicative and reflective of the status of the intra-tumoral immune response. In the revised manuscript we used a custom immune focused imaging mass cytometry antibody panel to probe tissue sections from 9 patients. We now show that the key populations driving the predictive model in the periphery are not only reflected at the tumoral level, but these disparate immune cell subpopulations also interact. See Figure 6 in which we use a machine learning approach to segment cells and assign them to distinct immunological subpopulations. We found that the peripheral monocyte population strongly correlated with a tumoral macrophage population having a similar marker expression pattern. We found that the peripheral central memory CD8 T cells inversely correlated with tissue resident memory T cells. The tissue presence of both these cells correlated positively with outcome. Most strikingly, these two populations were most likely to co-localize with each other at the tissue level at a frequency of almost double the second highest co-localization. Data on the nature of the interplay between peripheral systemic immunity and intra-tumoral immunity is novel and rarely exists in the literature outside the scope of in-vivo animal models. Here we describe these interactions using human patient samples treated with a clinically relevant therapy.

      Given the limited amount of patient sera collected in the trial we opted to perform exosome analysis on markers known to impact the response to the anti-EGFR/HER3 treatment/immune responses. This was in line with our labs work to use exosome FRET-FLIM as a surrogate for tissue FRET-FLIM which we originally used to discover a potential dimer dependent mechanism for anti-EGFR treatment resistance in neoadjuvant breast cancer patients9; and more recently published on a colorectal patient sample cohort from the COIN study 10. While exosome EGFR-HER3 heterodimer failed to reach significance in our risk signature, it was close as depicted in the Kaplan-Meier curve from Figure 3C. We of course acknowledge the potential added benefit of having serum cytokine array analysis. While that was not feasible for this study our group now aims at ensuring that extra patient serum samples are bio-banked for such analysis from ongoing and future trials.

      Reviewer 1 (Public Review):

      1) For this study to be significant, one would want to see a marked improvement over current biomarkers, in a robust and generalizable population. Unfortunately, this study falls short in these respects. First, the authors do not adequately discuss the prior literature. Even a fairly crude and old-fashioned blood-based biomarker such as neutrophil:lymphocyte ratio has quite good predictive and prognostic capability in R/M HNSCC

      Thank you for your suggestion. We have expanded the discussion to include an overview of current biomarkers. We also compared the predictive power of neutrophil:lymphocyte ratio (NLR) from two published meta-analysis to our risk signature 11,12. We used the median risk score to divide our original patient cohort into a high and low risk group. We then calculated the HRs and CI for both signatures at pre-treatment alone (HR = 4.1397 [95% CI: 1.975 - 8.676]) and for the combined signature (HR = 2.574 [95% CI: 1.336 - 4.96]). Both were higher than the published literature whilst only using the median as the cutoff. Mascarella, Mannard et al. published “NLR greater than the cutoff value was associated with poorer OS and DSS (HR 1.69; 95% CI 1.47-1.93; P < .001 and HR 1.88; 95% CI 1.20-2.95”, and Takenaka, Oya et al published : “The combined hazard ratio for OS in patients with an elevated NLR (range 2.04-5) was 1.78 (confidence interval [CI] 1.53-2.07”. We realize that we are stratifying patients based on PFS and not overall survival, which is an inherent limitation of the study, but the added preditive value of the signature relative to existing literature we humbly believe is too large to not be impacful.

      2) It is not clear to me that there is a compelling need to do better -- given that existing predictive biomarkers based on clinical nomograms or NLR are actually used in practice.

      We agree that clinical nomograms (based on clinicopathological factors) have been shown to be predictors of outcomes in HNSCC 13. However, whilst these models have been validated as prognostic biomarkers for overall survival and/or disease specific survival, they are not currently recommended in the cancer treatment guidelines nor universally used in the clinic. With the further validation performed on a cohort treated with an immune-checkpoint inhibitor, our multimodal signature describes new data to help understand the range of treatment responses and predict outcomes and could be used to guide treatment intensification, continuation and/or early termination in clinical practice or incorporated into future clinical trials. Moreover, in the resubmission we extend our work from predictive biomarker research to developing a better understanding of the interplay between the peripheral immune response to intra-tumoral immunity which we discuss in this letter as part of our response to the public evaluation summary part 3. Given the recent surge in literature focused on tumor immunity with the increased use of immune checkpoint blockers, we believe our work offers a strong contribution to the few papers in circulation that have attempted to link tumor immunity from the systemic level to the tumor tissue level.

      3) A large number (31 of 87) patients were not included due to lack of biomaterials. No analyses have been performed to examine the characteristics of these patients. It is unlikely that the collection of biomaterials has no correlation with disease characteristics, prognostic features, outcomes, or the analytes in this study. This exclusion -- akin to unequal censoring in clinical trials -- is likely to significant impact results. Given that the population enrolled in a phase II trial, and that sub-population of patients who survive long enough and are feeling well enough to submit to large volume blood draws on trial, would not necessarily represent the real world population of R/M HNSCC patients, a broader population is needed to justify conclusions about this assay having robust predictive value.

      We appreciate the reviewer’s concern on potential skewness of the data based on patient selection criteria. The median PFS of our 56-patient cohort used in the generation of the risk signature was 5.48 months as shown in supplementary table 1 in the original submission. This is in line with real-world treatment outcomes to the EXTREME Regimen (cetuximab with platinum-based therapy) as first line therapy for Recurrent/Metastatic Squamous Cell Carcinoma of the Head and Neck which was reported as 5 month by Sano et al in 2019 14. It is also very similar to the median PFS observed in the DIRECT study 15

      4) It is unclear why OS as a hard endpoint was not analyzed here. No explanation is provided, other than OS was not available, a statement that is difficult to understand, given that PFS was available, and overall survival is a component of PFS.

      Thank you. We admit that the absence of overall survival is an inherent limitation of the study. In the process of submitting this revision, we have once again requested this dataset from the sponsoring pharmaceutical company but were informed that they are unable to provide it. This is because reorganization of funding priorities within the company precludes them opening datasets from an already-published clinical trial. We are equally disappointed to not be able to obtain this data, but firmly believe that the ability of the signature to predict PFS (the primary endpoint of the trial, untainted by subsequent lines of treatment), as well as cross-validation against the contemporary EACH trial, is a testament to the signature’s strength.

      There is no validation set for the biomarker. The biomarker was trained and cross-validated using Bayesian techniques to reduce overfitting. This is a valid approach for training and cross-validation, but for the biomarker to be testable and interpretable, it requires assessment in an independent dataset. There is no statistical technique that I am aware of that generates informative biomarkers without an independent validation dataset

      We completely agree with the reviewer regarding the need to obtain a validation set. Obtaining patient samples from a similar cohort was difficult but we managed to validate the signature on a set of patients treated with an anti-PD-L1 monoclonal antibody in combination with cetuximab. Furthermore, the validation was performed using a limited numbers of covariates that were identified in the risk signature by the Bayesian model. These immune populations can be obtained by running a limited set of markers on flow cytometry. We were very happy to see that these limited immune based covariates strongly correlated with a worst disease response in an independent cohort using a different treatment modality. This furthers our hypothesis that changes in the immune populations are key to understanding response to systemic therapy. Fueled with the data from the validation cohort we furthered our analysis of the tissue from a total of 9 patients from the test cohort. Using imaging mass cytometry, we were able to identify how immune populations are mirrored at the tumoral level opening the horizon for new research. The data for the validation set are copied into this letter in response to point 2 of the public evaluation summary.

    1. Author Response

      Reviewer #1 (Public Review):

      Tarasov and colleagues provide data that extensively phenotypes TGAC8 mice, which exhibit a cAMP-mediated increase in cardiac workload prior to developing heart failure. The authors confirm data from prior studies, showing increased cardiac output mediated by changes in heart rate with similar ejection fraction. 

      The above is slightly incorrect as stated. Our results section stated that HR and EF were increased in TGAC8, but that stroke volume did not differ by genotype. Thus 30% increase in cardiac output in TGAC8 was attributable to the increased HR.

      The study is overall well-planned and the amount of data presented by the authors is impressive. The work nicely incorporates animal-level physiology (echocardiography data), tests for known canonical markers of hypertrophy, and then delves into an unbiased analysis of the transcriptome and proteome of LV tissue in bulk. The techniques and analyses in the study are adequately executed and within the realm of expertise of the Lakatta laboratory. This study is a necessary and crucial first step to extensively phenotype this mouse line and generate hypotheses for further work. 

      Reviewer #2 (Public Review): 

      Tarasov et al. present an impressive amount of work in their in-depth assessment of a murine model of chronic stress in a transgenic line with constitutively active AC/cAMP/PKA/Ca2+ signaling that spans cardiac structure, function, cellular architecture, gene and protein expression, mitochondrial function, energetics and more. Exploration of multiple cellular pathways throughout the manuscript and as summarized in Figure 16 help characterize this murine model and serves as a first step in using this model to understanding the effect of chronic stress on the heart. The conclusions of the manuscript are well-supported by the data, and I have the following comments: 

      Strengths: 

      1. The authors present echocardiographic, histologic, electrocardiographic, neurohormonal quantification, protein synthesis/degradation, mitochondrial, gene and protein expression profiling, and metabolism data in their assessment of this model. 

      2. The verification of increased transcripts of AC and PKA activation in this transgenic line provided validation for the model. 

      3. The pathway analyses for both gene and protein expression profiling help supports the authors' claim of the importance of differences noted in the various pathways between the transgenic line and controls. 

      4. The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. 

      As written, this statement does not exactly reflect what we had intended to communicate in the paper. We did not posit, that LV wall stress was reduced in TGAC8, but that it must be reduced compared to WT on the basis of Laplace’s Law because of a substantial reduction of LV cavity volume. We also did not posit that energy production is due to a shift in metabolism, but rather, that adaptations in energy metabolism resulted in adequate energy production to meet, what appeared to us to be a marked increase in energy demand in TGAC8 vs WT, based on our observation that transcriptome and proteome gene ontology (GO) terms that differed in TGAC8 vs WT, covered nearly all biological processes and molecular functions within nearly all compartments of the LV myocardium.

      These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. 

      Although the chronic increase in cardiac output in TGAC8 heart simulates that of an athlete’s heart during exercise, LV cavity volume at rest is larger in the endurance trained heart and this is associated with bradycardia. In these aspects, the TGAC8 heart differs from the endurance trained heart (perhaps because it does not have sufficient rest periods between bouts of exercise, as does the endurance trained heart). In the discussion section of the manuscript, we noted several features that differed between the TGAC8 vs the endurance trained heart. 

      However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      Heart failure in the long run, had first been described in the TGAC8 mouse by Mougenot et. al. (ref 10 in our manuscript) who performed numerous biochemical and biophysical measurements in TGAC8 and WT attributed the heart failure to be a manifestation of accelerated heart aging. We are in the midst of conducting a longitudinal study of cardiac structure and function in the TGAC8 vs WT as these mice age, along with additional non-biased multi-omics analyses in order to get an overview about which of adaptive pathways that are activated in TGAC8 heart at 3 months of age become faltered with advancing age and how changes in these pathways relate to the altered cardiac structure and function of the TGAC8 as age advances. Following that, we will focus on each of these pathways employing detailed mechanistic analyses. Our provisional hypothesis is that while AC8 activity will continue to be increased as age advances, its downstream signaling will begin to fail due to age-associated changes in proteostasis and in the expression of proteins, including those involved in energy metabolism.

      Weaknesses: 

      1.  As acknowledged by the investigators, this is a hypothesis-generating rather than hypothesistesting study. 

      Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      2.  The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      We have addressed these comments above in our response to your comment #4 under strengths.

      3.  Figure 5B is referenced to support the claim regarding beta adrenergic receptor desensitization, but the data show catecholamine levels in tissue. I would have expected receptor expression analysis to suggest up/downregulation of receptors at the membrane to support this claim. 

      Beta adrenergic receptor desensitization can occur due to changes in molecules that inhibit signaling that are at the receptor or at the signaling downstream of the receptor in the absence of changes in receptor number. Here is how we summed this up in our manuscript:  “Numerous molecules that inhibit βAR signaling, (e.g. Grk5 by 2.6 fold in RNASEQ and 30% in proteome; Dab2 by 1.14 fold in RNASEQ and 18% in proteome; and β-arrestin by 1.2 fold in RNASEQ and 14% in proteome) were upregulated in the TGAC8 vs WT LV (Table S.3, S.5 and S.9), suggesting that βAR signaling is downregulated in TGAC8 vs WT, and prior studies indicate that βAR stimulation-induced contractile and HR responses are blunted in TGAC8 vs WT.8,11… A blunted response to βAR stimulation in a prior report was linked to a smaller increase in L-type Ca2+ channel current in response to βAR stimulation in the context of increased PDE activity.13, 14 WB analyses showed that PDE3A and PDE4A expression increased by 94% and 36%, respectively in TGAC8 vs WT, whereas PDE4B and PDE4D did not differ statistically by genotype (Figure 16-supplement 1 A). In addition to mechanisms that limit cAMP signaling, the expression of endogenous PKI-inhibitor protein (PKIA), which limits signaling of downstream of PKA was increased by 93% (p<0.001) in TGAC8 vs WT (Table S.3). Protein phosphatase 1 (PP1) was increased by 50% (Figure 16-supplement 1 A). The DopamineDARPP-32 feedback on cAMP signaling pathway was enriched and also activated in TGAC8 vs WT (Figure 15), the LV and plasma levels of dopamine were increased, and DARPP-32 protein was increased in WB by 269% (Figure 16-supplement 1 A).

      Thus, mechanisms that limit signaling downstream of AC-PKA signaling (βAR desensitization, increased PDEs, PKI inhibitor protein, and phosphoprotein phosphatases, and increased DARPP32, cAMP (dopamine- and cAMP-regulated phosphoprotein)) are crucial components of the cardio-protection circuit that emerge in response to chronic and marked increases in AC and PKA activities (Figure 4 C, F).” 

      4. Changes in ion channel (e.g. KCNQ1 and KCNJ2) gene and protein expression were described but not validated by assessment of change in function. 

      Reviewer #3 (Public Review): 

      Tarasov et al have undertaken a very extensive series of studies in a transgenic mouse model (cardiomyocyte-specific overexpression of adenylyl cyclase type 8) that apparently resists the chronic stress of excessive cAMP signaling for around a year or so without overt heart failure. Based on the extensive analyses, including RNAseq and proteomic screening, the authors have hunted for potential "adaptive" or "protective" pathways. There is a wealth of information in this study and the experiments appear to have been carefully performed from a technical viewpoint. Many interesting pathways are identified and there is plenty of information where additional experiments could be designed. 

      General comments 

      1. Ultimately, this is a descriptive and hypothesis-generating study rather than providing directly proven mechanistic insights.

      As noted in response to Reviewer #2: “Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      -Given several prior studies reporting a detrimental effect of chronically increased cAMP signaling, what is it that is different in this model? Is it something specific about AC8? Is it to do with when (in life) the stress commences? 

      We believe it is, at least in part, due to something specific about the effects of the marked increased activity of AC8 perse, because adenylyl cyclase singling impacts nearly all aspects of our current knowledge of cell biology. Thus, due to the marked increase of AC and PKA activation in the TGAC8 heart, the transcriptome and proteome gene ontology (GO) terms that differ in TGAC8 vs. WT covered nearly all biological processes and molecular functions within nearly all compartments of the TGAC8 LV myocardium.

      - Is the information herein relevant to stress adaptation in general or is it just something interesting in this specific mouse model?

      In our opinion, AC8 mouse model is very relevant to stress adaptation in general, but this broad view has hardly ever been realized previously in the literature, because of the reductionist nature (by necessity) of mainstream biomedical research. For example, reports on cardiac specific overexpression of AC5 and AC6 never provided broader view on these mice and were focused only on a limited number of traits i.e., arrhythmogenesis, chronic pressure overload, contraction (Am J Physiol Heart Circ Physiol. 2015 Feb 1;308(3):H240-9; Am J Physiol Heart Circ Physiol. 2010 Sep;299(3):H707-12; Clin Transl Sci. 2008 Dec;1(3):221-7; Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9986-90; Am J Physiol Heart Circ Physiol. 2013 Jul 1;305(1):H1-8). 

      None of the pathways that are apparently activated were directly perturbed so their mechanistic role requires further study.

      We agree and have entitled a section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results” to address this plainly.

      Specific 

      1. The strain of the mice and their sex needs to be stated as well as the exact age at which the various assays were performed.

      All assays were performed on 3-month-old males. This information was inadvertently not directly stated in the original submission.  

      2. The hearts of the Tg mice have more cardiomyocytes but which are smaller. Since there is no observed increase in proliferation of cardiomyocytes, how (or when) did this increase in cell number occur?   

      It is likely that an increase in number of cardiomyocytes may have occurred during the embryonic stage of development (8.5 dpc), when AC8 expression begins. Since submitting our manuscript we have found that the expression level of human AC8 (the type of AC8 employed in this transgenic model) increases markedly during the embryonic period when compared to endogenous AC8 and remains elevated in both the fetal and perinatal periods. 

      3. While the mice do not show an increased mortality up to 12 months of age, HR/CO/EF are poor indices of contractile function. Data on end-systolic elastance or perhaps echo-based LV strain indices which will be relatively load-independent would be useful.

      Numerous comprehensive hemodynamic measurements have been performed previously on this mouse. For example, Mougenot et. al (Ref 10 in our manuscript), based on invasive hemodynamics analysis concluded that contractile function in the TGAC8 heart was increased at both 2 and 12 months of age. But Doppler imaging of the heart in conscience mice, unmasked, myocardial dysfunction, informed by a reduction in systolic strain rate in both old TGAC8 and WT littermates. This is why they attributed the heart failure in TGAC8 at 12 months of age to be a manifestation of accelerated aging.

      We agree with your comment that end-systolic elastance ought to be measured in the TGAC8 but also end-diastolic elastance, and effective arterial elastance should be measured in order to quantify diastolic function and heart energetic coupling in the TGAC8.  

      4.  Quite a lot of conclusions are made relating to metabolism. However, this is entirely based on gene expression or protein levels. Given the substantial role of allosteric regulation in metabolic control, as well as the interconnectedness of metabolic pathways, ultimately any robust conclusions need to be based on an assessment of activity of key pathways. 

      We concur and have described some of the types of metabolic assessments in the last section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results”: “… precisely defining shifts in metabolism within the cell types that comprise the TGAC8 LV myocardium via metabolomic analyses, including fluxomics.97 It will be also important that future metabolomics studies elucidate post-translational modifications (e.g. phosphorylation, acetylation, ubiquitination and 14-3-3 binding) of specific metabolic enzymes of the TGAC8 LV, and how these modifications affect their enzymatic activity”.

    1. Author Response

      Reviewer #1 (Public Review):

      In their manuscript, these authors present a novel geostatistical framework for modelling the complex animal-environment-human interaction underlying Leptospira infections in a marginalised urban setting in Salvador, Brazil.

      In their work, the authors combine human infection data and the rattiness framework of Eyre et al. (Journal of the Royal Society Interface, 2020) . They use seroconversion defined as an MAT titer increase from negative to over 1:50 or a four-fold increase in titer for either serovar between paired samples from cohort subjects. Whereas this is a commonly used measure of infection; the work would benefit from answering the question about how robust results are related to this definition of seroconversion.

      Thank you for your comment. We have acknowledged this on line 534 in the discussion by adding the following text: “A possible limitation of this study is the titre rise cut-off values used for classifying seroconversion and reinfection in the cohort that determine the sensitivity and specificity of the infection criteria. However, these criteria were used because they are the standard definitions for serological determination of infection that are commonly applied for leptospirosis and a wide range of other infections, and they enable the comparison of results with other previous leptospirosis studies.”

      The model framework relies on the concept of 'rattiness' previously defined by Eyre et al. (JRSI, 2020) and assumes conditional independence within its built up (equation (1)). Whereas this is a reasonable assumption, it would be good to discuss situations in which this assumption is questionable and what the implications are for applying the modelling framework to other settings.

      We have added the following text immediately after “is shown schematically in Figure 2” following equation (1) on line 225: “The conditional independence assumption in (1) is reasonable for a vector-borne disease or one that is transmitted indirectly, in which context the observed rat indices are to be considered as noisy indicators of the unobservable spatial variation in the extent to which the environment is contaminated with rat-derived pathogen. It would be more questionable for applications in which the disease of interest is spread by direct transmission from rat to human.”

      The authors provide an extensive model building exercise and investigate, in different ways, whether the model captures the necessary complexity (GAM smoothers - testing linearity, spatial correlation, etc). I believe the work would benefit from (1) a formal diagnostic investigation, if feasible; (2) providing guidelines on how model building should be performed.

      We have added a new Appendix 7 with diagnostic plots of randomized quantile residuals to check the rattiness-infection model fit with the human infection data and included the following text in Section 2.4 of the main text: “A formal diagnostic investigation of randomized quantile residuals is included in Appendix 7. We found no evidence in the diagnostic plots to suggest that there were issues with our modelling approach.”

      To supplement the R code that is publicly available for repeating all of the steps in this analysis, we have now also included a detailed step-by-step explanation of the model building process in Appendix 8 that outlines the key steps for building the rat and infection components of the model (variable selection and evaluation of residual spatial autocorrelation) and fitting and examining the joint rattiness-infection model. We have added the following text in Section 2.6 of the main text: “We also include a step-by-step explanation of the model building process to guide future users of the rattiness-infection framework in Appendix 8.”

      The authors are to be acknowledged for providing an extensive and thorough discussion of the different aspects of their work. Whereas the discussion is complete, I wonder whether the authors can give a brief example about how this model can be applied in a different setting.

      Thank you. We have added the following text on line 551 in the discussion: “The framework may have important applications beyond the study of zoonotic spillover, with the rattiness component replaced by other exposure measures e.g. mosquito density or ecological indices (such as pollution, where there are multiple, related measures of air or groundwater quality) to model associations with human or animal health outcomes.”

      Reviewer #2 (Public Review):

      Eyre et al. developed and applied a novel geostatistical framework for joint spatial modeling of multiple indices of pathogen (Leptospira) reservoir (rats) abundance and human infection risk. This framework enabled evaluation of infection risk at a fine spatial scale and accounted for uncertainty in the pathogen reservoir abundance estimates. The authors used data collected in two different field projects: (1) a rat ecology study in which three different approaches were used to detect rat presence "rattiness", and (2) a prospective community cohort study in which individuals were sampled during two different time periods to detect recent infections via seroconversion or a four-fold increase in anti-Leptospira antibody MAT titer. Univariable and then multivariable analyses were performed on these data to identify (1) the environmental variables that best predicted "rattiness", and (2) the demographic/social, environmental (household), occupational, and behavioral variables that best predicted human risk of infection. Once identified, the best predictors from (1) and (2) were included in a final, joint model to identify the significant predictors of both 'rattiness' and human infection risk. As a result of this study, the authors were able to detect spatial heterogeneity in leptospiral transmission to humans. They found that infection risk associated with increases in reservoir abundance differed by elevation, and that increases in reservoir abundance at high elevation were associated with a much higher odds ratio for infection than at low elevation. The authors suggest that this has to do with differences in how the infectious leptospires (shed by the rat reservoir) are dispersed in the environment. At high elevations, flooding is less frequent and thus rat shed leptospires are likely to stay where the rat deposited them. Whereas at lower elevations, flooding may play a large role in spreading leptospires more evenly across the landscape, reducing the importance of rat presence at smaller spatial scales. The final best model was then used to generate prediction maps of 'rattiness' as well as human infection risk at all locations within the study area (i.e. including those that lacked rat detection data and human infection data. This work represents an important advance in infection risk modeling as it explicitly incorporates estimates of reservoir abundance and the uncertainty surrounding these estimates into the infection risk assessment, and allows for modeling of infection risk at fine spatial scales. Findings from this study have important management implications at the authors' study site as it suggests that interventions directed at high elevations should be different from those designed to address infection risk at lower elevations. However these are broader implications, as this novel approach may be applied to other systems to enable identification of differences in infection risk for other pathogens at a fine spatial scale, predict infection risk more broadly, and facilitate intervention strategies targeted for the specific epidemiological and ecological conditions experienced by a population.

      This was a well-designed study. The field sampling approach was well balanced, well described and appropriate. Broadly the modeling framework is appropriate for the questions being asked and for the data being used. The variable and model selection approaches were clearly described and appropriate. Evaluation of the more detailed mathematical approach is outside of my area of expertise, so I am unable to comment on the validity of the approach.

      For the most part, the explanatory variables assessed in the different models were well described and justified, however there were some cases for which further explanation would have been helpful. For example, how did the authors determine which occupations to evaluate? Specifically, why traveling salesperson? What is the difference between open sewer within 10 m and unprotected from sewer?

      We have added the following additional text to Section 2.3.2 on line 297 to clarify the definition and reason for inclusion for these variables: “In the household environment domain, two variables were used to capture risk due to sewer flooding close to the household: i) the presence of an open sewer within 10 metres of the household location and ii) a binary `unprotected from open sewer' variable which identified those households within 10 metres of an open sewer that did not have any physical barriers erected to prevent water overflow. Three high-risk occupations were included in the occupational exposures domain as binary variables. Construction workers and refuse collectors have direct contact with potentially contaminated soil, building materials and refuse in areas that provide harbourage and food for rats. Travelling salespeople have regular and high levels of exposure to the environment (particularly during flooding events) as they move from house to house by foot. Two other binary occupational exposure variables were included that measured whether a participant worked in an occupation that involves contact with floodwater or sewer water.”

      I also had some concerns regarding the time-period of the rat ecology study used to determine abundance, potential fluctuations in rat abundance through time, and how this might align with sampling to detect infection in humans. Depending on the time scale of population fluctuation in rats as well as fluctuations in infection prevalence in rats, the abundances calculated from data from the ecology study may not be accurately reflecting true abundance (and therefore shedding and transmission risk) during the time period that a human may have been exposed. However, the authors do a nice job of addressing some of these issues in the discussion. They mention that infection prevalence in rats is consistently around 80% and that there don't appear to be seasonal fluctuations in human exposure risk in the study area.

      Thank you.

      Reviewer #3 (Public Review):

      The goal of the authors was to test how important local rat abundance is as a driver of Leptospira infection in humans.

      The authors approached this using a strong combination of datasets on human infection risk and rat abundance, across a spatial scale that is large enough to allow simultaneous assessment of multiple potentially important drivers of infection risk. This further enables the authors to develop infection prediction maps based on the fitted models.

      This study design is a major advance towards understanding link between rat abundance and human infection risk.

      Based on the top models tested in the study, the authors conclude that local rat abundance is indeed correlated with infection risk, and that this correlation is strongest at higher elevation.

      This is an impactful finding, but in my opinion it is not yet clear how robust and important this is, because of two reasons:

      (1) The infection risk data: while the actual infection risk data are not shown, the map shown in Figure 5B suggests that there is an infection hotspot that happens to be at high elevation. This raises the question of how strongly this single hotspot is driving the observed correlation between rat abundance and infection risk (which the authors find to be much stronger at high elevation than at lower elevations).

      We have added a new figure (Figure 4) earlier on in the article (we decided to add this here rather than to Figure 6 - formerly Figure 5 - to ensure that the map is large enough that points in Figure 4A are easily visible – please note that it is included as a larger and easier to view image in the main eLife template version) with the raw infection data overlaid on contour lines for the three elevation levels to provide the reader with a better overview of the raw data. This new Figure 4 shows that out of a total of 403 participants in the high elevation region there were 16 infections, of which only 5 (31%) were located in the large hotspot in Valley 3 (valleys are numbered 1 to 3 from west to east, see Figure 1A). In addition to the largest hotspot in the north of Valley 3, there are several other areas in the high elevation region with raised predicted infection risk values relative to their surroundings where there were also rattiness hotspots and infected participants in the raw data: fives cases (red and yellow infection risk areas in Figure 5B) on the western side of Valley 2; the two cases on the eastern edge of Valley 2; the two cases on the western edge of Valley 3; and the single case in the southwest of Valley 3. Other variables are also important drivers of infection risk and at several of these locations the contribution of rattiness increases infection risk significantly relative to the low-risk surrounding area (e.g. to 10% in areas where risk is closer to 1% or 2%) without reaching the more obviously visible high infection risk values closer to 20%. We believe that our statistical model provides a better test of whether there is a statistical association between rattiness and infection at high elevations than a visual examination, but that this is supported by the large number of observations in the high elevation area (403) and the distribution of infected and uninfected households, which demonstrates that the observed association is not only driven by the hotspot in Valley 2.

      (2) The statistical models: if I understand correctly, all tested models of infection risk include the variable rat abundance, and while the individual effect estimates for rat abundance are statistically significant (Table 3), the more important question of how the fit of a model without the rat abundance variables compares with those of the other tested models (shown in Supplementary Table S2) has not been addressed.

      These models were considered but were ranked outside of the top five models and for this reason were not reported in Table S2. We agree that showing the AIC of a model without rattiness in this table can more clearly demonstrate the improved fit of the model with rattiness. To do this we have added the highest ranked model without rattiness (M) to Table S2 and added a note to the table explaining the reason for its inclusion (“Model M was ranked outside of the top 5 models but is included here for reference to demonstrate the improvement in model fit when rattiness is included”). The AIC of M* was 532.13. This is substantially higher than the top five models (M1 = 523.14 and M5 = 525.04), justifying its inclusion in this model and in the joint rattiness-infection framework.

      Regardless of whether rat abundance is an important driver of human infection risk, this study is a major step in our understanding of the role of rats in the spread of leptospirosis, due to the strong combination of a unique combination of datasets and a spatial statistical modeling approach.

      Thank you.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript discusses evolutionary patterns of manipulation of others' allocation of investment in individual reproduction relative to group productivity. Three traits are considered: this investment, manipulation of others' investment, and resistance to this investment. The main result of the manuscript is that the joint evolution of these traits can lead to the maintenance of diversity through, as documented here, cyclic (or noisier) dynamics. Although there are some analytical results, this main conclusion is instead supported by individual-based simulations, which seem correctly performed (but for clonal populations, as emphasized below).

      There could be material for a good paper here but the organization of the manuscript makes it difficult to fully evaluate. The narrative is highly condensed, with the drawbacks that this often entails in terms of accurately conveying the results of a study, as illustrated here by the following issue.

      The population is apparently assumed to be clonal (more than just "haploid"), meaning that there is no recombination between the loci controlling the three traits. In the one case where this assumption is relaxed (quite artificially), the cyclic dynamics disappear (section 4.4 of the appendix). This is crucial information that cannot be appreciated in the main text.

      The paragraph at line 368 offers a simple explanation for the joint dynamics of traits. However, this explanation would hold identically for a sexual population and a clonal population, whereas these two cases seem to have completely different dynamics. Thus, there is something essential to explain these differences, that is missing from the given explanation.

      Yes, our model was asexual with no recombination. To address this comment, we carried additional simulations where recombination was allowed (Appendix 1— 4.8). We found that recombination does not change our results (predictions), and describe this on line 469-475. By assuming additive effects of traits and each traits having the same dispersal property, our haploid asexual model is also equivalent to a diploid sexual model (Taylor 1996; Day & Taylor 1998).

      This is especially important because the finding that the joint evolution of several traits can lead to some form of diversity maintenance is not surprising. As the discussion acknowledges (but the introduction seems to downplay), it is also well understood that manipulation and counter-adaptations to it can occur in many contexts and lead to the maintenance of diversity. For this reason, similar results in the present case are not surprising, and the main outcome of the study should be to provide a deeper understanding of the forces leading to the different outcomes in the current models.

      I do not see clearly what distinguishes "manipulative cheating" from other forms of manipulations that have been previously discussed in the literature (e.g, as cited lines 461). Couldn't this be clarified by some kind of mathematical criterion?

      Thanks for pointing out that there is room to improve the distinction between our model and previous models! We have added more description to explain the conceptual difference on line 187-193, and a new subsection in appendix to show these differences through mathematically examine the fitness formulations in previous models (Appendix 1—1.3).

    1. Author Response

      Reviewer #1 (Public Review):

      This paper addresses an important question: whether the conduction velocity in white matter tracts is related to individual differences in memory performance. The authors use novel MRI techniques to estimate the "g-ratio" in vivo in humans - the ratio of the inner axon relative to the inner axon plus its outer myelin sheath. They find that autobiographical recall is positively related to the g-ratio in a specific white matter tract (the parahippocampal cingulum bundle) in a population of 217 healthy adults. This main finding is extended by showing that better memory is associated with larger inner axon diameters and lower neurite dispersion, which suggests more coherently organised neurites. The authors also argue that their results show that the magnetic resonance (MR) g-ratio can reveal novel insights into individual differences in cognition and how the human brain processes information.

      The study is exploratory in nature and the analyses were not pre-registered. The technique has not been used before to associate cognitive performance with MR estimates of conduction velocity in candidate white matter tracts. It is therefore unknown how strong any associations are likely to be and what sort of sample size might be needed to observe them. Nevertheless, if the technique proves to be reliable, then it certainly offers a valuable new tool to understand individual differences in cognitive abilities. However, brain structure to behavior associations are notoriously variable across studies and have been argued to require very large sample sizes to obtain reproducible results.

      We respectfully disagree that the study was exploratory. We had distinct aims and hypotheses from the outset. Our prime interest is in autobiographical memory, the hippocampus and its connectivity. This motivated our focus on three specific white matter tracts. We also planned from the time of study design to examine the MR g-ratio, and even contributed to refining the pre-processing pipeline for this approach, as reported in a previous paper (Clark et al., 2021, Frontiers in Neuroscience). Moreover, in the current manuscript we outlined well thought through possible outcomes and declared specific predictions.

      Regarding pre-registration, due to the scope of this work, the experiment was planned eight years ago, and data collection commenced seven years ago. At that time, formal pre-registration was not common practice. However, it has been a long-standing feature of our Centre that proposed studies and their analysis plans undergo rigorous internal peer review, including presentation to the whole Centre, before data acquisition can commence. The proposal for the research under consideration here was presented on 26th September, 2014.

      As noted in our response to the Editors’ Public Evaluation Summary above, someone has to be the first to report a novel result, and we believe that the depth and transparency of our approach permits confidence in the findings. Not least, and to reprise, because we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Our primary analyses were performed using the behavioural outcome measure from this test, the results of which were directly compared to those from a closely-matched control measure to test whether significantly larger effects were observed for our variable of interest. The potential for false positives was further reduced by extracting microstructure data from hypothesised tracts of interest (instead of performing whole brain voxel-wise analyses), with statistical correction performed on all structure-behaviour analyses. Moreover, we performed partial correlations with age, gender, scanner and number of voxels in a region of interest (ROI) as covariates. Complementary investigations were also conducted using other commonly-reported measures, providing supporting evidence. We report all analyses (and provide all the source data), including those finding no relationships. The consistent results throughout were associations between autobiographical memory recall ability and the microstructure of the parahippocampal cingulum bundle only. Moreover, thanks to the excellent suggestions of the Reviewers, the revised version reports additional analyses that allow us to further corroborate and interpret our findings.

      Our sample of 217 participants allowed for sufficient power to identify medium effect sizes when conducting correlation analyses at alpha levels of 0.01 and when comparing correlations at alpha levels of 0.05 (Cohen, 1992, Psychological Bulletin). While it has recently been suggested that thousands of participants are required in order to investigate brain structure-behaviour associations (Marek et al., 2022, Nature), other, more sophisticated, analyses suggest that samples of ~200 participants can be sufficient, in line with our estimates (Cecchetti and Handjaras, https://psyarxiv.com/c8xwe; DeYoung et al., https://psyarxiv.com/sfnmk). Given that our study was principled, well-controlled, analysed appropriately and produced very specific and consistent findings, we are confident that the findings are robust.

      The authors decided to analyse performance on a single task - the Autobiographical Memory Interview - and identified three candidate white matter tracts that connect the hippocampal region with other brain regions. While it is clear why these three tracts were chosen, it is less obvious why the authors chose to investigate associations with the Autobiographical Memory Interview and not other memory tests that were part of the battery of tests administered to the participants. It is reasonable to assume that something as general as the conduction velocity of a white matter tract would have an effect on memory ability across a range of tasks, so to single out one seems an unnecessarily narrow focus.

      Our main interest over many years, and hence the focus of this study, is autobiographical memory recall because it directly relates to how people function in real life. As noted above, autobiographical experiences occur in dynamic, multisensory, multidimensional, non-linear, ever-changing contexts; they involve actively engaging with the environment and other people; they are embodied; they span milliseconds to decades. Many of these features cannot be captured by laboratory-based episodic memory tests. This issue is increasingly being discussed (for example, see recent reviews by Nastase et al., 2020, NeuroImage; Mobbs et al., 2021, Neuron; Miller et al., 2022, Current Biology). It is further laid bare in McDermott et al.’s (2009, Neuropsychologia) meta-analysis of functional MRI studies which showed that laboratory-based and autobiographical memory retrieval tasks differ substantially in terms of their neural substrates. Consequently, we were not surprised to find that when we analysed laboratory-based memory test performance, there were no correlations with the MR g-ratio. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer for making the excellent suggestion to include these additional data, analyses and discussion points.

      The results of the study are interesting and highlight a key role of the parahippocampal cingulum bundle in autobiographical memory recall. The results are corrected for multiple comparisons across the three fiber tracts of interest and the recall of "external details" provides a nice control compared to the "internal details" which are the measure of interest. The main findings are extended to show that it is likely to be an increase in axon diameter and an increase in neurite coherency that characterize those individuals with better autobiographical recall. Despite these positives, it remains unclear whether memory recall, in general, is better in people with higher g-ratios in this tract (as implied in the Abstract), or if this effect is specific to scores on the Autobiographical Memory Interview.

      Our interest is in autobiographical memory, and so we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Not only does this test include a control measure, external details (as noted by the Reviewer), but we had independent raters score the autobiographical memory descriptions, and found that the inter-class correlation coefficients were very high (see Materials and Methods). Despite using this current, gold standard approach, at the request of the Reviewer we have now analysed data from eight additional laboratory-based memory tests. These are standard memory tests that are often used in neuropsychological studies: testing recall - the immediate and delayed recall of the Logical Memory subtest of the Wechsler Memory Scale IV, the immediate and delayed recall of the Rey Auditory Verbal Learning Test, the delayed recall of the Rey–Osterrieth Complex Figure; testing recognition memory - the Warrington Recognition Memory Tests for Words and Faces; testing semantic memory - the “Dead or Alive Test”. While these tests can assess some aspects of memory recall, they cannot be regarded simply as proxies for autobiographical memory recall, for the reasons we outlined in our response to the previous point. They do not capture key aspects of autobiographical memories. It is therefore all the more interesting that we found no associations between these laboratory-based memory tasks and the MR g-ratio of the parahippocampal cingulum bundle, in contrast to the relationship identified with autobiographical memory recall ability. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer once again for making the excellent suggestion to include these additional data, analyses and discussion points.

      Reviewer #2 (Public Review):

      In this study, Clark and colleagues tackle a very intriguing question: how differences in autobiographical recall abilities reflect in the human brain structure and function? To answer this question, they interviewed a large cohort of subjects and proceeded to acquire MRI data, specifically diffusion-weighted imaging and magnetization transfer data, to estimate the g-ratio, a measure of myelination deeply linked to conduction velocity. Looking at three specific white matter pathways of interest, all interconnecting the hippocampus with other brain structures, they studied the relationship between the g-ratio and the autobiographical recall abilities, together with many more measures from MRI. They found a significant positive association between the g-ratio of the parahippocampal cingulum bundle and the number of inner details from the interviews. These results can provide new potential directions to further study the underlying neural features beyond memory.

      I think that this is a very interesting article, it is well written, the methods are extensively explained, and the appendix provides further details for more expert readers. The authors put an effort into providing a comprehensive context in the introduction and in the discussion, and as a result, the paper seems overall quite suitable for both general and specialistic readerships.

      Thank you.

      The main issue I can currently see in the paper is that the mentioned relationship between g-ratio and recall abilities is then used to infer that better recall abilities are associated with higher conduction velocity and larger axons. The authors' line of reasoning is that given the hypothesized association, the increase in the g-ratio implies increases in myelin and axonal diameter. Despite this scenario being indeed possible given the current result, an increased g-ratio may also not indicate higher conduction velocity. In fact, the first potential inference would be that, without having any information on the axon size, the quantify of myelin can indeed be lower and as result, the conduction velocity would decrease. I understand that the authors expected higher conduction velocity associated with better autobiographical memory recall, but it is hard to see any experimental outcome that could have disproved this hypothesis: from the possible scenarios depicted in the introduction, any change in the g-ratio (and even not any change at all) could indicate higher conduction velocity. What would be then needed to corroborate one of these scenarios is some independent or complementary measure, which unfortunately is missing.

      The mentioned issue does not mean that the paper loses relevance - I think that it should focus on the very practical result, a change in myelination and microstructure, and discuss what are the potential implications, including the one that currently dominates the discussion section.

      Thank you for these comments and the opportunity to provide further clarification.

      First, we have now provided additional background information regarding the relationship between the MR g-ratio and conduction velocity. We explicitly note that while finding a significant relationship between the MR g-ratio and autobiographical memory recall suggests the existence of an association between autobiographical memory recall and parahippocampal cingulum bundle conduction velocity, it cannot speak to the direction of this association.

      Second, we have further noted that interpretation of the parahippocampal cingulum bundle MR g-ratio in relation to the underlying microstructure requires knowledge, or an assumption, about whether the associated change in conduction velocity is faster or slower. Given that faster conduction velocity is thought to promote better cognition (e.g. Brancucci, 2012; Dicke and Roth, 2016; Miller, 1994; Reed and Jensen, 1992), we interpreted our MR g-ratio findings under the assumption of faster conduction velocity, and now explicitly note in several places in the revised manuscript that this is an assumption.

      Third, we thank the Reviewer for the excellent suggestion that a complementary measure could help to further inform the findings. Consequently, we now also include additional analyses examining the relationship between the extent of myelination and autobiographical memory recall ability. This is possible using the magnetisation transfer saturation maps, which are optimised to assess myelination. Given our assumption of faster conduction velocity when interpreting our positive MR g-ratio correlations, then no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall would be expected. On the other hand, if conduction velocity is actually decreasing, then a negative correlation between magnetisation transfer saturation values and autobiographical memory recall ability would be observed. In fact, we found no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall. This suggests that myelin was not associated with autobiographical memory recall ability, supporting our assumption that relationships with the MR g-ratio were indicative of faster rather than slower, conduction velocity.

      We have now added these new data, analyses and discussion points to the revised manuscript.

      It would also be helpful to include some paragraphs on both interpretation and methodological issues when it comes to MRI-based microstructural imaging, which at the moment is lacking. This would provide a better picture of the results for a more general readership.

      We agree, and additional consideration of interpretational and methodological limitations have now been included in the manuscript.

      As one of the first works using an MRI-based microstructural measure of myelin, the g-ratio, to study cognition in a large cohort of subjects, I think this work will be a needed and significant step towards merging the neuroscience and MRI physics community - the methodology presented here is robust and could be used in many other applications.

      Thank you.

      Reviewer #3 (Public Review):

      The manuscript adds useful information about how structural properties of the brain are related to individual differences in autobiographical memory. A novel metric is used to assess features of white matter in tracts that are important for information exchange between the hippocampus and other brain regions. In one of these, the parahippocampal bundle, a relationship between the MR g-ratio and autobiographical memory recall is identified. This represents new and interesting information. The authors interpret the results in line with the theory that speed of signal transmission is important for cognitive function.

      Thank you for this positive summary.

    1. Author Response

      Reviewer #1 (Public Review):

      Rasicci et al. have developed a FRET biosensor that is designed to light up when cardiac myosin folds. This structure is extremely important to understand, and its link to the super-relaxed (SRX) state has not been fully shown. Their study provides a comprehensive review of the literature and provides compelling data that the 15 heptad+leucine zipper+GFP construct does function well and that the DCM mutant E525K has a similar IVM velocity despite a reduced ATPase compared with HMM. They rely on the ionic strength-dependent changes in the rate of MantATP release to argue that the E525K mutation stabilizes the 'interacting heads motif' (IHM) state, which makes logical sense.

      Strengths:

      Well written and comprehensive.

      Utilizes the appropriate fluorescence-based sensor for measuring the folding of the myosin structure. Provides a detailed range of techniques to support the premise of the study

      Weaknesses:

      Over-interpretation of the outcomes from this study means that the IHM and SRX are the same. Similar studies, e.g. Anderson 2018 and Chu 2021 support the opposite view that IHM and SRX are not necessarily the same, Anderson (and Rohde 2018) point out that S1 has some element of a reduced ATPase, this clearly cannot be due to folding of the molecule. Also, mavacamten was used in these studies to show that even S1 is inhibited suggesting that SRX and IHM are not connected. This is not to say that with enough supporting evidence that these observations cannot be over-ridden, it is just not clear that there is enough in this study to support this conclusion.

      We have revised our discussion to emphasize that our results support a model in which the SRX state is enhanced by formation of the IHM, but given the S1 and 2HP data the IHM may not be required for populating the SRX biochemical state (see page 8).

      I felt that the authors passed over the recent Chu 2021 paper too quickly, the Thomas group used a FRET sensor as well and provides a direct comparison as a technique, but with opposite conclusions. They also have supporting data in Rohde 2018 that their constructs were less ionic strength sensitive. It would be useful to understand what the authors think about this.

      We have discussed the Rohde and Chu papers in more detail in the discussion (see page 8). In the Rhode paper they used proteolytically prepared HMM and S1. Rohde found 20% SRX at all KCl concentrations in S1, while HMM shifted from 50% to 20% SRX in low and high salt conditions, respectively. Our results are different in terms of the absolute fraction of the SRX state but the trend is similar in terms of S1 being salt-insensitive and HMM being salt-sensitive. The difference could be proteolytic HMM, which is a longer construct, and proteolytic S1, which is prone to internal cleavage that can impact ATPase activity. Another difference could be the mixed isoform of mantATP used in previous studies and the single isoform of mantATP used on our study (see page 5)

      Reviewer #2 (Public Review):

      The paper by Rasicci et al. examines the impact of the DCM mutation E525K in beta-cardiac myosin on its function and regulation by autoinhibition. The role of the auto-inhibited state of beta-cardiac myosin in fine-tuning cardiac contractility is an active and exciting area of current research related to muscle biology and cardiomyopathies. Several studies in the past have linked the destabilization of the autoinhibited, super-relaxed (SRX) state of myosin to the pathogenesis of hypertrophic cardiomyopathy. This timely study provides one of the first few examples where the hypocontractile phenotype of a DCM mutation has been linked to the stabilization of the SRX state.

      One of the strengths here is the utilization of a wide variety of both pre-existing and novel biochemical and biophysical assays for the study. The authors have characterized a new two-headed long-tailed myosin construct containing 15-heptad repeats of the proximal S2 (15HPZ), which they show allows myosin to form the SRX state in vitro using single ATP turnover assays. The authors go on to compare the E525K and WT proteins using the 15HPZ myosin construct in terms of their steady-state actin-activated ATPase activity, in-vitro actin-sliding velocity and single ATP turnover measurements. These assays reveal that the predominant effect of this mutation is the stabilization of the SRX state which is maintained even at 150 mM salt concentration where the WT SRX is largely disrupted. This is an important observation because DCM mutations so far have been believed to only affect the force-generating capacity of myosin.

      One of the biggest strengths of this study is the attempt to develop a FRET-based approach to directly ask if the biochemical SRX state here correlates well with the structural IHM state, which is an important unresolved question in the field. The authors have designed a FRET pair (C-terminal GFP and Cy3ATP bound to the active site) that is sensitive to the relative position of the heads and the tail, allowing them to distinguish between the low-FRET closed IHM conformation and the no-FRET open conformation. Remarkably, the authors show that the salt dependence of the FRET efficiency values closely follows their results from the salt dependence of the percent SRX for both WT and E525K proteins. The authors then attempt to substantiate their FRET results by a direct visual analysis of the conformational states populated by both WT and E525K proteins at low salt using negative staining EM analysis. The authors have optimized conditions to allow the deposition of the IHM state on grids without adding the small molecule mavacamten, which was found to be necessary in an earlier study to visualize the closed state using EM. The authors conclude that the SRX state correlates well with the IHM state and that the E525K mutation indeed stabilizes the folded-back conformation of myosin.

      This study significantly strengthens the previously illustrated correlation between the SRX and IHM states and provides methodological advances (especially visualization of the IHM state by negative EM in the absence of cross-linking agents) that will be very useful to the field going forward. The observation that a DCM mutation can lead to stabilization of the folded back state is a novel insight that should spark interest in the field to test how broadly this applies to other DCM mutations. The conclusions of the paper are mostly supported by the data; however, some clarifications and qualifications are needed.

      Weaknesses:

      The extremely low enzymatic activity of the M2β 15HPZ myosins as compared to the WT S1 control (which is a historical control not assayed in parallel with the 15HPZ proteins), is concerning for the low protein quality of the 15HPZ myosins. The authors attribute the low kcat to the high proportion of SRX population in their ensembles. However, the DRX rates reported for the WT and E525K 15HPZ proteins in the single ATP turnover assay are ~3-4 fold lower than those of their S1 counterparts. These rates reflect basal turnover of ATP in the open state and thus should not be affected by the presence of the S2 tail, which leads to concerns about the 15HPZ protein activity. In addition, the very high percentage of stuck filaments in the in vitro motility assay for the 15HPZ constructs (despite the use of dark actin) is also concerning for significant amounts of enzymatically inactive protein.

      We thank the reviewer for pointing out the differences in the S1 and HMM DRX rates. We performed additional single turnover measurements with S1, adding two sets of measurements from one additional preparation (N=3), and we demonstrate that there is a significant increase in the DRX rates of WT S1 compared to WT HMM (see pages 4-5, Table 3, Figure 3- figure supplement 3). A faster rate in S1 was also reported in Rohde et al. 2018. Indeed, the DRX rates of E525K S1 are significantly higher than WT in S1, which we also now report in the results (see page 5, Figure 3 – figure supplement 3). We addressed the concerns about 15HPZ activity by performing NH4+ ATPase assays to demonstrate that the number of active heads was similar in S1 and 15HPZ HMM (see page 4). It is possible that the higher percentage of stuck filaments in the HMM motility is due to myosin heads in the IHM state on the motility surface, which generate a drag force by non-specifically interacting with actin, but further study is necessary to examine this question.

      The authors assert that the E525K mutation represents a new mechanism by which DCM-causing mutations lead to decreased contractility - by stabilizing the sequestered state rather than affecting motor function. However, there is no evaluation of the motor function (actin-activated ATPase activity or in vitro motility) of the E525K S1, which would reveal the effects of the mutation without confounding effects due to the sequestering of heads. Interestingly, in the single ATP turnover assay, the DRX rate of the E525K S1 is >2-fold higher than the WT control, suggesting that the mutation may have effects beyond stabilization of the SRX state. The conclusion that the E525K mutation's effect on myosin function is mediated via stabilization of the SRX state would be strengthened if the effects of the mutation on the motor domain alone were also known.

      We thank the reviewer for this suggestion. We performed actin-activated ATPase assays with WT and E525K S1 and found that E525K increases kcat and lowers KATPase, demonstrating enhanced intrinsic motor activity in the mutant S1 construct (see page 4, Figure 2B). This adds an interesting dimension to the manuscript because we report a mutant that enhances the intrinsic motor activity but stabilizes the SRX/IHM (see Discussion page 10). We did not perform in vitro motility, because this assay depends on the surface attachment strategy, and we would like to compare all constructs with the same attachment strategy using a C-terminal GFP tag (mutant and WT S1 and 15HPZ HMM). Therefore, we are making the S1 construct with a C-terminal GFP tag for this purpose, to be examined in a future study.

      While the authors show strong qualitative correlations between the SRX and IHM states using single ATP turnover, FRET, and EM experiments, attempts to quantitatively compare the fraction of heads in the IHM state using the various experimental approaches is problematic. For example, the R0 value of the FRET pair used here doesn't allow precise measurement of the distances being probed here to be made, but the distances are reported and compared to predicted distances. The authors report that the R0 for their FRET pair is 63 Å. Surprisingly the authors go on to use the steady-state FRET efficiency values to determine the average D-A distance (Fig 5B) which is 100 Å when all heads are in the IHM configuration and becomes larger than that when heads open. R0 of 63 Å allows a precise distance measurement to be made in the 31.5-94.5 Å range which corresponds to 0.5-1.5 R0. It is therefore technically incorrect to use the steady-state FRET efficiency values to determine the D-A distance here. Besides, there are several unknown factors here like orientation factor (κ2) which further complicate these calculations. Similarly, the quantification of IHM state molecules from the negative stain EM experiments is significantly hampered by the disruptive effect of the grid surface on the structure of the IHM state. The authors find that limiting the contact time with the grid to ~ 5s is necessary to preserve the IHM state.

      Despite that, only ~15% WT molecules were seen in the IHM state at low salt (Fig. 6B). In contrast, ~56% E525K molecules were in the IHM state. Both these proteins have similar SRX proportions (Fig. 3C) and similar FRET efficiency values (Fig. 5A) at this salt concentration. This mismatch highlights the problem arising due to not having a measure of the populations from the FRET data. It is not clear if the hugely different proportions of the IHM state in EM experiments are indicative of the relative stability of this state in the two proteins or a random difference in the electrostatic interactions of WT vs mutant with the grid. These experiments do not provide a correct idea of the %IHM in the two proteins. In the absence of any IHM population measurement, it is important to proceed with caution when quantitatively correlating the SRX and IHM.

      We thank the reviewer for pointing out that measuring precise distances by FRET can be difficult. We agree that the low FRET efficiency makes precise distance determination even more challenging. However, FRET is quite good at measuring a change in distance given a specific donor-acceptor pair. We feel our FRET biosensor clearly demonstrates FRET efficiencies that are salt-insensitive in E525K but a clear decrease in FRET at higher salt concentrations in WT. In order to compare the trend in the predicted FRET, based on the single turnover measurements, and the actual FRET we thought it was important to plot the two together on the same graph. We understand that this could have been misleading that we were reporting actual distances. We have now plotted the FRET efficiency instead of distance as a function of KCl concentration (Figure 5B), to prevent any confusion with reporting distances. In addition, we have emphasized that the data are plotted to allow for a comparison of the trend in the single turnover and FRET data (see page 6, 10, Figure 5B).

      We agree that it is important to proceed with caution when comparing the EM to the FRET and single turnover data. The EM does not give a quantitative estimate of the fraction of IHM molecules, due to the disruptive effect of the grid surface on protein conformation. However, it does provide direct (though qualitative) evidence that the conformation underlying SRX and enhanced FRET is the IHM, and it is consistent with our interpretation that the E525K mutation enhances FRET and SRX by stabilizing the IHM. To consolidate this result, we have performed EM experiments now with a total of 3 preparations of WT and mutant (see page 6-7 and Figure 6D). We find that while there is variability from experiment to experiment, likely because the grid surface is slightly different each time the experiment is performed, in all cases there was a ~4-fold higher fraction of folded molecules in the mutant. Since each WT/mutant experimental pair was studied in parallel, using identically prepared grids, the results provide further evidence that the mutant stabilizes the IHM. However, we agree that a quantitative, direct visual correlation of the SRX and IHM is not possible based on the current EM data.

      Finally, the utility of the methods described in the paper to the field would be greatly enhanced if they were described in more detail. As currently written, it would be difficult for others to replicate these experiments.

      Thank you for the comment. We have made significant changes in the methods to clarify the details of the experiments (see pages 11-14). In addition, we have added details to the results and figure legends.

    1. Author Response

      Reviewer #1 (Public Review):

      “This study investigates the dynamics of brain network connectivity during sustained experimental pain in healthy human participants. To this end, capsaicin was applied to the tongues of two cohorts of participants (discovery cohort, N=48; replication cohort, N=74). This procedure resulted in pain for several minutes. During sustained pain, pain avoidance/intensity ratings and fMRI scans were obtained. The analyses (i) compare the pain state with a resting state, (ii) assess the dynamics of brain networks during sustained pain, and (iii) aim to predict pain based on the dynamics of brain networks. To this end, the analyses focus on community structures of time-evolving networks. The results show that sustained pain is associated with the emergence of a brain network including somatomotor, frontoparietal, basal ganglia and thalamic brain areas. The somatomotor area of the tongue is particularly involved in that network while this area is decoupled from other parts of the somatomotor cortex. Moreover, the network configuration changes over time with the frontoparietal network decoupling from the somatomotor network. Frontoparietal-cerebellar connections were predictive of decreases of pain. Together, the findings provide novel and convincing insights into the dynamics of brain network during sustained pain.

      Strengths

      • The brain mechanisms of sustained pain is a timely and relevant topic with potential clinical implications.

      • Assessing the dynamics of sustained pain and relating it to the dynamics of brain networks is a timely and promising approach to further the understanding of the brain mechanisms of pain.

      • The study includes discovery and replication cohorts and pursues a cutting-edge analysis strategy.

      • The manuscript is very well-written and the results are visualized in an exemplary manner including a graphical outline and summary of the findings.”

      We thank the reviewer for the thoughtful summarization and evaluation of our study.

      “Weaknesses

      • It remains unclear whether the changes of brain networks over time simply reflect the duration of sustained pain or whether they essentially reflect different levels of pain intensity/avoidance.”

      We appreciate the editor and reviewer’s comment on this issue. With the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. That is, the pain duration is correlated with the pain intensity in our task.

      However, when we examined the time-course of the ratings at each individual level (as shown in Figure S2), the time duration explained 53.7% of the rating variance, R2 = 0.537 ± 0.315 (mean ± standard deviation). In addition, if we constrain the beta coefficient of the time duration to be negative (i.e., ratings should decrease over time), the explained variance decreases to 48.2%, R2 = 0.482 ± 0.457, leaving us enough variance (i.e., greater than 50%) for examining the distinct effects of time duration and ratings on the patterns of functional brain reorganization.

      Indeed, the two main analyses included in the manuscript—consensus community detection and predictive modeling—were designed to examine those two aspects of the task, i.e., time duration and pain avoidance ratings, respectively. First, through the consensus community detection analysis, we examined the community structure that changes over time, i.e., across the early, middle, and late periods (as shown in Figure 3). We then developed predictive models of pain avoidance ratings in the second main analysis (as shown in Figure 5).

      Though it is still a caveat that we cannot fully dissociate the effects of time duration versus pain ratings, we could interpret the first set of results to be more about time duration, while the second set of results is more about pain ratings.

      We now added a description of the implication of predictive modeling for isolating the effects of pain ratings. In addition, a discussion on the caveat of the current experimental design and relevant future direction.

      Revisions to the main manuscript:

      p. 25: Moreover, developing models to directly predict the pain ratings is helpful to complement the group-level analysis, because the changes in consensus community structure over the early, middle, and late periods only indirectly reflect the different levels of pain.

      p. 27: This study also had some limitations. First, with the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. Though we aimed to model the effects of pain duration and pain avoidance ratings with our two primary analyses, i.e., consensus community detection and predictive modeling, we cannot fully dissociate the impact of time duration versus pain ratings.

      “• Although the manuscript is very well-written it might benefit from an even clearer and simpler explanation of what the consensus community structure and the underlying module allegiance measure assesses.”

      We thank you for the suggestion. Now we added additional (but simple) descriptions of module allegiance and consensus community detection methods.

      Revisions to the main manuscript:

      pp. 8-9: Here, the consensus community means the group-level representative structures of the distinct community partitions of individuals. To determine the consensus community across different individuals and times, we first obtained the module allegiance (Bassett et al., 2011) from the community assignment of each individual. Module allegiance assesses how much a pair of nodes is likely to be affiliated with the same community label, and is defined as a matrix T whose element Tij is 1 when nodes i and j are assigned to the same community and 0 when assigned to different communities. This conversion of the categorical community assignments to the continuous module allegiance values allows group-level summarization of different community structures of individuals.

      p. 14: Here, high module allegiance indicates the voxels of two regions are likely to be in the same community affiliation, and vice versa.

      “• The added value of the assessment of the dynamics of brain networks remains unclear. Specifically, it is unclear whether the current analysis of brain networks dynamics allows for a clearer distinction between and prediction of pain and no-pain states than other measures of static or dynamic brain activity or static measures of brain connectivity.”

      The main goal (and thus, the added value) of the current study was to provide a “mechanistic” understanding of the brain processes of sustained pain, rather than the “prediction.” Even though we included the results from the predictive modeling, as in Figures 4-6, our focus was more on the interpretation of the model to quantitatively examine the functional changes in the brain, not on the maximization of the prediction performance.

      Indeed, maximizing the prediction performance was the main goal of our previous study (Lee et al., 2021), in which we developed a predictive model of sustained pain based on the patterns of dynamic functional connectivity. The model showed better prediction performances compared to the current study, but it was challenging to interpret the model because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      In this sense, the multi-layer community detection method has several advantages to achieving our goal. First, the community detection analysis allows us to summarize the complex, high-dimensional whole-brain connectivity patterns into neurobiologically interpretable subsystems. Second, the multi-layer community detection method allows us to study the temporal changes in community structure by connecting the same nodes across different time points.

      Now we added a description of the rationale behind the choice of the multi-layer community detection analysis over the conventional functional connectivity methods, and the added value of our study.

      Revisions to the main manuscript:

      p. 3: In this study, we examined the reconfiguration of whole-brain functional networks underlying the natural fluctuation in sustained pain to provide a mechanistic understanding of the brain responses to sustained pain.

      p. 7: In this study, we used this approach to examine the temporal changes of brain network structures during sustained pain, which cannot be done with conventional functional connectivity-based analyses (Lee et al., 2021).

      p. 27: However, the previous model provides a limited level of mechanistic understanding because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      Reviewer #2 (public Review):

      “The Authors J-J Lee et al., investigated cortical and subcortical brain networks and their organization in communities over time during evoked tonic pain. The paper is well-written, and the findings are interesting and relevant for the field. Interestingly, other than confirming well known phenomena (e.g., segregation within the primary somatomotor cortex) the Authors identified an emerging "pain supersystem" during the initial increase of pain, in which subcortical and frontoparietal regions, usually more segregated, showed more interactions with the primary somatomotor cortex. Decrease of pain was instead associated to a reconfiguration of the networks that sees subcortical and frontoparietal regions connected with areas of the cerebellum. The main novelty of the proposed analysis, lies in the resulting high performances of the classifier, that shows how this interesting link between frontoparietal network and subcortical regions with the cerebellum, is predictive of pain decrease. In summary, the main strengths of the present manuscript are: • Inclusion of subcortical regions: most of the recent papers using the Shaefer parcellation in ~200 brain areas1, do not consider subcortical areas, ignoring possible relevant responses and behaviors of those regions. Not only the Authors smartly addressed this issue, but most of their results showed how subcortical regions played a key role in the networks reconfiguration over time during evoked sustained pain.

      • Robust classification results: high accuracy obtained on training dataset (internal validation), using a leave-one-out approach, and on the available independent test dataset (external validation) of relatively large sample size (N=74).

      • Clarity in the description of aim and sub-aims and exhaustive presentation of the obtained results helped by appropriate illustrations and figures (I suggest less wording in some of them).

      • Availability of continuous behavioral outcome (track ball).”

      We appreciate the reviewer’s summary and positive evaluations.

      “Even though the results are mostly cohesive with previous literature, some of the results need to be discussed in relationship to recently published papers on the same topic as well as justifying some of the non-standard methodological procedures adding appropriate citations (or more detailed comments). The Authors do not touch upon the concept of temporal summation of pain, historically associated with tonic pain, especially when the study is finalized to better understanding brain mechanisms in chronic pain populations (chronic pain patients often exhibit increased temporal summation of pain2). I would suggest starting from the paper recently published by Cheng et al. that also shares most of the methodological pipeline3 to highlight similarities and novelties and deepen the comparison with the associated literature.”

      We thank the reviewer and editor for the comment on this important topic. Temporal summation of pain indicates progressively increased sensation of pain during prolonged noxious stimulation (Price, Hu, Dubner, & Gracely, 1977), and has been suggested as a hallmark of chronic pain disorders including fibromyalgia (Cheng et al., 2022; Price et al., 2002). In a recent study by Cheng et al. (2022), the authors induced tonic pain using constantly high cuff pressure and examined whether the participants experienced increased pain in the late period compared to the early period of pain. On the contrary, in our experimental paradigm, the capsaicin liquid initially delivered into the oral cavity is being cleaned out by saliva, and thus overall pain intensity was decreasing over time, not increasing (Figure 1B). Therefore, the temporal summation of pain may occur in a limited period (e.g., the early period of the run), but it is difficult to examine its effect systematically in our study.

      However, it is notable that Cheng et al.’s results overlap with our findings. For example, Cheng et al. reported the intra-network segregation within the somatomotor network and the inter-network integration between the somatomotor and other networks during the temporal summation of pressure pain in patients with fibromyalgia, which were similar to the findings we reported in Figure S9 and Figure 4. Although it is unclear whether these results reflect the temporal summation of pain, these network-level features shared across the two studies are likely to be an essential component of the sustained pain processes in the brain.

      Now we added a comment on the temporal summation of pain in the main manuscript.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      We thank the reviewer and editor for the information about this recent publication. Cheng et al. (2022) was not published at the time we wrote the manuscript, and we were surprised that Cheng et al. shares many aspects with our study, e.g., both used multilayer community detection and also reported similar findings, as described above.

      However, there were some differences between the two studies as well.

      First, the focus of our study was on the brain dynamics during the natural time-course of sustained pain from its initiation to remission in healthy participants, whereas the focus of Cheng et al. was on the temporal summation phenomenon of pain (TSP) and the enhanced TSP in patients with fibromyalgia patients. Because of this difference in the research focuses, our study and Cheng et al. are providing many nonoverlapping results and insights. For example, our study paid particular attention to the coping mechanisms of the brain (e.g., the network-level changes in the subcortical and frontoparietal network regions) and the brain systems that are correlated with the natural decrease of pain (e.g., the cerebellum in Figure 5). In contrast, Cheng et al. (2022) identified the brain connectivity and network features important for the increased TSP in fibromyalgia patients.

      Second, our great interest was in identifying and visualizing the fine-grained spatiotemporal patterns of functional brain network changes over the period of sustained pain. To utilize fine-grained brain activity information, we conducted our main analyses at a voxel-level resolution and on the native brain space, such as in Figures 2-3 and Figures S5, S7, and S8. With this fine-grained spatiotemporal mapping, we were able to identify small, but important voxel-level dynamics.

      We now cited Cheng et al. (2022) in multiple places and revised the manuscript accordingly.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Here the main significant weaknesses of the study:

      • The data analysis is entirely conducted on young healthy subjects. This is not a limitation per se, but the conclusion about offering new insights into understanding mechanisms at the basis of chronic pain is too far from the results. Centralization of pain is very different from summation and habituation, especially if all the subjects in the study consistently rated increased and decreased pain in the same way (it never happens in chronic pain patients). A similar pipeline has been actually applied to chronic pain patients (fibromyalgia and chronic back pain)3,4. Discussing the results of the present paper in relationship to those, could offer a more robust way to connect the Authors' results to networks behavior in pathological brains.”

      We are grateful for the opportunity to discuss the clinical implication of our study. First of all, we agree with the reviewer and editor that we cannot make a definitive claim about chronic pain with the current study, and thus, we revised the last sentence of the abstract to tone down our claim.

      Revisions to the main manuscript (p. 2, in the abstract):

      This study provides new insights into how multiple brain systems dynamically interact to construct and modulate pain experience, advancing our mechanistic understanding of sustained pain.

      However, as we noted above in E-4, some of our findings were consistent with the findings from a previous clinical study (Cheng et al., 2022), suggesting the potential to generalize our study to clinical pain conditions. In addition, we previously reported that a predictive model of sustained pain derived from healthy participants performed better at predicting the pain severity of chronic pain patients than the model derived directly from chronic pain patients (Lee et al., 2021), highlighting the advantage of the “component process approach.”

      The component process approach aims to develop brain-based biomarkers for basic component processes first, which can then serve as intermediate features for the modeling of multiple clinical conditions (Woo, Chang, Lindquist, & Wager, 2017). This has been one of the core ideas of the Research Domain Criteria (RDoC) (Insel et al., 2010) and the Hierarchical Taxonomy of Psychopathology (HiTOP) (Kotov et al., 2017). If the clinical pain of a patient group is modeled as a whole, it becomes unclear what is being modeled because of the multidimensional and heterogeneous nature of clinical pain (Melzack, 1999) as well as other co-occurring health conditions (e.g., mental health issues, medication use, etc.). The component process approach, in contrast, can specify which components are being modeled and are relatively free from heterogeneity and comorbidity issues by experimentally manipulating the specific component of interest in healthy participants.

      The current study was conducted on healthy young adults based on the component process approach. We used oral capsaicin to experimentally induce sustained pain, which unfolds over protracted time periods and has been suggested to reflect some of the essential features of clinical pain (Rainville, Feine, Bushnell, & Duncan, 1992; Stohler & Kowalski, 1999). Therefore, the detailed characterization of the brain processes of sustained pain will be able to serve as an intermediate feature of multiple clinical conditions in future studies.

      Now we added the discussion on the clinical generalizability issue in the discussion section.

      Revisions to the main manuscript:

      pp. 25-26: An interesting future direction would be to examine whether the current results can be generalized to clinical pain. Experimental tonic pain has been known to share similar characteristics with clinical pain (Rainville et al., 1992; Stohler & Kowalski, 1999). In addition, in a recent study, we showed that an fMRI connectivity-based signature for capsaicin-induced orofacial tonic pain can be generalized to chronic back pain (Lee et al., 2021). Therefore, a detailed characterization of the brain responses to sustained pain has the potential to provide useful information about clinical pain.

      p. 26: Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Vice versa, the behavioral measure used to assess evoked pain perception (avoidance ratings), has been developed for chronic pain patients and never validated on healthy controls5. It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects6,7.”

      We acknowledge that pain avoidance measures are not fully validated in the healthy population. Nevertheless, we used this measure in this study for the following two main reasons that outweigh the limitations.

      First, a pain avoidance rating provides an integrative measure that can reflect the multi-dimensional aspects of sustained pain. One of the essential functions of pain is to avoid harmful situations and promote survival, and the avoidance motivation induced by pain is composed of not only sensory-discriminative, but also cognitive components including learning, valuation, and contexts (Melzack, 1999). According to the fear-avoidance model (Vlaeyen & Linton, 2012), if the pain-induced avoidance motivation is not resolved for a long time and is maladaptively associated with innocuous environments, chronic pain is likely to develop, suggesting the importance and clinical relevance of pain avoidance measures. In addition, our experimental design is particularly suitable for the use of avoidance rating because the oral capsaicin stimulation is accompanied by the urge to avoid the painful sensation, but it cannot immediately be resolved similar to chronic pain. Moreover, capsaicin is sometimes experienced as intense but less aversive (or even appetitive) in some cases, e.g., spicy food craver (Stevenson & Yeomans, 1993). In this case, avoidance ratings can provide a more reasonable measure of pain compared to the intensity rating.

      Second, the avoidance measure provides a common scale on which we can compare different types of aversive experiences, allowing us to conduct specificity tests for a predictive model of pain. For example, a recent study successfully compared the brain representations of two types of pain and two types of aversive, but non-painful experiences (e.g., aversive auditory and visual experiences) using the same avoidance measure (Ceko, Kragel, Woo, Lopez-Sola, & Wager, 2022). These comparisons were possible because the avoidance measure provided one common scale for all the aversive experiences regardless of their types of stimuli.

      To provide a better justification for the use of the avoidance measure, we now included the specificity test results of our pain predictive models. More specifically, we tested our module allegiance-based SVM and PCR models of pain on the aversive taste and aversive odor conditions (Figure S13).

      Despite these advantages, the use of avoidance rating without thorough validation is a limitation of the current study, and thus future studies need to examine the psychometric properties of the avoidance rating, e.g., examining the relationship among pain intensity, unpleasantness, and avoidance measures. However, the current study showed that the predictive models derived with pain avoidance rating (Study 1) could be used to predict the pain intensity rating (Study 2). In addition, the overall time-course of pain avoidance ratings in Study 1 was similar to the time-course of pain intensity ratings in Study 2, providing some supporting evidence for the convergent validity of the pain avoidance measure.

      As to the following comment, “It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects,” there are pieces of evidence supporting that the low between-individual variability of ratings is due to the characteristics of our experimental design, not to the fact that we used the avoidance measure. As we discussed in more detail in our response to E-1, our experimental procedure based on capsaicin liquid commonly induces the initial burst of painful sensation and the subsequent gradual relief for most of the participants (Figure 1B, left). A similar time-course pattern of ratings was observed in Study 2 (Figure 1B, right), which used the pain “intensity” rating, not the pain avoidance rating. In addition, previous studies with a similar experimental design (i.e., intra-oral capsaicin application) (Berry & Simons, 2020; Lu, Baad-Hansen, List, Zhang, & Svensson, 2013; Ngom, Dubray, Woda, & Dallel, 2001) also showed a similar time-course of pain ratings with low between-individual variability regardless of the rating types (e.g., VAS or irritation intensity), confirming that this observation is not unique to the pain avoidance rating.

      Now we added descriptions on the small between-individual variability of pain ratings and the use of avoidance ratings.

      Revisions to the main manuscript:

      pp. 5-7: Note that the overall trend of pain ratings over time was similar across participants because of the characteristics of our experimental design, which has also been observed in the previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001). However, also note that each individual’s time-course of pain ratings were not entirely the same (Figures S2 and S3).

      p. 26: However, there are also differences between the characteristics of capsaicin-induced tonic pain versus clinical pain. For example, clinical pain continuously fluctuates over time in an idiosyncratic pattern (Apkarian, Krauss, Fredrickson, & Szeverenyi, 2001), whereas capsaicin-induced tonic pain showed a similar time-course pattern across the participants—i.e., increasing rapidly and then decreasing gradually (Figure 1B). This typical time-course of pain ratings has been reported in previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001).

      pp. 26-27: Note that Study 1 used a pain avoidance measure that is not yet fully validated in healthy participants. However, we chose to use the pain avoidance measure, which can provide integrative information on the multi-dimensional aspects of pain (Melzack, 1999; Waddell, Newton, Henderson, Somerville, & Main, 1993). It also has a clinical implication considering that the maladaptive associations of pain avoidance to innocuous environments have been suggested as a putative mechanism of transition to chronic pain (Vlaeyen & Linton, 2012). Lastly, the avoidance measure can provide a common scale across different modalities of aversive experience, allowing us to compare their distinct brain representations (Ceko et al., 2022) or test the specificity of their predictive models (Lee et al., 2021) (Figure S13). Although the psychometric properties of the pain avoidance measure should be a topic of future investigation, we expect that the pain avoidance measure would have a high level of convergent validity with pain intensity given the observed similarity between pain avoidance (Study 1) and pain intensity (Study 2) in their temporal profiles. The generalizability of our PCR model across Studies 1 and 2 also supports this speculation. However, there would also be situations in which pain avoidance is dissociated from pain intensity. For example, capsaicin can be experienced to be intense but less aversive or even appetitive in some contexts, such as cravings for spicy food (Stevenson & Yeomans, 1993). In addition, the gradual rise of avoidance ratings during the late period of the control condition in Study 1 would not be observed if the intensity measure was used. Future studies need to examine the relationship between pain avoidance and the other pain assessments and the advantage of using the pain avoidance measure.

      “• The dynamic measure employed by the Authors is better described from the term "windowed functional connectivity". It is often considered a measure of dynamic functional connectivity and it gives information about fluctuations of the connectivity patterns over time. Nevertheless, the entire focus of the paper, including the title, is on dynamic networks, which inaccurately leads one to think of time-varying measures with higher temporal resolution (either updating for every acquired time point, as the Authors did in their previous publication on the same dataset4, or sliding windows involving weighting or tapering8,9). This allows one to follow network reorganization over time without averaging 2-min intervals in which several different brain mechanisms might play an important role3,10,11. In summary, the assumption of constant response throughout 2-min periods of tonic pain and the use of Pearson correlations do not mirror the idea of dynamic analysis expressed by the Authors in title and introduction. I would suggest removing "dynamic" from the title, reduce the emphasis on this concept, address possible confounds introduced by the choice of long windows and rephrase the aim of the study in terms of brain network reconfiguration over the main phases of tonic pain experience.”

      Now we removed the word ‘dynamic’ from many places in the manuscript, including the title. In addition, we added a brief discussion on the reason we chose to use the long and non-overlapping windows for connectivity calculation.

      Revisions to the main manuscript (p. 8):

      Although the long duration of the time window without overlaps may obscure the fine-grained temporal dynamics in functional connectivity patterns, we chose to use this long time window based on previous literature (Bassett et al., 2011; Robinson, Atlas, & Wager, 2015), which also used long time windows to obtain more reliable estimates of network structures and their transitions.

      “• Procedure chosen for evoking sustained pain. To the best of my knowledge, capsaicin sauce on the tongue is not a validated tonic pain procedure. In favor of this argument is the absence of inter-subject variability in the behavioral results showed in the paper, very unusual for response to painful stimulations. The procedure is well described by the Authors, and some precautions like letting the liquid drying before the start of the scan, have helped reducing confounds. Despite this, the measures in figure 1B suggest that the intensity of the painful stimulation is not constant as expected for sustained pain (probably the effect washes out with the saliva). In this case, the first six-minute interval requires particular attention because it encapsulates the real tonic pain phase, and the following ones require more appropriate labels. Ideally the Author should cite previous studies showing that tongue evoked pain elicits a very specific behavioral response (summation, habituation/decrease of pain, absence of pain perception). If those works are missing, this response need to be treated as a funding rather than an obvious point.”

      We addressed this comment. Moreover, we could find previous studies that experimentally induced tonic pain through the application of capsaicin on the tongue (Berry & Simons, 2020; Boudreau, Wang, Svensson, Sessle, & Arendt-Nielsen, 2009; Green, 1991; Ngom et al., 2001), suggesting that our experimental procedure is in line with previous literature.

      Reviewer #3 (Public Review ):

      “In their manuscript, Lee and colleagues explore the dynamics of the functional community structure of the brain (as measured with fMRI) during sustained experimental pain and provide several potentially highly valuable insights into, and evaluate the predictive capacity of, the underlying dynamic processes. The applied methodology is novel but, at the same time, straightforward and has solid foundations. The findings are very interesting and, potentially, of high scientific impact as they may significantly push the boundaries of our understanding of the dynamic neural processes during sustained pain, with a (somewhat limited) potential for clinical translation.

      However (Major Issue 1), after reading the current manuscript version, not all of my doubts have been dissolved regrading the specificity of the results to pain. Moreover (Major Issue 2), some of the results (specifically, those related to the group level analysis of community differences) do not seem to be underpinned with a proper statistical inference in the current version of the manuscript and, therefore, their presentation and discussion may not be proportional to the degree of evidence. Next to these Major Issues (detailed below), some other, minor clarifications might also be needed before publications. These are detailed below or in the private part of the review ("Recommendations for the authors").

      Despite these issues, this is, in general, a high quality work with a high level of novelty and - after addressing the issues - it has a very high potential for becoming an important contribution (and a very interesting read) to the pain-research community and beyond.”

      We appreciate the reviewer’s thoughtful comments. We have revised the manuscript to address the Reviewer’s major concerns, as described below.

      “Major Issue 1:

      The main issue with the manuscript is that it remains somewhat unclear, how specific the results are to pain.

      Differences between the control resting state and the capsaicin trials might be - at least partially - driven by other factors, like:

      • motion artifacts

      • saliency, attention, axiety, etc.

      Differences between stages over the time-course might, additionally, be driven by scanner drifts (to which the applied approach might be less sensitive, but the possibility is still there ) or other gradual processes, e.g. shifts in arousal, attention shifts, alertness, etc.

      All the above factors might emerge as confounding bias in both of the predictive models.

      This problem should be thoroughly discussed, and at least the following extra analyses are recommended, in order to attenuate concerns related to the overall specificity and neurobiological validity of the results:

      • reporting of, and testing for motion estimates (mean, max, median framewise displacement or anything similar)

      • examining whether these factors might, at least partially, drive the predictive models.

      • e.g. applying the PCR model on the resting state data and verifying of the predicted timecourse is flat (no inverse U-shape, that is characteristic to all capsaicin trials).

      Not using the additional sessions (bitter taste, aversive odor, phasic heat) feels like a missed opportunity, as they could also be very helpful in addressing this issue.”

      We thank the reviewer for this comment on the important issue regarding the specificity of our results and the potential influences of noise. The effects of head motion and physiological confounds are particularly relevant to pain studies because pain involves substantial physiological changes and often causes head motion. To address the related concerns of specificity, we conducted additional analyses assessing the independence of our predictive models (i.e., SVM and PCR models) from head movement and physiology variables and the specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1.

      First, we examined the overall changes of framewise displacement (FD) (Power, Barnes, Snyder, Schlaggar, & Petersen, 2012), heart rate (HR), and respiratory rate (RR) in the capsaicin condition (Figure S11). For the univariate comparison between the capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, the capsaicin condition caused significant changes in head motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). In addition, the increased motion and autonomic responses were more prominent in the early period of pain (Figure S11B). The 10-binned (2 mins per time-bin) FD and HR showed a decreasing trend while the RR showed an increasing trend over time in the capsaicin condition. The comparisons between the early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) periods of the capsaicin condition showed significant differences both for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). These results suggest that while participants were experiencing capsaicin tonic pain, particularly during the early period, head motion and heart rate were increased, while breathing was slowed down. Note that we needed to exclude 4 participants’ data in this analysis due to technical issues with the physiological data acquisition.

      Next, we examined whether the changes in head motion and physiological responses influenced our predictive model performance (Figure S12). We first regressed out the mean FD, HR, and RR (concatenated across conditions and participants as we trained the SVM model) from the predicted values of the SVM model with leave-one-subject-out cross-validation (2 conditions × 44 participants = 88) and then calculated the classification accuracy again (Figure S12A). The results showed that the SVM model showed a reduced, but still significant classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). We also did the same analysis for the PCR model (10 time-bins × 44 participants = 440) and the PCR model also showed a significant prediction performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) (Figure S12B). These results suggest that our SVM and PCR models capture unique variance in tonic pain above and beyond the head movement and physiological changes.

      Lastly, we examined the specificity of our predictive models to pain, by testing the models on the non-painful but aversive conditions including the bitter taste (induced by quinine) and aversive odor (induced by fermented skate) conditions (Figure S13). All the model responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall model responses of the SVM model for the bitter taste and aversive odor conditions were higher than those for the control condition but lower than the capsaicin condition (Figure S13A). Classification accuracies for comparing capsaicin vs. bitter taste and capsaicin vs. aversive odor were all significant (for capsaicin vs. bitter taste, accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; for capsaicin vs. aversive odor, accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), supporting the specificity of our SVM model of pain. Similarly, the model responses of the PCR model for the bitter taste and aversive odor conditions were lower than the capsaicin condition, and their temporal trajectories were less steep and fluctuating compared to the capsaicin condition (Figure S13B). The time-course of the model responses for the control condition was flatter than all other conditions and did not show the inverted U-shape. Furthermore, the model responses of the bitter taste and aversive odor conditions did not show the significant correlations with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain.

      Overall, we have provided evidence that our models can predict pain ratings above and beyond the head motion and physiological changes and that the models are more responsive to pain compared to non-painful aversive conditions.

      Now we added descriptions on the specificity tests to the main manuscript and also to the Supplementary Information.

      Revisions to the main manuscript (p. 20):

      Specificity of the module allegiance-based predictive models To examine whether the predictive models were specific to pain and the prediction performances were not influenced by confounding variables such as head motion and physiological changes, we conducted additional analyses as shown in Figures S11-13. The SVM and PCR models showed significant prediction performances even after controlling for head motion (i.e., framewise displacement) and physiological responses (i.e., heart rate and respiratory rate) (Figures S11 and S12) and did not respond to the non-painful but aversive conditions including the bitter taste and aversive odor conditions (Figure S13), supporting the specificity of our predictive to pain. For details, please see Supplementary Results.

      Revisions to the Supplementary Information (pp. 2-4):

      Specificity analysis (Figures S11-13) To examine whether the predictive models (i.e., SVM and PCR models) were specific to pain and not influenced by confounding noises, we conducted additional specificity analysis assessing the independence of the models from head movement and physiology variables and specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1. First, we examined the overall changes of framewise displacement (FD) (Power et al., 2012), heart rate (HR), and respiratory rate (RR) in sustained pain (Figure S11). For the univariate comparison between capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, capsaicin condition caused significant changes in motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). For the temporal changes of movement and physiology variables (Figure S11B), the results showed that the increased motion and autonomic responses are more prominent in the early period of pain. The 10-binned (2 mins per time-chunk) FD and HR showed decreasing trend while the RR showed increasing trend over time in capsaicin condition. Additional univariate comparisons between early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) period of capsaicin condition showed that differences were significant for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). This suggests that while participants were experiencing tonic pain, particularly in the early period, motion and heart rate was increased but breathing was slowed. Note that we needed to exclude 4 participants’ data due to technical issues with physiological data acquisition. Next, we examined whether the head movement and physiological responses are the main driver of our predictive models (Figure S12). For all the original signature responses from SVM model (2 conditions × 44 participants = 88), we regressed out the mean FD, HR, and RR (concatenated across conditions and participants as the SVM model was trained) and calculated the classification accuracy (Figure S12A). Although the signature responses were controlled for movement and physiology variables, the SVM model still showed a high classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). Similarly, for all the original signature responses from PCR model (10 time-bins × 44 participants = 440), we regressed out the 10-binned FD, HR, and RR (concatenated across time-bins and participants as the PCR model was trained) and calculated the within-individual prediction-outcome correlation (Figure S12B). Again, the PCR model showed a significantly high predictive performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) while controlling for movement and physiology variables. These results suggest that our SVM and PCR models captures unique variance in tonic pain above and beyond the head movement and physiological changes. Lastly, we examined the specificity of our predictive models to pain, by testing the models onto the non-painful but tonic aversive conditions including bitter taste (induced by quinine) and aversive odor (induced by fermented skate) (Figure S13). All the signature responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall signature responses of SVM model for bitter taste and aversive odor conditions were higher than those for control conditions, but lower than capsaicin condition (Figure S13A). Classification accuracy between capsaicin vs. bitter taste and vs. aversive odor were all significantly high (capsaicin vs. bitter taste: accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; capsaicin vs. aversive odor: accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), suggesting the specificity of SVM model to pain. Similarly, the temporal trajectories of the signature responses of PCR model for bitter taste and aversive odor conditions were not overlapping with that of the capsaicin condition (Figure S13B). Furthermore, the signature responses of bitter taste and aversive odor conditions do not have significant relationship with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain. Overall, we have provided evidence that the module allegiance-based models can predict pain ratings above and beyond the movement and physiological changes, and are more responsive to pain compared to non-painful aversive conditions, which suggest the specificity of our results to pain.

      “Major Issue 2:

      Another important issue with the manuscript is the (apparent) lack of statistical inference when analyzing the differences in the group-level consensus community structures (both when comparing capsaicin to control and when analysing changes over the time-course of the capsaicin-challenge).

      Although I agree that the observed changes seem biologically plausible and fit very well to previous results, without proper statistical inference we can't determine, how likely such differences are to emerge just by chance.

      This makes all results on Figs. 2 and 3, and points 1, 4 and 5 in the discussion partially or fully speculative or weakly underpinned, comprising a large proportion of the current version of the manuscript.

      Let me note, that this issue only affects part of the results and the remaining - more solid - results may already provide a substantial scientific contribution (which might already be sufficient to be eligible for publication in eLife, in my opinion).

      Therefore I see two main ways of handling Major Issue 2:

      • enhancing (or clarifying potential misunderstandings regarding) the methodology (see my concrete, and hopefully feasible, suggestions in the "private part" of the review),

      • de-weighting the presentation and the discussion of the related results.

      I believe there are many ways to test the significance of these differences. I highlight two possible, permutation testing-based ideas.

      Idea 1: permuting the labels ctr-capsaicin, or early-mid-late, repeating the analysis, constructing the proper null distribution of e.g. the community size changes and obtain the p-values. Idea 2: "trace back" communities to the individual level and do (nonparametric) statistical inference there.”

      We appreciate this important comment. We did not conduct statistical inference when comparing the group-level consensus community affiliations of the different conditions (Figure 2) or different phases (Figure 3) because of the difficulty in matching the community affiliation values of the networks to be compared.

      For example, let us assume that the 800 out of 1,000 voxels of community #1 and 1,000 out of 4,000 voxels of community #2 in the control condition are commonly affiliated with the same community #3 in the capsaicin condition. To compare the community affiliation between two conditions, we should first match the community label of the capsaicin condition (i.e., #3) to that of the control condition (i.e., #1 or #2), and here a dilemma occurs; if we prioritize the proportion of the overlapping voxels for the matching, the common community should be labeled as #1, whereas if we prioritize the number of the overlapping voxels for the matching, the label of the common community should be #2. Although both choices look reasonable, none of them can be a perfect solution.

      As the example above, it is impossible to exactly match the community affiliation of the different networks. We must choose an imperfect criterion for the matching procedure, which essentially affects the comparison of network structure. This was the main reason that we limited our results of Figures 2-3 to a qualitative description based on visual inspection. Moreover, the group-level consensus community structures in Figures 2-3 are not a simple group statistic like sample mean; they were obtained from multiple steps of analyses including permutation-based thresholding and unsupervised clustering, which could further complicate the interpretation of statistical tests.

      Alternatively, there is a slightly different but more rigorous approach to the comparisons of the community structures, which is the Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). Instead of direct use of the community labels, this method converts the community label of each voxel into a list of module allegiance values between the seed voxel and all the voxels of the brain (i.e., 1 if the seed and target voxels have the same community label and 0 otherwise). This allows quantitative comparisons of voxel-level community profiles between different conditions without an arbitrarily matching of the community labels. We adopted this Phi-test for our analyses to examine whether the regional community affiliation pattern is significantly different between (i) the capsaicin vs. control conditions and (ii) the early vs. late periods of pain (Figure S6), which correspond to the main findings of the Figures 2 and 3 in our manuscript, respectively.

      More specifically, to compare the group-level consensus community structures between the capsaicin vs. control conditions and the early vs. late periods, we first obtained a seed-based module allegiance map for each voxel (i.e., using each voxel as a seed). Then, we calculated a correlation coefficient of the module allegiance values between two different conditions for each voxel. This correlation coefficient can serve as an estimate of the voxel-level similarity of the consensus community profile. Because module allegiance is a binary variable, these correlation values are Phi coefficients. A small Phi coefficient means that the spatial pattern of brain regions that have the same community affiliation with the given voxel are different between the two conditions. For example, if a voxel is connected to the somatomotor-dominant community during the capsaicin condition and the default-mode-dominant community during the control condition, the brain regions that have the same community label with the voxel will be very different, and thus the Phi coefficient will become small. Moreover, the Phi coefficient can be small even if a voxel is affiliated as the same (matched) community label for both conditions, when the spatial patterns of the same community is different between conditions.

      To calculate the statistical significance of the Phi coefficient, we conducted permutation tests, in which we randomly shuffled the condition labels in each participant and obtained the group-level consensus community structure for each shuffled condition. Then, we calculated the voxel-level correlations of the module allegiance values between the two shuffled conditions. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficients, and calculated the proportion of null samples that have a smaller Phi coefficient (i.e., a more dis-similar regional community structure) than the non-shuffled original data.

      Results showed that there are multiple voxels with statistical significance (permutation tests with 1,000 iterations, one-tailed) in the area where the community affiliations of the two contrasting conditions were different (Figure S6). For example, the frontoparietal and subcortical regions for the capsaicin vs. control (c.f., Figure 2), and the frontoparietal, subcortical, brainstem, and cerebellar regions for the early vs. late period of pain (c.f., Figure 3) contain voxels that survived after thresholding with FDR-corrected q < 0.05, suggesting the robustness of our main results.

      Particularly, the somatomotor and insular cortices showed statistical significance in the permutation test, and this may reflect the large changes in other areas that are connecting to the somatomotor and insular cortices across different conditions. The statistical significance was also observed in the visual cortex, which was unexpected. We interpret that the spatial distribution of the visual network community is too stable across conditions, and thus the null distribution from permutation formed a very narrow distribution of Phi coefficients. Therefore, a small change in the community structure could achieve statistical significance.

      Now we added descriptions on the permutation tests.

      Revisions to the main manuscript:

      p. 9: Permutation tests confirmed that the community assignment in the frontoparietal and subcortical regions showed significant changes between the capsaicin versus control conditions (Figure S6A).

      p. 13: Permutation tests further confirmed that the community assignment in the frontoparietal, subcortical, and brainstem regions showed significant changes between the early versus late period of pain (Figure S6B).

      pp. 36-37: Permutation tests for regional differences in community structures. To test the statistical significance of the voxel-level difference of consensus community structures (Figures 2 and 3), we performed the following Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). First, for each given voxel, we compared the community label of the voxel to the community label of all the voxels, generating a list of voxel-seed module allegiance values that allow quantitative comparison of voxel-level community profile (e.g., [1, 0, 1, 1, 0, 0, ...], whose element is equal to 1 if the seed and target voxels were assigned to the same community and 0 otherwise). Next, a correlation coefficient was calculated between the module allegiance values of the two different brain community structures (i.e., capsaicin versus control, and early versus late). This correlation coefficient is an estimate of the regional similarity of community profiles (here, the correlation coefficient is Phi coefficient because module allegiance is a binary variable). To estimate the statistical significance of the Phi coefficient, we performed permutation tests, in which we randomly shuffled the labels and then obtained the group-level consensus community structures from the shuffled data. Then, the Phi coefficient between the module allegiance values of the two shuffled consensus community structures was calculated. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficient for each voxel. Lastly, we examined the probability to observe a smaller Phi coefficient (i.e., a more dissimilar community profile) than the one from the non-shuffled original data, which corresponds to the P-value of the permutation test. All the P-values were one-tailed as the hypothesis of this permutation test is unidirectional.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript by Chen et al., the authors use live-cell single-molecule imaging to dissect the role of DNA binding domains (DBD) and activation domains (AD) in transcription factor mobility in the nucleus. They focus on the family of HypoxiaInducible factors isoforms, which dimerize and bind chromatin to induce a transcriptional response. The main finding is that activation domains can be involved in DNA binding as indicated by careful observations of the diffusion/reaction kinetics of transcription factors in the nucleus. For example, different bound fractions of HIF-1beta and HIF2alpha are observed in the presence of different binding partners and chimeras. The paradigm of interchangeable parts of transcription factors has been eroded over the years (the recent work of Naama Barkai comes to mind, cited herein), so the present observations are not unexpected per se. Yet, the measurements are rigorous and wellperformed and have the important benefit of being in living cells. Enthusiasm is also dampened by the exclusive use of one technique and one analysis to reach conclusions.

      In the revised manuscript we complement the single molecule imaging experiments with genomic approaches, including Cut&Run and RNA-seq, that largely confirm our main conclusions derived from the SPT results. 

      Reviewer #2 (Public Review):

      The authors raise the very important question how different transcription factors with similar in vitro DNA sequence specificity are able to achieve distinct binding profiles associated with distinct functions. They use hypoxia inducible factors (HIF) as model system and combine live cell single-particle tracking with comprehensive genetic and chemical perturbations to study the mechanisms underlying isoform-specific gene regulation. Their main experimental readout is the distribution of diffusion coefficients of a molecular species, extracted from a population of single-particle trajectories. From this distribution, the authors extract the fractions of immobile and mobile molecules as well as the peak diffusion coefficient of the mobile fraction. They find that in addition to the structured DNA binding domain and the dimerization interface of HIF-1a and HIF-2a, the C-terminus of those factors, which includes intrinsically disordered regions and an activation domain, contributes to modulating the bound fraction of HIF-1b and the HIF-a isoforms. In particular, the C-terminus of HIF-2a mediates a higher bound fraction than the one of HIF-1a. This finding is important as it demonstrates that separating HIF into distinct domains that each have clearly defined functions is an oversimplification. Rather, a more holistic view seems suitable, in which all parts of HIF contribute to nuclear diffusion and binding.

      The conclusions drawn on the bound fractions and the nuclear dynamics of HIF isoforms are mostly backed up by data and proper controls. However, some controls are missing and some aspects of data analysis need to be clarified and extended. Moreover, the authors fail to answer their initial question, as the experimental readout does not contain information on the DNA sequences involved in the binding events.

      Experimental controls:

      For some imaging experiments, the authors use cell lines where endogenous HIF-1b or HIF-2a was fused to a N-terminal HaloTag by CRISPR/Cas editing. These cell lines are comprehensively controlled for proper functionality of the edited transcription factors, including expression levels, cellular localization and DNA binding. However, differential expression compared to unedited levels is not quantified and only Halo-HIF-2a is tested for functional gene transcription.

      To confirm that the tagged proteins still maintain normal function in driving target gene expression, we performed RNA-seq on WT cells, HaloTag-HIF-2α KIN and Halo-HIF-1β KIN cells, and show that gene expression on these edited cells do not differ significantly from unedited WT cells (Figure 1—figure supplement 3B, C).

      Other experiments include overexpression of exogenously expressed factors. For those, the authors give statements such as "expressed from a relatively strong ... promoter" and "weakly expressed", but do not provide any control of the amount of overexpression. Quantifying the expression levels will be important, as some of the author's experiments demonstrate a strong dependency of results on expression level. 

      We have now included Western Blot results showing L30-driven expression of all HIF variants in comparison with KIN levels (Fig 4—Figure Supplement 1). However, we note that cells stably expressing the HIF variants are polyclonal and Western Blotting is a bulk assay only able to assess the population average. As such, Western blot analysis may not reflect the actual expression level in the individual cells used in the imaging experiments. To properly control HIF expression at the individual cell level, we instead monitored the protein concentration in each cell and only chose to image cells with similar fluorescence level, as measured by localization density (Fig 4—Figure Supplement 1 and see detailed discussion in Appendix 2).

      Moreover, the authors do not provide any control for proper functionality of domainswap mutants.

      We now include RNA-seq results demonstrating that WT cells over-expressing HIF-α

      WT and domain swap variants (Halo-HIF-1α, Halo-HIF-1α/2α, Halo-HIF-2α, Halo-HIF2α/1α) can activate their specific target genes, confirming that all these variants are also transcriptionally active. (See Figure 6A, B, Figure 6—figure supplement 2 - increased binding of wild type or domain-swapped HIF to several gene loci or neighboring regions coincide with increased transcription levels of these genes, and Figure 7 - HIF expressing cells with same HIF-IDR co-cluster in their mRNA transcription profile).

      The authors further state that they use a high illumination power of 1100 mW. Such high laser power might be detrimental to cells and the authors should control whether this laser power induces any artifacts.

      We agree that a high illumination power (indispensable to achieve high signal-to-noise ratio and detect single molecules) may be detrimental to cells in the long run. However, we only took 1 movie with < 2000 frames for each cell. With a 5-ms frame rate, the total imaging duration per cell was under 10 seconds. Cells are unlikely to respond to any stimulus/damage in such a short time. Moreover, we used stroboscopic illumination instead of continuous illumination, with only 1-ms laser exposure for each 5-ms frame. The total integrated laser exposure is thus only 2 seconds. In addition, all imaging was done with a red laser (633 nm), which has a relatively low phototoxicity. Finally, the 1100 mW is the output from the laser box, but the actual laser power density used for imaging were measured to approximately 2.3 kW/cm2 at 633 nm (Graham et al., 2021). Such an imaging scheme is very unlikely to generate phototoxicity artifacts within the short time window of our measurements. Lastly, we are comparing results across all conditions with the exact same imaging set-up, so any artifact should be accounted and controlled for. We do consider fast SPT a terminal, end-point experiment, where each cell is only imaged once and never re-used.

      Data analysis:

      Distributions of diffusion coefficients greatly vary between individual cells (e.g. Fig. 2A and B, Fig. S3A and C, Fig. S4E). Unfortunately, the authors do not explain whether this variation is a real cell-to-cell variation, or rather reflects variation of their analysis method, potentially due to a low number of single particle tracks per cell. 

      We agree with the reviewer that the cell-to-cell variation we observed could be due to a low number of trajectories collected for each cell. In fact, sampling small numbers of trajectories allows us to identify protein species with unique diffusion coefficients, which might be lost if we just looked at a large population. Also, the fact that the diffusion coefficient distribution varies between cells does not mean that a particular cell only contains the more prevalent species that was detected. Here we are not trying to determine whether proteins in each cell indeed behave differently or whether the observed variation in the diffusion coefficient distribution is simply an effect of the limited trajectories collected in each cell. We instead analyzed data collected from many cells combined to get a better estimation of the population behavior. We have modified our text to make this important point clear to the readers. 

      Moreover, the bound fraction of HIF-1b differs between two independent measurements including three biological replicates each (Fig. 5 C and F). This raises the concern that not enough data enter each biological replicate, or not enough replicates are considered.

      Unfortunately, the number of cells that could be measured in our current setup is limited. It takes approximately 1 hour to collect 20 cells per sample, including staining, washing, looking for cells with desired expression level, and acquiring movies. For experiments with multiple conditions (>12), 20 cells per sample is the upper limit that can fit into a single day. 

      To address the question of what is the minimum number of cells/replicates needed we included in Figure 2—figure supplement 3 - the result of a bootstrapping analysis. We used data collected from a total of 243 cells of the same cell line, from over 11 replicates as the “population” and performed a bootstrapping analysis to identify the source of variation. We have also included appendix 1 with a detailed discussion. Our results showed that cell-to-cell variation contributes most to the total variation of the data, followed by day-to-day (replicate-to-replicate) variation. However, sampling over 800 trajectories, and from over 60 cells, imaged in 3 replicates well approximates the “population value” (bound fraction calculated from 243 cells from over 11 replicates). As a result, in each figure we always used over 60 cells from 3 replicates to generate the reported parameters. Although this approach still gives variable numbers from figure to figure, the variations seen for the same cell line are much smaller compared to the differences observed between different cell lines/conditions. 

      The authors compare the bound fractions among various mutants and experimental conditions. However, the peak diffusion is not, or only descriptively, evaluated. Thus, it is not clear whether the main effect of a mutation or chemical treatment is to change the bound fraction, or rather the diffusion coefficient of the mobile fraction. 

      Since there might be multiple mobile populations (defined as the fraction with a diffusion coefficient > 0.5 μm2/sec), the mean diffusion coefficient can change while the mode (peak) diffusion coefficient stays the same and vice versa. Because of such complexity in the mobile population, we prefer to use descriptive words to report the trend for the change instead of reporting exact values. However, as requested, we have added peak diffusion coefficient information to relevant figures as bar plots. We have also included in Table 1 a summary of mean and mode diffusion coefficient estimated for moving molecules in all relevant figures for reader’s reference. Note that the diffusion coefficient estimation is on a log scale, and the larger the diffusion coefficient, the lower the resolution (e.g, there is 1-grid of difference both between 2.63 and 2.75, and between 9.55 and 10).

      Conclusions:

      The authors provide data that highlight a potential role of the intrinsically disordered domain of HIF in modulating the bound fraction of these transcription factors. They further claim that the intrinsically disordered domains have a main contribution to this bound fraction. However, the autors do not quantify how this contribution relates to those of the DNA binding domain or the dimerisation interface. Changes in bound fraction estimated from the data in e.g. Fig. 3C, Fig. 4C, Fig. 5C and F rather hint to a dominant effect of dimerisation, followed by DNA binding and a smaller contribution of the intrinsically disordered domain. The authors should quantify the relative changes of the bound fraction for all mutants and experimental conditions, to clarify the importance of the contribution of the intrinsically disordered domain.

      It would be ideal if we could quantify what percent of the bound fraction is contributed by dimerization interface, DBD and IDR, respectively. However, it is very likely that these different domains do not act independently of each other in terms of binding to chromatin fibers. In practice, it is very difficult to dissect and quantify these effects independently. For example, we did try to express HIF-1α and 2α with their IDR completely deleted; however, because the protein-degradation signals are within the IDRs, these deletions caused massive stabilization of these proteins, making it impossible to find cells that express these forms at similar levels as the full-length counterpart. As a result, although these IDR-deleted HIF-α show greatly reduced binding, we did not include the results in the paper because the loss of binding could also be due to the overall higher protein expression levels, leading to large unbound fractions. Regarding the DBD mutants, they only have 1 mutation, so it is hard to tell whether the remaining binding in Figure 5B is due to some residual binding affinity of HIF-α (HIF-α only partially lost its binding affinity), or is due to binding through its partner HIF-1β (HIF-α completely lost binding affinity, but can still bind through dimerization with HIF-1β). All we can safely conclude from Figure 5B is that HIF-α DBD is required for optimal binding, but we cannot determine how much exactly it contributes to binding. We thus argue that, given the interdependence of the different protein domains, the reviewer’s request is not experimentally feasible.

      The authors state that the intrinsically disordered domains of HIF determine their differential binding specificity to chromatin. However, the experiments provided do not allow for such a conclusion. In particular, measuring changes in the bound fractions is not sufficient. Such a conclusion requires a method that is able to inform about the DNA sequences involved in HIF binding, for example chromatin immunoprecipitation.

      As requested, we have included new Cut&Run and RNA-seq results in the revised manuscript showing HIF-α-IDR-specific binding and gene activation.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors report a public browser in which users can easily investigate associations between PGSs for a wide range of traits, and a large set of metabolites measured by the Nightingale platform in UKBB. This browser can potentially be used for identifying novel biomarkers for disease traits or, alternatively, for identifying novel causal pathways for traits of interest.

      Overall I have no major technical concerns about the study, but I would encourage the authors to revisit whether they can find a more compelling example that can better showcase the work that they have done. I understand that this is partly a resource paper but I think the resource itself can have more impact if the paper provides a clearer use-case for how it can drive novel biological insight.

      Many thanks for your comments. We have undertaken a new application of bi-directional Mendelian randomization to demonstrate how users may use this approach to disentangle whether associations in our atlas likely reflect either causes or consequences of PGS traits/diseases. This example is described on page 9:

      ‘For example, we applied Mendelian randomization (MR) to further evaluate associations highlighted in our atlas with triglyceride-rich very low density lipoprotein (VLDL) particles. For instance, both VLDL particle average diameter size and concentration were associated with the PGS for body mass index (BMI) (Beta=0.04, 95% CI=0.033 to 0.046, P<1x10-300 & Beta=0.012, 95% CI=0.006 to 0.019, P=2.7x104 respectively) and coronary heart disease (CHD) (Beta=0.026, 95% CI=0.019 to 0.032, P<1x10-300 & Beta=0.035, 95% CI=0.028 to 0.042, P<1x10-300 respectively). Conducting bi-directional MR suggested that the associations with average diameter of VLDL particles are likely attributed to a consequence of BMI and CHD liability as opposed to the size of VLDL particles having a causal influence on these outcomes (Supplementary Table 6). In contrast, MR analyses suggested that the concentration of VLDL particles increases risk of CHD (Beta=1.28 per 1-SD change in VLDL particle concentration, 95% CI=1.25 to 1.65, P=2.8x10-7) which may explain associations between the CHD PGS and this metabolic trait within our atlas.’

      and discussed in the discussion on page 21:

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

      PGS construction: It's unclear how well the PGS work. Should the reader prefer the stringent or lenient PGS? Perhaps there could be some validation with traits that have decent sample sizes in UKBB. Was there any filtering to remove traits with few GWS hits, low sample sizes, or low SNP heritability as these are unlikely to produce useful PGSs?

      An example of validation was previously included for the chronic kidney disease PGS and its association with circulating creatinine, although this has now been removed due to the feedback you provided in your comments below. However, we have now provided the weights for all of the PGS included in our web atlas should users want to use these scores for prediction purposes (page 7):

      ‘The specific weights for clumped variants used in all PGS can be found at https://tinyurl.com/PGSweights.’

      On page 8 we have mentioned that in this work we have used a more lenient threshold to facilitate endeavours in a ‘reverse gear Mendelian randomization’ framework. However, the option to use the more stringent threshold remains an option for users interested in this as an alternative:

      ‘In this paper, we have discussed findings using PGS that were derived using the more lenient criteria (i.e., P<0.05 & r2<0.1), although all findings based on both thresholds can be found in the web atlas.’

      ‘Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very low density lipoprotein (VLDL) particles in the next section.’

      We have not filtering based on other criteria such as the number as SNPs given that certain scores, despite only been constructed using few SNPs, may still provide useful to users. For example, our score for ‘Drinks per day’ based on the more stringent threshold (i.e. P<5x10-8) consists of only 6 SNPs. However, one of these is rs1229984, a missense variant located at the alcohol dehydrogenase ADH1B gene region and known to be a strong predictor of alcohol use (e.g. https://pubmed.ncbi.nlm.nih.gov/31745073/).

      Reviewer #2 (Public Review):

      The authors set out to create an atlas of associations between phenome-wide polygenic scores and circulating lipids, fatty acids, and metabolites. To do so, they utilize GWAS from 129 traits available in the OpenGWAS database to derive polygenic (risk) scores (PGS) along with the recently released NMR metabolomics data containing 249 biomarkers (and ratios) in ~120,000 UK Biobank participants. The authors create a publicly available web portal containing PGS to NMR biomarker associations:

      http://mrcieu.mrsoftware.org/metabolites_PGS_atlas/.

      The strength of this study is in the comprehensive nature of the atlas, containing associations for 129 traits phenome-wide, the large sample size of the UK Biobank NMR data, and the use of PGS for prioritising molecular traits for follow-up experiments, which is an emerging area of interest (International Common Disease Alliance, 2020; Ritchie et al., 2021a). To our knowledge this study is the first to explore this for circulating metabolites.

      In its current form the atlas has several limitations, which should be straightforward to address. Notably, results in the current atlas may be confounded by (1) technical variation in the NMR data (Ritchie et al., 2021b), and (2) major biological determinants of biomarker concentrations, including body mass index, fasting time, and statin usage.

      Firstly, thank you for the suggestion to use your ‘ukbnmr’ R package to help remove technical variations from the UK Biobank NMR metabolites data. We have applied it to remove outliers and variation in the individual data due to (1) the duration between sample preparation and sample measurement, (2) position of samples on shipment plates, (3) different equipment (spectrometers) used. This meant that we needed to re-run our entire analysis pipeline for this project from scratch to the updated dataset. Results do not appear to have drastically changed, although nonetheless we have updated results from all downstream analyses in our online web atlas using this updated dataset provided by ‘ukbnmr’.

      Secondly, the reviewer is correct that biological factors, such as body mass index (BMI) and statin usage, are indeed strongly correlated with metabolites levels. However, we are not able to adjust for such biological factors directly in our analyses, given that they are potential colliders in the causal relationship between diseases/traits and metabolites. Statin usage may be caused by both the high genetic liability to coronary artery disease as well as abnormal lipoprotein lipid levels. Likewise, obesity (and changes in BMI) may result from a high genetic predisposition to cardiometabolic disorders and disrupted metabolism. Thus, adjusting for statin usage and BMI will induce collider bias (https://jamanetwork.com/journals/jama/fullarticle/2790247), which creates spurious associations between the disease/trait PGS and metabolites.

      To better illustrate this issue, we have added additional text on page 14 to justify this study design decision as well as added a new figure (Figure 3) to help demonstrate this clearly to the readers. Fasting time on the other hand we believe is unlikely to act as a collider and was adjusted as a covariate in all linear regression models in this work. This is mentioned on page 25.

      …Further, association results for two (of the 129) PGSs, systolic blood pressure (SBP) and diastolic blood pressure (DBP), are invalid (vastly inflated) as the GWASs used to construct these PGSs included UK Biobank samples.

      Many thanks for your suggestion. We have now removed the SBP and DBP PGS from our atlas due to overlapping samples in UKB. Furthermore, our colleagues at the University of Bristol have notified us that the Glioma GWAS data obtained from the OpenGWAS platform was uploaded with incorrect effect alleles. This PGS has also been subsequently removed from the atlas. Additionally, we removed the Alzheimer’s disease (without APOE) PGS because the pleiotropic effect of lipid associated genes is now systematically examined using lipid gene excluded PGS.

      To demonstrate how one might use these PGS to NMR biomarker associations to prioritise (or deprioritise) findings for follow-up, the authors select a biomarker of interest, glycoprotein acetyls (GlycA), to perform bi-directional Mendelian randomization to orient the direction of causal effects between GlycA and traits of associated PGS. However, the conclusions of this analysis are hampered by the heterogeneous nature of the GlycA biomarker, which captures the levels of five proteins in circulation (Otvos et al., 2015; Ritchie et al., 2019), making it a difficult target to appropriately instrument for Mendelian randomization analysis. This, however, does not detract from the broader point the authors make: that PGS can help prioritize molecular traits for experimental follow-up.

      We have now conducted further sensitivity analyses to evaluate the genetically predicted effects of each of the five proteins in the reference you have provided. This is discussed on page 11:

      ‘We also conducted further sensitivity analyses given that the NMR signal of GlycA is a composite signal contributed by the glycan N-acetylglucosamine residues on five acute-phase proteins, including alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin (Otvos et al., 2015). Using cis-acting plasma protein (where possible) and expression quantitative trait loci (pQTLs and eQTLs) as instrumental variables for these proteins (Supplementary Table 12) did not provide convincing evidence that they play a role in disease risk for associations between PGS and GlycA (Supplementary Table 13). The only effect estimate robust to multiple testing was found for higher genetically predicted alpha1-antitrypsin levels on gamma glutamyl transferase (GGT) levels (Beta=0.05 SD change in GGT per 1 SD increase in protein levels, 95% CI=0.03 to 0.07, FDR=3.6x10-3), although this was not replicated when using estimates of genetic associations with GGT levels from a larger GWAS conducted in the UK Biobank data (Beta=1.6x10-3, 95% CI=-6.9 x10-3 to 0.01, P=0.71). For details of pleiotropy robust analysis and replication results see Supplementary Table 14.’

      There are also several important limitations to the study which cannot be addressed, which the authors discuss appropriately in the paper. First, the NMR data does not provide a comprehensive view of the metabolome - it is heavily focused on lipids and fatty acids. Many small metabolites in circulation cannot be measured by NMR spectroscopy, and further insights must wait for data from molecular profiling efforts planned or underway in UK Biobank (e.g. mass spectrometry). Second, the authors restricted analysis to participants of European ancestries. This a pragmatic analysis choice given (1) the PGSs were derived from GWAS performed in European ancestries, (2) PGS associations are particularly susceptible to confounding from genetic stratification and differences in environment, and (3) the very small sample sizes for which NMR data is currently available in UK Biobank participants. Finally, although a large sample size, UK Biobank is not a random sample of the population: healthy adults are over-represented, meaning PGS to metabolite associations may be different in disease cases or less healthy individuals.

      Overall this study has strong potential, with straightforward to address limitations, and the resulting atlas will provide a useful characterisation of the relationships between NMR biomarkers and polygenic predisposition to various traits and diseases, which can be used by domain experts to prioritise biomarkers or traits for experimental follow-up.

      Reviewer #3 (Public Review):

      Fang et al. created an atlas for associations between the genetic liability of common risk factors or complex disorders and the abundance of small molecules as well as the characteristics of major apolipoproteins in blood. The whole study is well executed, and the statistical framework is sound. A clear strength of the study is the large array of common risk factors and disease analyzed by means of polygenic risk scores (PGS). Further, the development of an open access platform with appealing graphical display of study results is another strength of the work. Such a reference catalog can help to identify novel biomarkers for diseases and possible causative mechanisms. The authors further show, how such a systematic investigation can also help to distinguish cause from causation. For example, an inflammatory molecule readily measured by the NMR platform and strongly associated in observational studies, is likely to be a consequence rather than a cause for common complex diseases.

      However, in its current form, the study suffers from some weakness that would need to be addressed to improve the applicability of the 'atlas'. This includes a distinction of locus-specific versus real polygenic effects, that is, to what extent are findings for a PGS driven by strong single genetic variants that have been shown to have dramatic impact on small molecule concentrations in blood.

      Thank you for your suggestions to help refine our work. In line with this comment, we have repeated all analyses 1) after applying the ‘ukbnmr’ R package as recommending by reviewer #2 to remove technical variations and outliers and 2) conducted sensitivity analyses to remove an established list of lipid gene loci from PGS construction. Full results can be interrogated in the web atlas to evaluate whether PGS association may be driven by locus-specific effects at these regions, which may be particularly informative given the representation of lipoprotein lipid metabolites on the NMR panel. Findings are reported on page 19:

      ‘The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 14 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene-metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes see Supplementary Table 5).

      For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N=209), waist circumference (N=206) and body mass index (N=205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to P<0.05/19 in the initial analysis including these lipid loci (Supplementary Table 17), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary Table 18). Further inspection suggested that the likely explanation for this attenuation of evidence were due to variants located within the APOE locus which are recognised to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).’

      …Further, it is unclear how much NMR spectroscopy adds over and above established clinical biomarkers, such as LDL-cholesterol or total triglycerides. This is in particular important, since the authors do not adequately distinguish between small molecules, such as amino acids, and characteristics of lipoprotein particles, e.g., the cholesterol content of VLDL, LDL or HDL particles, the latter presenting the vast majority of measures provided by the NMR platform. Finally, the study would benefit from more intriguing or novel examples, how such an atlas could help to identify novel biomarkers or potential causal metabolites, or lipoprotein measures other than the long-established markers named in the manuscript, such as creatinine or lipoproteins.

      To address these comments, we have added a new example focusing on the granular measures of VLDL particles provided by the NMR data (on top of the examples listed at the start of the response to reviewer document), which as the review points out is one of its strengths of the measures generated by this platform over long-established biomarkers (page 21):

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

    1. Author Response

      We appreciate the thoughtful and thorough critique provided by the two reviewers, and generally agree with their assessment. The revised submission will address the issues they raise. In particular, we agree that the framework of the paper should be broadened to include bacteria and the deep literature associated with coincidental selection.

    1. Author Response

      Evaluation Summary:

      The work by Volante et al. studied a new plasmid partition system, in which the authors discovered that four or more contiguous ParS sequence repeats are required to assemble a stable partitioning ParAB complex and to activate the ParA ATPase. The work reveals a new plasmid partitioning mechanism in which the mechanic property of DNA and its interaction with the partition complex may drive the directional movement of the plasmid.

      Thank you for the kind evaluation. But we wonder about the description of the pSM19035 partition system we studied here as “a new plasmid partition system”. This system itself is quite old. The editor might have meant “new” as a subject of a research, but plasmid partition systems involving RHH-ParB proteins have been studied by number of groups for some time, including the Alonso Lab, which has worked on the pSM19035 partition system number of years prior to our current collaboration for this paper. Therefore, we wonder if the term “new” is the most appropriate.

      Reviewer #1 (Public Review):

      This is a very thorough biochemical work that investigated the ParABS system in pSM19035 by Volante et al. Volante et al showed convincingly that a specific architecture of the centromere (parS) of pSM19035 is required to assemble a stable/functional partition complex. Minimally, four consecutive parS are required for the formation of partition complex, and to efficiently activate the ATPase activity of ParA. The work is very interesting, and the discovery will allow the community to compare and contrast to the more widespread/more investigated canonical chromosomal ParABS system (where ParB is a sliding CTPase protein clamp, and a single parS site is often sufficient to assemble a working partition complex). All the main conclusions in the abstract are justified and supported by biochemical data with appropriate controls. A proposed multistep mechanism of partition complex assembly and disassembly (summarized in Fig 6) is reasonable. Perhaps the only shortcoming of this work is that the team does not yet get to the bottom of why four consecutive parS are needed.

      Thank you for the kind evaluation. The last point is an important one. We would like to continue to test our current model to either obtain stronger supporting evidence or come up with better alternative model.

      *Reviewer #2 (Public Review):

      ParBs come in two variations, RHH and HTH. In this study, the authors examine the in vitro behavior of the RHH system, which is less studied. Two activities were carefully monitored; ATPase activation and ParA removal from DNA. The system is quite complex, but the authors have done a good job of examining parameter space. One question concerns the physiological relevance. Can this be assessed by uncoupling ParA/ParB expression (making it inducible with IPTG from the chromosome, for example) and testing plasmids with the various constructs?

      This is an excellent point; we agree this a shortcoming of the current study. As described in response to “Essential Revisions”, we very much wanted to include an experiment testing in vivo plasmid stability for different size parSpSM sites in this paper, and we put a significant effort. However, we encountered certain technical issues with the approach we tried, and we failed to obtain conclusive data in timely fashion before we run out of time. Although, we had preliminary data, which appeared to be consistent with the notion that shorter parS sites are non-functional and full-size parS sites are functional, the experiment had certain flaw, which we could not rectify immediately to our satisfaction. Therefore, we decided to postpone this part of the project and plan for broader physiological evaluation of the parSpSM sequence arrangements in near future. In the revision, we mentioned at the beginning of discussion that in vivo functional test of parSpSM site requirements still remains to be examined.

      The authors appear to suggest that the requirement for at least 4 ParB binding sites is due to the inability of ParBs of this type to spread inferring that for the ParB-HTH multiple ParBs bound to ParS are required. Has this been tested in this system?

      ParB spreading has been shown to be essential for the HTH-ParB to perform its role in partition function. We clarified the importance of HTH-ParB spreading for partition function on lines 44-45.

      In any event, another major difference between the two systems is that a peptide corresponding to the N-ter of ParB is sufficient to bind DNA indicating this type of ParB does not have to be bound to DNA to stimulate ParA. It would have been useful if the authors had commented on this.

      There seems to be a mistype here. “N-ter of ParB is sufficient to bind DNA indicating ……” is incorrect. Perhaps this was meant to be “N-ter of ParB is unable to bind DNA, indicating ……” This is not a qualitative difference between the HTH- and RHH-ParBs: the N-terminal ParA interacting peptides of HTH-ParBs also can activate their cognate ParA ATPase without parS DNA binding, and parS-dependency of ATPase activation for HTH-ParBs appears to be significantly less stringent compared to the case for RHH-ParB we report here. ParBpSM1-27 , which cannot bind parSpSM, could only stimulate ParApSM ATPase to at most 1/10 of the full size ParBpSM in the presence of active parSpSM. We clarified this on lines 156-157, and also added discussion about this contrast between the HTH- and RHH-ParBs and possible implications on lines 458-467.

      Reviewer #3 (Public Review):

      Drs. Volante, Alonso, and Mizuuchi presented a milestone experimental finding on how the distinct architecture of centromere (ParS) on bacterial plasmid drives the ParABS-mediated genome partition process. Rather than driven by cytoskeletal filament pushing or pulling as its eukaryotic counterpart, the genome partition in prokaryotes is demonstrated to operate as a burnt-bridge Brownian Ratchet, first put forward by the Mizuuchi group. To drive directed and persistent movement without linear motor proteins, this Brownian Ratchet requires two factors: 1) enough bonds (10s' to 100s') bridging the PC-bound ParB to the nucleoid-bound ParA to largely quench the diffusive motion of the PC, and 2) the PC-bound ParB 'kicks" off the nucleoid-bound ParA that can replenish the nucleoid only after a sufficient time delay, which rectifies the initial symmetry-breaking into a directed and persistent movement. Although the time delay in ParA replenishment is established as a common feature across different bacteria, the binding properties of PC-bound ParB vary greatly, which begs the question of how Brownian Ratcheting adapts to different cellular milieu to fulfill the functional fidelity.

      The finding in this work presented a new but important twist in the Brownian Ratchet paradigm. The authors showed that in the pSM19035 plasmid partition system, only four contiguous ParB-binding repeats in ParS are required for the ParA-ParB interactions that drive the PC partition. In other words, only four chemical bonds are needed for the PC partition. Crucially, the authors further demonstrated that distinct orientation (configuration?) of the ParB-binding repeats is required for this fidelity by their state-of-art biochemistry and reconstitution experiments. The authors then elaborated on a possible mechanism of how the smaller number of PC-bound ParB can drive directed and persistent PC movement by interacting with nucleoid ParA. If I understand correctly, in their proposed scheme, due to their specific orientation (configuration?), when two of the ParS-bound ParB molecules bind to the two nucleoid-bound ParA molecules there arises a torsional/distortional stress. Consequently, the thermal fluctuations preload the forming bonds, triggering the dissociation of the two ParB molecules from the PC. And the remaining PC-bound ParBs may kick off the ParAs that have a time delay in DNA-rebinding, while ParB molecules will replenish the ParS to initiate the next round. In this proposal, the key conceptual leap is that not only the substrate but the cargo remodels to underlie the Brownian Ratcheting.

      We thank the reviewer for kind evaluation of our work. The model proposed is highly speculative at this point. Despite it may appear rather detailed in order to account for the unexpected findings, we consider it only a working hypothesis to be revised or replaced by a better model in future. We thank for many useful suggestions, which we will follow in our revision.

    1. Author Response

      REVIEWER #1 (PUBLIC REVIEW):

      The study by Monterisi et al. reports that loss-of-function mutations in metabolic pathways do not necessarily have a negative impact on cancer growth. The authors suggest that small solutes transferred via gap junction channels formed between wild-type cells and cells express mutants defective in metabolic pathways rescue the metabolic-deficient cancer cells. Through the examination of multiple human cell lines with several advanced means to determine gap junction coupling, Cx26 was identified as a major connexin molecule involved in medicating gap junction coupling between colorectal cancer (CRC) cells. The gene mutations of three metabolic gene mutations were investigated for major metabolic function of the cell, pH regulation, glycolysis and mitochondrial function.

      Strengths: The paper tests a new hypothesis that the mutations that inactivate key metabolic pathways do not incur functional deficits in cancer cells expressing the mutants due to their gap junction coupling to wild type cells.

      From microarray data they identified multiple connexins expressed in various CRC cells. Several advanced analyses were used to assess gap junction coupling in CRC cells including fluorescence recovery after photobleaching (FRAP). The extent of permeability at steady state was evaluated using CellTracker dyes and coupling coefficients were determined. They also used flow-cytometry to study dye transfer, which will provide a quantitative, dynamic means for study cell coupling. The data showed that knocking down Cx26 could greatly reduce diffusive exchange in most of the CRC cells tested.

      The study focused on three metabolic genes, Na+/H+ exchanger NHE1, a regulator of intracellular pH, a glycolytic gene, ALDOA and mitochondrial respiration gene, NDUFS1. These genes were knocked out in the selected CRC cells highly expressing these genes. The co-culture studies were well executed with fluorescence-markers distinguishing the WT and knockout cells and well-defined readouts such as intracellular pH, media pH, glucose/lactate levels and mitochondrial O2 consumption and glycolytic acid.

      The experiments in general were well designed and conducted, and the data supported the conclusions. The paper is also logically written and figures were well presented providing clear graphic illustrations.

      Thank you for recognising the strengths and novelty of our findings.

      Weaknesses: Although the hypothesis is innovative, no clear justification is provided that illustrates the scenario representing the clinical situation. The remaining questions include: What kind of somatic mutations in cancer cells has little impact on their growth and progression?

      We have now added in vivo data (Fig 8) and revised the Introduction and Discussion to address this point. Briefly, the broader clinical relevance our findings relates to the notion of essential genes and their negative selection. We show that connexin-dependent coupling can rescue a genetic deficiency, provided the mutation-carrying cell can access wild-type neighbours for the missing function. This rescue effect is limited to processes that handle solutes that can pass via connexin channels, i.e. metabolic processes. As such, sporadic loss-of-function mutations in “essential genes” may not produce a functional deficit in human cancers. We demonstrate rescue extensively in vitro, and now in a xenograft model.

      We argue that our work can explain why certain metabolic genes are essential in vitro, but not in vivo. In monolayers of mutated cells, diffusion across gap junctions cannot rescue the mutant phenotype, because there is no wild-type cell available to supply the missing function. In contrast, mutations in vivo will arise sporadically and wild-type cells are typically available to couple onto the mutation-bearing cell, providing it with functional rescue. Thus, only in the former case would the lethality of essential genes emerge.

      Indeed, many notable studies have found genes of various metabolic pathways to be essential for growth in vitro. Such genes would be expected to undergo negative selection in vivo, but this is exceedingly rare according to multiple observations. By demonstrating metabolic rescue in co-cultures (i.e. a setting closer to the tumour) and (now) in xenografts, our work provides an explanation for this apparent paradox. Indeed, cells such as NDUFS1-negative SW1222 grow very, very slowly in culture compared to wild-type cells and require regular media changes to keep pH alkaline. However, coupling onto wild-type cells can rescue knock-out cells in vitro and in vivo. We argue that this finding explains why loss-of-function mutations in NDUFS1 (and similar genes) do not undergo negative selection in human tumours (despite in vitro predictions).

      The three proteins selected for this study were chosen to represent very distinct types of solute-handling processes. We illustrate our point in a (new) summary figure in Fig 8.

      What types of WT cells, within the tumour cells or with neighbouring normal cells? Whether the current experimental design closely recapitulates the scenario in vivo?

      Indeed, we find that stromal fibroblasts may also support cancer cells via gap junctions, as this is essentially the same concept (i.e. coupling onto a cell with wild-type genes). However, we feel that expanding our present submission to fibroblasts would make the volume of data exceeding large. Also, the methods we use for fibroblasts are different, and require a full manuscript on its own. For example, there is the issue of how to control for the radically different growth rates of fibroblasts and cancer cells. We chose co-cultures of WT and genetically altered CRC cells so that the co-cultures are of the same background, with just one element changing (i.e. the metabolite-handling gene). This makes our data easy to interpret, and thus strengthens our case. Our in vitro experiments were performed on monolayers, where cells can make contacts in 2D. In vivo, these contacts will spread in all dimensions, thus connectivity is likely to be even more significant. If anything, monolayers probably under-estimate the importance of connectivity, but this preparation is more accessible for studying cell-to-cell communication.

      We recognise the importance of adding in vivo data to firm our conclusions. To that end, we have analysed xenografts established from co-cultures of wild-type DLD1 and NDUFS1-KO SW1112 cells on one flank of a mouse, and Cx26-KO DLD1 and NDUFS1-KO SW1112 cells on the other flank. This experiment tested whether Cx26-dependent connections between mitochondrially-defective NDUFS1 KO SW1222 cells and respiring DLD1 cells (on left flank only) are able to stimulate growth of the former (GFP-tagged). Indeed, NDUFS1-deficient cells grew faster when rescued by Cx26-expressing DLD1 cells. In contrast, their growth decelerated when DLD1 cells were Cx26-negative. We include these experiments and their controls in Fig 8.

      The readouts for co-culturing for glycolytic ALDOA and NDUFS1 knockout are only cell mass, without determining the more relevant markers, glucose/lactate and mitochondrial O2 consumption and glycolytic acid production.

      Our readouts are two-fold: total biomass and the size of the genetically altered compartment of co-cultures (GFP). We can therefore follow the relative growth of KO cells, which is essential for describing their growth (dis)advantage. We appreciate other markers are informative. Indeed, we characterised KO and WT cells in terms of O2 consumption and acid production in Fig 7. However, it would not be possible to measure glucose consumption selectively in GFP-positive KO cells of a co-culture, as the assays available for this measure ensemble rates for the entire population of cells (e.g. in a single well). Nonetheless, we believe that biomass as a readout is highly relevant to cancer, and we hope the reviewer concurs with us.

      The study needs to include cells without functional gap junctions like the characterized negative control RKO cells.

      This is an excellent suggestion, and we have added data for RKO cells to several figures. As expected, these do not form a syncytium and cannot rescue genetic defects in co-cultured cells. New data are shown in Fig 3G-H, Fig 6-supp2 and Fig 7H, adding to existing RKO controls in Fig 2A/B. Briefly, RKO cells do not exchange CellTracker dyes in monolayers (Fig 3F/G), cannot rescue cells that are ALDOA-deficient (Fig 6-supp2), and cannot rescue NDUFS1-deficient cells (Fig 7H). We also added Cx26-KO DLD1 cells to the CellTracker experiments in Fig 3.

      REVIEWER #2 (PUBLIC REVIEW):

      This paper is a logical extension of the 50 year-old concept of the "bystander effect" in tumours, wherein the effects of anti-tumour chemotherapeutics extend beyond the cells that take them up due to spread through gap junctions to adjacent cells. In this case, however, the authors have creatively realized that the reverse might also occur, and that tumour cells with otherwise fatal mutations in essential metabolic pathways can be rescued by their neighbours through passage of the missing metabolites through gap junctions. This can explain why mutations in other critical pathways, such as protein synthesis and transporters, are selected against in rapidly growing tumours, but others in equally critical pathways of glycolysis, electron transport, etc. are not, despite these genes having been demonstrated to be essential in in vitro KO studies (where all cells in the plate have the critical gene knocked out). A series of elegant experiments are used to test this proposal in several colorectal cancer (CRC) cell lines using three examples - pH regulation (defective Na+/H+ exchanger NHE1), glycolysis (defective Aldolase A (ALDOA)) and oxidative phosphorylation (defective Complex 1 - NDUFS1).

      Thank you for these positive comments. We have added key references to the bystander effect in the Introduction, and explain how our findings build on these milestones.

      The authors first determine the levels of different Cx proteins expressed in each cell line, and determine that for most Cx26 and 31 are dominant, although come lines have a subset of cells with high Cx43 expression. They then use Cell Tracker Green to pre-label cells and use FRAP as a means to measure how well the cell population is coupled. This is a useful measurement but is significantly over-interpreted by the authors as a "permeability" in uM/min. This is not really a permeability, which requires knowledge of the concentration gradient of the permeant species, relative cell volumes, etc. Rather it is a rate of fluorescent recovery that is presumably correlated with, but not quantitatively related to, levels of coupling.

      Thank you for this comment. We would like to explain why we believe our FRAP experiments are able to estimate permeability in units of um/s. The rate of recovery of a solute in a cell following its “destruction” (here, photobleaching) is given as follows:

      dCcell/dt = p⋅P(Ccell-Csurround) … [1]

      Where subscripts ‘cell’ and ‘surround’ refer to the cell and its neighbours. P is the permeability of the barrier between these two compartments, and p is the ratio of the surface area of the barrier (i.e. membrane) to volume of the bleached cell. Within a “bleached” cell, we measure fluorescence.

      Since fluorescence (F) is proportional to concentration (C), we can substitute:

      C = α⋅F

      where α is a constant of proportionality. Thus, the rate of recovery (L.H.S. of equation [1]) becomes:

      dC/dt = d(α⋅F)/dt = α⋅dF/dt … [2]

      And the R.H.S. of equation [1] is re-written as: P⋅(Ccell-Csurround) = P⋅(α⋅Fcell-α⋅Fsurround) = α⋅P⋅ (Fcell-Fsurround) … [3]

      Putting [2] and [3] together,

      dFcell/dt = p⋅P⋅(Fcell-Fsurround)

      Prior to photobleaching, there are no (net) gradients, thus initial Fcell and Fsurround are equal.

      Thus, we can re-write the equation in terms of normalized fluorescence (f=F/F0):

      dfcell/dt = p⋅P⋅(fcell-fsurround)

      P can therefore be expressed as:

      P = dfcell/dt / (p⋅ (fcell-fsurround))

      Here, dfcell/dt is measured from the fluorescence recovery time course and fcell-fsurround is measured experimentally (in fact, bleaching in the cell is set to 50%, thus this takes the value of 0.5 by default). We can approximate the monolayer as a network of cuboidal cells. The cell’s volume is thus ‘area’ times ‘height’, and the cell’s surface (at which it contacts its neighbors) is the ‘cell’s perimeter’ times ‘height’. Thus, for the bleached cell,

      p = perimeter × height / area × height = perimeter / area.

      The perimeter and area can be measured from the acquired fluorescence images. Thus, we can describe permeability using data obtained from image stacks. We appreciate that this method makes certain geometrical approximations, but we believe these are not unreasonable. We explain the assumptions and calculations in Appendix 1. More information about the method is published by us in https://pubmed.ncbi.nlm.nih.gov/28368405/. Of course, we accept that these calculations are less accurate than, say, electrical conductance measurements, and to that end, we added a note of caution to the main text.

      This fluorescent recovery is shown to be sensitive to siRNA KO of Cx expression, but strangely its reduction is only correlated with KD of Cx26 in the 5 cell lines examined. KD of Cx43 (in LOVO cells) and Cx31 in all 5 cell lines had no effect or in some cases seemed to increase the rate of recovery (DLD1 and SNU1235). This is a notable finding, yet the authors choose to completely ignore it and continue with Cx26 KDs in studies of specific metabolite transfers. Some discussion should be included as to why KD pf these Cxs has no effect or causes an apparent increase in coupling of the cells.

      The effectiveness of GJB2 knockdown in ablating ensemble connectivity is most likely a reflection that Cx26 is likely the dominant conductance inherited from the parent epithelium. Other isoforms are expressed, but in most CRCs cells, these do not produce major coupling, as GJB2 knockdown was sufficient to uncouple many CRCs. These observations justify our choice of connexin for studying metabolic rescue functionally. These findings are also consistent with the good correlation between ensemble connectivity and GJB2 levels.

      Our data show a trend that GJB3 (Cx31) KD in DLD1 and SNU1235 cells and of GJA1 (Cx43) KD in LOVO cells produce an increase in coupling. However, when analysed by hierarchical (nested) analysis, these effects are not statistically significant, and for that reason we did not elaborate on these trends in the original submission. The apparent increase in conductivity in cells treated with GJA1 or GJB3 siRNA could reflect a compensatory response to the ablation of a specific message, closer contacts between cells allowing Cx26 to strengthen its connections, or a shift away from heterotypic channels involving Cx26 and Cx31/Cx43, towards homotypic Cx26. We did not see any consistent change in the intimacy of cell-cell contacts. We now performed western blots for connexins to probe for compensatory changes (see Fig 2-supp1). In comparison to wild-type cells, expression of Cx31 was not changed by GJB2 (Cx26) or GJA1 (Cx43) knockdown in DLD1 cells. GJB2 KO DLD1 cells did not induce expression of the other major isoform, Cx43. Also, in DLD1 cells, KD of GJB3 or GJA1 did not substantially change Cx26 levels. Similarly, KD of GJB3 did not affect Cx43 levels. In GJA1-high C10 cells, KD of GJB3 did not alter Cx43 levels, although a small decrease was observed with GJB2 KD on Cx43. Also in C10 cells, KD of GJB2 and GJA1 did not induce an increase in Cx31 levels.

      We agree that complex interactions between connexin genes are possible, but we feel that a molecular study of Cx gene regulation would fall outside the scope of the present manuscript. Our findings point to a prominent role of Cx26 in metabolic rescue, and to strengthen this point, we show that Cx26-negative cells that express other connexins (e.g. C10 cells or NCIH747 cells) cannot rescue ALDOA-deficient counterparts or NDUFS1-KO SW1222 cells (new data in Fig 6 and 7). We share the Reviewer’s enthusiasm about the interplay between connexins and will endeavour to study this further in the near future.

      Rather than just focus on acute transfer of dye between cells, the authors develop a system using 50/50 mixes of cells labelled with two junctionally permeant dyes and measured the degree of mixing at equilibrium (48 hours). This is presented as a "coupling coefficient", but how it is calculated, and its significance is not well described, and does not correlate with the historical use of this term in the literature. Nonetheless, the studies do seem to demonstrate a good degree of equilibration, although it would have been informative to determine of the cells that do not exchange dyes express Connexins. To document that this equilibration requires gap junctions, the authors employ low density cultures, which significantly decrease dye exchange. However, in at least one cell line (SW1222) dye exchange is only reduced by <50%, indicating a very high background to this assay. This is not addressed.

      Thank you for these comments. We agree that our description of the method was inadequate, and we have added the necessary information in Appendix 1. We have also added information about actual confluency and restructured the figure. We also added new data for RKO cells and DLD1-Cx26 KO cells, i.e. two negative controls (Fig 3H). We pondered about the best name for describing the numerical output of the method, and concluded that “coupling coefficient” is reasonable (provided we improve our description of it) because it is dimensionless, and like many coefficients has a finite range (here, zero to one). With further explanation, we hope this terminology is acceptable. The issue with SW1222 cells is that both low- and high-seeding densities produce clusters of cells. Even though overall cell numbers were different in high and low seeded cultures, actual connectivity within “islands” of cells remained high, hence their similar coupling coefficients (see Fig 3E). Indeed, this CRC line is unusual in this behaviour, so we only present data from the higher density.

      The most compelling part of the study is the use of reporters to directly demonstrate a role of Cx26 coupling of cells to rescue cells with mutations of the three genes mentioned above when mixed with normal neighbours. This case was most convincing in the cases of ALDOA and NDUFS1, with the data for the pH regulation requiring more explanation for full understanding of the data shown (e.g. Figs 7 G and H).

      Thank you for this comment. Studies of pHi regulation provide a unique opportunity to obtain single-cell resolution (unlike e.g. glycolytic assays). We took advantage of this, and therefore the figure on pHi presents a greater depth of analysis. Nonetheless, we agree the pH data need further explanation. We have expanded the text, and also added a bar plot of data on day 7, which now provides a clearer illustration of the rescue effect. This form of presentation was also adopted for ALDOA and NDUFS1 experiments in the subsequent figures.

      Overall, the study does a credible job of demonstrating that Cx26 coupling of CRC cells serves to rescue cells with mutations in critically necessary metabolic pathways, presumably due to transfer of metabolites from surrounding wt cells. However, some of the results indicate this is not a simple process where all connexins behave similarly, and some effort should be made to investigate if Cx31 and 43, which do not seem to play the same roles in maintaining cell coupling as Cx26, also play any role in such metabolic rescue.

      Thank you for this comment. We have addressed this by selecting three additional cell lines for study: RKO – a cell line with no major Cx expression; C10 – a cell line that expresses Cx43, but very low levels of Cx26; NCIH747 – a cell line that expresses Cx31, but low levels of Cx26. These additional experiments cover lines that are GJB2 (Cx26)-low/negative to test whether metabolic rescue is best achieved with Cx26. Our new data show that these cells are unable to rescue metabolic defects (new data provided in Fig 6H/I, Fig 7H, and Fig 6-supp2). These findings strengthen our case for a major role of Cx26, at least in CRC networks. Indeed, recent analyses by Robert Gatenby and colleagues have shown that mutations in GJB2 (Cx26) are exceedingly rare in cancer (a property not shown for other connexins genes). This is interpreted to mean that Cx26 plays a particularly prominent role, ostensibly for metabolic rescue.

      REVIEWER #3 (PUBLIC REVIEW):

      Strengths of the study include that it appears to be a careful and well thought out set of experiments. The analysis and treatment of multiplexed data is also sophisticated. For the most part, the work is clearly and logically described, as well as well illustrated. In general, the authors achieved their experimental goals, and the methods while not entirely new, do provide new twists and augmentations that should be useful to the field. A general weakness is that this is not entirely a new story. Instead, it is a variant of one of the oldest concepts in the field of gap junction biology i.e. "Metabolic cooperation". The term "Metabolic cooperation" (i.e., as mediated by gap junctions) was not mentioned by the authors, but it is a long-established and foundational concept in the field. Indeed, in a classic paper by Gilula and colleagues published in 1972, the experimental approach used was similar to that of the study in hand. These earlier authors showed how transformed cell lines with deficiencies in hypoxanthine metabolism can be "rescued" by "metabolic cooperation" in co-culture with metabolically competent cells via passing a gap junctional permeant molecule. This and other relevant papers were not cited. More importantly, the extant literature places the onus on the authors to explain and convince reviewers why this study is more than an incremental step.

      We apologise for not quoting these important and classical references. We have now added these works to our reference list (quoted in Introduction). At the time of these seminal discoveries, Loewenstein and colleagues made a case that connexins are absent in cancer, and this belief persisted for many decades. More recently, the role of gap junctions in cancers has garnered attention. With new gene manipulations (e.g. CRISPR/Cas9) and imaging techniques and improved xenografting, it is now possible to precisely study the impact of GJ on cancer metabolism. Moreover, we have a wide panel of cancer cell lines to study, and identify the prominent role of Cx26. We highlight that our study is the first to offer a mechanistic explanation for the absence of negative selection in cancer, a phenomenon which was not known in the 1970s. To strengthen our novelty, we now add in vivo data to Fig 8 that confirm in vitro findings.

    1. Author Response

      Reviewer #1 (Public Review):

      1. “The major weakness of the study is that with the interpretation of the results. The changes in tractography, behavior and TBM are what would be expected following lesions of the neostriatum”

      We appreciate this comment and would like to offer clarification. We respectfully disagree that the pattern of results presented in this manuscript are akin to what would be expected following striatal lesions. In NHPs, striatal lesions typically cause more extreme phenotypes than what we observed in our 85Q-treated animals. In macaques, bilateral putamen lesions can result in phenotypes that include seizures, inappetence, hyper-aggression, and other severe features.  This strongly impacts clinical scores and can make it unfeasible to care for the animals for multiple years. For these reasons, recent NHP HD lesion models have used only unilateral putamen lesions coupled with bilateral caudate lesions to model HD (as in the recent paper by Lavisse et al, 2019). Of additional relevance is that even the cognitive effects of these striatal lesions are more severe than what we observed in our 85Q-treated animals: for example, Lavisse reported reduced performance on similar “prefrontal” cognitive tasks by ~50%, whereas our AAV-HTT model exhibited only ~10% reductions in working memory. This mild, but significant, change in cognitive performance and motor function seen in our 85Q animals is much more akin to that which is observed in the early stages of HD.

      2. “The results have been interpreted as showing a progressive model, although evidence that there is progression is limited”...“begs the question as to whether or not the 85Q-lesioned monkeys would recover to a level similar to the 10Q animals if left for another 12 months”

      At the request of Reviewer 1, we added an additional 30-month timepoint and re-ran all of the analyses to include these new data.  All of the behavioral and neuroimaging data were re-analyzed with this final timepoint included (see Lines 125-141, 146-163, 173-194, 228-255, 270-294, 314-345). Additionally, due to the unidirectional nature of our hypothesis and on the advice of our bio-statistician, we applied one-tailed tests to the planned comparisons in this revision. To address the Reviewer’s point directly: 85Q-treated animals showed minimal evidence of functional recovery between the 20- and 30-months timepoints on the behavior tasks. In particular, working memory deficits measured with SDR and fine motor skills measured with Lifesaver Retrieval did not improve between 20- and 30-months (Figure 1C and 1F). Additionally, neurological rating scores in group 85Q remained consistently elevated (in the 5-7 range) between the 20- and 30-month timepoint. Taken together, we feel confident that these results do not show evidence of any significant functional recovery, out to 2.5 years (30-months). In terms of the longitudinal trajectories of the behavioral measures, we appreciate the Reviewer’s feedback regarding the use of the term ‘progressive’ and have tempered our language appropriately. We removed all instances of the word progressive/progressed except in the context of the motor rating scores, which show a significant Group x Timepoint interaction and demonstrate a clear progression.

      3. “The whole manuscript is written as though this is a genetically-relevant progressive model of HD. But the animals are normal, and so there is no genetic context relevant to HD”

      We thank Reviewer 1 for this comment. We recognize that viral-based animal models of HD, including the model characterized here, are not as genetically similar to the human condition compared to some of the other modeling approaches currently under investigation (ex. knock-in and gene editing). Limitations of the AAV-based HTT85Q model include: 1) vector packaging restrictions that prohibit expression of full-length HTT, 2) the use of a CAG promoter vs. an endogenous promoter that leads to overexpression of the transgene, 3) the use of cDNA versus genomic DNA excludes introns and therefore lacks the ability to produce alternatively spliced variants (ex, Exon 1), 4) the use of a mixed CAG-CAA repeat may preclude the possibility of somatic instability and 5) expression of HTT that is restricted to specific brain regions and cell types. All of these important limitations have been added to the discussion section in this re-submission (Lines 503-517).

      Despite these limitations, we feel that this AAV2:AAV2.retro-HTT85Q based model has some features that make it genetically-relevant to human HD including: 1) the expression of an N-terminal fragment of human HTT (N171), 2) the N-terminal fragment bears a pathological PolyQ expansion (85Q), 3) the expressed mHTT fragment forms neuronal aggregates that can be detected in the nucleus, 4) mHTT fragments are expressed in many of the same brain regions where aggregates are detected in human HD cases, with both regional and sub-regional specificity (ex. higher expression in anterior vs posterior cortical regions and expression primarily limited to deep cortical layers V/VI) and 5) expression of mHTT fragments in these regions leads to many of the same pathological and behavioral changes observed in HD patients.  Importantly, expression of the N-terminal portion of HTT allows for the evaluation of HTT lowering therapeutics that target first 3 exons (ASOs, miRNAs, zinc finger repressors, CRISPR-based therapies, etc), which cannot be evaluated in lesion-based models.

      4. “The authors state in the Abstract that the injection resulted in "robust expression of mutant huntingtin in the caudate and putamen". These data are not in the manuscript.”

      Evidence of mHTT expression in the caudate and putamen, as well as several other brain regions, via immunohistochemical and immunofluorescent staining is now included in the manuscript. Please see additions to the methods, results and discussion sections regarding these findings, as well as a new Figure 5, (see Lines 347-376, 756-788). Additionally, further details regarding an associated PET imaging study in this same cohort of animals using a mHTT aggregate-binding radioligand has been added to the discussion, (see Lines 437-443). Please also see response #13 (below).

      5. “The authors chose to use a fragment of the HD gene, with a very long repeat that is seen only in juvenile patients”

      Comments regarding the need to use a fragment of the HTT gene, versus the full-length gene, due to packaging constraints of the viral vector, were added to the discussion in the context of limitations (Lines 503-517), and also discussed above in response #3.  The choice to use a CAG repeat length of 85 (83 pure CAGs followed by a CAA/CAG cassette -see response #17 below for further details), was based off previous studies wherein similar CAG repeat lengths were used to create animal models of HD over the past several decades. Interestingly, while CAG repeat lengths in patients with adult-onset HD typically range from ~40-60, longer repeat lengths (>60) are typically required in animal models of HD to elicit pathological and behavioral manifestations of disease: transgenic, knock-in and viral vector-based rodent models (ranging from 72-150 CAGs), OVT73 transgenic sheep model (73 CAGs), transgenic and knock-in minipig models (ranging from 85-150 CAGs), transgenic and viral vector-based macaque models (ranging from 82-103 CAGs). See Ramaswamy et al, 2007 and Howland et al, 2021 for thorough reviews of these models.

      6. “For their cognitive testing, the authors used a task (delayed non-match to sample) that measures object recognition and familiarity. Before surgery, only 11/17 of the animals were successfully trained to complete this task. It is not clear how useful the data are when only 64% of the animals can be included.”

      We appreciate the Reviewer’s concerns and have decided to conservatively remove this data from the revised manuscript.

      7. “It is not clear how this monkey model will be useful for developing either disease biomarkers or therapeutic strategies for HD (as stated in the abstract)”. “The authors state that they hope the model will become a widely used resource. This seems an unlikely scenario, given the limitations of the current study and the challenges associated with using monkeys. They say that a major advantage of their technique is being able to generate large numbers of monkeys. But this is not a relevant argument if the usefulness of the model to investigate HD is not proven.”

      We thank the reviewer for requesting clarification on these important points. We believe that this model will be useful for developing therapeutic strategies because the HTT85Q-treated macaques express mutant HTT, along with HTT aggregates, in several key brain regions that are affected in human HD, along with undergoing regional gray matter atrophy and white matter microstructural alterations that correlate well with behavioral dysfunction. Studies currently under review elsewhere also show reduced dopamine neurotransmission and regional hypometabolism via PET imaging in this model. Together, or individually, these imaging and behavioral changes can serve as outcome measures when screening potential therapies. Possible therapeutic interventions that are amenable to screening in this model are included in the discussion.

      Regarding biomarker development, we have already engaged in PET imaging biomarker development in this model in collaboration with the CHDI foundation and the Molecular Imaging Center at the University of Antwerp, evaluating a candidate radioligand that binds to aggregated mHTT. See #13 below for a more detailed description of this PET study, including recent data showing its ability to bind to aggregated species of mHTT in several brain regions in this same cohort of HTT85Q macaques that correspond to 2B4 and em48 IHC staining (a manuscript describing these results has been prepared for submission and the PDF is included for the reviewers to peruse).

      The authors do envision this AAV-based macaque model becoming a resource for the HD research community. While this model does have certain limitations (now detailed in the Discussion), we respectfully assert that all of the HD animal models, both small and large, each have their own important limitations to consider when deciding on which to use to screen therapeutics. Selecting a specific animal model based on the individual scientific questions being asked will be required, and employing a combination of models may be an even more prudent strategy.

      While NHP research presents unique challenges (cost, housing requirements and recent challenges in availability, among them), we believe that viral vector-based NHP models could be more accessible to the HD research community compared to some of the other established large animal models; in that they may able to be readily created at contract research organizations (CROs), in addition to various academic research institutions. There are now many CROs that exist in the US, and elsewhere around the world, that have developed specific expertise in MRI-guided, intracranial delivery of AAVs into the NHP brain (including the caudate and putamen), in the context of assessing therapeutic interventions for a variety of neurological disorders (HD, PD, and MSA, among others). Most of these same CROs also have expertise in NHP imaging (MRI/DTI) and behavioral assessments across multiple domains. It seems feasible that AAV-mediated HD macaques could be produced in sufficient numbers to appropriately power therapeutic studies, using the outcome measures established in the current study.

      Reviewer #2 (Public Review):

      1. “The major weaknesses are the manner in which the data is presented”

      We replotted all of the figures with improved color palettes and larger font sizes to make them easier to read. We also added additional details throughout the results section to aid in clarity and improve readability.

      2. “The authors would benefit from talking more about their model in the introduction and including references to some key points. For example, there has been critical new data in the field showing the importance of poly (CAG) in disease, not necessarily poly(Q), and the community will want to know (and not be required to look up), the nature of the transgene. Is it a pure CAG repeat? A mixed repeat? If it is pure, do they see or could they measure somatic expansion in the various brain regions impacted? How does that data match the phenotypes seen? Since this is a transgene, there is no possibility for the exon1/intron1 splicing variant to appear - how does this impact their interpretation”

      Further details regarding the transgene have been added to the Viral Vector Section of the Methods (Lines 531-550). The repeat is not pure and contains a single CAA interruption. The glutamine encoded repeat for HTT85Q contained 83 pure CAG repeats, followed by a single CAA/CAG cassette, while the glutamine encoded sequence for HTT10Q contained 8 pure CAG repeats followed by a single CAA/CAG cassette. Both constructs contained a proline stretch distal to the glutamine repeat in the following allelic conformation where QT represents the total glutamine length:

      HTT85Q: QT\=85, (CAG)83(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      HTT10Q: QT\=10, (CAG)8(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      There are plans to probe for somatic expansion in various brain regions, including the caudate and putamen, as well as several distal cortical regions. That analysis is ongoing and not in the scope of the present manuscript; however, these analyses are now mentioned in the discussion section (lines 540-560), as well as a discussion on the ability to either remove or duplicate the CAA/CAG cassette to potentially increase or decrease the rate of disease progression, respectively, based on the work of Ciosi et al. 2019. Additionally, Reviewer 2 is correct in that the lack of intronic sequences in the transgene precludes the formation of splicing variants, such as the exon1/intron1 variant, which we know is pathological based on the work of Bates et al. This drawback has been added to the discussion, along with other limitations of this viral vector-based model (Lines 503-517).

      3. “What about RAN translations? Is RAN translation noted at all in this over-expression model? How does that contribute (or not) to the progressive phenotype they see in their NHPs?”

      We are also curious regarding the assessment of toxic protein products from RAN translation of the expanded repeat sequence in this model. These studies are planned, and the results of these assays will be included in a future manuscript describing other ongoing post-mortem evaluations in this model.

    1. Author Respose

      Reviewer #1 (Public Review):

      This manuscript reports a new genetically encoded neuronal silencer BoNT-C. They show that it fully blocks neurotransmission in two classes of Drosophila motor neurons (Is and 1b; tonic and phasic, respectively). They also update a GCaMP postsynaptic reporter SynapGCaMP to express GCaMP8f instead of 6f. They selectively silence 1b or 1s neurons to disambiguate the neurotransmission properties of each neuron. Finally, they show that silencing either 1b or 1s neurons does not induce heterosynaptic structural or functional plasticity (only neuron ablation triggers plasticity). The data are convincing. The new silencing tool will be widely used.

      We thank this reviewer for his positive assessment of our study and for highlighting the utility of the new silencing tool presented in this study.

      Reviewer #2 (Public Review):

      The conclusions of this paper are properly supported by the provided data.

      Overall this work opens a new window to examine novel aspects of heterosynaptic structural and functional plasticity.

      We also thank this reviewer for his positive assessment of our study and for putting the importance of our findings in context.

      Reviewer #3 (Public Review):

      The strength of the manuscript by Han et al. is the comprehensive characterization of BoNT-C, showing that it truly abolishes all evoked and mini responses without structural alteration of the NMJ. Based on this, the authors then show that ablation of all neurotransmission in either Ib or Is does not cause any compensatory changes (neither functional nor structural) in the 'other' (i.e. looking at Is when silencing Ib or looking at Ib when silencing Is).

      The weakness of the manuscript lies in the modest gain over the previous work. Specifically, Aponte-Santiago had already shown that many parameters are not changed (in Ib when Is is perturbed, or in Is when Ib is perturbed), including that 'the Is terminal failed to show functional or structural changes following loss of the coinnervating Ib input' (quote from 2020 paper). Hence, the only major difference is that Han et al now show that Ib also does not really change when Is is silenced. Aponte-Santiago also clearly showed a ~50% EJP reduction when Is or Ib are perturbed alone, and adding these two equals wild type. The highly emphasized finding of Han et al. that (quote) ' composite values of Is and Ib neurotransmission can be fully recapitulated by isolated physiology from each input' quite obviously follows from the one key finding that one does not affect the other, as mentioned above in the strengths. The wording is a bit odd, but really adding Is (with Ib perturbed) and Ib (with Is perturbed) inputs is really not adding much over either the main finding nor the previous work.

      We thank this reviewer for his/her/their assessment of our study and for highlighting the strengths in characterizing the impact of BoNT-C expression at the NMJ. We also understand and appreciate the criticisms raised. It is important to note from the outset that the motivation and central goal of this study was not primarily to mechanistically dissect heterosynaptic plasticity between tonic and phasic motor inputs at the Drosophila NMJ. Rather, it was to develop an approach that would, for the first time, enable accurate isolation of complete neurotransmission from entire MN-Is or MN-Ib NMJs (both miniature and evoked transmission). By the reviewer’s own admission, we were entirely successful at achieving this central goal in our comprehensive characterization of BoNT-C.

      Next, the reviewer raises the valid question about whether this achievement is a significant advance over previous work, and discusses recent experimental findings regarding heterosynaptic plasticity at the fly NMJ. We want to emphasize here that having a tool that is capable, for the first time, of accurately discriminating complete transmission from Is vs Ib alone is a major advance, one that it is not clear the reviewer sufficiently appreciates. As summarized in Fig. 1, no previous attempts have been successful in accurately isolating synaptic transmission between Is vs Ib synapses. In particular, no previous approach was capable of isolating miniature activity from Is vs Ib, and as we show in our manuscript, miniature events exhibit major differences between the two inputs. Thus, without isolating miniature transmission, one cannot know baseline synaptic function in Is vs Ib nor whether heterosynaptic functional plasticity has been induced. Further, we detail major confounds with some of the previous approaches the reviewer alludes to in prior studies, including selective optogenetic stimulation.

      Finally, the reviewer discusses at length recent findings regarding heterosynaptic plasticity and questions whether the new insights revealed by BoNT-C provides a sufficient advance. In particular, the reviewer refers to previous work published in 2020 and 2021, where important initial insights into Is vs Ib structure and transmission after differential manipulations to either input was reported. The reviewer appears to believe that it was settled in these studies that no heterosynaptic functional plasticity was induced.

      However, a critical point that the reviewer appears not to appreciate is that while the two previous studies on heterosynaptic plasticity at the Drosophila NMJ were able to assess structural plasticity (AponteSantiago et al., 2020; Wang et al., 2021), no accurate or quantitative conclusions can be made about heterosynaptic functional plasticity from these studies. This is due to the authors not knowing what baseline synaptic function is at Is vs Ib (miniature frequency, miniature amplitude, and evoked transmission), so that in their manipulations they cannot accurately determine whether any functional changes are observed after their manipulations. Further complicating the interpretation of the previous studies is that at the muscle 1 NMJ (2020 study), like the muscle 4 NMJ (2021 study), ~30% of these NMJs fail to be innervated by a Is input in wild-type larvae. This major confound makes it difficult to know how or whether adaptive plasticity is induced in wild-type NMJs with or without Is innervation (since, interestingly, evoked transmission does not appear to change in wild-type m1 or m5 NMJs with or without a Is input), and then to determine whether any heterosynaptic plasticity is induced. Indeed, we have also struggled with how to accurately determine whether synaptic function changes compared to baseline throughout our studies at earlier stages, despite the fact that the muscle 6/7 NMJ we use in our study does not suffer from the variable Is innervation confounds observed at muscle 4 (Wang et al., 2021) and muscle 1 (Aponte-Santiago et al., 2020).

      Respectfully, we contend that the only way one can accurately and quantitatively determine baseline synaptic transmission (miniature amplitude, frequency, evoked, quantal content), and whether any changes are observed following manipulations to Is or Ib, is to fully and accurately recapitulate wild type (blended Is+Ib) neurotransmission from isolated Is vs Ib transmission. This is why we believe the data shown in Fig. 7 (and also Fig. S7 in the revised manuscript) is so important. It is true that numerous previous studies established relative and qualitative differences between Is vs Ib (miniature events are larger at Is relative to Ib, Is drives larger depolarization in response to single synaptic stimulation over Ib, etc). However, in no case did previous studies accurately assess baseline Is vs Ib synaptic function from entire inputs, and therefore could not conclude with certainty whether heterosynaptic functional plasticity was induced.

      On a different but somewhat similar topic, UAS-BoNT-C is not a new tool. I am a bit put off by the wording ' We have developed a botulinum neurotoxin, BoNT-C...'. More on this and the way the previous BoNT-C paper (Backhaus et al., 2016) is cited in the detail comments below in the recommendations for the authors.

      We understand these points raised by the reviewer. Our BoNT-C transgenic line is indeed a new tool, the only one in which synaptic transmission has ever been electrophysiologically characterized and shown to completely silence synaptic transmission in Drosophila. That being said, in retrospect, we can appreciate that the term “developed” might imply a level of innovation that reasonable people can disagree about. We have therefore elected to change the apparently offensive wording to “We have employed a botulinum neurotoxin, BoNT-C…” in the abstract of the revised manuscript.

      Additionally, the manuscript does not really dive into an analysis of phasic versus tonic functions (that's just a correlation with the Is and Ib dominant modes of function).

      We absolutely agree that selective silencing by BoNT-C now enables a rigorous study of tonic vs phasic neurotransmission at MN-Is vs MN-Ib NMJs, but that in the current manuscript we have not focused on this interesting question. We have adopted the convention the field has used to classify MN-Is and MN-Ib subtypes based on their apparent firing modes as “phasic” vs “tonic”, but like previous studies, we have not analyzed these functional distinctions on a deeper level. Although the focus of the current manuscript was to establish the properties of BoNT-C and highlight its utility as a tool for the field, we are now in the process of preparing an entirely new manuscript focused on just this reviewer’s question about the differences in tonic vs phasic synaptic physiology. This eight-figure manuscript will be entitled “Electrophysiological properties and nanoscale distinctions that define tonic vs phasic glutamatergic synapses” and is focused on the central question raised by the reviewer - how and why synaptic transmission differs between tonic vs phasic inputs. While this interesting question is outside the scope of the current manuscript, we will submit this new manuscript within the next few months, which is based on new experimental insights now enabled by selective BoNT-C silencing established in the current manuscript.

      Finally, since the authors show that loss of Is or Ib function does not cause any change in the other, we are left to wonder what actually DOES cause heterosynaptic plasticity. TNT or rpr DO cause some heterosynaptic plasticity and they also DO cause some structural changes - but whether the structural changes themselves are important here remains unclear. Substantial progress would have been to take the starting point that BoNT-C does not cause heterosynaptic plasticity, and then identify the signal that does (is it morphology? or signaling between Is and Ib? Or with the muscle?).

      We certainly agree with the reviewer that understanding how heterosynaptic plasticity is induced is an important question and worthy of additional investigation. As stated above, the focus of our current study was to establish the tool, BoNT-C, that will now enable a variety of fascinating and important future studies, both at understanding how and why synaptic strength differs between tonic vs phasic synapses and also how heterosynaptic plasticity signaling occurs at the NMJ. It required substantial time and experimental effort to establish that BoNT-C works to cleanly silence transmission without inducing structural and functional plasticity in the current manuscript (Figures 1-7 and several supplemental figures). Respectfully, we believe it is unreasonable to expect all of this data to be relegated to a “starting point” to then go on and probe heterosynaptic plasticity in more detail, all compressed into a single paper.

      It appears this reviewer is particularly interested in heterosynaptic plasticity, which we agree is a fascinating topic. First, we should clarify that in our experiments, TNT expression does NOT induce any heterosynaptic structural or functional plasticity (see Figures 6 and Table S2), at least in our studies at m6/7, m12/13, and m4 NMJs. Rather, TNT expression alters synaptic structure in the neuron in which it is expressed (“intrinsic structural plasticity”, Fig. 6), but does not induce any changes to the convergent input. Hence, the only evidence for actual heterosynaptic plasticity is the rather minor adaptations in synaptic structure and function observed following ablation of Is motor inputs (Fig. 6 and 8).

      In addition to the important insights revealed by BoNT-C in accurately distinguishing tonic vs phasic transmission outlined above, it appears that the reviewer does not fully appreciate the mechanistic constraints that the new BoNT-C tool reveals about heterosynaptic signaling. We would therefore like to highlight the key insights our study has revealed specifically about heterosynaptic plasticity. First, we show that at the muscle 6/7 NMJ, loss of MN-Ib completely eliminates Is innervation – this was not the finding reported in the 2020 study (Ib ablation was not reported in the 2021 study). Rather, AponteSantiago et al. 2020 reported that elimination of Ib did not trigger compensatory changes in active zone or bouton numbers of the Is input, no were compensatory increases in the Is EPSP reported. This may be due to the confounding variable Is innervation at the muscle 1 and muscle 4 NMJs used in the previous studies. Second, to what extent miniature transmission changes after manipulating activity from Is vs Ib could not be accurately assessed in previous studies because spontaneous activity persists following TNT expression as does innervation following rpr.hid expression. Third, and perhaps most important, our study is the only one that can demonstrate no heterosynaptic functional plasticity is induced by the physical presence but functional silencing of neurotransmitter release between tonic vs phasic inputs at NMJs with consistent innervation by both Is and Ib inputs.

      It is clear to us now that we did not do a sufficient job of emphasizing these advances our study has now revealed about the baseline and heterosynaptic relationships between Is vs Ib. We have added additional details throughout the revised manuscript to ensure these insights are highlighted in an effort for the reader to better appreciate the importance of this study.

      Overall, while an initial reading of the manuscript sounded rather exciting, a deeper analysis of the work in context of the literature of the last few years diminishes my enthusiasm for the novelty and progress provided.

      We have responded to the major criticisms raised by this reviewer above and hope that he/she/they can more fully appreciate the importance of the new tool we developed, the impact it will have on the field in opening new studies on tonic vs. phasic transmission, and establishing the rules of heterosynaptic plasticity between convergent tonic and phasic inputs on common targets.

    1. Author Response

      Reviewer #1 (Public Review):

      It should also be noted that their immunohistochemical studies of human fetal tissue for TBX5 and PTK7 are not convincing. There appears to be widespread staining of multiple cell types, suggesting either very broad expression of both genes or poor specificity of the primary antibodies.

      We appreciate the reviewer’s comment that the immunohistochemistry staining does not provide definitive evidence for the functional importance of TBX5 and PTK7 in PUV, however these images do confirm that the proteins are ‘in the right place at the right time’ during normal human urinary tract development. We have updated the discussion on page 19, line 441-445 to emphasise this. To further support a putative role for these proteins in urinary tract development we have added additional images from a second human embryo at the same gestation which confirms these distinct patterns of staining (Figure 8 – figure supplement 1 on page 14, lines 313-317). Even if these proteins can also be detected in other tissues or cell types, this does not detract from this idea, as in other locations the proteins may have redundant or different roles. 

      PUVs have not been described as a clinical manifestation of disease associated with mutations of either gene in humans.

      The reviewer is correct that rare variants affecting TBX5 and PTK7 have not previously been associated with PUV. They have however been associated with other developmental anomalies (as stated in the discussion on page 18, line 408-411 and page 19, line 434-437) confirming a clear role in embryonic development for both these genes.

      The fact that rare variant association testing did not identify an increased burden of rare, likely deleterious variants in these two genes (although with limited power in this cohort) suggests that PUV is not driven by ultra-rare, highly penetrant alleles in these genes. However, the identification of common and low-frequency variants using GWAS suggests a complex mode of inheritance for PUV likely in combination with maternal_/in utero_ factors. As with other complex traits, these signals provide potential insights into the underlying biology of this disease as opposed to the diagnostic implications of conventional monogenic gene discovery associated with purely Mendelian conditions. A paragraph on the Mendelian/complex trait implications of the findings of the study has been incorporated into the discussion (page 21-22, line 594-502).       

      Discuss how variants in either gene or in the patterns of structural variants that they found associated with PUV intersect with sex to result in this exclusively male condition.

      The fact that PUV is a uniquely male disease is most likely the result of differences in urethra and bladder development and length differences in urethra between males and females. Sex hormones may also potentially result in tissue-specific differences in gene expression (Ober, Loisel, and Gilad 2008). We have added a paragraph into the discussion to clarify this (page 20, line 454-463) as well as clarified the results of the chromosome X and sex-specific analyses (page 7, lines 149-155; see also Reviewer 2, point 5 below) as suggested. 

      Reviewer #2 (Public Review):

      Major:

      1. The replication study is problematic given that different genotyping methods are used for cases (targeted KASP) versus controls (WGS). This may introduce differential bias. Moreover, the ancestry of the control cohort (UK-based) does not seem to be well matched to the cases (predominantly German and Polish), and the lack of genome-wide data for the cases precludes proper adjustment for population stratification. The case-control design is also imbalanced in the replication study. The authors should reconsider their replication strategy to include a more balanced cohort with ancestry-matched controls and uniform genotyping. As an alternative, genome-wide genotyping of the replication case cohort would significantly enhance the study and should be considered.

      Many thanks to the reviewer for their valuable comments regarding the replication study case-control cohort. While different sequencing technologies were used to compare allele counts at the lead variants in the replication study (KASP genotyping for cases vs WGS for controls), both techniques exhibit > 99.5% accuracy and are subjected to variant level quality control metrics. Only individuals with reliably called genotypes were included in the replication analysis. This has been clarified in the methods section (page 30, line 693).

      We were able to obtain genome-wide genotyping data for 204 of the 395 European cases in the replication cohort. While (despite sustained effort on our part) we were unable to analyze this data jointly with the control cohort in the 100KGP due to enforced limitations on data sharing, we were able to demonstrate similar ancestry of the replication study cases and controls:  we performed PCA on a set of ~80,000 overlapping autosomal, high-quality, LD-pruned variants with MAF > 10% and projected the cases and controls separately onto (the same) data from the 1000 Genomes Project (Phase 3) labelled by ‘population’ (Figure 5). This clearly demonstrates that both cohorts have homogeneous European ancestry, as stated now in the results (page 8, lines 166-168).

      We note with thanks the reviewer’s comments regarding the case-control imbalance in the replication study which can sometimes result in a type 1 error. To address this, the case control ratio was reduced from 1:27 to 1:10.5 by including only the 4,151 male controls from the cancer cohort of the 100KGP. The results remained significant for both lead variants and have been updated in the manuscript (page 8, line 162-176; Table 2).

      When the number of controls was reduced to 500 males (a case:control ratio of 1:1.3), rs10774740 (TBX5 locus) remained significant demonstrating that case-control imbalance was not driving the observed signal (P\=9.9x10-3; OR 0.77; 95% CI 0.63-0.94). rs144171242 (PTK7 locus) however did not reach significance due to insufficient power (P\=0.06; OR 2.24; 95% CI 0.93-5.36). For a rare variant such as rs144171242 (MAF ~ 1%), a replication study with 500 controls is only powered to detect association with large effect size (OR > 3.5). A case:control ratio of ~1:10 is therefore optimal to maximize power to detect association, while minimizing unnecessary noise from excess controls. This has been added to the results section of the manuscript (page 8-9, lines 178-184).

      2. I am reassured that the TBX5 signal remains genome-wide significant in European-only analysis. However, the signal at PTK7 appears much less robust - it has borderline statistical significance (especially given that the authors test for all rare and common variants across the genome) and is represented by a single variant with a relatively rare risk allele that is differentially distributed by ancestry. Therefore, I would like to see more information for this specific signal:

      Information on the depth of coverage and the quality of the top variant

      This has been incorporated into the manuscript for both lead variants (Page 7, lines 142-145). For rs144171242 at the PTK7 locus, the meanDP was 29.34 and the meanGQ was 75.59.

      Information if the top PTK7 variant remain genome-wide significant after application of genomic control. Of note, the calculation of genomic inflation is dependent on sample size - lambda of 1.05 may represent an underestimate given low power of the cohort, and this point deserves at least a comment. Some methods correcting lambda for sample size have been proposed, and the authors should consider applying these methods.

      We appreciate the reviewer’s comments that the value of lambda may be affected by sample size and have added a comment to this in the manuscript (Page 7, line 136-137). Despite extensive searching, we were unable to find a recent published example of how to correct lambda for sample size and would be grateful if the reviewer could suggest a reference for this.

      To answer the reviewer’s specific question, application of genomic control to the lead variant at PTK7 results in P\=4.37x10-8 which remains below the threshold for conventional genome-wide significance. However, while the genomic inflation factor provides a reasonable indication of possible confounding by population structure, there are recognized limitations to applying it as a corrective factor as it assumes that all variants are confounded i.e., the same correction is applied irrespective of differences in population allele frequency which can be insufficient for some variants and lead to a loss of power in others. Furthermore, in addition to sample size, lambda can vary with heritability and disease prevalence (Yang et al. 2011) and its use for correction can therefore be too conservative and reduce power to detect significant associations. In this manuscript we therefore chose to use the mixed model approach (as part of SAIGE – detailed in the methods on page 28, lines 647-648), which has largely superseded older methods such as genomic control, to robustly correct for both population structure and cryptic relatedness and minimize false positive associations (Shin and Lee 2015).

      This locus requires more robust replication as discussed above. If more robust replication study is not possible, additional functional studies could provide more evidence in support of this locus.

      Please refer to point 1 regarding the revised and more robust evidence of replication. 

      3. There is no validation of sensitivity and specificity of SV detection by variant size or type (e.g. inversions, deletions, duplications). Also, since burden differences are not replicated independently, the authors should stress the exploratory nature of these analyses.

      We appreciate the reviewer’s comment that there is no independent validation of SV detection (e.g., by microarray or long-read sequencing) and this was reported as a limitation of our study in the discussion (page 22-23, line 520-524). However, one of the main strengths of this study is the use of clinical-grade WGS data where all samples have been sequenced on the same platform and undergone variant calling using the same bioinformatics pipeline. This essentially eliminates confounding due to differences in data generation and processing and the sensitivity and specificity of SV detection will therefore be the same for both cases and controls.

      We agree with the reviewer that the SV analyses have not yet been replicated independently and, as they suggest, have stressed the exploratory nature of the findings in the discussion (page 21, line 491-493).

      In the discussion (especially second paragraph, but also throughout), the authors overemphasize multi-ancestry nature of their study. The reality is that the included non-Europeans are very small in numbers (18 SAS cases, 11 AFR cases, and 14 admixed cases). I would suggest for the authors to specifically state these case counts and make it clear that expanded efforts to recruit non-Europeans are still needed given these very low numbers.

      We appreciate the reviewer’s comment about the overemphasis on the multi-ancestry nature of the study and the small absolute numbers of individuals included, however as a proportion of the cohort, a third of the cases are non-European: 14% are of South Asian ancestry, 8% are of African ancestry and 11% are admixed. This breakdown comprises a greater proportion of non-white European ancestry individuals than the UK as a whole (DOI: 10.5257/census/aggregate-2001-2), where the discovery cohort was based. This provides evidence that our study eliminates at least some of the Euro-centric bias present in existing genetic and genomic literature, at least as far as the UK population is concerned. Clearly, global studies fairly representing all populations would be needed to address this issue perfectly. The case counts were reported in Table 1 but we have now referenced the low absolute numbers and included the reviewer’s suggestion about expanding efforts to recruit non-European populations in the main text (page 22, line 518-520). We have also edited paragraph two of the discussion in response to the reviewer’s comments (page 17, line 387-398).   

      Supplemental figure 2 -provide case-control counts in each ancestral group (Y axis).

      These have been added to the figure legend of Figure 6 – supplemental figure 4 (previously Figure 5 - supplemental figure 2).

      Supplemental figure 3 is misleading since allelic frequencies in the cases are pooled and are not available individually for all depicted populations.

      Figure 5 - supplemental figure 3 has been removed and replaced by Figure 6 – supplemental figure 3 to show only the individual case, control and gnomAD AF by ancestry for AFR, SAS and EUR population groups instead of using the pooled allele frequencies.

      5. I did not see details of chr. X analysis. This is important given that the case group involves only Males and control group involves both Males and Females. Also, please explain how sex was used as a fixed effect (as stated in the methods) given that the case cohort is 100% male.

      We thank the reviewer for their insightful comments. Sex was used as a covariate (or fixed effect) to control for the anatomical differences in development of the urethra (and in utero hormonal changes) between the sexes in the control cohort (clarified in the methods, page 28, lines 651-653). Given the PheWAS findings (page 13, line 292-297) reveal an association between the lead variant near TBX5 and female genital prolapse and urinary incontinence, this suggests that while women do not develop PUV (due to differences in urethral development) they may manifest other lower urinary tract phenotypes. In theory, removing the female individuals from the control cohort should therefore strengthen the association as the signal would not be diluted by ‘affected’ women (i.e., those with potentially unknown lower urinary tract phenotypes). We tested this by performing a sex-specific male-only GWAS and found that the strength of association at both lead variants increased. The results of this have been added to the manuscript (page 7, line 149-155).

      The results of the chromosome X rare variant analysis are shown on the Manhattan plot (Figure 9), with no significant genes identified. We have added chromosome X to the mixed-ancestry and European GWAS as suggested (with no significant results) and the Manhattan and Q-Q plots have been updated in Figure 2 and Figure 6. The number of analyzed variants in each analysis has also been updated accordingly.

    1. Author Response

      Reviewer #2 (Public Review):

      Feeding behaviour in C. elegans has been extensively studied over decades. Several methods  of measuring feeding exist, but none can directly measure both pumping and locomotion  behaviour in freely-moving worm populations. The authors have developed a new  imaging-based method for automated detection of pharyngeal pumping events in freely moving

      C. elegans populations, and can thus simultaneously measure pumping and locomotion  behaviour in tens of worms, at a single-worm, single-pump resolution that is not possible with  previous methods. This user-friendly method can be applied to several research directions, such  as large-scale foraging, behavioural coordination, and high throughput screening.

      The authors designed their new method to be broadly applicable and user-friendly, for easy  adaptation in other research labs. However, adding direct evidence to show that "the method is  relatively insensitive to the optical instrument used" will better support this claim of wider  application.

      We appreciate the reviewer’s suggestion to show evidence that our method will also work on  data acquired on different microscopes. We now present data obtained on a second  epi-fluorescent microscope, which was downscaled and analyzed in Fig. 1H-J.

      The authors carefully benchmarked their new method against expert annotations and existing  results from previous methods, to both validate their method and reveal additional advantages.  They also assessed potential pitfalls of the method such as by examining the effect of  fluorescence imaging on the behavioural outcome, albeit only at the timescale of minutes. The  effect of longer-term fluorescence imaging should be further explored, which is relevant for  large-scale foraging experiments that the authors discussed. It could be helpful to determine the  maximum total exposure for the method to still be valid, both in terms of pump detection (which  could be sensitive to photobleaching) and behavioural modulation (which could be sensitive to  higher phototoxicity).

      We thank the reviewer for this comment. In response to their comment and related comments  by the other reviewers, we have provided bleaching curves and evidence of long-term imaging  to show the potential of the methods for longer scale assays. We found that with our illumination  intensity (see methods), bleaching was significant at a time scale of ~1h. We then added  triggered illumination and could extend the recording time to ~5 h (Methods). Additionally, we  perform a supplementary control for viability of worms exposed to continuous light (not  triggered) for 5 hrs. We do not observe any apparent phototoxic effect.

      Overall, the manuscript is well-written and the results are clearly presented both in terms of  statistics and interpretation. Methodological details are well-documented and openly accessible.

      We thank the reviewer for their positive view of our work and their appreciation for our efforts to  document both data and software.

      Reviewer #3 (Public Review):

      In this manuscript, the authors present a method for simultaneous assessment of pharyngeal  pumping (feeding) and locomotion in many C. elegans simultaneously. In this technique,  imaging of the fluorescent labeled pharynx provides a measure of velocity and pumping rate,  through analysis of the spatial variations in fluorescence.

      The technique is clearly described, well-validated, and yields some novel results. It has the  advantage that it can be performed using microscopes found in many C. elegans laboratories.

      We appreciate that the reviewer recognizes the wide applicability of the method across many C.  elegans  laboratories.

      Some limitations of the method include its reliance on fluorescence imaging, which is a  hindrance to genetic analysis, computational intensiveness, and phototoxic effects of  fluorescence excitation that are not fully explored in the manuscript.

      The authors show the utility of their method by assessing pharyngeal pumping and motor  behavior (1) during development, (2) in the presence or absence of food, and (3) in the  presence of two mutations affecting feeding.

      Although I understand these are proof-of-principle demonstrations, I still came away feeling  underwhelmed by these examples. I did not see any results here that could not have been  obtained fairly easily with conventional techniques.

      We appreciate the constructive criticism of the reviewer and highlight in the revised version the  fact that using conventional techniques such studies would require tens of hours of experiment  time. We would like to emphasize the comparisons in Table 1 where we show other methods  and their current limitations. Obtaining a dataset such as in Figure 3 which comprises a total of  34 worm-hours of pumping observation from unrestrained animals is to our knowledge currently  impractical with competing methods. We would like to remind the reviewer that, using our  method we were able to reveal bimodal distributions within a population as illustrated, for  instance, in Fig. 3F, 4B, and 4F. These observations are not possible when the single worm  resolution is not accessible or when large statistics are not feasible as it happens with previous  methods.

      Given these limitations, I feel the method's eventual impact in the field will be relatively small.

      In this study, we present a method allowing performing behavioral studies on worm populations  at high throughput and reduced costs. Such a technique opens the door to many laboratories  that can not do EPG recordings or microfluidics due to the technical difficulties, or that want to  study animals in their normal plate context. We also would like to emphasize that there are already more than 1500 strains containing myo-2  promoter transgene available on CGC, which  would be amenable to our imaging approach. These transgenic strains form broad classes of  interest, such as thermotolerance, ER stress resistance, aging and neural-circuit specific genes.

      Pharyngeal pumping has also been used as a read-out for pharmacological screens, for  example, bacteria pre-loaded with pharmacological agents are tested for their effect on  pharyngeal pumping rate. Pharaglow offers a high-throughput and sensitive method to measure  the pumping rate. This will benefit the field who use C. elegans  pumping for pharmacological  screens, and pave the way for the researchers who plan to use but are hindered by existing  techniques.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors evaluate the involvement of the hippocampus in a fast-paced time-to-contact estimation task. They find that the hippocampus is sensitive to feedback received about accuracy on each trial and has activity that tracks behavioral improvement from trial to trial. Its activity is also related to a tendency for time estimation behavior to regress to the mean. This is a novel paradigm to explore hippocampal activity and the results are thus novel and important, but the framing as well as discussion about the meaning of the findings obscures the details of the results or stretches beyond them in many places, as detailed below.

      We thank the reviewer for their constructive feedback and were happy to read that s/he considered our approach and results as novel and important. The comments led us to conduct new fMRI analyses, to clarify various unclear phrasings regarding our methods, and to carefully assess our framing of the interpretation and scope of our results. Please find our responses to the individual points below.

      1) Some of the results appear in the posterior hippocampus and others in the anteriorhippocampus. The authors do not motivate predictions for anterior vs. posterior hippocampus, and they do not discuss differences found between these areas in the Discussion. The hippocampus is treated as a unitary structure carrying out learning and updating in this task, but the distinct areas involved motivate a more nuanced picture that acknowledges that the same populations of cells may not be carrying out the various discussed functions.

      We thank the reviewer for pointing this out. We split the hippocampus into anterior and posterior sections because prior work suggested a different whole-brain connectivity and function of the two. This was mentioned in the methods section (page 15) in the initial submission but unfortunately not in the main text. Moreover, when discussing the results, we did indeed refer mostly to the hippocampus as a unitary structure for simplicity and readability, and because statements about subcomponents are true for the whole. However, we agree with the reviewer that the differences between anterior and posterior sections are very interesting, and that describing these effects in more detail might help to guide future work more precisely.

      In response to the reviewer's comment, we therefore clarified at various locations throughout the manuscript whether the respective results were observed in the posterior or anterior section of the hippocampus, and we extended our discussion to reflect the idea that different functions may be carried out by distinct populations of hippocampal cells. In addition, we also now motivate the split into the different sections better in the main text. We made the following changes.

      Page 3: “Second, we demonstrate that anterior hippocampal fMRI activity and functional connectivity tracks the behavioral feedback participants received in each trial, revealing a link between hippocampal processing and timing-task performance.

      Page 3: “Fourth, we show that these updating signals in the posterior hippocampus were independent of the specific interval that was tested and activity in the anterior hippocampus reflected the magnitude of the behavioral regression effect in each trial.”

      Page 5: “We performed both whole-brain voxel-wise analyses as well as regions-of-interest (ROI) analysis for anterior and posterior hippocampus separately, for which prior work suggested functional differences with respect to their contributions to memory-guided behavior (Poppenk et al., 2013, Strange et al. 2014).”

      Page 9: “Because anterior and posterior sections of the hippocampus differ in whole-brain connectivity as well as in their contributions to memory-guided behavior (Strange et al. 2014), we analyzed the two sections separately. “

      Page 9: “We found that anterior hippocampal activity as well as functional connectivity reflected the feedback participants received during this task, and its activity followed the performance improvements in a temporal-context-dependent manner. Its activity reflected trial-wise behavioral biases towards the mean of the sampled intervals, and activity in the posterior hippocampus signaled sensorimotor updating independent of the specific intervals tested.”

      Page 10: “Intriguingly, the mechanisms at play may build on similar temporal coding principles as those discussed for motor timing (Yin & Troger, 2011; Eichenbaum, 2014; Howard, 2017; Palombo & Verfaellie, 2017; Nobre & van Ede, 2018; Paton & Buonomano, 2018; Bellmund et al., 2020, 2021; Shikano et al., 2021; Shimbo et al., 2021), with differential contributions of the anterior and posterior hippocampus. Note that our observation of distinct activity modulations in the anterior and posterior hippocampus suggests that the functions and coding principles discussed here may be mediated by at least partially distinct populations of hippocampal cells.”

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence [...]

      2) Hippocampal activity is stronger for smaller errors, which makes the interpretationmore complex than the authors acknowledge. If the hippocampus is updating sensorimotor representations, why would its activity be lower when more updating is needed?

      Indeed, we found that absolute (univariate) activity of the hippocampus scaled with feedback valence, the inverse of error (Fig. 2A). We see multiple possibilities for why this might be the case, and we discussed some of them in a dedicated discussion section (“The role of feedback in timed motor actions”). For example, prior work showed that hippocampal activity reflects behavioral feedback also in other tasks, which has been linked to learning (e.g. Schönberg et al., 2007; Cohen & Ranganath, 2007; Shohamy & Wagner, 2008; Foerde & Shohamy, 2011; Wimmer et al., 2012). In our understanding, sensorimotor updating is a form of ‘learning’ in an immediate and behaviorally adaptive manner, and we therefore consider our results well consistent with this earlier work. We agree with the reviewer that in principle activity should be stronger if there was stronger sensorimotor updating, but we acknowledge that this intuition builds on an assumption about the relationship between hippocampal neural activity and the BOLD signal, which is not entirely clear. For example, prior work revealed spatially informative negative BOLD responses in the hippocampus as a function of visual stimulation (e.g. Szinte & Knapen 2020), and the effects of inhibitory activity - a leading motif in the hippocampal circuitry - on fMRI data are not fully understood. This raises the possibility that the feedback modulation we observed might also involve negative BOLD responses, which would then translate to the observed negative correlation between feedback valence and the hippocampal fMRI signal, even if the magnitude of the underlying updating mechanism was positively correlated with error. This complicates the interpretation of the direction of the effect, which is why we chose to avoid making strong conclusions about it in our manuscript. Instead, we tried discussing our results in a way that was agnostic to the direction of the feedback modulation. Importantly, hippocampal connectivity with other regions did scale positively with error (Fig. 2B), which we again discussed in the dedicated discussion section.

      In response to the reviewer’s comment, we revisited this section of our manuscript and felt the latter result deserved a better discussion. We therefore took this opportunity to extend our discussion of the connectivity results (including their relationship to the univariate-activity results as well as the direction of these effects), all while still avoiding strong conclusions about directionality. Following changes were made to the manuscript.

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence, unlike its absolute activity, which scaled positively with feedback valence (Fig. 2A,B), suggesting that the two measures may be sensitive to related but distinct processes.

      Page 11: Such network-wide receptive-field re-scaling likely builds on a re-weighting of functional connections between neurons and regions, which may explain why anterior hippocampal connectivity correlated negatively with feedback valence in our data. Larger errors may have led to stronger re-scaling, which may be grounded in a corresponding change in functional connectivity.

      3) Some tests were one-tailed without justification, which reduces confidence in the robustness of the results.

      We thank the reviewer for pointing us to the fact that our choice of statistical tests was not always clear in the manuscript. In the analysis the reviewer is referring to, we predicted that stronger sensorimotor updating should lead to stronger activity as well as larger behavioral improvements across the respective trials. This is because a stronger update should translate to a more accurate “internal model” of the task and therefore to a better performance. We tested this one-sided hypothesis using the appropriate test statistic (contrasting trials in which behavioral performance did improve versus trials in which it did not improve), but we did not motivate our reasoning well enough in the manuscript. The revised manuscript therefore includes the two new statements shown below to motivate our choice of test statistic more clearly.

      Page 7: [...] we contrasted trials in which participants had improved versus the ones in which they had not improved or got worse (see methods for details). Because stronger sensorimotor updating should lead to larger performance improvements, we predicted to find stronger activity for improvements vs. no improvements in these tests (one-tailed hypothesis).

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively. Because we predicted to find stronger activity for improvements vs. no improvements in behavioral performance, we here performed one-tailed statistical tests, consistent with the direction of this hypothesis. Improvement in performance was defined as receiving feedback of higher valence than in the corresponding previous trial.

      4) The introduction motivates the novelty of this study based on the idea that thehippocampus has traditionally been thought to be involved in memory at the scale of days and weeks. However, as is partially acknowledged later in the Discussion, there is an enormous literature on hippocampal involvement in memory at a much shorter timescale (on the order of seconds). The novelty of this study is not in the timescale as much as in the sensorimotor nature of the task.

      We thank the reviewer for this helpful suggestion. We agree that a key part of the novelty of this study is the use of the task that is typically used to study sensorimotor integration and timing rather than hippocampal processing, along with the new insights this task enabled about the role of the hippocampus in sensorimotor updating. As mentioned in the discussion, we also agree with the reviewer that there is prior literature linking hippocampal activity to mnemonic processing on short time scales. We therefore rephrased the corresponding section in the introduction to put more weight on the sensorimotor nature of our task instead of the time scales.

      Note that the new statement still includes the time scale of the effects, but that it is less at the center of the argument anymore. We chose to keep it in because we do think that the majority of studies on hippocampal-dependent memory functions focus on longer time scales than our study does, and we expect that many readers will be surprised about the immediacy of how hippocampal activity relates to ongoing behavioral performance (on ultrashort time scales).

      We changed the introduction to the following.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions. Importantly, the hippocampus is not traditionally thought to support sensorimotor functions, and its contributions to memory formation are typically discussed for longer time scales (hours, days, weeks). Here, however, we characterize in detail the relationship between hippocampal activity and real-time behavioral performance in a fast-paced timing task, which is traditionally believed to be hippocampal-independent. We propose that the capacity of the hippocampus to encode statistical regularities of our environment (Doeller et al. 2005, Shapiro et al. 2017, Behrens et al., 2018; Momennejad, 2020; Whittington et al., 2020) situates it at the core of a brain-wide network balancing specificity vs. regularization in real time as the relevant behavior is performed.

      5) The authors used three different regressors for the three feedback levels, asopposed to a parametric regressor indexing the level of feedback. The predictions are parametric, so a parametric regressor would be a better match, and would allow for the use of all the medium-accuracy data.

      The reviewer raises a good point that overlaps with question 3 by reviewer 2. In the current analysis, we model the three feedback levels with three independent regressors (high, medium, low accuracy). We then contrast high vs. low accuracy feedback, obtaining the results shown in Fig. 2AB. The beta estimates obtained for medium-accuracy feedback are being ignored in this contrast. Following the reviewer’s feedback, we therefore re-run the model, this time modeling all three feedback levels in one parametric regressor. All other regressors in the model stayed the same. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor.

      The results we observed were highly consistent across the two analyses, and all conclusions presented in the initial manuscript remain unchanged. While the exact t-scores differ slightly, we replicated the effects for all clusters on the voxel-wise map (on whole-brain FWE-corrected levels) as well as for the regions-of-interest analysis for anterior and posterior hippocampus. These results are presented in a new Supplementary Figure 3C.

      Note that the new Supplementary Figure 3B shows another related new analyses we conducted in response to question 4 of reviewer 2. Here, we re-ran the initial analysis with three feedback regressors, but without modeling the inter-trial interval (ITI) and the inter-session interval (ISI, i.e. the breaks participants took) to avoid model over-specification. Again, we replicated the results for all clusters and the ROI analysis, showing that the initial results we presented are robust.

      The following additions were made to the manuscript.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using the beta estimates obtained for the parametric feedback regressor (Fig. 2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      6) The authors claim that the results support the idea that the hippocampus is findingan "optimal trade-off between specificity and regularization". This seems overly speculative given the results presented.

      We understand the reviewer's skepticism about this statement and agree that the manuscript does not show that the hippocampus is finding the trade-off between specificity and regularization. However, this is also not exactly what the manuscript claims. Instead, it suggests that the hippocampus “may contribute” to solving this trade-off (page 3) as part of a “brain-wide network“ (pages 2,3,9,12). We also state that “Our [...] results suggest that this trade-off [...] is governed by many regions, updating different types of task information in parallel” (Page 11). To us, these phrasings are not equivalent, because we do not think that the role of the hippocampus in sensorimotor updating (or in any process really) can be understood independently from the rest of the brain. We do however think that our results are in line with the idea that the hippocampus contributes to solving this trade-off, and that this is exciting and surprising given the sensorimotor nature of our task, the ultrashort time scale of the underlying process, and the relationship to behavioral performance. We tried expressing that some of the points discussed remain speculation, but it seems that we were not always successful in doing so in the initial submission. We apologize for the misunderstanding, adapted corresponding statements in the manuscript, and we express even more carefully that these ideas are speculation.

      Following changes were made to the introduction and discussion.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      Page 12: This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 13: This is in line with the notion that the hippocampus [...] supports finding an optimal trade off between specificity and regularization along with other regions. [...] Our results show that the hippocampus supports rapid and feedback-dependent updating of sensorimotor representations, suggesting that it is a central component of a brain-wide network balancing task specificity vs. regularization for flexible behavior in humans.

      Note that in response to comment 1 by reviewer 2, the revised manuscript now reports the results of additional behavioral analyses that support the notion that participants find an optimal trade-off between specificity and regularization over time (independent of whether the hippocampus was involved or not).

      7) The authors find that hippocampal activity is related to behavioral improvement fromthe prior trial. This seems to be a simple learning effect (participants can learn plenty about this task from a prior trial that does not have the exact same timing as the current trial) but is interpreted as sensitivity to temporal context. The temporal context framing seems too far removed from the analyses performed.

      We agree with the reviewer that our observation that hippocampal activity reflects TTC-independent behavioral improvements across trials could have multiple explanations. Critically, i) one of them is that the hippocampus encodes temporal context, ii) it is only one of multiple observations that we build our interpretation on, and iii) our interpretation builds on multiple earlier reports

      Interval estimates regress toward the mean of the sampled intervals, an effect that is often referred to as the “regression effect”. This effect, which we observed in our data too (Fig. 1B), has been proposed to reflect the encoding of temporal context (e.g. Jazayeri & Shadlen 2010). Moreover, there is a large body of literature on how the hippocampus may support the encoding of spatial and temporal context (e.g. see Bellmund, Polti & Doeller 2020 for review).

      Because both hippocampal activity and the regression effect were linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. If so, one would expect that hippocampal activity should reflect behavioral improvements independently from TTC, it should reflect the magnitude of the regression effect, and it should generally reflect feedback, because it is the feedback that informs the updating of the internal prior.

      All three observations may have independent explanations indeed, but they are all also in line with the idea that the hippocampus does encode temporal context and that this explains the relationship between hippocampal activity and the regression effect. It therefore reflects a sparse and reasonable explanation in our opinion, even though it necessarily remains an interpretation. Of course, we want to be clear on what our results are and what our interpretations are.

      In response to the reviewer’s comment, we therefore toned down two of the statements that mention temporal context in the manuscript, and we removed an overly speculative statement from the result section. In addition, the discussion now describes more clearly how our results are in line with this interpretation.

      Abstract: This is in line with the idea that the hippocampus supports the rapid encoding of temporal context even on short time scales in a behavior-dependent manner.

      Page 13: This is in line with the notion that the hippocampus encodes temporal context in a behavior-dependent manner, and that it supports finding an optimal trade off between specificity and regularization along with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      The following statement was removed, overlapping with comment 2 by Reviewer 3:

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      8) I am not sure the term "extraction of statistical regularities" is appropriate. The termis typically used for more complex forms of statistical relationships.

      We agree with the reviewer that this expression may be interpreted differently by different readers and are grateful to be pointed to this fact. We therefore removed it and instead added the following (hopefully less ambiguous) statement to the manuscript.

      Page 9: This study investigated how the human brain flexibly updates sensorimotor representations in a feedback-dependent manner in the service of timing behavior.

      Reviewer #2 (Public Review):

      The authors conducted a study involving functional magnetic resonance imaging and a time-to-contact estimation paradigm to investigate the contribution of the human hippocampus (HPC) to sensorimotor timing, with a particular focus on the involvement of this structure in specific vs. generalized learning. Suggestive of the former, it was found that HPC activity reflected time interval-specific improvements in performance while in support of the latter, HPC activity was also found to signal improvements in performance, which were not specific to the individual time intervals tested. Based on these findings, the authors suggest that the human HPC plays a key role in the statistical learning of temporal information as required in sensorimotor behaviour.

      By considering two established functions of the HPC (i.e., temporal memory and generalization) in the context of a domain that is not typically associated with this structure (i.e., sensorimotor timing), this study is potentially important, offering novel insight into the involvement of the HPC in everyday behaviour. There is much to like about this submission: the manuscript is clearly written and well-crafted, the paradigm and analyses are well thought out and creative, the methodology is generally sound, and the reported findings push us to consider HPC function from a fresh perspective. A relative weakness of the paper is that it is not entirely clear to what extent the data, at least as currently reported, reflects the involvement of the HPC in specific and generalized learning. Since the authors' conclusions centre around this observation, clarifying this issue is, in my opinion, of primary importance.

      We thank the reviewer for these positive and extremely helpful comments, which we will address in detail below. In response to these comments, the revised manuscript clarifies why the observed performance improvements are not at odds with the idea that an optimal trade-off between specificity and regularization is found, and how the time course of learning relates to those reported in previous literature. In addition, we conducted two new fMRI analyses, ensuring that our conclusions remain unchanged even if feedback is modeled with one parametric regressor, and if the number or nuisance regressors is reduced to control for overparameterization of the model. Please find our responses underneath each individual point below.

      1) Throughout the manuscript, the authors discuss the trade-off between specific and generalized learning, and point towards Figure S1D as evidence for this (i.e., participants with higher TTC accuracy exhibited a weaker regression effect). What appears to be slightly at odds with this, however, is the observation that the deviation from true TTC decreased with time (Fig S1F) as the regression line slope approached 0.5 (Fig S1E) - one would have perhaps expected the opposite i.e., for deviation from true TTC to increase as generalization increases. To gain further insight into this, it would be helpful to see the deviation from true TTC plotted for each of the four TTC intervals separately and as a signed percentage of the target TTC interval (i.e., (+) or (-) deviation) rather than the absolute value.

      We thank the reviewer for raising this important question and for the opportunity to elaborate on the relationship between the TTC error and the magnitude of the regression effect in behavior. Indeed, we see that the regression slopes approach 0.5 and that the TTC error decreases over the course of the experiment. We do not think that these two observations are at odds with each other for the following reasons:

      First, while the reviewer is correct in pointing out that the deviation from the TTC should increase as “generalization increases”, that is not what we found. It was not the magnitude of the regularization per se that increased over time, but the overall task performance became more optimal in the face of both objectives: specificity and generalization. This optimum is at a regression-line slope of 0.5. Generalization (or regularization how we refer to it in the present manuscript), therefore did not increase per se on group level.

      Second, the regression slopes approached 0.5 on the group-level, but the individual participants approached this level from different directions: Some of them started with a slope value close to 1 (high accuracy), whereas others started with a slope value close to 0 (near full regression to the mean). Irrespective of which slope value they started with, over time, they got closer to 0.5 (Rebuttal Figure 1A). This can also be seen in the fact that the group-level standard deviation in regression slopes becomes smaller over the course of the experiment (Rebuttal Figure 1B, SFig 1G). It is therefore not generally the case that the regression effect becomes stronger over time, but that it becomes more optimal for longer-term behavioral performance, which is then also reflected in an overall decrease in TTC error. Please see our response to the reviewer’s second comment for more discussion on this.

      Third, the development of task performance is a function of two behavioral factors: a) the accuracy and b) the precision in TTC estimation. Accuracy describes how similar the participant’s TTC estimates were to the true TTC, whereas precision describes how similar the participant’s TTC estimates were relative to each other (across trials). Our results are a reflection of the fact that participants became both more accurate over time on average, but also more precise. To demonstrate this point visually, we now plotted the Precision and the Accuracy for the 8 task segments below (Rebuttal Figure 1C, SFig 1H), showing that both measures increased as the time progressed and more trials were performed. This was the case for all target durations.

      In response to the reviewer’s comment, we clarified in the main text that these findings are not at odds with each other. Furthermore, we made clear that regularization per se did not increase over time on group level. We added additional supporting figures to the supplementary material to make this point. Note that in our view, these new analyses and changes more directly address the overall question the reviewer raised than the figure that was suggested, which is why we prioritized those in the manuscript.

      However, we appreciated the suggestion a lot and added the corresponding figure for the sake of completeness.

      Following additions were made.

      Page 5: In support of this, participants' regression slopes converged over time towards the optimal value of 0.5, i.e. the slope value between veridical performance and the grand mean (Fig. S1F; linear mixed-effects model with task segment as a predictor and participants as the error term, F(1) = 8.172, p = 0.005, ε2=0.08, CI: [0.01, 0.18]), and participants' slope values became more similar (Fig. S1G; linear regression with task segment as predictor, F(1) = 6.283, p = 0.046, ε2 = 0.43, CI: [0, 1]). Consequently, this also led to an improvement in task performance over time on group level (i.e. task accuracy and precision increased (Fig. S1I), and the relationship between accuracy and precision became stronger (Fig. S1H), linear mixed-effect model results for accuracy: F(1) = 15.127, p = 1.3x10-4, ε2=0.06, CI: [0.02, 0.11], precision: F(1) = 20.189, p = 6.1x10-5, ε2 = 0.32, CI: [0.13, 1]), accuracy-precision relationship: F(1) = 8.288, p =0.036, ε2 = 0.56, CI: [0, 1], see methods for model details).

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 15: We also corroborated this effect by measuring the dispersion of slope values between participants across task segments using a linear regression model with task segment as a predictor and the standard deviation of slope values across participants as the dependent variable (Fig. S1G). As a measure of behavioral performance, we computed two variables for each target-TTC level: sensorimotor timing accuracy, defined as the absolute difference in estimated and true TTC, and sensorimotor timing precision, defined as coefficient of variation (standard deviation of estimated TTCs divided by the average estimated TTC). To study the interaction between these two variables for each target TTC over time, we first normalized accuracy by the average estimated TTC in order to make both variables comparable. We then used a linear mixed-effects model with precision as the dependent variable, task segment and normalized accuracy as predictors and target TTC as the error term. In addition, we tested whether accuracy and precision increased over the course of the experiment using separate linear mixed-effects models with task segment as predictor and participants as the error term.

      2) Generalization relies on prior experience and can be relatively slow to develop as is the case with statistical learning. In Jazayeri and Shadlen (2010), for instance, learning a prior distribution of 11-time intervals demarcated by two briefly flashed cues (compared to 4 intervals associated with 24 possible movement trajectories in the current study) required ~500 trials. I find it somewhat surprising, therefore, that the regression line slope was already relatively close to 0.5 in the very first segment of the task. To what extent did the participants have exposure to the task and the target intervals prior to entering the scanner?

      We thank the reviewer for raising the important question about the time course of learning in our task and how our results relate to prior work on this issue. Addressing the specific reviewer question first, participants practiced the task for 2-3 minutes prior to scanning. During the practice, they were not specifically instructed to perform the task as well as they could nor to encode the intervals, but rather to familiarize themselves with the general experimental setup and to ask potential questions outside the MRI machine. While they might have indeed started encoding the prior distribution of intervals during the practice already, we have no way of knowing, and we expect the contribution of this practice on the time course of learning during scanning to be negligible (for the reasons outlined above).

      However, in addition to the specific question the reviewer asked, we feel that the comment raises two more general points: 1) How long does it take to learn the prior distribution of a set of intervals as a function of the number of intervals tested, and 2) Why are the learning slopes we report quite shallow already in the beginning of the scan?

      Regarding (1), we are not aware of published reports that answer this question directly, and we expect that this will depend on the task that is used. Regarding the comparison to Jazayeri & Shadlen (2010), we believe the learning time course is difficult to compare between our study and theirs. As the reviewer mentioned, our study featured only 4 intervals compared to 11 in their work, based on which we would expect much faster learning in our task than in theirs. We did indeed sample 24 movement directions, but these were irrelevant in terms of learning the interval distribution. Moreover, unlike Jazayeri & Shadlen (2010), our task featured moving stimuli, which may have added additional sensory, motor and proprioceptive information in our study which the participants of the prior study could not rely on.

      Regarding (2), and overlapping with the reviewer’s previous comment, the average learning slope in our study is indeed close to 0.5 already in the first task segment, but we would like to highlight that this is a group-level measure. The learning slopes of some subjects were closer to 1 (i.e. the diagonal in Fig 1B), and the one of others was closer to 0 (i.e. the mean) in the beginning of the experiment. The median slope was close to 0.65. Importantly, the slopes of most participants still approached 0.5 in the course of the experiment, and so did even the group-level slope the reviewer is referring to. This also means that participants’ slopes became more similar in the course of the experiment, and they approached 0.5, which we think reflects the optimal trade-off between regressing towards the mean and regressing towards the diagonal (in the data shown in Fig. 1B). This convergence onto the optimal trade-off value can be seen in many measures, including the mean slope (Rebuttal Figure 1A, SFig 1F), the standard deviation in slopes (Rebuttal Figure 1B, SFig 1G) as well as the Precision vs. Accuracy tradeoff (Rebuttal Figure 1C, SFig 1H). We therefore think that our results are well in line with prior literature, even though a direct comparison remains difficult due to differences in the task.

      In response to the reviewer’s comment, and related to their first comment, we made the following addition to the discussion section.

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is well in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      3) I am curious to know whether differences between high-accuracy andmedium-accuracy feedback as well as between medium-accuracy and low-accuracy feedback predicted hippocampal activity in the first GLM analysis (middle page 5). Currently, the authors only present the findings for the contrast between high-accuracy and low-accuracy feedback. Examining all feedback levels may provide additional insight into the nature of hippocampal involvement and is perhaps more consistent with the subsequent GLM analysis (bottom page 6) in which, according to my understanding, all improvements across subsequent trials were considered (i.e., from low-accuracy to medium-accuracy; medium-accuracy to high-accuracy; as well as low-accuracy to high-accuracy).

      We thank the reviewer for this thoughtful question, which relates to questions 5 by reviewer 1. The reviewer is correct that the contrast shown in Fig 2 does not consider the medium-accuracy feedback levels, and that the model in itself is slightly different from the one used in the subsequent analysis presented in Fig. 3. To reply to this comment as well as to a related one by reviewer 1 together, we therefore repeated the full analysis while modeling the three feedback levels in one parametric regressor, which includes the medium-accuracy feedback trials, and is consistent with the analysis shown in Fig. 3. The results of this new analysis are presented in the new Supplementary Fig. 3B.

      In short, the model included one parametric regressor with three levels reflecting the three types of feedback, and all nuisance regressors remained unchanged. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor. We found that our results presented initially were very robust: Both the observed clusters in the voxel-wise analysis (on whole-brain FWE-corrected levels) as well as the ROI results replicated across the two analyses, and our conclusions therefore remain unchanged.

      We made multiple textual additions to the manuscript to include this new analysis, and we present the results of the analysis including a direct comparison to our initial results in the new Supplementary Fig. 3. Following textual additions were.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using thebeta estimates obtained for the parametric feedback regressor (Fig. S2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      4) The authors modeled the inter-trial intervals and periods of rest in their univariateGLMs. This approach of modelling all 'down time' can lead to model over-specification and inaccurate parameter estimation (e.g. Pernet, 2014). A comment on this approach as well as consideration of not modelling the inter-trial intervals would be useful.

      This is an important issue that we did not address in our initial manuscript. We are aware and agree with the reviewer’s general concern about model over-specification, which can be a big problem in regression as it leads to biased estimates. We did examine whether our model was overspecified before running it, but we did not report a formal test of it in the manuscript. We are grateful to be given the opportunity to do so now.

      In response to the reviewer’s comment, we repeated the full analysis shown in Fig. 2 while excluding the nuisance regressors for inter-trial intervals (ISI) and breaks (or inter-session intervals, ISI). All other regressors and analysis steps stayed unchanged relative to the one reported in Fig. 2. The new results are presented in a new Supplementary Figure 3B.

      Like for our previous analysis, we again see that the results we initially presented were extremely robust even on whole-brain FWE corrected levels, as well as on ROI level. Our conclusions therefore remain unchanged, and the results we presented initially are not affected by potential model overspecification. In addition to the new Supplementary Figure 3B, we made multiple textual changes to the manuscript to describe this new analysis and its implications. Note that we used the same nuisance regressors in all other GLM analyses too, meaning that it is also very unlikely that model overspecification affects any of the other results presented. We thank the reviewer for suggesting this analysis, and we feel including it in the manuscript has further strengthened the points we initially made.

      Following additions were made to the manuscript.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI) [...]

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. All other regressors including the main feedback regressors of interest remained unchanged, and we repeated both the voxel-wise and ROI-wise statistical tests as described above (Fig. S2B).

      Page 17: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Reviewer #3 (Public Review):

      This paper reports the results of an interesting fMRI study examining the neural correlates of time estimation with an elegant design and a sensorimotor timing task. Results show that hippocampal activity and connectivity are modulated by performance on the task as well as the valence of the feedback provided. This study addresses a very important question in the field which relates to the function of the hippocampus in sensorimotor timing. However, a lack of clarity in the description of the MRI results (and associated methods) currently prevents the evaluation of the results and the interpretations made by the authors. Specifically, the model testing for timing-specific/timing-independent effects is questionable and needs to be clarified. In the current form, several conclusions appear to not be fully supported by the data.

      We thank the reviewer for pointing us to many methodological points that needed clarification. We apologize for the confusion about our methods, which we clarify in the revised manuscript. Please find our responses to the individual points below.

      Major points

      Some methodological points lack clarity which makes it difficult to evaluate the results and the interpretation of the data.

      We really appreciate the many constructive comments below. We feel that clarifying these points improved our manuscript immensely.

      1) It is unclear how the 3 levels of accuracy and feedback (high, medium, and lowperformance) were computed. Please provide the performance range used for this classification. Was this adjusted to the participants' performance?

      The formula that describes how the response window was computed for the different speed levels was reported in the methods section of the original manuscript on page 13. It reads as follows:

      “The following formula was used to scale the response window width: d ± ((k ∗ d)/2) where d is the target TTC and k is a constant proportional to 0.3 and 0.6 for high and medium accuracy, respectively.“

      In response to the reviewer’s comment, we now additionally report the exact ranges of the different response windows in a new Supplementary Table 1 and refer to it in the Methods section as follows.

      Page 10: To calibrate performance feedback across different TTC durations, the precise response window widths of each feedback level scaled with the speed of the fixation target (Table S1).

      2) The description of the MRI results lacks details. It is not always clear in the resultssection which models were used and whether parametric modulators were included or not in the model. This makes the results section difficult to follow. For example,

      a) Figure 2: According to the description in the text, it appears that panels A and B report the results of a model with 3 regressors, ie one for each accuracy/feedback level (high, medium, low) without parametric modulators included. However, the figure legend for panel B mentions a parametric modulator suggesting that feedback was modelled for each trial as a parametric modulator. The distinction between these 2 models must be clarified in the result section.

      We thank the reviewer very much for spotting this discrepancy. Indeed, Figure 2 shows the results obtained for a GLM in which we modeled the three feedback levels with separate regressors, not with one parametric regressor. Instead, the latter was the case for Figure 3. We apologize for the confusion and corrected the description in the figure caption, which now reads as follows. The description in the main text and the methods remain unchanged.

      Caption Fig. 2: We plot the beta estimates obtained for the contrast between high vs. low feedback.

      Moreover, note that in response to comment 5 by reviewer 1 and comment 3 by reviewer 2, the revised manuscript now additionally reports the results obtained for the parametric regressor in the new Supplementary Figure 3C. All conclusions remain unchanged.

      Additionally, it is unclear how Figure 2A supports the following statement: "Moreover, the voxel-wise analysis revealed similar feedback-related activity in the thalamus and the striatum (Fig. 2A), and in the hippocampus when the feedback of the current trial was modeled (Fig. S3)." This is confusing as Figure 2A reports an opposite pattern of results between the striatum/thalamus and the hippocampus. It appears that the statement highlighted above is supported by results from a model including current trial feedback as a parametric modulator (reported in Figure S3).

      We agree with the reviewer that our result description was confusing and changed it. It now reads as follows.

      Page 5: Moreover, the voxel-wise analysis revealed feedback-related activity also in the thalamus and the striatum (Fig. 2A) [...]

      Also, note that it is unclear from Figure 2A what is the direction of the contrast highlighting the hippocampal cluster (high vs. low according to the text but the figure shows negative values in the hippocampus and positive values in the thalamus). These discrepancies need to be addressed and the models used to support the statements made in the results sections need to be explicitly described.

      The description of the contrast is correct. Negative values indicate smaller errors and therefore better feedback, which is mentioned in the caption of Fig. 2 as follows:

      “Negative values indicate that smaller errors, and higher-accuracy feedback, led to stronger activity.”

      Note that the timing error determined the feedback, and that we predicted stronger updating and therefore stronger activity for larger errors (similar to a prediction error). We found the opposite. We mention the reasoning behind this analysis at various locations in the manuscript e.g. when talking about the connectivity analysis:

      “We reasoned that larger timing errors and therefore low-accuracy feedback would result in stronger updating compared to smaller timing errors and high-accuracy feedback”

      In response to the reviewer’s remark, we clarified this further by adding the following statement to the result section.

      Page 5: “Using a mass-univariate general linear model (GLM), we modeled the three feedback levels with one regressor each plus additional nuisance regressors (see methods for details). The three feedback levels (high, medium and low accuracy) corresponded to small, medium and large timing errors, respectively. We then contrasted the beta weights estimated for high-accuracy vs. low-accuracy feedback and examined the effects on group-level averaged across runs.”

      b) Connectivity analyses: It is also unclear here which model was used in the PPIanalyses presented in Figure 2. As it appears that the seed region was extracted from a high vs. low contrast (without modulators), the PPI should be built using the same model. I assume this was the case as the authors mentioned "These co-fluctuations were stronger when participants performed poorly in the previous trial and therefore when they received low-accuracy feedback." if this refers to low vs. high contrast. Please clarify.

      Yes, the PPI model was built using the same model. We clarified this in the methods section by adding the following statement to the PPI description.

      Page 17: “The PPI model was built using the same model that revealed the main effects used to define the HPC sphere “

      Yes, the reviewer is correct in thinking that the contrast shows the difference between low vs. high-accuracy feedback. We clarified this in the main text as well as in the caption of Fig. 2.

      Caption Fig 2: [...] We plot results of a psychophysiological interactions (PPI) analysis conducted using the hippocampal peak effects in (A) as a seed for low vs. high-accuracy feedback. [...]

      Page 17: The estimated beta weight corresponding to the interaction term was then tested against zero on the group-level using a t-test implemented in SPM12 (Fig. 2C). The contrast reflects the difference between low vs. high-accuracy feedback. This revealed brain areas whose activity was co-varying with the hippocampus seed ROI as a function of past-trial performance (n-1).

      c) It is unclear why the model testing TTC-specific / TTC-independent effects (resultspresented in Figure 3) used 2 parametric modulators (as opposed to building two separate models with a different modulator each). I wonder how the authors dealt with the orthogonalization between parametric modulators with such a model. In SPM, the orthogonalization of parametric modulators is based on the order of the modulators in the design matrix. In this case, parametric modulator #2 would be orthogonalized to the preceding modulator so that a contrast focusing on the parametric modulator #2 would highlight any modulation that is above and beyond that explained by modulator #1. In this case, modulation of brain activity that is TTC-specific would have to be above and beyond a modulation that is TTC-independent to be highlighted. I am unsure that this is what the authors wanted to test here (or whether this is how the MRI design was built). Importantly, this might bias the interpretation of their results as - by design - it is less likely to observe TTC-specific modulations in the hippocampus as there is significant TTC-independent modulation. In other words, switching the order of the modulators in the model (or building two separate models) might yield different results. This is an important point to address as this might challenge the TTC-specific/TTC-independent results described in the manuscript.

      We thank the reviewer for raising this important issue. When running the respective analysis, we made sure that the regressors were not collinear and we therefore did not expect substantial overlap in shared variance between them. However, we agree with the reviewer that orthogonalizing one regressor with respect to the other could still affect the results. To make sure that our expectations were indeed met, we therefore repeated the main analysis twice: 1) switching the order of the modulators and 2) turning orthogonalization off (which is possible in SPM12 unlike in previous versions). In all cases, our key results and conclusions remained unchanged, including the central results of the hippocampus analyses.

      Anterior (ant.) / Posterior (post.) Hippocampus ROI analysis with A) original order of modulators, B) switching the order of the modulators and C) turning orthogonalization of modulators off. ABC) Orange color corresponds to the TTC-independent condition whereas light-blue color corresponds to the TTC-specific condition. Statistics reflect p<0.05 at Bonferroni corrected levels () obtained using a group-level one-tailed one-sample t-test against zero; A) pfwe = 0.017, B) pfwe = 0.039, C) pfwe = 0.039.*

      Because orthogonalization did not affect the conclusions, the new manuscript simply reports the analysis for which it was turned off. Note that these new figures are extremely similar to the original figures we presented, which can be seen in the exemplary figure below showing our key results at a liberal threshold for transparency. In addition, we clarified that orthogonalization was turned off in the methods section as follows.

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively, and they were not orthogonalized to each other.

      Comparison of old & new results: also see Fig. 3 and Fig. S5 in manuscript

      d) It is also unclear how the behavioral improvement was coded/classified "wecontrasted trials in which participants had improved versus the ones in which they had not improved or got worse"- It appears that improvement computation was based on the change of feedback valence (between high, medium and low). It is unclear why performance wasn't used instead? This would provide a finer-grained modulation?

      We thank the reviewer for the opportunity to clarify this important point. First, we chose to model feedback because it is the feedback that determines whether participants update their “internal model” or not. Without feedback, they would not know how well they performed, and we would not expect to find activity related to sensorimotor updating. Second, behavioral performance and received feedback are tightly correlated, because the former determines the latter. We therefore do not expect to see major differences in results obtained between the two. Third, we did in fact model both feedback and performance in two independent GLMs, even though the way the results were reported in the initial submission made it difficult to compare the two.

      Figure 4 shows the results obtained when modeling behavioral performance in the current trial as an F-contrast, and Supplementary Fig 4 shows the results when modeling the feedback received in the current trial as a t-contrast. While the voxel-wise t-maps/F-maps are also quite similar, we now additionally report the t-contrast for the behavioral-performance GLM in a new Supplementary Figure 4C. The t-maps obtained for these two different analyses are extremely similar, confirming that the direction of the effects as well as their interpretation remain independent of whether feedback or performance is modeled.

      The revised manuscript refers to the new Supplementary Figure 4C as follows.

      Page 17: In two independent GLMs, we analyzed the time courses of all voxels in the brain as a function of behavioral performance (i.e. TTC error) in each trial, and as a function of feedback received at the end of each trial. The models included one mean-centered parametric regressor per run, modeling either the TTC error or the three feedback levels in each trial, respectively. Note that the feedback itself was a function of TTC error in each trial [...] We estimated weights for all regressors and conducted a t-test against zero using SPM12 for our feedback and performance regressors of interest on the group level (Fig. S4A). [...]

      Page 17: In addition to the voxel-wise whole-brain analyses described above, we conducted independent ROI analyses for the anterior and posterior sections of the hippocampus (Fig. S2A). Here, we tested the beta estimates obtained in our first-level analysis for the feedback and performance regressors of interest (Fig. S4B; two-tailed one-sample t tests: anterior HPC, t(33) = -5.92, p = 1.2x10-6, pfwe = 2.4x10-6, d=-1.02, CI: [-1.45, -0.6]; posterior HPC, t(33) = -4.07, p = 2.7x10-4, pfwe = 5.4x10-4, d=-0.7, CI: [-1.09, -0.32]). See section "Regions of interest definition and analysis" for more details.

      If the feedback valence was used to classify trials as improved or not, how was this modelled (one regressor for improved, one for no improvement? As opposed to a parametric modulator with performance improvement?).

      We apologize for the lack of clarity regarding our regressor design. In response to this comment, we adapted the corresponding paragraph in the methods to express more clearly that improvement trials and no-improvement trials were modeled with two separate parametric regressors - in line with the reviewer’s understanding. The new paragraph reads as follows.

      Page 18: One regressor modeled the main effect of the trial and two parametric regressors modeled the following contrasts: Parametric regressor 1: trials in which behavioral performance improved \textit{vs}. parametric regressor 2: trials in which behavioral performance did not improve or got worse relative to the previous trial.

      Last, it is also unclear how ITI was modelled as a regressor. Did the authors mean a parametric modulator here? Some clarification on the events modelled would also be helpful. What was the onset of a trial in the MRI design? The start of the trial? Then end? The onset of the prediction time?

      The Inter-trial intervals (ITIs) were modeled as a boxcar regressor convolved with the hemodynamic response function. They describe the time after the feedback-phase offset and the subsequent trial onset. Moreover, the start of the trial was the moment when the visual-tracking target started moving after the ITI, whereas the trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The onset of the “prediction time” was the moment in which the visual-tracking target stopped moving, prompting participants to estimate the time-to-contact. We now explain this more clearly in the methods as shown below.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI), which were all convolved with the canonical hemodynamic response function of SPM12. The start of the trial was considered as the trial onsets for modeling (i.e. the time when the visual-tracking target started moving). The trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The ITI was the time between the offset of the feedback-phase and the subsequent trial onset.

      On a related note, in response to question 4 by reviewer 2, we now repeated one of the main analyses (Fig. 2) without modeling the ITI (as well as the Inter-session interval, ISI). We found that our key results and conclusions are independent of whether or not these time points were modeled. These new results are presented in the new Supplementary Figure 3B.

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. [...]

      1. Perhaps as a result of a lack of clarity in the result section and the MRI methods, it appears that some conclusions presented in the result section are not supported by the data. E.g. "Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time." The data show that hippocampal activity is higher during and after an accurate trial. This pattern of results could be attributed to various processes such as e.g. reward or learning etc. I would recommend not providing such interpretations in the result section and addressing these points in the discussion.

      Similar to above, statements like "These results suggest that the hippocampus updates information that is independent of the target TTC". The data show that higher hippocampal activity is linked to greater improvement across trials independent of the timing of the trial. The point about updating is rather speculative and should be presented in the discussion instead of the result section.

      The reviewer is referring to two statements in the results section that reflect our interpretation rather than a description of the results. In response to the reviewer’s comment, we therefore removed the following statement from the results.

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      In addition, we replaced the remaining statement by the following. We feel this new statement makes clear why we conducted the analysis that is described without offering an interpretation of the results that were presented before.

      Page 8: We reasoned that updating TTC-independent information may support generalization performance by means of regularizing the encoded intervals based on the temporal context in which they were encoded.

    1. Author Response

      Reviewer #1 (Public Review):

      7) Can the primary cells in Figure 2E and AML#1 and AML#2 be studied for mTORC1 activity by Western, as in 2D?

      For reasons that we do not understand, we have been unable to effectively culture primary FLT3-ITD AMLs, despite being able to culture most other AMLs for weeks. This issue has prevented us from being able to perform biochemical analyses of FLT3-ITD AMLs in response to FLT3 inhibition.

      8) Additional genetic information should be provided if possible for the primary AML cells - what other mutations in addition to FLT3 were present? Were there any mTOR pathway alterations?

      We provided the other mutations of AML#1 sample (NPM1 mutation) in the section METHODS-Therapeutic modeling in mice, as well as Figure legends 2E and 3D. There were no evident alterations in the mTOR pathway (beyond the FLT3-ITD mutation).

    1. Author Response:

      We thank the reviewers for their thoughtful critiques and helpful suggestions for how to improve our manuscript. Described below is our response clarifying a number of issues raised by the reviewers.

      We agree with the reviewer that we cannot definitively conclude that the first division chromosome segregation defects and the later mid-blastula transition CI-induced defects are the result of distinct mechanisms. In fact, we raise this possibility in the discussion. However, our finding that the CI phenotype induces a temporally and developmentally deferred chromosome segregation defects in the late blastoderm divisions (in addition to the well-studied first division defect) alters the established view of the CI phenotype and must be taken into account when considering mechanisms of CI. Our current view is that the distinct early and late defects could be caused by either 1) a common mechanism (possibly a chromosome mark/defect inherited through the early blastoderm divisions causing segregation defects in the late blastoderm divisions) or 2) distinct early and late mechanisms that do not strictly “depend” upon one another. We have clarified this point in the revised manuscript.

      We disagree with the reviewer that this result is to be expected given previous studies. In D. simulans, a small percentage of embryos derived from the CI cross hatch. These embryos are thought to have bypassed the first division defect. It is not obvious why there must be late defects in these embryos that “escape” early CI-induced defects and subsequently hatch. Previous studies interpreted embryos that exhibit late division errors as those that have lost their entire paternal complement of chromosomes as a result of strong CI-induced defects during the first mitotic division and develop as maternal haploids. These studies, including transgene- induced CI, have focused primarily on embryos that have undergone the first mitotic division embryonic defects. To the best of our knowledge, no group has thoroughly examined embryos that progress normally through the pre-cortical cycles 2-9 as performed in this manuscript. Thus, it was entirely unexpected that these embryos would exhibit the mitotic defects during the late blastoderm divisions and the MBT. We discuss how this finding requires modified current models for the mechanisms of CI.

      Regarding the comment that “the primary claim of the paper that later-stage embryos die for different reasons than early-stage embryos,” we make no such claim. In fact, we provide evidence that the failure to hatch (late embryonic lethality) is, at least in part, due to haploid development—a direct result of the first division CI defect. The focus of our studies are those CI-derived embryos that progress normally, maintain the normal complement of chromosomes through the first division, and exhibit chromosome segregation errors during the late blastoderm divisions. We do not know the fate of these embryos, and previous studies have demonstrated that embryos suffering extensive late blastoderm segregation errors are able to hatch (Sullivan, 1990, Development 110:311-323). We have clarified these points in the discussion.

      While we agree that transgenic tools have proven invaluable in the study of CI, they are not appropriate for these studies. The purpose of our study was to undertake an unbiased re-examination of the CI phenotype. Of necessity, the transgenic studies rely on exogenous host promoters rather than the natural endogenous Wolbachia/Prophage promoters. Thus, while informative, it is unlikely the that the transgenic alleles would capture all of the complexities and nuance of the CI phenotype. In addition, the transgenic studies, of which we are aware, have only interrogated a single pair of the CI-inducing genes, while the Wolbachia genome contains both Cid and Cin CI-associated gene pairs and possibly other yet-to-be-identified CI/Rescue genes.

      Our unbiased re-examination of the CI phenotype induced by W. riverside in D. simulans identified a previously unsuspected temporally and developmentally distinct set of CI-induced defects that occur during and after the mid-blastula transition. This finding must be taken into account when considering the mechanisms that cause CI. In our revisions, we clarify the above points and qualify our statements to appropriately interpret our results in context of the nuances and uncertainties of CI and early Drosophila embryogenesis.

    1. Author Response:

      Reviewer #3 (Public Review):

      The authors revealed the novel role of the DLL-4-Notch1-NICD signaling axis played in platelet activation, aggregation, and thrombus formation. They firstly confirmed the expression of Notch1 and DLL-4 in human platelets and demonstrated both Notch1 and DLL-4 were upregulated in response to thrombin stimulation. Further, they confirmed the exposure of human platelets with DLL-4 would lead to γ-secretase mediated NICD (a calpain substrate) release. Stimulating platelets with DLL-4 alone triggered platelet activation measured by integrin αIIbβ3 activation, P-selectin translocation, granule release, enhanced platelet-neutrophil and platelet-monocyte interactions, intracellular calcium mobilization, PEVs release, phosphorylation of cytosolic proteins, and PI3K and PKC activation. In addition, Susheel N. Chaurasia et al. showed that when platelets were stimulated with DLL-4 and low-dose thrombin, the Notch1 signaling can operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation. When the DLL-4-Notch1-NICD signaling axis was blocked by γ-secretase inhibitors, the platelets responding to stimulation were attenuated, and the arterial thrombosis in mice was impaired.

      This study by Susheel N. Chaurasia et al. was carefully designed and used multiple approaches to test their hypothesis. Their research raised the potential of targeting the DLL-4-Notch1-NICD signaling axis for anti-platelet and anti-thrombotic therapies. Interestingly, compared to thrombin, a potent physiological platelet agonist, the signaling cascade triggered by DLL-4 was relatively weak. Given that the upregulation of DLL-4 and Notch1 happened in response to thrombin stimulation, how much DLL-4 mediated signaling could contribute to in vivo platelet activation in the presence of thrombin is questionable. This could potentially limit the application of targeting Notch1 as an anti-thrombotic therapy. Further, the authors showed that Notch1 signaling could operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation, which means the DLL-4 mediated platelet signaling can act as an accelerator of early-stage hemostasis. Again, inhibition of Notch1 may slow down the hemostasis process. But given the fact that there are other platelet agonists (ADP, collagen...) existing simultaneously, blocking Notch1 signaling may not have a strong anti-platelet effect.

      We concur with the Public Reviewer that, further study is needed to delineate extent of contribution of DLL-4 signaling in thrombin-activated platelets. However, it is now amply clear that Notch signaling plays a central role in development of thrombinactivated phenotype of platelets. Further, DLL-4-Notch1 interaction on surfaces of adjacent platelets within the thrombus reinforces platelet-platelet aggregate formation. This is further reflected from significant inhibition of thrombus formation in vivo in presence of DAPT in a mouse model of intravital thrombosis. Given that there is a lot of redundancy in stimulation of platelets employing different physiological agonists (ADP, collagen, thrombin etc.), none of the present-day drugs is fully capable of effective platelet inhibition due to parallel signaling pathways. Thus, discovery of Notch signaling and its seminal role in platelet activation could explain redundancy associated with anti-platelet drugs, and Notch inhibition could complement with existing anti-platelet regimen in evoking effective and complete platelet inhibition.