5,945 Matching Annotations
  1. Nov 2025
    1. Author Response

      Reviewer #2 (Public Review):

      Zylbertal and Bianco propose a new model of trial-to-trial neuronal variability that incorporates the spatial distance between neurons. The 7-parameter model is attractive because of its simplicity: A neuron's activity is a function of stimulus drive, neighboring neurons, and global inhibition. A neuroscientist studying almost any brain area in any model organism could make use of this model, provided that they have access to 1) simultaneously-recorded neurons and 2) the spatial locations of those neurons. I could foresee this model being the de-facto model to compare to all future models, as it is easy to code up and interpret. The paper explores the effectiveness of this distance model by modeling neural activity in the zebrafish optic tectum. They find that this distance-based model can capture 1) bursting found in spontaneous activity, 2) ongoing co-fluctuations during stimulus-evoked activity, and 3) adaptation effects during prey-catching behavior.

      Strengths:

      The main strength of the paper is the interpretability of the distance-based model. This model is agnostic to the brain area from which the population of neurons is recorded, making the model broadly applicable to many neuroscientists. I would certainly use this model for any baseline comparisons of trial-to-trial variability.

      The model is assessed in three different contexts, including spontaneous activity and behavior. That the model provides some prediction in all three contexts is a strong indicator that this model will be useful in other contexts, including other model organisms. The model could reasonably be extended to other cognitive states (e.g., spatial attention) or accounting for other neuron properties (such as feature tuning, as mentioned in the manuscript).

      The analyses and intuition to show how the distance-based model explains adaptation were insightful and concise.

      We thank the reviewer for these supportive comments.

      Weaknesses:

      Model evaluation and comparison: The paper does not fully evaluate the model or its assumptions; here, I note details in which evaluation is needed. A key assumption of the model - that correlations fall off in a gaussian manner (Fig. 1C-E - is not supported by Fig. 1C, which appears to have an exponential fall-off. Functions other than gaussian may provide better fits.

      A key feature of our model is that connection strengths smoothly decrease with distance. However, we did not intend to make strong claims about the exact function parametrizing this distance relationship. In light of the reviewer’s comment, we have additionally tested an exponential function and find that it too can describe activity correlations in OT with a negligible decrease in r2 (Figure 1 – figure supplement 1A-C). The main purpose of the analysis was to show that the correlation is maximal around the seed and decays uniformly with distance from it (i.e. no sub-networks or cliques are detected). We have emphasized this in a revised conclusion paragraph and note that while multiple functions can be used to parameterize the relationship, they are nonetheless certainly simplifications. Secondly, we also ran a version of the network simulation where the connections decay in space according to an exponential rather than Gaussian function and show that, as expected, tectal bursting is robust to this change.

      Furthermore, it is not clear whether the r^2s in Fig. 1E are computed in a held-out manner (more details about what goes into computing r^2 are needed).

      These values are computed by fitting the 2-d Gaussian (or exponential function) to all neurons excluding the seed itself (added a short clarification in the Methods).

      Assessing the model based on peak location alone (Fig. 1E) is not sufficient, as other smooth monotonically-decreasing functions may perform similarly.

      As discussed above, an exponential function indeed performs similarly to a Gaussian. However, goodness of fit is secondary to the main aim of Fig 1E, which is to show that the correlation peak tends to fall near the seed cell.

      Simulating from the model greatly improves the reader's understanding (Fig. 2D), but no explanation is given for why the simulations (Fig. 2D) have almost no background spikes and much fewer, non-co-occurring bursts than those of real data (Fig. 2E).

      In part this is because the simulation results depicted in Fig 2D were derived from the ‘baseline model’, prior to optimizing to match biological bursting statistics. It is thus expected that activity will differ from experimental observation and was our main motive to tune the model parameters (now emphasized in the text). However, the model will certainly not account for all aspects of tectal activity; rather, it was designed to reproduce bursting as a prominent feature of ongoing activity and in the second part of the paper we explore the extent to which it can account for other phenomena. As noted above, in the revised abstract, introduction and discussion we have tried to clarify the motivation for developing the model and how it was used to gain insight into activity-dependent changes in network excitability.

      A key assumption of the distance model (Fig. 2A) is that each neuron has the same gaussian fall-off (i.e., sigma_excitation and sigma_inhibition), but it is unclear if the data support this assumption.

      We intentionally opted for a simple model (i.e. described by few parameters), in part due to the lack of connectivity data and additionally to set a lower bound on the extent to which multiple features of tectal activity could be accounted for. More complex models with additional degrees of freedom (such as cell-specific connectivity) may well describe the data better, but likely at the cost of interpretability. We consider such extensions are beyond the scope of the present study but might be fruitful avenues for future research.

      Although an excitatory and inhibitory gain is assumed (Fig. 2A), it is not clear from the data (Fig. 1C) that an inhibitory gain is needed (no negative correlations are observed in Fig. 1C-D).

      This is now explored in the revised Figure 3A which includes the condition of zero inhibition gain. See also response to reviewer 1.

      After optimization (Fig. 3), the model is evaluated on predicting burst properties but not evaluated on predicting held-out responses (R^2s or likelihoods), and no other model (e.g., fitting a GLM or a model with only an excitatory gain) is considered. In particular, one may consider a model in which "assemblies" do exist - does such an assembly model lead to better held-out prediction performance?

      The model we developed is a mechanistic, generative model. In contrast to Pillow et al 2008, we did not fit the model to data but rather we used it to simulate network activity and tuned the seven parameters (using EMOO) to best match biological observations. Thus, rather than assessing goodness-of-fit using cross-validation, our approach involved comparison of summary statistics related to the target emergent phenomenon (tectal bursting). This was necessary as bursting appears highly stochastic. Further to the comments above, we have expanded the parameter space to include instances with only an excitatory gain (where bursting failed) and no distance-dependence (again, busting failed). Introducing assemblies into the model will inevitably support bursting (and introduce many more free parameters), but one of our key observations is that such assemblies are not required for this aspect of spontaneous activity. Again, our aim was not to produce a detailed picture of tectal connectivity, but rather to develop a minimal model and estimate the extent to which it can account for observed features of activity. Note that the second half of the paper (Figure 4 onwards) shows the model can explain phenomena that were not considered during parameter tuning.

      It is unclear why a genetic algorithm (Fig. 1A-C) is necessary versus a grid search; it appears that solutions in Generation 2 (Fig. 3C, leftmost plot, points close to the origin) are as good as solutions in Generation 30 and that the spreads of points across generations do not shrink (as one would expect from better mutations). Given the small number of parameters (7), a grid search is reasonable, computationally tractable, and easier to understand for all readers (Fig. 3A).

      Perhaps in hindsight a grid search would have worked, but at increased computational cost (each instantiation of the model is computationally expansive). At the time we chose EMOO, and since it produced satisfactory results, we kept it. As often happens with multi-objective optimization, an improvement in one objective usually happens at the expense of other objectives, so the spread of the points does not shrink much but they move closer to the axes (i.e. reduced error). The final parameter combination is closer to the origin than any point in generation 2, though admittedly not by much. Importantly, however, optimizing the model using the training features generalized to other burst-related statistics.

      It is unclear why the excitatory and inhibitory gains of the temporal profiles (Fig. 3I) appear to be gaussian but are formulated as exponential (formula for I_ij^X in Methods).

      The interactions indeed have exponential decay in time. These might appear Gaussian because the axis scale is logarithmic.

      Overall, comparing this model to other possible (similar) models and reporting held-out prediction performance will support the claim that the distance model is a good explanation for trial-to-trial variability.

      See comments above. A key point we want to stress is that we intentionally explored a minimal network model and found that, despite obvious simplifications of the biology, it was nonetheless able to explain multiple aspects of tectal physiology and behaviour. We hope that it inspires future studies and can be extended, in parallel to experimental findings, to more accurately represent the cell-type diversity and cell-specific connectivity of the tectal network.

      Data results: Data results were clear and straightforward. However, the explanation was not given for certain results. For example, the relationship between pre-stimulus linear drive and delta R was weak; the examples in Fig. 4C do not appear to be representative of the other sessions. The example sessions in Fig. 4C have R^2=0.17 and 0.19, the two outliers in the R^2 histogram (Fig. 4D).

      The revised figure 4 is based on new data and new analysis (see below), and the presented examples no longer represent the extreme tail of the distribution (they still, however, represent strong examples, as is now explicitly indicated in the figure legend).

      The black trace in Fig. 4D has large variations (e.g., a linear drive of 25 and 30 have a change in delta R of ~0.1 - greater than the overall change of the dashed line at both ends, ~0.08) but the SEMs are very tight. This suggests that either this last fluctuation is real and a major effect of the data (although not present in Fig. 4C) or the SEM is not conservative enough. No null distribution or statistics were computed on the R^2 distribution (Fig. 4C, blue distribution) to confirm the R^2s are statistically significant and not due to random fluctuations.

      We agree that this was not sufficiently robust and in response to this comment we undertook a significant revision to figure 4 and the associated text:

      i) The revised figure is based on an entirely new dataset, allowing us to verify the results on independent data. We used 5 min ISI for all stimulus presentations, regardless of stimulus type (high or low elevation), thus ensuring that we are only examining differences in state brought about by previous ongoing activity, without risk of ‘contamination’ by evoked activity.

      ii) As per the reviewer’s suggestion, we compared model-estimated pre-stimulus state to a null estimate using randomly sampled time-points. We additionally compared the optimised model with the baseline model. Whereas the null (random times) estimates had no predictive power, both models using pre-stimulus activity were able to explain a fraction of the response residuals with the optimised model performing better.

      iii) We refined the binning process by first computing, for each response, the mean of response residuals across neurons for each bin of estimated linear drive, and then averaging across responses. This prevents the relationship being skewed by rare instances involving unusually large numbers of neurons for a particular linear drive bin, and thereby eliminates the fluctuations the reviewer was referring to.

      The absence of any background activity in Fig. 6B (e.g., during the rest blocks) is confusing, given that in spontaneous activity many bursts and background activity are present (Fig. 2E).

      The raster only presents evoked responses and no background activity is shown. This has been clarified in the revised figure and legend.

      Finally, it appears that the anterior optic tectum contributes to convergent saccades (CS) (Fig. 7E) but no post-saccadic activity is shown to assess how activity changes after the saccade (e.g., plotting activity from 0 to 60).

      Activity before and after the saccade is shown in Fig 7A. Fig 7E shows the ‘linear drive’ (or ‘excitability’), and how it changes leading up to the saccade. Since we were interested in the association between pre-saccade state and saccade-associated activity, we did not plot post-saccadic linear drive. However, as can be seen in the below figure for the reviewer, linear drive is strongly suppressed by the saccade, as expected due to CS-associated activity.

      No explanation is given why activity drops ~30 seconds before a convergent saccade (Fig. 7E).

      This is no longer shown after we trimmed the history data in Fig 7E in accordance with a comment from reviewer 1. We speculate, however, that the mean linear drive of a compact population of neurons would be somewhat periodical, since a high linear drive leads to a burst which results in a prolonged inhibition (low linear drive) with a slow recovery and so on.

      No statistical test is performed on the R^2 distribution (Fig. 7H) to confirm the R^2s (with a mean close to R^2=0.01) are meaningful and not due to random fluctuations.

      We revised the analysis in Fig 7 along the same lines as the revision of Fig 4. Model-estimated linear drive predicts CS-associated activity whereas a null estimate (random times) shows no such relationship.

      Presentation: A disjointed part of the paper is that for the first part (Figs. 1-3), the focus is on capturing burst activity, but for the second part (Figs. 4-7), the focus is on trial-to-trial variability with no mention of bursts. It is unclear how the reader should relate the two and if bursts serve a purpose for stimulus-evoked activity.

      In the first part of the paper (Figs. 1-3), we use ongoing activity to develop an understanding (formulated as a network model) of how activity modulates the network state. In the second part, we test this understanding in the context of evoked responses and show that model-estimated network state explains a fraction of visual response variability and experience-dependent changes in activity and behaviour. In the revised MS we further emphasize this idea and have edited the results text to strengthen the connections between these parts of the study. See also comments above.

      Citations: The manuscript may cite other relevant studies in electrophysiology that have investigated noise correlations, such as:

      • Luczak et al., Neuron 2009 (comparing spontaneous and evoked activity).

      • Cohen and Kohn, Nat Neuro 2011 (review on noise correlations).

      • Smith and Kohn, JNeurosci 2008 (looking at correlations over distance).

      • Lin et al., Neuron 2015 (modeling shared variability).

      • Goris et al., Nat Neuro 2014 (check out Fig. 4).

      • Umakantha et al., Neuron 2021 (links noise correlation and dim reduction; includes other recent references to noise correlations).

      We agree that the manuscript could benefit from citing some of these suggested studies and have added citations accordingly.

    1. Author Response

      Reviewer #1 (Public Review):

      It is well established that valuation and value-based decision-making is context-dependent. This manuscript presents the results of six behavioral experiments specifically designed to disentangle two prominent functional forms of value normalization during reward learning: divisive normalization and range normalization. The behavioral and modeling results are clear and convincing, showing that key features of choice behavior in the current setting are incompatible with divisive normalization but are well predicted by a non-linear transformation of range-normalized values.

      Overall, this is an excellent study with important implications for reinforcement learning and decision-making research. The manuscript could be strengthened by examining individual variability in value normalization, as outlined below.

      We thank the Reviewer for the positive appreciation of our work and for the very relevant suggestions. Please find our point-by-point answer below.

      There is a lot of individual variation in the choice data that may potentially be explained by individual differences in normalization strategies. It would be important to examine whether there are any subgroups of subjects whose behavior is better explained by a divisive vs. range normalization process. Alternatively, it may be possible to compute an index that captures how much a given subject displays behavior compatible with divisive vs. range normalization. Seeing the distribution of such an index could provide insights into individual differences in normalization strategies.

      Thank you for pointing this out, it is indeed true that there is some variability. To address this, and in line with the Reviewer’s suggestion, we extracted model attributions per participant on the individual out-of-sample log-likelihood, using the VBA_toolbox in Matlab (Daunizeau et al., 2014). In experiment 1 (presented in the main text), we found that the RANGE model accounted for 79% of the participants, while the DIVISIVE model accounted for 12%. The relative difference was even higher when including the RANGEω model in the model space: the RANGE and RANGEω models account for a total of 85% of the participants, while the DIVISIVE model accounted only for 5%.

      In experiment 2 (presented in the supplementary materials), the results were comparable (see Figure 3-figure supplement 3: 73% vs 10%, 83% vs 2%).

      To provide further insights into the behavioral signatures behind inter-individual differences, we plotted the transfer choice rates for each group of participants (best explained by the RANGE, DIVISIVE, or UNBIASED models), and the results are similar to our model predictions from Figure 1C:

      Author Response Image 1. Behavioral data in the transfer phase, split over participants best explained by the RANGE (left), DIVISIVE (middle) or UNBIASED (right) model in experiment 1 (A) and experiment 2 (B) (versions a, b and c were pooled together).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      One possibility currently not considered by the authors is that both forms of value normalization are at work at the same time. It would be interesting to see the results from a hybrid model. R1.2 Thank you for the suggestion, we fitted and simulated a hybrid model as a weighted sum between both forms of normalization:

      First, the HYBRID model quantitatively wins over the DIVISIVE model (oosLLHYB vs oosLLDIV : t(149)=10.19, p<.0001, d=0.41) but not over the RANGE model, which produced a marginally higher log-likelihood (oosLLHYB vs oosLLRAN : t(149)=-1.82, p=.07, d=-0.008). Second, model simulations also suggest that the model would predict a very similar (if not worse) behavior compared to the RANGE model (see figure below). This is supported by the distribution of the weight parameter over our participants: it appears that, consistently with the model attributions presented above, most participants are best explained by a range-normalization rule (weight > 0.5, 87% of the participants, see figure below). Together, these results favor the RANGE model over the DIVISIVE model in our task.

      Out of curiosity, we also implemented a hybrid model as a weighted sum between absolute (UNBIASED model) and relative (RANGE model) valuations:

      Model fitting, simulations and comparisons slightly favored this hybrid model over the UNBIASED model (oosLLHYB vs oosLLUNB: t(149)=2.63, p=.0094, d=0.15), but also drastically favored the range normalization account (oosLLHYB vs oosLLRAN : t(149)=-3.80, p=.00021, d=-0.40, see Author Response Image 2).

      Author Response Image 2. Model simulations in the transfer phase for the RANGE model (left) and the HYBRID model (middle) defined as a weighted sum between divisive and range forms of normalization (top) and between unbiased (no normalization) and range normalization (bottom). The HYBRID model features an additional weight parameter, whose distribution favors the range normalization rule (right).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      Reviewer #2 (Public Review):

      This paper studies how relative values are encoded in a learning task, and how they are subsequently used to make a decision. This is a topic that integrates multiple disciplines (psych, neuro, economics) and has generated significant interest. The experimental setting is based on previous work from this research team that has advanced the field's understanding of value coding in learning tasks. These experiments are well-designed to distinguish some predictions of different accounts for value encoding. However there is an additional treatment that would provide an additional (strong) test of these theories: RN would make an equivalent set of predictions if the range were equivalently adjusted downward instead (for example by adding a "68" option to "50" and "86", and then comparing to WB and WT). The predictions of DN would differ however because adding a low-value alternative to the normalization would not change it much. Would the behaviour of subjects be symmetric for equivalent ranges, as RN predicts? If so this would be a compelling result, because symmetry is a very strong theoretical assumption in this setting.

      We thank the Reviewer for the overall positive appraisal concerning our work, but also for the stimulating and constructive remarks that we have addressed below. At this stage, we just wanted to mention that we also agree with the Reviewer concerning the fact that a design where we add "68" option to "50" and "86" would represent also an important test of our hypotheses. This is why we had, in fact, run this experiment. Unfortunately, their results were somehow buried in the Supplementary Materials of our original submission and not correctly highlighted in the main text. We modified the manuscript in order to make them more visible:

      Behavioral results in three experiments (N=50 each) featuring a slightly different design, where we added a mid value option (NT68) between NT50 and NT87 converge to the same broad conclusion: the behavioral pattern in the transfer phase is largely incompatible with that predicted by outcome divisive normalization during the learning phase (Figure 2-figure supplement 2).

      Reviewer #3 (Public Review):

      Bavard & Palminteri extend their research program by devising a task that enables them to disassociate two types of normalisation: range normalisation (by which outcomes are normalised by the min and max of the options) and divisive normalisation (in which outcomes are normalised by the average of the options in ones context). By providing 4 different training contexts in which the range of outcomes and number of options vary, they successfully show using 'ex ante' simulations that different learning approaches during training (unbiased, divisive, range) should lead to different patterns of choice in a subsequent probe phase during which all options from the training are paired with one another generating novel choice pairings. These patterns are somewhat subtle but are elegantly unpacked. They then fit participants' training choices to different learning models and test how well these models predict probe phase choices. They find evidence - both in terms of quantitive (i.e. comparing out-of-sample log-likelihood scores) and qualitative (comparing the pattern of choices observed to the pattern that would be observed under each mode) fit - for the range model. This fit is further improved by adding a power parameter which suggests that alongside being relativised via range normalisation, outcomes were also transformed non-linearly.

      I thought this approach to address their research question was really successful and the methods and results were strong, credible, and robust (owing to the number of experiments conducted, the design used and combination of approaches used). I do not think the paper has any major weaknesses. The paper is very clear and well-written which aids interpretability.

      This is an important topic for understanding, predicting, and improving behaviour in a range of domains potentially. The findings will be of interest to researchers in interdisciplinary fields such as neuroeconomics and behavioural economics as well as reinforcement learning and cognitive psychology.

      We thank Prof. Garrett for his positive evaluation and supportive attitude.

    1. Author Response

      Reviewer #1 (Public Review):

      While the mechanism about arm-races between plant and specialist herbivores has been studied, such as detoxification of specific secondary metabolites, the mechanism of the wider diet breadth, so-called generalist herbivores have been less studied. Since the heterogeneity of host plant species, the experimental validation of phylogenetic generalism of herbivores seemed as hard to be conducted. The authors declared the two major hypotheses about the large diet breadth ("metabolic generalism" and "multi-host metabolic specialism"), and carefully designed the experiment using Drosophila suzukii as a model herbivore species.

      By an untargeted metabolomics approach using UHPLC-MS, authors attempted to falsify the hypotheses both in qualitative- and quantitative metabolomic profiles. Intersections of four fruit (puree) samples and each diet-based fly individual samples from the qualitative data revealed that there were few ions that occur as the specific metabolite in each diet-based fly group, which could reject the "multi-host metabolic specialism" hypothesis. Quantitative data also showed results that could support the "metabolic generalism" hypothesis. Therefore, the wide diet breadth of D. suzukii seemed to be derived from the general metabolism rather than the adaptive traits of the diverse host plant species. On the other hand, the reduction of the metabolites (ions) set using GLM seemed logical and 2-D clustering from the reduced ions set showed that quantitative aspects of diet-associated ions could classify "what the flies ate". These interesting results could enhance the understanding of the diet breadth (niche) of herbivorous insects.

      The authors' approach seemed clear to falsify the hypotheses based on the appropriate data processing. The intersection of shared ions from the qualitative dataset could distinguish the diet-specific metabolites in flies and commonly occurring metabolites among flies and/or fruits. Also, filtering on the diet-specific ions seemed to be a logical and appropriate way. Meanwhile, the discussion about the results seemed to be focused on different points regarding the research hypotheses which were raised in the introduction part. Discussion about the results mainly focused on the metabolism of D. suzukii itself, rather than the research hypotheses and questions that were raised from the evolution of the wide diet breadth of generalist herbivores. In particular, the conclusion seems to be far from the main context of the authors' research; e.g. frugivory. It makes the implication of the study weaker.

      We wish to thank Reviewer #1 for their appreciation of our study. As recommended, we now focus our discussion more on the general aspect of our findings (relevant to insects, herbivores, or frugivores), and less on the peculiarities of the metabolism of D. suzukii itself. Specifically, we now only mention D. suzukii in one section (two sentences) of our Discussion, to serve as an example (l.387-396). Thanks to this comment, the Discussion may interest a broader readership, on the evolution of diet breadth in generalist herbivorous species and offers a better understanding of the general implications of our findings.

      Reviewer #2 (Public Review):

      The manuscript: "Metabolic consequences of various fruit-based diets in a generalist insect species" by Olazcuaga et al., addresses an interesting question. Using an untargeted metabolomics approach, the authors study how diet generalism may have evolved versus diet specialization which is generally more commonly observed, at least in drosophila species. Using the phytophagous species Drosophila suzukii, and by directly comparing the metabolomes of fruit purees and the flies that fed on them, the authors found evidence for "metabolic generalism". Metabolic generalism means that individuals of a generalist species process all types of diet in a similar way, which is in contrast to "multi-host metabolic specialism" which entails the use of specific pathways to metabolize unique compounds of different diets. The authors find strong evidence for the first hypothesis, as they could easily detect the signature of each fruit diet in the flies. The authors then go on to speculate on the evolutionary ramifications of this for how potentially diet specializations may have evolved from diet generalism. Overall, the paper is well written, the experiments well documented, and the conclusions convincing.

      We thank Reviewer #2 for their comments and appreciation of our work.

      Reviewer #3 (Public Review):

      Laure Olazcuaga et al. investigated the metabolomes of four fruit-based diets and corresponding individuals of Drosophila suzukii that reared on them using comparative metabolomics analysis. They observed that the four fruit-based diets are metabolically dissimilar. On the contrary, flies that fed on them are mostly similar in their metabolic response. From a quantitative point of view, they find that part of the fly metabolomes correlates well with that of the corresponding diet metabolomes, which is indicative of insect ingestive history. By further focusing on 71 metabolites derived from diet-specific fly ions and highly abundant fruit ions, the authors show that D. suzukii differentially accumulates diet metabolism in a compound-specific manner. The authors claim that the data support the metabolic generalism hypothesis while rejecting the multi-host metabolic specialism hypothesis. This study provides a valuable global chemical comparison of how diverse diet metabolites are processed by a generalist insect species.

      Strengths:

      The rapid advances in high-resolution mass spectrometry have recently accelerated the discovery of many novel post-ingestive compounds through comparative metabolomics analysis of insect/frass and plant samples. Untargeted metabolomics is thus a very powerful approach for the systematic comparison of global chemical shifts when diverse plant-derived specialized metabolites are further modified or quantitatively metabolized after ingestion by insects. The technique can be readily extended to a larger micro- or macro-evolutionary context for both generalist and specialist insects to systematically investigate how plant chemical diversity contributes to dietary generalism and specialism.

      We would like to thank Reviewer #3 for their insightful comments on the power of untargeted metabolomics to evaluate the fate of plant metabolites and their use by herbivores. We also agree that these techniques can be used to tackle eco-evolutionary issues, such as the origin and maintenance of dietary generalism and specialism here. We hope that our study will inspire other researchers to explore such techniques and experiments to gain a global overview of biochemistry fluxes and their evolution. We now mention it in the conclusion (L454-459).

      Weaknesses:

      The authors claim that their data support the hypothesis of metabolic generalism, however, a total analysis of insect metabolism may not generate a clean dataset for direct comparison of fruit-derived metabolites with those metabolized by D. suzukii, given that much of these metabolites would be "diluted" proportionally by insect-derived metabolites. If the insect-derived metabolites predominate, then, as the authors observed, a tight clustering of D. suzukii metabolomes in the PCA plot would be expected. It is therefore very difficult to interpret these patterns.

      We agree with Reviewer #3 that a careful examination of the different possible origins of metabolites should take place to distinguish between our two competing hypotheses.

      The only source of metabolites for insects in our experimental setup is a mixture of (i) a large proportion of fruit purees and (ii) a minor proportion of artificial medium consisting mainly of yeast. Our goal is thus to understand the fate of (i) “fruit-derived” metabolites (transformed and untransformed), while controlling for (ii) “artificial media-derived” metabolites, that constitute a nuisance signal but are necessary for a complete development in our system.

      By “fruit-derived” and “insect-derived” metabolites, it is our understanding that Reviewer #3 means “fruit” metabolites (when in insects, untransformed “fruit-derived” metabolites) and “artificial medium-derived” metabolites. It is true that we do wish to avoid a predominance of “artificial medium-derived” metabolites and focus on “fruit-derived” metabolites in insects. We also want to note that it is of primary importance in our study to distinguish between “fruit” metabolites that are carried as is (“fruit” metabolites present in insects, ie untransformed “fruit-derived” metabolites), and “fruit” metabolites that are used after transformation by the insect (i.e., transformed “fruit-derived” metabolites).

      We agree with Reviewer #3 that the presence of “artificial medium-derived” metabolites could be problematic in direct comparisons of fruits and insects (and not among fruits or among insects’ comparisons).

      However, we took some steps to avoid such problems:

      1. We included control fly samples in our experiment: at each experimental generation, flies developed only on artificial medium (without fruit puree) were collected and processed simultaneously with flies that developed on fruit media. Results using these artificial medium-reared flies as controls (by subtracting their ions levels and removing ions that were similar, respective of their generation) were similar to results using raw data and conclusions were identical (see below).

      2. We lowered the proportion of artificial medium in our fruit media so that it was kept to a minimum, compatible with larval development and adult survival.

      Consistent with the low impact of this “artificial medium” component on our conclusions, we also wish to point out the presence pattern of metabolites found only in flies and never in fruits when using raw data (Figure 3, yellow stack). Even in the most conservative hypothesis of 100% of these metabolites originating from our artificial medium (which is probably not the case), we observe that it constitutes only a minor proportion of metabolites common to all flies (15.7%).

      For your consideration, we include below the main Figures, using both raw data and artificial medium-controlled:

      Figure 2, left = raw data; right = artificial-media controlled:

      Figure 3, left = raw data; right = artificial-media controlled:

      Figure 3S1, left = raw data; right = artificial-media controlled:

      Figure 4, above = raw data; below = artificial-media controlled:

      We hope that we convinced the Editor/Reviewers that raw data and artificial-medium controlled data provide a single and same answer to all our analyses. We chose to present only raw data, to simplify the Materials & Methods section.

      We however modified the current version of the manuscript to inform the reader that proper controls were done and that their inclusion do not modify any of our conclusions (l.110-113 and l.583-589).

      We also wish to point out two additional comments:

      • As Reviewer #1 also recommended, we modified the expectations drawn in Fig1G to better consider the general comment of “insect derived” metabolites being fundamentally different from plant metabolites (even if we do show in our study that only approx. 9% of metabolites are private to flies).

      • The main part of our care in the use of this global PCA analysis is that it follows two other analyses (global intersection and comparison of intersections among fruits and among flies) and precedes another one (fly-focused PCA). We hope that all these analyses help the readers get a comprehensive overview of the dataset and associated results, avoiding reliance on a single analysis.

      • We also help readers to explore and visualize all analyses presented in our manuscript by setting up a shiny application (in addition to our available dataset and R code), at https://fruitfliesmetabo.shinyapps.io/shiny/. This is now mentioned in the main text (l.588-589).

      We thank the Reviewer for their comment that greatly improved the manuscript.

      The authors generated a qualitative dataset using the peak list produced by XCMS which contains quantitative peak areas, it is unclear how the threshold was selected to determine if a peak is present or absent in a given sample. The qualitative dataset would influence the output of their data analysis.

      The referee is right in pointing out that the threshold used to determine if a peak is present or absent in a given sample was not clearly specified. This has now been corrected in the “Host use” section of the Materials & Methods (l.513-516). Briefly, a given replicate of a compound was considered present if the corresponding peak area following XCMS quantification was > 1000. This threshold was selected to be close to the practical quantification threshold of the Thermo Exactive mass spectrometer used in this study. This threshold was selected in order to allow the quantification of low-abundance compounds, as many plant-derived diet compounds were expected to be present in trace amounts in flies. We additionally applied a stringent rule for presence of any given compound (presence in at least 3 biological replicates).

      The authors reply on in-source fragmentation for peak annotation when authentic standards are not available. The accuracy of the annotation thus requires further validation.

      The Supplementary Table 1 was unfortunately omitted in the first submission of the manuscript. This oversight has been now corrected and the Supplementary Table 1 details all information used for metabolite annotation. In particular, MS/MS data comparison with mass spectral databases as well as with published literature have been added to substantiate metabolite identifications. This MS/MS data was produced thanks to the comment of the Reviewer. We also provide four more annotations from standards to attain 30 / 71 identifications validated through chemical standards.

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, the science is sound and interesting, and the results are clearly presented. However, the paper falls in-between describing a novel method and studying biology. As a consequence, it is a bit difficult to grasp the general flow, central story and focus point. The study does uncover several interesting phenomena, but none are really studied in much detail and the novel biological insight is therefore a bit limited and lost in the abundance of observations. Several interesting novel interactions are uncovered, in particular for the SPS sensor and GAPDH paralogs, but these are not followed up on in much detail. The same can be said for the more general observations, eg the fact that different types of mutations (missense vs nonsense) in different types of genes (essential vs non-essential, housekeeping vs. stress-regulated...) cause different effects.

      This is not to say that the paper has no merit - far from it even. But, in its current form, it is a bit chaotic. Maybe there is simply too much in the paper? To me, it would already help if the authors would explicitly state that the paper is a "methods" paper that describes a novel technique for studying the effects of mutations on protein abundance, and then goes on to demonstrate the possibilities of the technology by giving a few examples of the phenomena that can be studied. The discussion section ends in this way, but it may be helpful if this was moved to the end of the introduction.

      We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      Schubert et al. describe a new pooled screening strategy that combines protein abundance measurements of 11 proteins determined via FACS with genome-wide mutagenesis of stop codons and missense mutations (achieved via a base editor) in yeast. The method allows to identify genetic perturbations that affect steady state protein levels (vs transcript abundance), and in this way define regulators of protein abundance. The authors find that perturbation of essential genes more often alters protein abundance than of nonessential genes and proteins with core cellular functions more often decrease in abundance in response to genetic perturbations than stress proteins. Genes whose knockouts affected the level of several of the 11 proteins were enriched in protein biosynthetic processes while genes whose knockouts affected specific proteins were enriched for functions in transcriptional regulation. The authors also leverage the dataset to confirm known and identify new regulatory relationships, such as a link between the SDS amino acid sensor and the stress response gene Yhb1 or between Ras/PKA signalling and GAPDH isoenzymes Tdh1, 2, and 3. In addition, the paper contains a section on benchmarking of the base editor in yeast, where it has not been used before.

      Strengths and weaknesses of the paper

      The authors establish the BE3 base editor as a screening tool in S. cerevisiae and very thoroughly benchmark its functionality for single edits and in different screening formats (fitness and FACS screening). This will be very beneficial for the yeast community.

      The strategy established here allows measuring the effect of genetic perturbations on protein abundances in highly complex libraries. This complements capabilities for measuring effects of genetic perturbations on transcript levels, which is important as for some proteins mRNA and protein levels do not correlate well. The ability to measure proteins directly therefore promises to close an important gap in determining all their regulatory inputs. The strategy is furthermore broadly applicable beyond the current study. All experimental procedures are very well described and plasmids and scripts are openly shared, maximizing utility for the community.

      There is a good balance between global analyses aimed at characterizing properties of the regulatory network and more detailed analyses of interesting new regulatory relationships. Some of the key conclusions are further supported by additional experimental evidence, which includes re-making specific mutations and confirming their effects on protein levels by mass spectrometry.

      The conclusions of the paper are mostly well supported, but I am missing some analyses on reproducibility and potential confounders and some of the data analysis steps should be clarified.

      The paper starts on the premise that measuring protein levels will identify regulators and regulatory principles that would not be found by measuring transcripts, but since the findings are not discussed in light of studies looking at mRNA levels it is unclear how the current study extends knowledge regarding the regulatory inputs of each protein.

      See response to Comment #10.

      Specific comments regarding data analysis, reproducibility, confounders

      1) The authors use the number of unique barcodes per guide RNA rather than barcode counts to determine fold-changes. For reliable fold changes the number of unique barcodes per gRNA should then ideally be in the 100s for each guide, is that the case? It would also be important to show the distribution of the number of barcodes per gRNA and their abundances determined from read counts. I could imagine that if the distribution of barcodes per gRNA or the abundance of these barcodes is highly skewed (particularly if there are many barcodes with only few reads) that could lead to spurious differences in unique barcode number between the high and low fluorescence pool. I imagine some skew is present as is normal in pooled library experiments. The fold-changes in the control pools could show whether spurious differences are a problem, but it is not clear to me if and how these controls are used in the protein screen.

      Because of the large number of screens performed in this study (11 proteins, with 8 replicates for each) we had to trade off sequencing depth and power against cell sorting time and sequencing cost, resulting in lower read and barcode numbers than what might be ideally aimed for. As described further in the response to Comment #5, we added a new figure to the manuscript that shows that the correlation of fold-changes between replicates is high (Figure 3–S1A). The second figure below shows that the correlation between the number of unique barcodes and the number of reads per gRNA is highly significant (p < 2.2e-16).

      2) I like the idea of using an additional barcode (plasmid barcode) to distinguish between different cells with the same gRNA - this would directly allow to assess variability and serve as a sort of replicate within replicate. However, this information is not leveraged in the analysis. It would be nice to see an analysis of how well the different plasmid barcodes tagging the same gRNA agree (for fitness and protein abundance), to show how reproducible and reliable the findings are.

      We agree with the reviewer that this would be nice to do in principle, but our sequencing depth for the sorted cell populations was not high enough to compare the same barcode across the low/unsorted/high samples. See also our response to Comment #5 for the replicate analyses.

      3) From Fig 1 and previous research on base editors it is clear that mutation outcomes are often heterogeneous for the same gRNA and comprise a substantial fraction of wild-type alleles, alleles where only part of the Cs in the target window or where Cs outside the target window are edited, and non C-to-T edits. How does this reflect on the variability of phenotypic measurements, given that any barcode represents a genetically heterogeneous population of cells rather than a specific genotype? This would be important information for anyone planning to use the base editor in future.

      We agree with the reviewer that the heterogeneity of editing outcomes is an important point to keep in mind when working with base editors. In genetic screens, like the ones described here, often the individual edit is less important, and the overall effects of the base editor are specific/localized enough to obtain insights into the effects of mutations in the area where the gRNA targets the genome. For example, in our test screens for Canavanine resistance and fitness effects, in which we used gRNAs predicted to introduce stop codons into the CAN1 gene and into essential genes, respectively, we see the expected loss-of-function effect for a majority of the gRNAs (canavanine screen: expected effect for 67% of all gRNAs introducing stop codons into CAN1; fitness screen: expected effect for 59% of all gRNAs introducing stop codons into essential genes) (Figure 2). In the canavanine screen, we also see that gRNAs predicted to introduce missense mutations at highly conserved residues are more likely to lead to a loss-of-function effect than gRNAs predicted to introduce missense mutations at less conserved residues, further highlighting the differentiated results that can be obtained with the base editor despite the heterogeneity in editing outcomes overall. We would certainly advise anyone to confirm by sequencing the base edits in individual mutants whenever a precise mutation is desired, as we did in this study when following up on selected findings with individual mutants.

      4) How common are additional mutations in the genome of these cells and could they confound the measured effects? I can think of several sources of additional mutations, such as off-target editing, edits outside the target window, or when 2 gRNA plasmids are present in the same cell (both target windows obtain edits). Could some of these events explain the discrepancy in phenotype for two gRNAs that should make the same mutation (Fig S4)? Even though BE3 has been described in mammalian cells, an off-target analysis would be desirable as there can be substantial differences in off-target behavior between cell types and organisms.

      Generally, we are not very concerned about random off-target activity of the base editor because we would not expect this to cause a consistent signal that would be picked up in our screen as a significant effect of a particular gRNA. Reproducible off-target editing with a specific gRNA at a site other than the intended target site would be problematic, though. We limited the chance of this happening by not using gRNAs that may target similar sequences to the intended target site in the genome. Specifically, we excluded gRNAs that have more than one target in the genome when the 12 nucleotides in the seed region (directly upstream of the PAM site) are considered (DiCarlo et al., Nucleic Acids Research, 2013).

      We do observe some off-target editing right outside the target window, but generally at much lower frequency than the on-target editing in the target window (Figure 1B and Figure 1–S2). Since for most of our analyses we grouped perturbations per gene, such off-target edits should not affect our findings. In addition, we validated key findings with independent experiments. For our study, we used the Base Editor v3 (Komor et al., Nature, 2016); more recently, additional base editors have been developed that show improved accuracy and efficiency, and we would recommend these base editors when starting a new study (see, e.g., Anzalone et al., Nature Biotechnology, 2020).

      We are not concerned about cases in which one cell gets two gRNAs, since the chance that the same two gRNAs end up in one cell repeatedly is low, and such events would therefore not result in a significant signal in our screens.

      We don’t think that off-target mutations can explain the discrepancy between pairs of gRNAs that should introduce the same mutation (Figure 3–S1. The effect of the two gRNAs is actually well-correlated, but, often, one of the two gRNAs doesn’t pass our significance cut-off or simply doesn’t edit efficiently (i.e., most discrepancies arise from false negatives rather than false positives). We may therefore miss the effects of some mutations, but we are unlikely to draw erroneous conclusions from significant signals.

      5) In the protein screen normalization uses the total unique barcode counts. Does this efficiently correct for differences from sequencing (rather than total read counts or other methods)? It would be nice to see some replicate plots for the analysis of the fitness as well as the protein screen to be able to judge that.

      We made a new figure that shows a replicate comparison for the protein screen (see below; in the manuscript it is Figure 3–S1A) and commented on it in the manuscript. For this analysis, the eight replicates for each protein were split into two groups of four replicates each and analyzed the same way as the eight replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16). The second figure shows that the total number of reads and the total number of unique barcodes are well correlated.

      For the fitness screen, we used read counts rather than barcode counts for the analysis since read counts better reflect the dropout of cells due to reduced fitness. The figure below shows a replicate comparison for the fitness screen. For this analysis, the four replicates were split into two groups of two replicates each and analyzed the same way as the four replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16).

      6) In the main text the authors mention very high agreement between gRNAs introducing the same mutation but this is only based on 20 or so gRNA pairs; for many more pairs that introduce the same mutation only one reaches significance, and the correlation in their effects is lower (Fig S4). It would be better to reflect this in the text directly rather than exclusively in the supplementary information.

      We clarified this in the manuscript main text: “For 78 of these gRNA pairs, at least one gRNA had a significant effect (FDR < 0.05) on at least one of the eleven proteins; their effects were highly correlated (Pearson’s R2 = 0.43, p < 2.2E-16) (Figure 3–S1B). For the 20 gRNA pairs for which both gRNAs had a significant effect, the correlation was even higher (Pearson’s R2 = 0.819, p = 8.8e-13) (Figure 3–S1C). These findings show that the significant gRNA effects that we identify have a low false positive rate, but they also suggest that many real gRNA effects are not detected in the screen due to limitations in statistical power.”

      7) When the different gRNAs for a targeted gene are combined, instead of using an averaged measure of their effects the authors use the largest fold-change. This seems not ideal to me as it is sensitive to outliers (experimental error or background mutations present in that strain).

      We agree that the method we used is more sensitive to outliers than averaging per gene. However, because many gRNAs have no effect either because they are not editing efficiently or because the edit doesn’t have a phenotypic consequence, an averaging method across all gRNAs targeting the same gene would be too conservative and not properly capture the effect of a perturbation of that gene.

      8) Phenotyping is performed directly after editing, when the base editor is still present in the cells and could still interact with target sites. I could imagine this could lead to reduced levels of the proteins targeted for mutagenesis as it could act like a CRISPRi transcriptional roadblock. Could this enhance some of the effects or alter them in case of some missense mutations?

      To reduce potential “CRISPRi-like” effects of the base editor on gene expression, we placed the base editor under a galactose-inducible promoter. For both the fitness and protein screens we grew the cultures in media without galactose for another 24 hours (fitness screen) or 8-9 hours (protein screens) before sampling. In the latter case, this recovery time corresponded to more than three cell divisions, after which we assume base editor levels to have strongly decreased, and therefore to no longer interfere with transcription. This is also supported by our ability to detect discordant effects of gRNAs targeting the same gene (e.g., the two mutations leading to loss-of-function and gain-of-function of RAS2), which would otherwise be overshadowed by a CRISPRi effect.

      9) I feel that the main text does not reflect the actual editing efficiency very well (the main numbers I noticed were 95% C to T conversion and 89% of these occurring in a specific window). More informative for interpreting the results would be to know what fraction of the alleles show an edit (vs wild-type) and how many show the 'complete' edit (as the authors assume 100% of the genotypes generated by a gRNA to be conversion of all Cs to Ts in the target window). It would be important to state in the main text how variable this is for different gRNAs and what the typical purity of editing outcomes is.

      We now show the editing efficiency and purity in a new figure (Figure 1B), and discuss it in the main text as follows: “We found that the target window and mutagenesis pattern are very similar to those described in human cells: 95% of edits are C-to-T transitions, and 89% of these occurred in a five-nucleotide window 13 to 17 base pairs upstream of the PAM sequence (Figure 1A; Figure 1–S2) (Komor et al., 2016). Editing efficiency was variable across the eight gRNAs and ranged from 4% to 64% if considering only cases where all Cs in the window are edited; percentages are higher if incomplete edits are considered, too (Figure 1B).”

      Comments regarding findings

      10) It would be nice to see a comparison of the results to the effects of ~1500 yeast gene knockouts on cellular transcriptomes (https://doi.org/10.1016/j.cell.2014.02.054). This would show where the current study extends established knowledge regarding the regulatory inputs of each protein and highlight the importance of directly measuring protein levels. This would be particularly interesting for proteins whose abundance cannot be predicted well from mRNA abundance.

      We agree with the reviewer that it would be very interesting to compare the effect of perturbations on mRNA vs protein levels. We have compared our protein-level data to mRNA-level data from Kemmeren and colleagues (Kemmeren et al., Cell 2014), and we find very good agreement between the effects of gene perturbations on mRNA and protein levels when considering only genes with q < 0.05 and Log2FC > 0.5 in both studies (Pearson’s R = 0.79, p < 5.3e-15).

      Gene perturbations with effects detected only on mRNA but not protein levels are enriched in genes with a role in “chromatin organization” (FDR = 0.01; as a background for the analysis, only the 1098 genes covered in both studies were considered). This suggests that perturbations of genes involved in chromatin organization tend to affect mRNA levels but are then buffered and do not lead to altered protein levels. There was no enrichment of functional annotations among gene perturbations with effects on protein levels but not mRNA levels.

      We did not include these results in the manuscript because there are some limitations to the conclusions that can be drawn from these comparisons, including that our study has a relatively high number of false negatives, and that the genes perturbed in the Kemmeren et al. study were selected to play a role in gene regulation, meaning that differences in mRNA-vs-protein effects of perturbations are limited to this function, and other gene functions cannot be assessed.

      11) The finding that genes that affect only one or two proteins are enriched for roles in transcriptional regulation could be a consequence of 'only' looking at 10 proteins rather than a globally valid conclusion. Particularly as the 10 proteins were selected for diverse functions that are subject to distinct regulatory cascades. ('only' because I appreciate this was a lot of work.)

      We agree with this, and we think it is clear in the abstract and the main text of the manuscript that here we studied 11 proteins. We made this point also more explicit in the discussion, so that it is clear for readers that the findings are based on the 11 proteins and may not extrapolate to the entire yeast proteome.

      Reviewer #3 (Public Review):

      This manuscript presents two main contributions. First, the authors modified a CRISPR base editing system for use in an important model organism: budding yeast. Second, they demonstrate the utility of this system by using it to conduct an extremely high throughput study the effects of mutation on protein abundance. This study confirms known protein regulatory relationships and detects several important new ones. It also reveals trends in the type of mutations that influence protein abundances. Overall, the findings are of high significance and the method appears to be extremely useful. I found the conclusions to be justified by the data.

      One potential weakness is that some of the methods are not described in main body of the paper, so the reader has to really dive into the methods section to understand particular aspects of the study, for example, how the fitness competition was conducted.

      We expanded the first section for better readability.

      Another potential weakness is the comparison of this study (of protein abundances) to previous studies (of transcript abundances) was a little cursory, and left some open questions. For example, is it remarkable that the mutations affecting protein abundance are predominantly in genes involved in translation rather than transcription, or is this an expected result of a study focusing on protein levels?

      We thank the reviewer for pointing out that this paragraph requires more explanation. We expanded it as follows: “Of these 29 genes, 21 (72%) have roles in protein translation—more specifically, in ribosome biogenesis and tRNA metabolism (FDR < 8.0e-4, Figure 5C). In contrast, perturbations that affect the abundance of only one or two of the eleven proteins mostly occur in genes with roles in transcription (e.g., GO:0006351, FDR < 1.3e-5). Protein biosynthesis entails both transcription and translation, and these results suggest that perturbations of translational machinery alter protein abundance broadly, while perturbations of transcriptional machinery can tune the abundance of individual proteins. Thus, genes with post-transcriptional functions are more likely to appear as hubs in protein regulatory networks, whereas genes with transcriptional functions are likely to show fewer connections.”

      Overall, the strengths of this study far outweigh these weaknesses. This manuscript represents a very large amount of work and demonstrates important new insights into protein regulatory networks.

    1. Author Response

      Reviewer #2 (Public Review):

      The authors seek to determine how various species combine their effects on the growth of a species of interest when part of the same community.

      To this end, the authors carry out an impressive experiment containing what I believe must be one of the largest pairwise + third-order co-culture experiments done to date, using a high-throughput co-culture system they had co-developed in previous work. The unprecedented nature of this data is a major strength of the paper. The authors also discover that species combine their effect through "dominance", i.e. the strongest effect masks the others. This is important as it calls into question the common assumption of additivity that is implicit in the choice of using Lotka-Volterra models.

      A stronger claim (i.e. in the abstract) is that joint effect of multiple species on the growth of another can be derived from the effect of individual species. Unless I am misunderstanding something, this statement may have to be qualified a little, as the authors show that a model based on pairwise dominance (i.e. the strongest pairwise) does a somewhat better job (lower RMSD, though granted, not by much, 0.57 vs 0.63) than a model based on single species dominance. This is, the effect of the strongest pair predicts better the effect of a trio than the effect of the larger species.

      This issue makes one wonder whether, had the authors included higher-order combinations of species (i.e. five-member consortia or higher), the strongest-effect trio would have predicted better than the strongest-effect pair, which in turn is better predictor than the strongest-effect species. This is important, as it would help one determine to what extent the strongest-effect model would work in more diverse communities, such as those one typically finds in nature. Indeed, the authors find that the predictive ability of the strongest effect species is much stronger for pairs than it is for trios (RMSD of 0.28 vs 0.63). Does the predictive ability of the single species model decline faster and faster as diversity grows beyond 4-member consortia?

      Thank you for raising this important point. It is true that in our study we see that single species predict pairs better than trios, and that pairs predict trios better than single species. As we did not perform experiments on more diverse communities (n>4), we are not sure if or how these rules will scale up. We explicitly address these caveats in our revised discussion.

      Reviewer #3 (Public Review):

      A problem in synthetic ecology is that one can't brute-force complex community design because combinatorics make it basically impossible to screen all possible communities from a bank of possible species. Therefore, we need a way to predict phenomena in complex communities from phenomena in simple communities. This paper aims to improve this predictive ability by comparing a few different simple models applied to a large dataset obtained with the use of the author's "kchip" microfluidics device. The main question they ask is whether the effect of two species on a focal species is predicted from the mean, the sum, or the max of the effect of each single "affecting" species on the focal species. They find that the max effect is often the best predictor, in the sense of minimizing the difference between predicted effect and measured effect. They also measure single-species trait data for their library of strains, including resource niche and antibiotic resistance, and then find that Pearson correlations between distance calculations generated from these metrics and the effect of added species are weak and unpredictive. This work is largely well-done, timely and likely to be of high interest to the field, as predicting ecosystem traits from species traits is a major research aim.

      My main criticism is that the main take-home from the paper (fig 3B)-that the strongest effect is the best predictor-is oversold. While it is true that, averaged over their six focal species, the "strongest effect" was the best overall predictor, when one looks at the species-specific data (S9), we see that it is not the best predictor for 1/3 of their focal species, and this fraction grows to 1/2 if one considers a difference in nRMSE of 0.01 to be negligible.

      As suggested, we have softened our language regarding the take-home message. This matter is addressed in detail above in response to 'Essential Revisions'. Briefly, we see that the strongest model works best when both single species have qualitatively similar effects, but is slightly less accurate when effects are mixed. We also see overall less accurate predictions for positive effects. In light of these findings, we propose that focal species for which the strongest model is not the most accurate is due to the interaction types, and not specific to the focal species.

      We made substantial changes to the manuscript, including the first paragraph of the discussion which more accurately describes these findings and emphasizes the relevant caveats:

      "By measuring thousands of simplified microbial communities, we quantified the effects of single species, pairs, and trios on multiple focal species. The most accurate model, overall and specifically when both single species effects were negative, was the strongest effect model. This is in stark contrast to models often used in antibiotic compound combinations, despite most effects being negative, where additivity is often the default model (Bollenbach 2015). The additive model performed well for mixed effects (i.e. one negative and one positive), but only slightly better than the strongest model, and poorly when both species had effects of the same sign. When both single species’ effects were positive, the strongest model was also the best, though the difference was less pronounced and all models performed worse for these interactions. This may be due to the small effect size seen with positive effects, as when we limited negative and mixed effects to a similar range of effects strength, their accuracy dropped to similar values (Figure 3–Figure supplement 5). We posit that the difference in accuracy across species is affected mainly by the effect type dominating different focal species' interactions, rather than by inherent species traits (Figure 3–Figure supplement 6)." (Lines 288-304)

      The same criticism applies to the result from figure 2-that pairs of affecting species have more negative effects than single species. Considered across all focal species this is true (though minor in effect size, Fig 2A). But there is only a significant effect within two individual species. Again, this points to the effects being focal-species-specific, and perhaps not as generalizable as is currently being claimed.

      Upon more rigorous analysis, and with regard to changes in the dataset after filtering, we see that the more accurate statement is that effects become stronger, not necessarily more negative (in line with the accuracy of the strongest model). The overall trend is towards more negative interactions, due to the majority of interactions being negative, but as stated this is not true for each individual focal. As such the following sentence in the manuscript has been changed:

      "The median effect on each focal was more negative by 0.28 on average, though the difference was not significant in all cases; additionally, focals with mostly positive single species interactions showed a small increase in median effect (Fig. 2D)" (Lines 151-154)

      As well as the title of this section: "Joint effects of species pairs tend to be stronger than those of individual affecting species" (Lines 127-128)

      Another thing that points to a focal-species-specific response is Fig 2D, which shows the distributions of responses of each focal species to pairs. Two of these distributions are unimodal, one appears bimodal, and three appear tri-modal. This suggests to me that the focal species respond in categorically different ways to species addition.

      We believe this distribution of pair effects is related to the distribution of single species effects, and not to the way in which different focal species respond to the addition of second species. Though this may be difficult to see from the swarm plots shown in the paper, below is a split violin plot that emphasizes this point.

      Fig R1: Distribution of single species and pair effects. Distribution of the effect of single and pairs of affecting species for each focal species individually. Dashed lines represent the median, while dotted lines the interquartile range.

      These differences occur even though the focal bacteria are all from the same family. This suggests to me that the generalizability may be even less when a more phylogenetically dispersed set of focal species are used.

      We have added the following sentence to the discussion explicitly emphasizing the phylogenetic limitations of our study:

      "Lastly, it is important to note that our focal species are all from the same order (Enterobacterales), which may also limit the purview of our findings." (Lines 364-366)

      Considering these points together, I argue that the conclusion should be shifted from "strongest effect is the best" to "in 3 of our focal species, strongest effect was the best, but this was not universal, and with only 6 focal species, we can't know if it will always be the best across a set of focal species".

      As mentioned above, we have softened our language regarding the take-home message in response to these evaluations.

      My second main criticism is that it is hard to understand exactly how the trait data were used to predict effects. It seems like it was just pearson correlation coefficients between interspecies niche distances (or antibiotic distances) and the effect. I'm not very surprised these correlations were unpredictive, because the underlying measurements don't seem to be relevant to the environment tested. What if, rather than using niche data across 20 nutrients, only the growth data on glucose (the carbon source in the experiments) was used? I understand that in a field experiment, for example, one might not know what resources are available, and so measuring niche across 20 resources may be the best thing to do. Here though it seems imperative to test using the most relevant data.

      It is true that much of the profiling data is not directly related to the experimental conditions (different carbon sources and antibiotics), but in addition to these we do use measurements from experiments carried out in the same environment as the interactions assays (i.e. growth rate and carrying capacity when growing on glucose), which also showed poor correlation with the effects on focals. Additionally, we believe that these profiles contain relevant information regarding metabolic similarity between species (similar to metabolic models often constructed computationally). To improve clarity, we added the following sentence to the figure legend of Figure 3–Figure supplement 1:

      "The growth rate, and maximum OD shown in panel A were measured only in M9 glucose, similar to conditions used in the interaction assays." (Lines 591-592)

      Additionally and relatedly, it would be valuable to show the scatterplots leading to the conclusion that trait data were uninformative. Pearson's r only works on an assumption of linearity. But there could be strong relationships between the trait data and effect that are monotonic but not linear, or even that are non-monotonic yet still strong (e.g. U-shaped). For the first case, I recommend switching to Spearman's rho over Pearson's r, because it only assumes monotonicity, not linearity. If there are observable relationships that are not monotonic, a different test should be used.

      Per your suggestion, we have changed the measurement of correlation in this analysis from Pearson's r, to Spearman's rho. As we observed similar, and still mostly weak correlations, we did not investigate these relationships further. See Figure 3–Figure supplement 1.

      Additionally, we generated heat maps including scatterplots mapping the data leading to these correlations. We found no notable dependency in these plots, and visually they were quite crowded and difficult to interpret. As this is not the central point of our study, we ultimately decided against adding this information to the plots.

      In general, I think the analyses using the trait data were too simplistic to conclude that the trait data are not predictive.

      We agree that more sophisticated analyses may help connect between species traits and their effects on focal species. In fact, other members of our research group have recently used machine learning to accomplish similar predictions (https://doi.org/10.1101/2022.08.02.502471). As such we have changed the wording in to reflect that this correlation is difficult to find using simple analyses:

      "These results indicate that it may be challenging to connect the effects of single and pairs of species on a focal strain to a specific trait of the involved strains, using simple analysis." (Lines 157-159)

    1. Author Response

      Reviewer #1 (Public Review):

      Slusarczyk et al present a very well written manuscript focused on understanding the mechanisms underlying aging of erythrophagocytic macrophages in the spleen (RPM) and its relationship to iron loading with age. The manuscript is diffuse with a broad swath of data elements. Importantly, the manuscript demonstrates that RPM erythrophagocytic capacity is diminished with age, restored in iron restricted diet fed aged mice. In addition, the mechanism for declining RPM erythrophagocytic capacity appears to be ferroptosis-mediated, insensitive to heme as it is to iron, and occur independently of ROS generation. These are compelling findings. However, some of the data relies on conjecture for conclusion and a clear causal association is not clear. The main conclusion of the manuscript points to the accumulation of unavailable insoluble forms of iron as both causing and resulting from decreased RPM erythrophagocytic capacity.

      We are proposing that intracellular iron accumulation progresses first and leads to global proteotoxic damage and increased lipid peroxidation. This eventually triggers the death of a fraction of aging RPMs, thus promoting the formation of extracellular iron-rich protein aggregates. More explanation can be found below. Besides, iron loading suppresses the erythrophagocytic activity of RPMs, hence further contributing to their functional impairment during aging.

      In addition, the finding that IR diet leads to increased TF saturation in aged mice is surprising.

      We believe that this observation implies better mobilization of splenic iron stores, and corroborates our conclusion that mice that age on an iron-reduced diet benefit from higher iron bioavailability, although these differences are relatively mild. More explanation can be found in our replies to Reviewer #2.

      Furthermore, whether the finding in RPMs is intrinsic or related to RBC-related changes with aging is not addressed.

      We now addressed this issue and we characterized in more detail both iron and ROS levels in RBCs.

      Finally, these findings in a single strain and only female mice is intriguing but warrants tempered conclusions.

      We tempered the conclusions and provided a basic characterization of the RPM aging phenotype in Balb/c female mice.

      Major points:

      1) The main concern is that there is no clear explanation of why iron increases during aging although the authors appear to be saying that iron accumulation is both the cause of and a consequence of decreased RPM erythrophagocytic capacity. This requires more clarification of the main hypothesis on Page 4, line 17-18.

      We thank the reviewer for this comment. It was previously reported that iron accumulates substantially in the spleen during aging, especially in female mice (Altamura et al., 2014). Since RPMs are those cells that process most of the iron in the spleen, we aimed to explore what is the relationship between iron accumulation and RPM functions during aging. This investigation led us to uncover that indeed iron accumulation is both the cause and the consequence of RPM dysfunction. Specifically, we propose that intracellular iron loading of RPMs precedes extracellular deposition of iron in a form of protein-rich aggregates, driven by RPMs damage. To support this, we now show that the proteome of RPMs overlaps with those proteins that are present in the age-triggered aggregates (Fig. 3F). Furthermore, corroborating our model, we now demonstrate that transient iron loading of RPMs via iron-dextran injection (new Fig. 3G) leads to the formation of protein-rich aggregates, closely resembling those present in aged spleens (new Fig. 3H). This implies that high iron content in RPMs is indeed a major driving factor that leads to aggregation of their proteome and cell damage. Importantly, we now supported this model with studies using iRPMs. We demonstrated that iron loading and blockage of ferroportin by synthetic mini-hepcidin (PR73)(Stefanova et al., 2018) cause protein aggregation in iRPMs and lead to their decreased viability only in cells that were exposed to heat shock, a well-established trigger of proteotoxicity (new Fig. 5K and L). We propose that these two factors, namely age-triggered decrease in protein homeostasis and exposure to excessive iron levels, act in concert and render RPMs particularly sensitive to damage during aging (see also Discussion, p. 16).

      In parallel, our data imply that the increased iron content in aged RPMs drives their decreased erythrophagocytic activity, as we now better documented by more extensive in vitro experiments in iRPMs (new Fig 6E-H). We cannot exclude that some of the senescent splenic RBCs that are retained in the red pulp and evade erythrophagocytosis due to RPM defects in aging, may also contribute to the formation of the aggregates. This is supported by the fact that mice that lack RPMs as well exhibit iron loading in the spleen (Kohyama et al., 2009; Okreglicka et al., 2021), and that the proteome of aggregates overlaps to some extent with the proteome of erythrocytes (new Fig. 3F).

      We believe that during aging intracellular iron accumulation is chiefly driven by ferroportin downregulation, as also suggested by Reviewer#3. We now show that ferroportin drops significantly already in mice aged 4 and 5 months (new Fig. 4H), preceding most of the other impairments. This drop coincides with the increase in hepcidin expression, but if this is the sole reason for ferroportin suppression during early aging would require further investigation outside the scope of the present manuscript.

      In sum, to address this comment, we now modified the fragment of the introduction that refers to our hypothesis and major findings to be more clear (p. 4), we improved our manuscript by providing new data mentioned above and we added more explanation in the corresponding sections of the Results and Discussion.

      2) It is unclear if RPMs are in limited supply. Based on the introduction (page 4, line 13-15), they have limited self-renewal capacity and blood monocytes only partially replenished. Fig 4D suggests that there is a decrease in RPMs from aged mice. The %RPM from CD45+ compartment suggests that there may just be relatively more neutrophils or fewer monocytes recruited. There is not enough clarity on the meaning of this data point.

      Thank you for this comment. We fully agree that %RPMs of CD45+ splenocytes, although well-accepted in literature (Kohyama et al., 2009; Okreglicka et al., 2021), is only a relative number. Hence, we now included additional data and explanations regarding the loss of RPMs during aging.

      It was reported that the proportion of RPMs derived from bone marrow monocytes increases mildly but progressively during aging (Liu et al., 2019). This implies that due to the loss of the total RPM population, as illustrated by our data, the cells of embryonic origin are likely even more affected. We could confirm this assumption by re-analysis of the data from Liu et al. that we now included in the manuscript as Fig. 5E. These data clearly show that the representation of embryonically-derived RPMs drops more drastically than the percent of total RPMs, whereas the replenishment rate from monocytes is not affected significantly during aging. Consistent with this, we have not observed any robust change in the population of monocytes (F4/80-low, CD11b-high) or pre-RPMs (F4/80-high, CD11b-high) in the spleen at the age of 10 months (Figure 5-figure supplement 2A and B). We also have detected a mild decrease, not an increase, in the number of granulocytes (new Figure 5-figure supplement 2C). Furthermore, we measured in situ apoptosis marker and found a clear sign of apoptosis in the aged spleen (especially in the red pulp area), a phenotype that is less pronounced in mice on an IR diet (new Fig. 5O). This is consistent with the observation that apoptosis markers can be elevated in tissues upon ferroptosis induction (Friedmann Angeli et al., 2014) and that the proteotoxic stress in aged RPMs, which we now emphasized better in our manuscript, may also lead to apoptosis (Brancolini & Iuliano, 2020). Taken together, we strongly believe that the functional defect of embryonically-derived RPMs chiefly contributes to their shortage during aging.

      3) Anemia of aging is a complex and poorly understood mechanistically. In general, it is considered similar to anemia of chronic inflammation with increased Epo, mild drop in Hb, and erythroid expansion, similar to ineffective erythropoiesis / low Epo responsiveness. It is not surprising that IR diet did not impact this mild anemia. However, was the MCV or MCH altered in aged and IR aged mice?

      We now included the data for hematocrit, RBC counts, MCV, and MCH in Figure 1-figure supplement 5. Hematocrit shows a similar tendency as hemoglobin levels, but the values for RBC counts, MCV, and MCH seem not to be altered. We also show now that the erythropoietic activity in the bone marrow is not affected in aged versus young mice. Taken together, the anemic phenotype in female C57BL/6J mice at this age is very mild, which we emphasized in the main text, and is likely affected by other factors than serum iron levels (p. 6).

      4) Page 6, line 23 onward: the conclusion is that KC compensate for the decreased function of RPM in the spleen, based on the expansion of KC fraction in the liver. Is there evidence that KCs are engaged in more erythrophagocytosis in aged mice? Furthermore, iron accumulation in the liver with age does not demonstrate specifically enhanced erythrophagocytosis of KC. Please clarify why liver iron accumulation would not be simply a consequence of increased parenchymal iron similar to increased splenic iron with age, independent of erythrophagocytic activity in resident macrophages in either organ.

      Thanks for these questions. For the quantification of the erythrophagocytosis rate in KC, we show, as for the RPMs (Fig. 1K), the % of PKH67-positive macrophages, following transfusion of PKH67-stained stressed RBCs (Fig. 1M). The data implies a mild (not statistically significant) drop (of approx. 30%) in EP activity. We believe that it is overridden by a more pronounced (on average, 2-fold) increase in the representation of KCs (Fig. 1N). The mechanisms of iron accumulation between the spleen and the liver are very different. In the liver, we observed iron deposition in the parenchymal cells (not non-parenchymal, new Fig. 1P) that we currently characterizing in more detail in a parallel manuscript. Our data demonstrate a drop in transferrin saturation in aged mice. Hence, it is highly unlikely that aging would be hallmarked by the presence of circulating non-transferrin-bound iron that would be sequestered by hepatocytes, as shown previously (Jenkitkasemwong et al., 2015). Thus, the iron released locally by KCs is the most likely contributor to progressive hepatocytic iron loading during aging. The mechanism of iron delivery to hepatocytes from erythrophagocytosing KCs was demonstrated by Theurl et al.(Theurl et al., 2016), and we propose that it may be operational, although in a much more prolonged time scale, during aging. We now discussed this part better in our Results sections (p. 7).

      5) Unclear whether the effect on RPMs is intrinsic or extrinsic. Would be helpful to evaluate aged iRPMs using young RBC vs. young iRPMs using old RBCs.

      We are skeptical if the generation of iRPMs cells from aged mice would be helpful – these cells are a specific type of primary macrophage culture, derived from bone marrow monocytes with MCSF1, and exposed additionally to heme and IL-33 for 4 days. We do not expect that bone marrow monocytes are heavily affected by aging, and would thus recapitulate some aspects of aged RPMs from the spleen, especially after 8-day in vitro culture. However, to address the concerns of the reviewer, we now provide additional data regarding RBC fitness. Consistent with the time life-span experiment (Fig, 2A), we show that oxidative stress in RBCs is only increased in splenic, but not circulating RBCs (new Fig. 2C, replacing the old Fig. 2B and C). In addition, we show no signs of age-triggered iron loading in RBCs, either in the spleen (new Fig. 2F) or in the circulation (new Fig. 2B). Hence, we do not envision a possibility that RPMs become iron-loaded during aging as a result of erythrophagocytosis of iron-loaded RBCs. In support of this, we also have observed that during aging first RPMs’ FPN levels drop, afterward erythrophagocytosis rate decreases, and lastly, RBCs start to exhibit significantly increased oxidative stress (presented now in new Fig. 4H, J and K).

      6) Discussion of aggregates in the spleen of aged mice (Fig 2G-2K and Fig 3) is very descriptive and non-specific. For example, if the iron-rich aggregates are hemosiderin, a hemosiderin-specific stain would be helpful. This data specifically is correlatory and difficult to extract value from.

      Thanks for these comments. To the best of our knowledge Prussian blue Perls’ staining (Fig. 2J) is considered a hemosiderin staining. Our investigations aimed to better understand the nature and the origin of splenic iron deposits that to some extent are referred to as hemosiderin. Most importantly, as mentioned in our reply R1 Ad. 1. to assign causality to our data, we now demonstrated that iron accumulation in RPMs in response to iron-dextran (Fig. 3G) increases lipid peroxidation (Fig. 5F), tends to provoke RPMs depletion (Fig. 5G) and triggers the formation of protein-rich aggregates (new Fig. 3H). Of note, we assume that the loss of embryonically-derived RPMs in this model may be masked by simultaneous replenishment of the niche from monocytes, a phenomenon that may be addressed by future studies using Ms4a3-driven reporter mice (as shown for aged mice in our new Fig. 5E).

      7) The aging phenotype in RPMs appears to be initiated sometime after 2 months of age. However, there is some reversal of the phenotype with increasing age, e.g. Fig 4B with decreased lipid peroxidation in 9 month old relative to 6 month old RPMs. What does this mean? Why is there a partial spontaneous normalization?

      Thanks for this comment and questions. Indeed, the degree of lipid peroxidation exhibits some kinetics, suggestive of partial normalization. Of note, such a tendency is not evident for other aging phenotypes of RPMs, hence, we did not emphasize this in the original manuscript. However, in a revised version of the manuscript, we now present the re-analysis of the published data which implies that the number of embryonically-derived RPMs drops substantially between mice at 20 weeks and 36 weeks (new Fig. 5E). We think that the higher proportion of monocyte-derived RPMs in total RPM population later in aging (9 months) might be responsible for the partial alleviation of lipid peroxidation. We now discussed this possibility in the Results sections (p. 12).

      8) Does the aging phenotype in RPMs respond to ferristatin? It appears that NAC, which is a glutathione generator and can reverse ferroptosis, does not reverse the decreased RPM erythrophagocytic capacity observed with age yet the authors still propose that ferroptosis is involved. A response to ferristatin is a standard and acceptable approach to evaluating ferroptosis.

      We fully agree with the Reviewer that using ferristatin or Liproxstatin-1 would be very helpful to fully characterize a mechanism of RPMs depletion in mice. However, previous in vivo studies involving Liproxstatin-1 administration required daily injections of this ferroptosis inhibitor (Friedmann Angeli et al., 2014). This would be hardly feasible during aging. Regarding the experiments involving iron-dextran injection, using Liproxstatin-1 would require additional permission from the ethical committee which takes time to be processed and received. However, to address this question we now provide data from iRPMs cell cultures (new Fig.5 K-L). In essence, our results imply that both proteotoxic stress and iron overload act in concert to trigger cytotoxicity in RPM in vitro model. Interestingly, this phenomenon does not depend solely on the increased lipid peroxidation, but when we neutralize the latter with Liproxstatin-1, the cytotoxic effect is diminished (please, see also Results on p. 13 and Discussion p. 15/16).

      9) The possible central role for HO-1 in the pathophysiology of decreased RPM erythrophagocytic capacity with age is interesting. However, it is not clear how the authors arrived at this hypothesis and would be useful to evaluate in the least whether RBCs in young vs. aged mice have more hemoglobin as these changes may be primary drivers of how much HO-1 is needed during erythrophagocytosis.

      Thanks for this comment. We got interested in HO-1 levels based on the RNA sequencing data, which detected lower Hmox-1 expression in aged RPMs (Figure 3-figure supplement 1). We now show that the content of hemoglobin is not significantly altered in aged RBCs (MCH parameter, Figure 1-figure supplement 5E), hence we do not think that this is the major driver for Hmox-1 downregulation. Likewise, the levels of the Bach1 message, a gene encoding Hmox-1 transcriptional repressor, are not significantly altered according to RNAseq data. Hence, the reason for the transcriptional downregulation of Hmox-1 is not clear. Of note, HO-1 protein levels in the total spleen are higher in aged versus young mice, and we also detected a clear appearance of its nuclear truncated and enzymatically-inactive form (see a figure below, we opt not to include this in the manuscript for better clarity). The appearance of truncated HO-1 seems to be partially rescued by the IR diet. It is well established that the nuclear form of HO-1 emerges via proteolytic cleavage and migrates to the nucleus under conditions of oxidative stress (Mascaro et al., 2021). This additionally confirms that the aging spleen is hallmarked by an increased burden of ROS. Moreover, we also detected HO-1 as one of the components of the protein iron-rich aggregates. Thus, we propose that the low levels of the cytoplasmic enzymatically active form of HO-1 in RPMs (that we preferentially detect with our intracellular staining and flow cytometry) may be underlain by its nuclear translocation and sequestration in protein aggregates that evade antibody binding [this is also supported by our observation that the protein aggregates, despite the high content of ferritin (as indicated by MS analysis) are negative for L-ferritin staining. Of note, we also cannot exclude that other cell types in the aging spleen (eg. lymphocytes) express higher levels of HO-1 in response to splenic oxidative stress.

      Fig. Total splenic levels of HO-1 in young, aged IR and aged mice.

      Reviewer #2 (Public Review):

      Slusarczyk et al. investigate the functional impairment of red pulp macrophages (RPMs) during aging. When red blood cells (RBCs) become senescent, they are recycled by RPMs via erythrophagocytosis (EP). This leads to an increase in intracellular heme and iron both of which are cytotoxic. The authors hypothesize that the continuous processing of iron by RPMs could alter their functions in an age-dependent manner. The authors used a wide variety of models: in vivo model using female mice with standard (200ppm) and restricted (25ppm) iron diet, ex vivo model using EP with splenocytes, and in vitro model with EP using iRPMs. The authors found iron accumulation in organs but markers for serum iron deficiency. They show that during aging, RPMs have a higher labile iron pool (LIP), decreased lysosomal activity with a concomitant reduction in EP. Furthermore, aging RPMs undergo ferroptosis resulting in a non-bioavailable iron deposition as intra and extracellular aggregates. Aged mice fed with an iron restricted diet restore most of the iron-recycling capacity of RPMs even though the mild-anemia remains unchanged.

      Overall, I find the manuscript to be of significant potential interest. But there are important discrepancies that need to be first resolved. The proposed model is that during aging both EP and HO-1 expression decreases in RPMs but iron and ferroportin levels are elevated. In their model, the authors show intracellular iron-rich proteinaceous aggregates. But if HO-1 levels decrease, intracellular heme levels should increase. If Fpn levels increase, intracellular iron levels should decrease. How does LIP stay high in RPMs under these conditions? I find these to be major conflicting questions in the model.

      We thank the Reviewer for her/his valuable feedback. As we mentioned in our replies we can only assume that a small misunderstanding in the interpretation of the presented data underlies this comment. We show that ferroportin levels in RPMs (Fig. 1F) are modulated in a manner that fully reflects the iron status of these cells (both labile and total iron levels, Figs. 1H and I). FPN levels drop in aged RPMs and are rescued when mice are maintained on a reduced iron diet. As pointed out by Reviewer#3, and explained in our replies we believe that ferroportin levels are critical for the observed phenotypes in aging. We now described our data in a more clear way to avoid any potential misinterpretation (p.6).

      Reviewer #3 (Public Review):

      This is a comprehensive study of the effects of aging of the function of red pulp macrophages (RPM) involved in iron recycling from erythrocytes. The authors document that insoluble iron accumulates in the spleen, that RPM become functionally impaired, and that these effects can be ameliorated by an iron-restricted diet. The study is well written, carefully done, extensively documented, and its conclusions are well supported. It is a useful and important addition for at least three distinct fields: aging, iron and macrophage biology.

      The authors do not explain why an iron-restricted diet has such a strong beneficial effect on RPM aging. This is not at all obvious. I assume that the number of erythrocytes that are recycled in the spleen, and are by far the largest source of splenic iron, is not changed much by iron restriction. Is the iron retention time in macrophages changed by the diet, i.e. the recycled iron is retained for a short time when diet is iron-restricted (making hepcidin low and ferroportin high), and long time when iron is sufficient (making hepcidin high and ferroportin low)? Longer iron retention could increase damage and account for the effect. Possibly, macrophages may not empty completely of iron before having to ingest another senescent erythrocyte, and so gradually accumulate iron.

      We are very grateful to this Reviewer for emphasizing the importance of the iron export capacity of RPMs as a possible driver of the observed phenotypes. Indeed, as mentioned above, we now show in the revised version of the manuscript that ferroportin drops early during aging (revised Fig. 4). Importantly, we now also observed that iron loading and limitation of iron export from iRPMs via ferroportin aggravate the impact of heat shock (a well-accepted trigger of proteotoxicity) on both protein aggregation and cell viability (new Fig. 5K and L). Physiologically, recent findings show that aging promotes a global decrease in protein solubility [BioRxiv manuscript (Sui X. et al., 2022)], and it is very likely that the constant exposure of RPMs to high iron fluxes renders these specialized cells particularly sensitive to proteome instability. This could be further aggravated by a build-up of iron due to the drop of ferroportin early during aging, ultimately leading to the appearance of the protein aggregates as early as at 5 months of age in C57BL/6J females. Based on the new data, we emphasized this model in the revised version of the manuscript (please, see Discussion on p. 16)

    1. Author Response

      Reviewer #1 (Public Review):

      1) It would be helpful to include some sort of comparison in Fig. 4, e.g. the regressions shown in Fig 3, to indicate to what extent the ICCl data corresponds to the "control range" of frequency tuning.

      Figure 4 was modified to show the frequency range typically found in the ICCls. This range is based on results from Wagner et al., 2007, which extensively surveyed ICCls responses. This modification shows that our ICCls recordings in the ruff-removed owls cover the normal frequency hearing range of the owl.

      2) A central hypothesis of the study is that the frequency preference of the high-frequency neurons is lower in ruff-removed owls because of the lowered reliability caused by a lack of the ruff. Yet, while lower, the frequency range of many neurons in juvenile and ruff-removed owls seems sufficiently high to be still responsive at 7-8 kHz. I think it would be important to know to what extent neurons are still ITD sensitive at the "unreliable high frequencies" even if the CFs are lower since the "optimization" according to reliability depends not on the best frequency of each neuron per se, but whether neurons are less ITD sensitive at the higher, less reliable frequencies.

      The concern regarding the frequency range that elicits responsivity was largely addressed above. Specifically, Figure L1 showing frequency tuning of frontally tuned ICx neurons in ruff-removed owls indicates that while there is some variability of tuning across neurons, there is little responsivity above 6 kHz. In contrast, equivalent analysis in juvenile owls (Figure L3), shows there is much more responsiveness and variability across neurons to high and low frequencies. This evidence supports our hypothesis that the juvenile owl brain is still highly plastic, which facilitates learning during development. Although the underlying data was already reported in Figure 7 of our previously submitted manuscript, we can include Figures L1 and L2, potentially as supplemental figures, if considered useful by editors and reviewers. Nevertheless, this argumentation was further expanded in the revised text (Line 229).

      Figure L1. Frequency tuning of frontally-tuned ICx neurons in ruff-removed owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      Figure L2. ITD sensitivity across frequencies in ruff-removed owl. Two example neurons shown in a and b. ITD tuning for tones (colored) and broadband (black) plotted by firing rate (non-normalized). Solid colored lines indicate responses to frequencies that are within the neuron’s preferred frequency range (i.e. above the half-height, see Methods), dashed lines indicate frequencies outside of the neuron’s frequency range.

      Figure L3. Frequency tuning of frontally-tuned ICx neurons in juvenile owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      3) It would be interesting to have an estimate of the time scale of experience dependency that induces tuning changes. Do the authors have any data on this question? I appreciate the authors' notion that the quantifications in Fig 7 might indicate that juvenile owls are already "beginning to be shaped by ITD reliability" (line 323 in Discussion). How many days after hearing onset would this correspond to? Does this mean that a few days will already induce changes?

      While tracking changes induced by ruff-removal over development were outside of the scope of this study, many other studies have assessed experience-dependent plasticity in the barn owl. The recordings in this study were performed approximately 20 days after hearing onset, suggesting that the juveniles had ample time to begin learning. These points were expanded upon in the discussion (Lines 254, 280-283).

      Reviewer #2 (Public Review):

      1) Why is IPD variability plotted instead of ITD variability (or indeed spatial reliability)? The relationship between these measures is likely to vary across frequency, which makes it difficult to compare ITD variability across frequency when IPDs are plotted. Normalizing data across frequencies also makes it difficult to compare different locations and acoustical conditions. For example, in Fig.1a and Fig.1b, the data shown for 3 kHz at ~160 degrees seems quantitatively and visually quite different, but the difference (in Fig.1c) appears to be negligible.

      Justification of why IPD variability is used as an estimate of ITD variability was added to introduction (Lines 55-60), results (Line 100) and methods (Lines 371-374) sections of the manuscript, explaining the fact that because ITD detection is based on phase locking by auditory nerve and ITD detector neurons tuned to narrow frequency bands, responses of ITD detector neurons forwarded to downstream midbrain regions are therefore determined by IPD variability. Additionally, ITD is calculated by dividing IPD by frequency, which makes comparisons of ITD reliability across frequency mathematically uninformative.

      2) How well do the measures of ITD reliability used reflect real-world listening? For example, the model used to calculate ITD reliability appears to assume the same (flat) spectral profile for targets and distractors, which are presented simultaneously with the same temporal envelope, and a uniform spatial distribution of sounds across space. It is therefore unclear how robust the study's results are to violations of these assumptions.

      While we agree that our analysis cannot completely capture real-world listening for the barn owl, a general analysis using similar flat spectral profiles for targets and concurrent sounds provides a broad assessment of reliability of ITD cues. While a full recapitulation of real-world listening is beyond the scope of this study (i.e. recording natural scenes from the ear canals of wild barn owls), we included additional analyses of ITD reliability in Figure 1-figure supplement 1, described above.

      3) Does facial ruff removal produce an isolated effect on ITD variability or does it also produce changes in directional gain, and the relationship between spatial cues and sound location? Although the study considers this issue in some places (e.g. Fig.2, Fig.5), a clearer presentation of the acoustical effects of facial ruff removal and their implications (for all locations, not just those to the front), as well as an attempt to understand how these acoustical changes lead to the observed changes in ITD reliability, would greatly strengthen the study. In addition, Fig.1 shows average ITD reliability across owls, but it would be helpful to know how consistent these measures are across owls, given individual variability in Head-Related Transfer Functions (HRTFs). This potentially has implications for the electrophysiological experiments, if the HRTFs of those animals were not measured. One specific question that is potentially very relevant is whether the facial ruff attenuates sounds presented behind the animal and whether it does so in a frequency-dependent way. In addition, if facial ruff removal enables ILDs to be used for azimuth, then ITDs may also become less necessary at higher frequencies, even if their reliability remains unchanged.

      Additional analysis was conducted to generate representation of changes in directional gain induced by ruff removal, added to new figure (Fig 5). This analysis shows that changes in gain following ruff-removal are largely frequency-independent: there is a de-attenuation of peripherally and rearwardly located sounds, but the highest gain remains for high frequencies in frontal space. There is an additional increase in gain for high frequencies from rearward space, these changes would not explain the changes in frequency tuning we report. As mentioned in new additions to the manuscript, the changes at the most rearward-located auditory spatial locations are unlikely to have an effect on the auditory midbrain. No studies in the barn owl have found neurons in the ICx or optic tectum tuned to >120° (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). In addition, variability of IPD reliability across owls was analyzed and reported in the amended Figure 1, which notes very little changes across owls. In this analysis, we did realize that the file of one of the HRTFs obtained from von Campenhausen et al. 2006 was mislabeled, which explains slight differences in revised Fig 1b. Nevertheless, added analysis of IPD reliability across owls indicates that the pattern in ITD reliability is stable across owls (Fig. 1d,e), which supports our decision to not record HRTFs from owls used in this study. Finally, we added to the discussion that clarifies that the use of ILD for azimuth would not provide the same resolution as ITD would (Lines 295-303). We also do not believe that the use of ILD for azimuth would make “ITDs… less necessary at higher frequencies”, given that the ICCls is still computing ITD at these high frequencies (Fig 4), and that ILDs also have higher resolution at higher frequencies, with and without the facial ruff (Olsen et al, 1989; Keller et al., 1998; von Campenhausen et al., 2006).

      1) It is unclear why some analyses (Fig.5, Fig.7) are focused on frontal locations and frontally-tuned neurons. It is also unclear why neurons with a best ITDs of 0 are described as frontally tuned since locations behind the animal produce an ITD of 0 also. Related to this, in Fig.1, facial ruff removal appears to reduce IPD variability at low frequencies for locations to the rear (~160 degrees), where the ITD is likely to be close to 0. Neurons with a best ITD of 0 might therefore be expected to adjust their frequency tuning in opposite directions depending on whether they are tuned to frontal or rearward locations.

      An extensive explanation was added to the methods detailing why we do not believe the neurons recorded in this study are tuned to the rear. Namely, studies mapping the barn owl’s ICx and optic tectum have not reported neurons tuned to locations >120°, with the number of neurons representing a given spatial location decreasing with eccentricity (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). While we agree that there does seem to be a change in ITD reliability at ~160° following ruff-removal, the result is largely similar to the change that occurs in frontal space (Fig 1b), which is consistent with the ruff-removed head functioning as a sphere. Thus, we wouldn’t expect rearwardly-tuned neurons, if they could be readily found, to adjust their frequency tuning to higher frequencies. Finally, we want to clarify that we focused our analyses on frontally-tuned neurons because frontal space is where we observed the largest change in ITD reliability. Text was added to the Discussion section to clarify this point (Lines 313-321).

      2) The study suggests that information about high-frequency ITDs is not passed on to the ICX if the ICX does not contain neurons that have a high best frequency. However, neurons might be sensitive to ITDs at frequencies other than the best frequency, particularly if their frequency tuning is broader. It is also unclear whether the best frequency of a neuron always corresponds to the frequency that provides the most reliable ITD information, which the study implicitly assumes.

      The concern about ITD sensitivity at non-preferred frequencies was addressed under the essential revision #3, as well as under Reviewer 1’s concerns.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript reports a systematic study of the cortical propagation patterns of human beta bursts (~13-35Hz) generated around simple finger movements (index and middle finger button presses).

      The authors deployed a sophisticated and original methodology to measure the anatomical and dynamical characteristics of the cortical propagation of these transient events. MEG data from another study (visual discrimination task) was repurposed for the present investigation. The data sample is small (8 participants). However, beta bursts were extracted over a +/- 2s time window about each button press, from single trials, yielding the detection and analysis of hundreds of such events of interest. The main finding consists of the demonstration that the cortical activity at the source of movement related beta bursts follows two main propagation patterns: one along an anteroposterior directions (predominantly originating from pre central motor regions), and the other along a medio- lateral (i.e., dorso lateral) direction (predominantly originating from post central sensory regions). Some differences are reported, post-hoc, in terms of amplitude/cortical spread/propagation velocity between pre and post-movement beta bursts. Several control tests are conducted to ascertain the veracity of those findings, accounting for expected variations of signal-to-noise ration across participants and sessions, cortical mesh characteristics and signal leakage expected from MEG source imaging.

      One major perceived weakness is the purely descriptive nature of the reported findings: no meaningful difference was found between bursts traveling along the two different principal modes of propagation, and importantly, no relation with behavior (response time) was found. The same stands for pre vs. post motor bursts, except for the expected finding that post-motor bursts are more frequent and tend to be of greater amplitude (yielding the observation of a so-called beta rebound, on average across trials).

      Overall, and despite substantial methodological explorations and the description of two modes of propagation, the study falls short of advancing our understanding of the functional role of movement related beta bursts.

      For these reasons, the expected impact of the study on the field may be limited. The data is also relatively limited (simple button presses), in terms of behavioral features that could be related to the neurophysiological observations. One missed opportunity to explain the functional role of the distinct propagation patterns reports would have been, for instance, to measure the cortical "destination" of their respective trajectories.

      In response to this comment, we would like to highlight two important points.

      First, our work constitutes the first non-invasive human confirmation of invasive work in animals (Balasubramanian et al., 2020; Roberts et al., 2019; Rule et al., 2018; (Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and patients (Takahashi et al., 2011). Thus, these results bridges between recordings limited to the size of multielectrode arrays (roughly 0.16 cm2; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and human EEG recordings spanning across large areas of the cortex and several functionally distinct regions (Alexander et al., 2016; Stolk et al., 2019). The ability to access these neural signatures non- invasively is important for cross-species comparison. This further enables us, to provide an in-depth analysis of the spatiotemporal diversity of human MEG signals and a detailed characterisation of the two propagation directions, which significantly extends previous reports. We note that their functional role remains undetermined also in these animal studies, but being able to identify these signals now in humans can provide a steppingstone for identifying their role.

      Second, and related, the reviewers are correct that we did not observe distinct propagation directions between pre- and post-movement bursts, nor a relationship with reaction time. However, such a null result would be relevant, in our view, towards understanding what the functional relevance of these signals, if any, might be. Recent work in macaques indicates that the spatiotemporal patterns of high-gamma activity carry kinematic information about the upcoming movement (Liang et al 2023). The functional role of beta may therefore be more complex and not relate to reaction times or kinematics in a straightforward manner. We believe this is a relevant observation, and in keeping with the continued efforts to identify how sensorimotor beta relates to behaviour. It is increasingly clear that spatiotemporal diversity in animal recordings and human E/MEG and intracranial recordings can constitute a substantial proportion of the measured dynamics. As such, our report is relevant in narrowing down what these signals may reflect.

      Together, we think that our work provides new insights into the multidimensional and propagating features of burst activity. This is important for the entire electrophysiology community, as it transforms how we commonly analyse and interpret these important brain signals. We anticipate that our work will guide and inspire future work on the mechanistic underpinnings of these dominant neural signals. We are confident that our article has the scope to reach out to the diverse readership of eLife.

      Reviewer #2 (Public Review):

      The authors devised novel and interesting experiments using high precision human MEG to demonstrate the propagation of beta oscillation events along two axes in the brain. Using careful analysis, they show different properties of beta events pre- and post movement, including changes in amplitude. Due to beta's prominent role in motor system dynamics, these changes are therefore linked to behavior and offer insights into the mechanisms leading to movement. The linking of wave-like phenomena and transient dynamics in the brain offers new insight into two paradigms about neural dynamics, offering new ways to think about each phenomena on its own.

      Although there is a substantial, and recent, body of literature supporting the conclusions that beta and other neural oscillations are transient, care must be taken when analyzing the data and the resulting conclusions about beta properties in both time and space. For example, modifying the threshold at which beta events are detected could alter their reported properties and expression in space and time. The authors should therefore performing parameter sweeps on e.g. the thresholds for detection of oscillation bursts to determine whether their conclusions on beta properties and propagation hold. If this additional analysis does not change their story, it would lend confidence in the results/conclusions.

      We thank the reviewing team for this comment. As suggested, we evaluated the effect of different burst thresholds on the burst parameters.

      The threshold in the main analysis was determined empirically from the data, as in previous work (Little et al., 2019). Specifically, trial-wise power was correlated with the burst probability across a range of different threshold values (from median to median plus seven standard deviations (std), in steps of 0.25, see Figure 6-figure supplement 1). The threshold value that retained the highest correlation between trial-wise power and burst probability was used to binarize the data.

      We repeated our original analysis using four additional thresholds, i.e., original threshold - 0.5 std, -0.25 std, +0.25 std, +0.5 std. As one would expect, burst threshold is negatively related to the number of bursts (i.e., higher thresholds yield fewer bursts, Figure R4a [top]), and positively related to burst amplitude (i.e., higher thresholds yield higher burst amplitudes, Figure R4a [bottom]).

      Similarly, the temporal duration of bursts and apparent spatial width are modulated by the burst threshold: lowering the threshold leads to longer temporal duration and larger apparent spatial width while increasing the threshold leads to shorter temporal duration and smaller apparent spatial width Figure R4b. Note that for the temporal and spectral burst characteristics, the difference to the original threshold can be numerically zero, i.e., changing the burst threshold did not lead to changes exceeding the temporal and spectral resolution of the applied time-frequency transformation (i.e., 200ms and 1Hz respectively).

      Importantly, across these threshold values, the propagation direction and propagation speed remain comparable.

      We now include this result as Figure 6-figure supplement 2and refer to this analysis in the manuscript (page 28 line 717).

      “To explore the robustness of the results analyses were repeated using a range of thresholds (Figure 6-figure supplement 2).”

      Determining the generators of beta events at different locations is a tricky issue. The authors mentioned a single generator that is responsible for propagating beta along the two axes described. However, it is not clear through what mechanism the beta events could travel along the neural substrate without additional local generators along the way. Previous work on beta events examined how a sequence of synaptic inputs to supra and infragranular layers would contribute to a typical beta event waveform. Although it is possible other mechanisms exist, how might this work as the beta events propagate through space? Some further explanation/investigation on these issues is therefore warranted.

      Based on this and other comments (i.e., comments 7 and 8) we re-evaluated the use of the term ‘generator’ in this manuscript.

      While the term generator can be used across scales, from micro- to macroscale, ifor the purpose of the present paper, we believe one should differentiate at least two concepts: a) generator of beta bursts, and b) generator of travelling waves.

      We realised that in the previous version of the manuscript the term ‘generator’ was at times used without context. We removed the term where no longer necessary.

      Further, the previous version of the manuscript discussed putative generators of travelling waves (page 19f.) but not generators of beta bursts. We now address this as follows:

      “Studies using biophysical modelling have proposed that beta bursts are generated by a broad infragranular excitatory synaptic drive temporally aligned with a strong supragranular synaptic drive (Law et al., 2022; Neymotin et al., 2020; Sherman et al., 2016; Shin et al., 2017) whereby layer specific inhibition acts to stabilise beta bursts in the temporal domain (West et al., 2023). The supragranular drive is thought to originate in the thalamus (E. G. Jones, 1998, 2001; Mo & Sherman, 2019; Seedat et al., 2020), indicating thalamocortical mechanisms (page 22f).”

      Once the mechanisms have been better understood, a question of how much the results generalize to other oscillation frequencies and other brain areas. On the first question of other oscillation frequencies, the authors could easily test whether nearby frequency bands (alpha and low gamma) have similar properties. This would help to determine whether the observations/conclusions are unique to beta, or more generally applicable to transient bursts/waves in the brain. On the second issue of applicability to other brain areas, the authors could relate their work to transient bursts and waves recorded using ECoG and/or iEEG. Some recent work on traveling waves at the brain-wide level would be relevant for such comparisons.

      We appreciate the enthusiasm and the suggestions. To comment on the frequency specificity of the observed effects we conducted the same analysis focusing on the gamma frequency range (60-90 Hz). For computational reasons, we limited this analysis to one subject. Figure R1 shows the polar probability histogram for the beta frequency range (left) and the gamma frequency range (right). In contrast to the beta frequency range, no dominant directions were observed for the gamma range and von Mises functions did not converge. These preliminary results suggest some frequency specificity of the spatiotemporal pattern in sensorimotor beta activity. We believe this paves the way for future analysis mapping propagation direction across frequency and space.

      Here we did not investigate the spatial specificity of the effects, as the beta frequency range is dominant in sensorimotor areas. Investigating beta bursts in other cortical areas would have likely resulted in very few bursts. We discuss our results across spatial scales in the section: Distinct anatomical propagation axes of sensorimotor beta activity. However, please note that most of the previous literature operates on a different spatial scale (roughly 4mm; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Rule et al., 2018; Takahashi et al., 2011, 2015) and different species (e.g., non-human primates). Non-invasive recordings in humans capture temporospatial patterns of a very different scale, i.e., often across the whole cortex (Alexander et al., 2016; Roberts et al., 2019). Comparing spatiotemporal patterns, across different spatial scales is inherently difficult. Work

      investigating different spatial scales simultaneously, such as Sreekumar et al. 2020, is required to fully unpack the relationship between mesoscopic and macroscopic spatiotemporal patterns.

      Figure R1: Spatiotemporal organisation for the beta (β, 13-30Hz) and gamma (γ, 60-90) frequency range for one exemplar subject. Same as Figure 4a, but for one exemplar subject.

      If the source code could be provided on github along with documentation and a standard "notebook" on use other researchers would benefit greatly.

      All analyses are performed using freely available tools in MATLAB. The code carrying out the analysis in this paper can be found here: [link provided upon acceptance]. The 3D burst analyses can be very computationally intensive even on a modern computer system. The analyses in this paper were computed on a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 and 32 Gb of RAM. Details on the installation and setup of the dependencies can be found in the README.md file in the main study repository.

      This information has been added to the paper in the methods section on page 35.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript provides a comprehensive investigation of the effects of the genetic ablation of three different transcription factors (Srf, Mrtfa, and Mrtfb) in the inner ear hair cells. Based on the published data, the authors hypothesized that these transcription factors may be involved in the regulation of the genes essential for building the actin-rich structures at the apex of hair cells, the mechanosensory stereocilia and their mechanical support - the cuticular plate. Indeed, the authors found that two of these transcription factors (Srf and Mrtfb) are essential for the proper formation and/or maintenance of these structures in the auditory hair cells. Surprisingly, Srf- and Mrtfb- deficient hair cells exhibited somewhat similar abnormalities in the stereocilia and in the cuticular plates even though these transcription factors have very different effects on the hair cell transcriptome. Another interesting finding of this study is that the hair cell abnormalities in Srfdeficient mice could be rescued by AAV-mediated delivery of Cnn2, one of the downstream targets of Srf. However, despite a rather comprehensive assessment of the novel mouse models, the authors do not have yet any experimentally testable mechanistic model of how exactly Srf and Mrtfb contribute to the formation of actin cytoskeleton in the hair cells. The lack of any specific working model linking Srf and/or Mrtfb with stereocilia formation decreases the potential impact of this study.

      Major comments:

      Figures 1 & 3: The conclusion on abnormalities in the actin meshwork of the cuticular plate was based largely on the comparison of the intensities of phalloidin staining in separate samples from different groups. In general, any comparison of the intensity of fluorescence between different samples is unreliable, no matter how carefully one could try matching sample preparation and imaging conditions. In this case, two other techniques would be more convincing: 1) quantification of the volume of the cuticular plates from fluorescent images; and 2) direct examination of the cuticular plates by transmission electron microscopy (TEM).

      In fact, the manuscript provides no single TEM image of the F-actin abnormalities either in the cuticular plate or in the stereocilia, even though these abnormalities seem to be the major focus of the study. Overall, it is still unclear what exactly Srf or Mrtfb deficiencies do with F-actin in the hair cells.

      Yes, we agree. As suggested by the reviewer, to directly examine the defects in F-actin organization within the cuticular plate of mutant mice, we conducted Transmission Electron Microscopy (TEM) analyses. The results, as presented in the revised Figures 1 and 4 (panels F, G, and E, F, respectively), provide crucial insights into the structural changes in the cuticular plate. Meanwhile, the comparison of the volume of the phalloidin labeled cuticular plate after 3-D reconstruction using Imaris software was conducted and shown in Author response image 1. The results of the cuticular plate (CP) volume were consistent with the relative F-actin intensity change of the cuticular plate in the revised Figures 1B and 4B. For the TEM analysis of the stereocilia, we regret that due to time constraints, we were unable to collect TEM images of stereocilia with sufficient quality for a meaningful comparison. However, we believe that the data we have presented sufficiently addresses the primary concerns, and we appreciate the reviewers’ understanding of these limitations.

      Author response image 1.

      Figures 2 & 4 represent another example of how deceiving could be a simple comparison of the intensity of fluorescence between the genotypes. It is not clear whether the reduced immunofluorescence of the investigated molecules (ESPN1, EPS8, GNAI3, or FSCN2) results from their mis-localization or represents a simple consequence of the fact that a thinner stereocilium would always have a smaller signal of the protein of interest, even though the ratio of this protein to the number of actin filaments remains unchanged. According to my examination of the representative images of these figures, loss of Srf produces mis-localization of the investigated proteins and irregular labeling in different stereocilia of the same bundle, while loss of Mrtfb does not. Obviously, a simple quantification of the intensity of fluorescence conceals these important differences.

      Yes, we agree. In addition to the quantification of tip protein intensity, we have added a few more analyses in the revised Figure 3 and Figure 6, such as the percentage of row 1 tip stereocilia with tip protein staining and the percentage of IHCs with tip protein staining on row 2 tip. Using the results mentioned above, the differences in the expression level, the row-specific distribution and the irregular labeling of tip proteins between the control and the mutants can be analyzed more thoroughly.

      Reviewer #2 (Public Review):

      The analysis of bundle morphology using both confocal and SEM imaging is a strength of the paper and the authors have some nice images, especially with SEM. Still, the main weakness is that it is unclear how significant their findings are in terms of understanding bundle development; the mouse phenotypes are not distinct enough to make it clear that they serve different functions so the reader is left wondering what the main takeaway is.

      Based on the reviewer’s comments, in this revised manuscript, we put more emphasis on describing the effects of SRF and MRTFB on key tip proteins’ localization pattern during stereocilia development, represented by ESPN1, EPS8 and GNAI3, as well as the effects of SRF and MRTFB on the F-actin organization of cuticular plate using TEM. We have made substantial efforts to interpret the mechanistic underpinnings of the roles of SRF and MRTFB in hair cells. This is reflected in the revised Figures 1, 3, 4, 6, and 10, where we provide more comprehensive insights into the mechanisms at play.

      We interpret our data in a way that both SRF and MRTF regulate the development and maintenance of the hair cell’s actin cytoskeleton in a complementary manner. Deletion of either gene thus results in somewhat similar phenotypes in hair cell morphology, despite the surprising lack of overlap of SRF and MRTFB downstream targets in the hair cell.

      In Figure 1 and 3, changes in bundle morphology clearly don't occur until after P5. Widening still occurs to some extent but lengthening does not and instead the stereocilia appear to shrink in length. EPS8 levels appear to be the most reduced of all the tip proteins (Srf mutants) so I wonder if these mutants are just similar to an EPS8 KO if the loss of EPS8 occurred postnatally (P0-P5).

      To address this question, we performed EPS8 staining on the control and Srf cKO hair cells at P4 and P10. We found that the dramatic decrease of the row 1 tip signal for EPS8 started since P4 in Srf cKO IHCs. Although the major hair bundle phenotype of Eps8 KO, including the defects of row 1 stereocilia lengthening and additional rows of short stereocilia also appeared in Srf cKO IHCs, there are still some bundle morphology differences between Eps8 KO and Srf cKO. For example, firstly, both Eps8 KO OHCs and IHCs showed additional rows of short stereocilia, but we only observed additional rows of short stereocilia in Srf cKO IHCs. Secondly, in Valeria Zampini’s study, SEM and TEM images did not show an obvious reduction of row 2 stereocilia widening (P18-P35), while our analysis of SEM images confirmed that the width of row 2 IHC stereocilia was drastically reduced by 40% in Srf cKO (P15). Generally, we think although Srf cKO hair bundles are somewhat similar to Eps8 KO, the Srf cKO hair bundle phenotype might be governed by multiple candidate genes cooperatively.

      Reference:

      Valeria Zampini, et al. Eps8 regulates hair bundle length and functional maturation of mammalian auditory hair cells. PLoS Biol. 2011 Apr;9(4): e1001048.

      A major shortcoming is that there are few details on how the image analyses were done. Were SEM images corrected for shrinkage? How was each of the immunocytochemistry quantitation (e.g., cuticular plates for phalloidin and tip staining for antibodies) done? There are multiple ways of doing this but there are few indications in the manuscript.

      We apologize for not making the description of the procedure of images analyses clear enough. As described in Nicolas Grillet group’s study, live and mildly-fixed IHC stereocilia have similar dimensions, while SEM preparation results in a hair bundle at a 2:3 scale compared to the live preparation. In our study, the hair cells selected for SEM imaging and measurements were located in the basal turn (30-32kHz), while the hair cells selected for fluorescence-based imaging and measurements were located in the middle turn (20-24kHz) or the basal turn (32-36kHz). Although our SEM imaging and fluorescence-based imaging of basal turn’s hair bundles were not from the same area exactly, the control hair bundles with SEM imaging have reduced row 1 stereocilia length by 10%-20%, compared to the control hair bundles with fluorescence-based imaging (revised Figure 2 and Figure 5). Generally, our stereocilia dimensions data showed appropriate shrinkage caused by the SEM preparation.

      Recognizing the need for clarity, we have provided a detailed description of our image quantification and analysis procedures in the “Materials and Methods” section, specifically under “Immunocytochemistry.” This will aid readers in understanding our methodologies and ensure transparency in our approach.

      Reference:

      Katharine K Miller, et al. Dimensions of a Living Cochlear Hair Bundle. Front Cell Dev Biol. 2021 Nov 25:9:742529.

      The tip protein analysis in Figs 2 and 4 is nice but it would be nice for the authors to show the protein staining separately from the phalloidin so you could see how restricted to the tips it is (each in grayscale). This is especially true for the CNN2 labeling in Fig 7 as it does not look particularly tip specific in the x-y panels. It would be especially important to see the antibody staining in the reslices separate from phalloidin.

      Thank you for the suggestions. We have shown tip proteins staining in grayscale separately from the phalloidin in the revised Figure 3 and Figure 6. To clearly show the tip-specific localization of CNN2, we conducted CNN2 staining at different ages during hair bundle development and showed CNN2 labeling in grayscale and in reslices in revised Figure 9-figure supplement 1B.

      In Fig 6, why was the transcriptome analysis at P2 given that the phenotype in these mice occurs much later? While redoing the transcriptome analysis is probably not an option, an alternative would be to show more examples of EPS8/GNAI/CNN2 staining in the KO, but at younger ages closer to the time of PCR analysis, such as at P5. Pinpointing when the tip protein intensities start to decrease in the KOs would be useful rather than just showing one age (P10).

      We agree with the reviewer. To address this question, we have performed ESPN1, EPS8 and GNAI3 staining on the control and the mutant’s hair cells at P4, P10 and P15 (the revised Figures 3 and 6). According to the new results, we found that the dramatic decreases of the row 1 tip signal for ESPN1 and EPS8 started since P4 in Srf cKO IHCs, is consistent with the appearance of the mild reduction of row 1 stereocilia length in P5 Srf cKO IHCs. For Mrtfb cKO hair cells, the obvious reduction of the row 1 tip signal for ESPN1 was observed until P10. However, a few genes related to cell adhesion and regulation of actin cytoskeleton were significantly down-regulated in P2 Mrtfb deficient hair cell transcriptome. We think that in hair cells the MRTFB may not play a major role in the regulation of stereocilia development, so the morphological defects of stereocilia happened much later in the Mrtfb mutant than in the Srf mutant.

      While it is certainly interesting if it turns out CNN2 is indeed at tips in this phase, the experiments do not tell us that much about what role CNN2 may be playing. It is notable that in Fig 7E in the control+GFP panel, CNN2 does not appear to be at the tips. Those images are at P11 whereas the images in panel A are at P6 so perhaps CNN2 decreases after the widening phase. An important missing control is the Anc80L65-Cnn2 AAV in a wild-type cochlea.

      We agree with the reviewer. We have conducted more immunostaining experiments to confirm the expression pattern of CNN2 during the stereocilia development, from P0 to P11. The results were included in the revised Figure 9-figure supplement 1B. As the reviewer suggested, CNN2 expression pattern in control cochlea injected with Anc80L65-Cnn2 AAV has also been provided in revised Figure 9E.

    1. Author Response

      Reviewer #1 (Public Review):

      This is an awesome comprehensive manuscript. Authors start by sorting putative stromal cellcontaining BM non-hematopoietic (CD235a-/CD45-) plus additional CD271+/CD235a/CD45- populations to identify nine individual stromal identities by scRNA-seq. The dual sorting strategy is a clever trick as it enriches for rare stromal (progenitor) cell signals but may suffer a certain bias towards CD271+ stromal progenitors. The lack of readable signatures already among CD45-/CD45- sorts might argue against this fear. This reviewer would appreciate a brief discussion on number & phenotype of putative additional MSSC phenotypes in light of the fact that the majority of 'blood lineage(s)'-negative scRNA-seq signatures identified blood cell progenitor identities (glycophorin A-negative & leukocyte common antigen-negative). The nine stromal cell entities share the CXCL12, VCAN, LEPR main signature. Perhaps the authors could speculate if future studies using VCAN or LEPRbased sort strategies could identify additional stromal progenitor identities?

      We would like to thank the reviewer for critically evaluating our work and for the generally positive evaluation of the paper. We apologize for delayed resubmission as it took a long time for a specific antibody to arrive to complete the confocal microscopy analyses.

      The reviewer asks for a brief discussion on the cell numbers and phenotypes of MSSC phenotypes. The cell numbers and percentages of MSSC in sorted CD45low/-CD235a- and CD45low/-CD235a-CD271+ cells can be found in Supplementary File 3 and we have added a summary of the phenotypes of MSSC in the new Supplementary File 7.

      Due to the extremely low frequency of stromal cells in human bone marrow, we chose a sorting strategy that also included CD45low cells (Fig 1A) to ensure that no stromal cells were excluded from the analysis. Although stromal elements are certainly enriched using this approach, the CD45low population contains several different hematopoietic cell types. These include CD34+ HSPCs which are characterized by low CD45 expression2, as well as the CD45low-expressing fractions of other hematopoietic cell populations such as B cells, T cells, NK cells, megakaryocytes, monocytes, dendritic cells, and granulocytes. Furthermore, CD235a- late-stage erythroid progenitors, which are negative for CD45, are represented as well. Of note, our data are consistent with previously reported murine studies showing the presence of a number of hematopoietic populations in CD45- cells, which accounted for the majority of CD45-Ter119-CD31- murine BM cells3,4. However, despite a certain enrichment of stromal elements in the CD45low cell fraction, frequencies were still too low to allow for a detailed analysis of this important bone marrow compartment. This prompted us to adopt the stromal cell-enrichment strategy as described in the manuscript to achieve a better resolution of the stromal compartment. In fact, sorting based on CD45low/-CD235a-CD271+ allowed us to sufficiently enrich bone marrow stromal cells to be clearly detectable in scRNAseq analysis. According to the reviewer’s suggestion, a brief discussion on this issue is now included in the Discussion (page 28, lines 10-15).

      The reviewer also suggested using VCAN or LEPR-based sorting strategy to identify additional stromal identities in future studies.

      However, as an extracellular matrix protein, FACS analysis of cellular VCAN expression can only be achieved based on its intracellular expression after fixation and permeabilization5,6. Additionally, while VCAN is highly and ubiquitously expressed by stromal clusters, VCAN is also expressed by monocytes (cluster 36). Therefore, VCAN is not an optimal marker to isolate viable stromal cells.

      LEPR is the marker that was reported to identify the majority of colony-forming cells in adult murine bone marrow7. We have previously reported that the majority of human adult bone marrow CFU-Fs is contained in the LEPR+ fraction 8. In our current scRNAseq surface marker profiling analysis, group A cells showed high expression of several canonical stromal markers including VCAM1, PDGFRB, ENG (CD73), as well as LEPR (Fig. 4A). However, the four stromal clusters in Group A could not be separated based on the expression of LEPR. Therefore, we chose not to use LEPR as a marker to prospectively isolate the different stromal cell types.

      The authors furthermore localized CD271+, CD81+ and NCAM/CD56+ cells in BM sections in situ. Finally, referring to the strong background of the group in HSC research, in silico prediction by CellPhoneDB identified a wide range of interactions between stromal cells and hematopoietic cells. Evidence for functional interdependence of FCU-F forming cells is completing the novel and more clear bone marrow stromal cell picture.

      We thank the reviewer for the positive comments.

      An illustrative abstract naming the top9 stromal identities in their top4 clusters by their "top10 markers" + functions would be highly appreciated.

      We thank the reviewer for the suggestion. A summary of the characteristics of stromal clusters is now shown in the new Supplementary File 7, which we hope matches the reviewer’s expectations.

      Reviewer #2 (Public Review):

      Knowledge about composition and function of the different subpopulations of the hematopoietic niche of the BM is limited. Although such knowledge about the mouse BM has been accumulating in recent years, a thorough study of the human BM still needs to be performed. The present manuscript of Li and coworkers fills this gap by performing single cell RNA sequencing (scRNAseq) on control BM as well as CD271+ BM cells enriched for non-hematopoietic niche cells.

      We apologize for delayed resubmission as it took a long time for a specific antibody to arrive to complete the confocal microscopy analyses. We thank the reviewer for the critical expert review and overall positive comments.

      Based on their scRNAseq, the authors propose 41 different BM cell populations, ten of which represented non-hematopoietic cells, including one endothelial cell cluster. The nine remaining skeletal subpopulations were subdivided into multipotent stromal stem cells (MSSC), four distinct populations of osteoprogenitors, one cluster of osteoblasts and three clusters of pre-fibroblasts. Using bioinformatic tools, the authors then compare their results and divisions of subpopulations to some previously published work from others and attempt to delineate lineage relationships using RNA velocity analyses. From these, they propose different paths from which MSSC enter the progenitor stages, and might differentiate into pre-osteoblasts and -fibroblasts.

      It is of interest to note, that apparently adipo-primed cells may also differentiate into osteolineage cells, something that should be further explored or validated. Furthermore, although this analysis yields a large adipo-primed populations, pre-adipocytes and mature adipocytes appear not to be included in the data set the authors used, which should also be explained.

      We thank the reviewer for this comment. We chose to annotate Cluster 5 as adipoprimed cluster based on the higher expression of adipogenic differentiation markers as well as a group of stress-related transcription factors (FOS, FOSB, JUNB, EGR1) (Fig. 2B-C, Figure 2-figure supplement 1C) some of which had been shown to mark bone marrow adipogenic progenitors1. Although at considerably lower levels compared to adipogenic genes, osteogenic genes were also expressed in cluster 5 cells (Fig. 2B and D), indicating the multi-potent potential of this cluster. Therefore, our initial annotation of these cells as adipoprimed progenitors was too narrow as it did not include the possible osteogenic differentiation potential. We apologize for the confusion caused by the inappropriate annotation and, in order to avoid any further confusion, cluster 5 has now been re-annotated as ‘highly adipocytic gene-expressing progenitors (HAGEPs), which we believe is a better representation of the cells. We furthermore agree with the reviewer that in-vivo differentiation needs to be performed to address potential differentiation capacities in future studies.

      With regard to the lack of adipocytes in our data set, we described in the Materials and Methods section that human bone marrow cells were isolated based on density gradient centrifugation. After centrifugation, the mononuclear cell-containing monolayers were harvested for further analysis. However, the resulting supernatant containing mature adipocytic cells was discarded14. Therefore, adipocyte clusters were not identified in our dataset. We have amended the manuscript accordingly (page 5, line 7).

      Regarding the pre-adipocytes, we are not aware of any specific markers for pre-adipocytes in the bone marrow. We examined the only known markers (ICAM1, PPARG, FABP4) that have been shown to mark committed pre-adipocytes in human adipose tissue15. As illustrated in Fig. R1 (below), low expression of all three markers was not restricted to a single distinct cluster but could be found in almost all stromal clusters. These data thus allow us to neither confirm nor exclude the presence of pre-adipocytes in the dataset. Due to the lack of specific markers for pre-adipocytes and the absence of mature adipocytes in the current dataset, it is therefore difficult to identify a well-defined pre-adipocytes cluster.

      Figure R1. UMAP illustration of the normalized expression of the markers for pre-adipocytes in stromal clusters.

      In addition, based on a separate analysis of surface molecules, the authors propose new markers that could be used to prospectively isolate different human subpopulations of BM niche cells by using CD52, CD81 and NCAM1 (=CD56). Indeed, these analyses yield six different populations with differential abilities to form fibroblast-like colonies and differentiate into adipo-, osteo-, and chondrogenic lineages. To explore how the scRNAseq data may help to understand regulatory processes within the BM, the authors predict possible interactions between hematopoietic and non-hematopoietic subpopulations in the BM. These should be further validated, to support statements as the suggestion in the abstract that separate CXCL12- and SPP1-regulated BM niches might exist.

      We agree with the reviewer that functional validation of the CellPhoneDB results using for example in vivo humanized mouse models would be needed to demonstrate the presence of different niches in the bone marrow. At this point of time we only put forward the hypothesis that different niche types exist while we will work on providing experimental proof in our future studies.

      The scRNAseq analysis is indeed a strong and important resource, also for later studies meant to increase knowledge about the hematopoietic niche of the BM. Although the analyses using different bioinformatic tools is very helpful, they remain mostly speculative, since validatory experiments, as already mentioned, are missing. As such, I feel the authors did not succeed in achieving their goals of understanding how non-hematopoietic cells of the BM regulate the different hematopoietic processes within the BM. Nevertheless, they have created valuable resources, both in the scRNAseq data they generated, as well as the different predictions about different cell populations, their lineage relationships, and how they might interact with hematopoietic cells.

      We thank the reviewer for the appreciation of the value of this dataset. We agree with the reviewer that it is of great importance to validate the contribution of potential driver genes for stromal cell differentiation and verify the in vitro data and in-silico prediction using in-vivo models. As the main goal of the current study was to formulate hypotheses based on the scRNAseq data for future studies, we believe that in vivo validation experiments using engineered human bone marrow models or humanized bone marrow ossicles are out of the scope of the current study, but certainly need to be performed in the future.

      The impact of this work is difficult to envision, since validations still need to be performed. Also, it has the born in mind that humans are not mice, which can be studied in neat homogeneous inbred populations. Human populations on the other hand, are quite diverse, so that the data generated in this manuscript and others will probably have to be combined to extrapolate data relevant to the whole of the human population. However, as it is equally difficult to generate reliable scRNAseq data from human BM, it seems likely that the data will indeed an important resource, when more data from different donors become available.

      We thank the reviewer for the generally positive evaluation of this study.

      Taken at point value, the authors provide evidence that human counterparts exist to several BM populations described in mice. In my opinion, the lineage relationships predicted using the RNA velocity analyses need more substance, as it seems the differentiation-paths may diverge from what is known from mice. If so, this issue should be studied more stringently. Similarly, the paper would have been strengthened considerably if a relevant experimental validation would have been attempted, perhaps by using genetically modified (knockdown) MSSC, similar to Battula et al. (doi: 10.1182/blood-2012-06-437988).

      In the study from Welner’s group, stromal differentiation trajectory was inferred based on scRNAseq analysis of murine bone marrow cells using Velocyto16. Velocyto identified MSCs as the ‘source’ cell state with pre-adipocytes, pro-osteoblasts, and prochondrocytes being end states. In our study, the MSSC population was predicted to be at the apex of the trajectory and the pre-osteoblast cluster was placed close to the terminal state of differentiation, which is consistent with the murine study. However, different stromal cell types were identified in mice compared with humans. For example, we have identified prefibroblasts in our dataset which are absent in the murine study, while a well-defined murine pre-adipocyte population was not identified in our human dataset. Therefore, it is not surprising to find some discrepancies between human and murine stromal differentiation trajectories. Of course and as mentioned before, critical in-vivo functional validations need to be carried out to address these important issues in the future.

      In summary, this is a very interesting but also descriptive paper with highly important resources. However, to prospectively identify or isolate human non-hematopoietic/nonendothelial niche populations, more stringent validations should have been performed to strengthen the validity of the different analyses that have been performed. As such, it remains an open question which niche subpopulations has the most impact on the different hematopoietic processes important for normal and stress hematopoiesis, as well as malignancies.

      Thank you for this comment. We completely agree that more stringent validations are necessary but are outside of the aim of our current hypothesis-generating study. Accordingly, we are planning functional verification studies using genetically manipulated stromal cells in combination with in-vivo humanized ossicles. Furthermore, other groups will hopefully use our database and contribute with functional studies in model systems that are currently not available to us, e.g. iPS-derived bone marrow in-vitro proxies.

      Specific remarks

      • Since CD45, CD235a, and CD271 are used as distinguishing markers in the sample preparation of the scRNAseq, it would be helpful to highlight these markers in the different analyses (Figures 1D, 2B, 2C-F, and 4A), and restrict the analyses to those cells that also not express CD45, CD235a (why use CD71?) and highly express CD271.

      Thank you for this comment. As shown in Fig. R2, we have modified figures Fig. 1D, 2B, and 4A showing now also the expression of PTPRC (CD45), GYPA (CD235a), and NGFR (CD271) on the top (Fig. 1D and 2B) or right (Fig. 4A) panel of the figures. To complement Fig. 2C-F, we have generated new stacked violin plots showing the expression level of three markers by all 9 stromal clusters (Fig. R2B). As we believe that including these three markers in the figures does not provide a better strategy to improve the analyses, we decided to leave the original figures unchanged in this respect.

      Figure R2. (A) Modified Fig. 1D, 2B and 4A with PTPRC (CD45), GYPA (CD235a) and NGFR (CD271) expression. (B) Stacked violin plots of PTPRC, GYPA and NGFR expressed by stromal clusters to complement Fig. 2C-F.

      With regard to cell exclusion based on CD45, as shown in the modified Figure corresponding to Fig 1A in the manuscript (Fig R2A), CD45 gene expression is observed also in the endothelial cluster, basal cluster, and neuronal cluster (Fig. R2A). These clusters represent non-hematopoietic clusters that we would like to keep in our dataset for further analysis, such as cell-cell interaction. Therefore, we choose to not restrict the analysis to solely CD45 nonexpressing cells.

      With regard to CD235a (GYPA), expression of CD235a is not detected in any of the nonhematopoietic clusters. Thus, CD235a-expressing cell exclusion is not necessary.

      For CD271, according to our previous results (own unpublished data, belonging to a dataset of which only significantly expressed genes were reported in Li et al.8), protein expression of CD271 is not necessarily reflected by gene expression. In the other words, stromal cells with CD271 protein expression do not always have high mRNA expression. A significant fraction of stromal cells would be excluded if we restrict the analyses only to those cells that show high CD271 gene expression, which would not reflect the real cellular composition of human bone marrow stroma. In order to not risk losing stromal cells, we therefore kept our previous analyses which included stromal cells with various CD271 expression levels.

      With regard to using CD71 as an exclusion marker, please see also the comments to reviewer 1. Briefly, according to our data, CD71 (TFRC)-expressing erythroid precursors could still be found after excluding CD45 and CD235a positive cells (Figure 1-figure supplement 1B and R3). As furthermore shown in Figure 1-figure supplement 1G and R2, CD71 expression in the stromal clusters is negligible. Therefore, we believe that this justifies the use of CD71 as an additional marker to exclude erythroid cells. We have amended the discussion to address this issue (page 19, lines 7-8).

      Figure R3. FACS plots illustrating the expression of (A) CD71 (TFRC) vs CD271 in CD45- CD235a- cells and (B) FSC-A vs CD81 in CD45-CD235a-CD271+CD71+ cells following exclusion of doublets and dead cells.

      • Despite a distinct neuronal cluster (39), there does not seem to be a distinctive marker for these cells. Is this true?

      Yes, the reviewer is correct that there is no significantly-expressed distinctive marker for neuronal cells. Multiple markers indicating the presence of different cell types were identified in cluster 39 (Supplementary File 4). Among them, several neuronal markers (NEUROD1, CHGB, ELAVL2, ELAVL3, ELAVL4, STMN2, INSM1, ZIC2, NNAT) were found to be enriched in this cluster (Supplementary File 4 and Fig. 1D) with higher fold changes compared to other identified genes. However, the expression of these genes was not statistically significant, which is mainly due to the heterogeneity of the cluster and thus does not allow us to draw any firm conclusions.

      Several genes including MALAT1, HNRNPH1, AC010970.1, and AD000090.1 were identified to be statistically highly expressed by cluster 39 (Supplementary File 4). The expression of these genes is not restricted to any specific cell type. It is therefore impossible to annotate the cluster based on this and our data thus indicated that cluster 39 is a heterogeneous population containing multiple cell types. Based on the expression of neuronal markers, we nevertheless chose to annotate Cluster 39 as “neuronal” as the prominent expression of neuronal markers indicated the presence of neurons in this cluster. To be more accurate, the annotation of cluster 39 has been changed to ‘neuronal cell-containing cluster’ to correctly reflect the presence of non-neuronal gene expressing cells as well (page 29, lines 3-8).

      • Since based on 2C and 2D, the authors are unable to distinguish adipo- from osteogenic cells, would the authors use the same molecules to distinguish different populations of 2C-D, or would they use other markers, if so which and why.

      We agree with the reviewer that at the first glance adipo-primed (cluster 5, now annotated as “highly adipocytic gene-expressing progenitors”, HAGEPs), balanced progenitors (cluster 16), and pre-osteoblasts (cluster 38) shared a similar expression pattern according to the violin plots in Fig. 2C and 2D. However, as illustrated in the heatmap (Fig. 2B), the expression patterns of adipo-primed (HAGEP) and balanced progenitors were quite different in terms of their expression of adipogenic and osteogenic markers. Both adipogenic and osteogenic marker expression was detected in HAGEPs, balanced progenitors, and preosteoblasts. Thus, as violin plots are summarizing the overall expression levels of a certain marker in a certain cluster, these plots tend to make it more difficult to detect differential expression patterns between different clusters. In this case, the heatmap shown in Fig. 2B is a good complement to the violin plots as it is demonstrating the different expression patterns of every cell in the different stromal clusters.

      Additionally, cluster 5 showed the expression of a group of stress-related transcription factors (FOS, FOSB, JUNB, EGR1) (Fig. 2B and Figure 2-figure supplement 1C), some of which had been shown to mark bone marrow adipogenic progenitors1. The expression of the abovementioned stress-related transcription factors (putative adipogenic progenitor markers) was generally lower in cluster 38 compared to cluster 5, further demonstrating that clusters were different.

      Furthermore, there was a gradual upregulation of more mature osteogenic markers such as RUNX1, CDH11, EBF1, and EBF3 from cluster 5 to cluster 16 and finally cluster 38. As shown in Fig. 2D, the expression of these markers was higher in cluster 38 compared to cluster 5. Therefore, cluster 38 was annotated as pre-osteoblasts.

      Most of the stromal clusters form a continuum (Fig. 2A), which correlates very well with the gradual transition of different cellular states during stromal cell development. It is highly unlikely that abrupt and dramatic gene expression changes would occur during the cellular state transition of cells of the same lineage. Therefore, it is not surprising to find the differences in gene expression profiles between stromal clusters share a certain level of similarities.

      In summary, we rely on several factors to distinguish different stromal clusters, which include canonical adipo-, osteo- and chondrogenic markers, stress markers, heatmap, violin plots, and the gradual up-regulation of certain lineage-specific markers.

      To directly answer the reviewer’s question, we believe that we are able to distinguish different stromal clusters based on our data.

      • In de Jong et al., an inflammatory MSC population (iMSC) is defined. Since the Schneider group showed that inflammatory S100A8 and A9 are expressed by inflamed MSC, is it possible that the some of the designated pre-fibroblasts actually correspond to these S100A8/A9-expressing iMSC?

      We thank the reviewer for raising this interesting question.

      First of all, we would like to point out that scRNAseq was performed using viably frozen bone marrow aspirates in de Jong’s study while freshly isolated bone marrows were used in our study. There might be discrepancies between frozen and fresh bone marrow samples in terms of cellular composition including stromal composition and, importantly, processinginduced stress-related gene expression profiles.

      To investigate if designated pre-fibroblasts actually correspond to iMSCs as suggested by the reviewer, we have re-examined the expression of some of the key iMSC genes as reported by de Jong et al 17. As shown in Fig. R6, the markers that can distinguish iMSC from other MSC clusters in de Jong et al. study were not exclusively expressed by pre-fibroblasts, but also by other stromal cell types including HAGEPs, balanced progenitors, and pre-osteoblasts.

      In the study by R. Schneider’s group18, significant upregulation of S100A8/S100A9 was observed in stromal cells from patients with myelofibrosis. Furthermore, base-line expression of S100A8/A9 was also observed in the fibroblast clusters in the control group, which correlates very well with our data of S100A8/9 expression in pre-fibroblasts in normal donors (Fig. 2F). Our data thus indicate – in line with Schneider’s findings - that there is a baseline level expression of S100A8/9 in fibroblasts in hematologically normal samples and that the expression of S100A8/9 is not restricted to inflamed MSC.

      In summary, the gene expression profiles observed in our study do not indicate the presence of iMSC in the healthy bone marrow.

      • Figure 3A: Do human adipo-primed cells (cluster 5) indeed differentiate into osteogenic cells (clusters 6, 38, and 39). This would be highly unexpected. Can the authors substantiate this "reliable outcome of the RNA velocity analysis"?

      Please refer to our previous responses regarding this topic. Briefly, as shown in Fig. 2B and D, both osteogenic and adipogenic genes are expressed in cluster 5, indicating the multi-potent potentials of this cluster. Although the cluster was initially annotated as adipo-primed progenitors, this was not intended to exclude the osteogenic differentiation potential of these progenitors. Nevertheless, this annotation did not correctly reflect the differentiation potential and might thus have caused confusion, for which we apologize. In order to more correctly describe the characteristics of these cells, cluster 5 has now been reannotated as ‘highly adipocytic gene-expressing progenitors (HAGEPs)’.

      In general, the outcome of the RNA velocity analysis needs to be corroborated by in-vivo differentiation experiments. But we believe that functional verification, which would be extensive, is out of the scope of the current study and we will address these questions in future studies.

      • How statistically certain are the authors, that the populations in Figure 4B as defined by flow cytometry, correspond to MSSC, adipo-primed cells, osteoprogenitors, etc., as defined by scRNAseq?

      To address this question, we sorted the A1-A4 populations and performed RT- PCR to examine the CD81 expression level in each cluster. As shown in Figure 4-figure supplement 1B, CD81 expression levels were higher in A1 and A2 compared with A3 and A4, which is consistent with the scRNAseq data that showed the highest CD81 expression in MSSCs compared to other clusters (Supplementary File 4).

      The phenotypes defined in this study allowed us to isolate different stromal cell types which demonstrated significant functional differences as described in the manuscript (page 19, lines 17-25; page 20, lines 1-11). These results, in combination with the quantitative real-time PCR results (Figure 4-figure supplement 1B), demonstrated that the A1-A4 subsets in FACS are functionally distinct populations and are likely to be – at least in large parts – identical or equivalent to the transcriptionally identified clusters in group A stromal cells. However, at this point, we do not have performed the required experiments (scRNAseq of sorted cells) that would provide sufficient proof to confirm this statement statistically.

      • The immunohistochemistry results shown do not allow distinct conclusions as the colors give unequivocal mix-colors, and surface expression cannot be distinguished from intracellular expression. Please use a 3D (confocal) method for such statements.

      We thank the reviewer for the suggestion and we have performed additional confocal microscopy analysis of human bone marrow biopsies as suggested by the reviewer. Representative confocal images are now presented in the middle and right panel of Fig. 6E. We also include a separate file (Supplemental confocal image file). Here, confocal scans of all maker combinations are shown as ortho views in addition to detailed intensity profile analyses of the cells of interest clearly distinguishing surface staining from intracellular staining.

      Confocal analysis of bone marrow biopsies confirmed our findings presented in the manuscript. As observed in the scanning images, CD271-expressing cells were negative for CD45 and were located in perivascular, endosteal, and peri-adipocytic regions. CD271/CD81double positive cells could be found either in the peri-adipocytic regions or perivascular regions while CD271/NCAM1 double-positive cells were exclusively situated at the bone-lining endosteal regions. The results of the confocal analysis have been added to the revised manuscript (page 21, lines 15-17).

      • Figure 5A: as all cells seem to interact with all other cells, this figure does not convey relevant information about BM regions using for instance CXCL12 or SPP1. Please reanalyze to show specificity of the interactions of the single clusters. Also, since it is unlikely the CellPhoneDB2-predicted interactions are restricted to hematopoietic responders, please also describe the possible interactions between non-hematopoietic cells.

      Fig. 5A was used to demonstrate the complexity of the interactions between hematopoietic cells and stromal cells.

      To gain a more detailed understanding of the interactions, we also performed an analysis with the top-listed ligand-receptor pairs as shown in Fig. 5B-C and Figure 5-figure supplement 1B. Here, each dot represents the interaction of a specific ligand-receptor pair listed on the x-axis between the two individual clusters indicated in the y-axis, which we believe shows what the reviewer is asking for.

      The specificity of the interactions between single clusters were shown in Fig. 5B-C and Figure 5-figure supplement 1B. The CXCL12- and SPP1-mediated interactions between MSSC/OC and hematopoietic clusters clearly suggested stromal cell type-specific interactions.

      Regarding non-hematopoietic cells, both inter- and intra-stromal interactions were identified to be operative between different stromal subsets as well as within the same stromal cell population as shown in Figure 5-figure supplement 3B. In addition, we have also analyzed the interaction pattern between endothelial cells and hematopoietic cells as shown in Fig. 7A, and thus we believe that we have sufficiently described these interactions as requested by the reviewer.

    1. Author Response

      Reviewer #2 (Public Review):

      Point 1: The transcriptomic analysis of E12.5 endocardial cushion cells in the various mouse models is informative in the extraction of Igf2- and H19-specific gene functions. In Fig. 6D, a huge sex effect is obvious with many more DEGs in female embryos compared to males. How can this be explained given that Igf2/H19 reside on Chr7 and do not primarily affect gene expression on the X chromosome? Is any chromosomal bias observed in the genomic distribution of DEGs?

      We examined chromosomal distribution of DEGs between WT and +/hIC1 (Supplemental Figure 6D) and did not see any bias on X chromosome. We described this result on lines 278-280: “Although the number of +/hIC1-specific DEGs largely differed between males and females, there was no sex-specific bias on the X chromosome (Supplemental Figure 6D).” Additionally, we agree with the reviewer that it is noteworthy that the dysregulated H19/Igf2 expression affected transcriptome in a sex-specific manner, especially when the mutation is located on a somatic chromosome. Although investigating the role of hormones versus sex chromosome in these effects would be quite interesting, it is beyond the scope of current study.

      Point 2: A separate issue is raised by Fig. 6E that shows a most dramatic dysregulation of a single gene in the delta3.8/hIC1 "rescue" model. Interestingly, this gene is Shh. Hence, these embryos should exhibit some dramatic skeletal abnormalities or other defects linked to sonic hedgehog function.

      The reason why Shh appeared to be differentially expressed between wild-type and d3.8/hIC1 samples was that Shh expression was 0 across all the samples except for two wild-type samples. In order to detect all the DEGs that might be lowly expressed, we did not want to filter DEGs based on the level of total expression. As a result, Shh was represented as significantly differently expressed in d3.8/hIC1 samples, although its expression in our samples appears to be too low to have any significant effect on development. This explanation was added to lines 310-312. To confirm that this was an exceptional case, we analyzed the expression of DEGs obtained from other pairwise comparisons. In the volcano plots below, genes of which expression is not statistically different between two groups are marked grey. Genes of which expression is statistically different and detected in both groups are marked red. Genes with statistically different but not detected in one group at all, such as Shh, are marked blue (Figure G). It is clear that that almost all of our DEGs are expressed consistently across the groups, and genes with no expression detected in one group are very rare.

      Point 3: The placental analysis needs to be strengthened. Placentas should be consistently positioned with the decidua facing up, and the chorionic plate down. The placentas in Fig. 3F are sectioned at an angle and the chorionic plate is missing. These images must be replaced with better histological sections.

      As requested, we have replaced placental images with better representative sections (Figure 3F and 4E). In addition, we have improved alignment of placental histology figures.

      Point 4: The CD34 staining has not worked and does not show any fetal vasculature, in particular not in the WT sample.

      As requested, we have replaced the CD34 vascular stained images with those that better represent fetal vasculature (Figure 3G).

      Point 5: The "thrombi" highlighted in Fig. 4E are well within the normal range, to make the point that these are persistent abnormalities more thorough measurements would need to be performed (number, size, etc).

      As requested, we measured the number and relative size of the thrombi that are found in dH19/hIC1 placentas with lesions. No thrombi were found in wild-type placentas whereas an average of 1.3 thrombi were found in six dH19/hIC1 placentas. The size of the thrombi widely varied, but occupied average of 2.58% of the labyrinth zone where these lesions were found (Supplemental Figure 4D). Additionally, we replaced the image in Figure 4E into the section that better represents the lesion.

      Point 6: The statement that H19 is disproportionately contributing to the labyrinth phenotype (lines 154/155) is not warranted as Igf2 expression is reduced to virtually nothing in these mice. Even though there is more H19 in the labyrinth than in the junctional zone, the phenotype may still be driven by a loss of Igf2. Given the quasi Igf2-null situation in +/hIC1 mice, is the glycogen cell type phenotype recapitulated in these mice, and how do glycogen numbers compare in the other mouse models?

      The sentence was edited in line 157. We performed Periodic acid Schiff (PAS) staining on +/hIC1 placentas to address if glycogen cells are affected by abnormal H19/Igf2 expression (Supplemental Figure 1E). In contrary to previous reports where Igf2-null mice had lower placental glycogen concentration (Lopez et al., 1996) and H19 deletion led to increased placental glycogen storage (Esquiliano et al., 2009), our quantification on PAS-stained images showed that the glycogen content is not significantly different between wild-type and +/hIC1 placentas. We have described this result in lines 166-168.

      Point 7: How do delta3.8/+ and delta3.8/hIC1 mice with a VSD survive? Is it resolved some time after birth such that heart function is compatible with postnatal viability? And more importantly, do H19 expression levels correlate with phenotype severity on an individual basis?

      Our study was limited to phenotypes prior to birth, thus postnatal/adult phenotypes were not examined. Because the VSD showed only partial penetrance in these mice, we cannot state that the d3.8/+ or d3.8/hlC1 mice with VSDs survive. It has also been previously reported in another mouse model with incomplete penetrance of a VSD that the mice which survived to adulthood did not have the VSDs (Sakata et al., 2002). We find it highly unlikely that either mouse model would survive significantly past the postnatal timepoint with a VSD. We have examined two PN0 d3.8/hIC1 neonates, and both did not have VSD.

      Regarding the second point, the only way to quantitatively address this question would be to do qPCR or RNA-seq on individual hearts, which then makes it impossible for those hearts to be examined for histology to confirm the VSD. Thus, hearts used to identify VSDs via histology could not also be used for quantitative H19 measurements. One thing to note is that the H19/Igf2 expression in independent replicates of d3.8/hIC1 cardiac ECs used in our RNA-seq experiment is quite variable, not clustering together in contrast to other mouse models used in this study (Fig. 6A). Such wide range of variability in the extent of H19/Igf2 dysregulation suggests that H19/Igf2 levels could have an impact on the penetrance or the severity of the VSD phenotype in d3.8/hIC1 embryos.

    1. Author Response

      Reviewer #2 (Public Review):

      Weaknesses:

      1) The relevance of the LPS-induced calvarial osteolysis model is not clear. Calvaria is mostly composed of cortical bone-like structures lacking marrow space, though small marrow space exists near the suture. Osteolysis appears to occur in areas apart from where marrow is located. The authors did not show in the manuscript which cells Adipoq-Cre marks in the calvaria.

      We have shown in a recent publication that MALPs exist in the calvarial bone marrow (2). As shown in Fig. R1A, Td+ cells are layer of cortical bone (Fig. R1B, blue arrows). In WT mice, after LPS injection, the normal bone structure, including suture and cortical bone, were mostly eroded, and filled with inflammatory cells (green arrows). Thus, osteolysis does occur at the area where bone marrow is originally located. On the contrary, calvarial bone structure was preserved in the CKO mice, demonstrating that Csf1 deficiency in MALPs suppresses LPS-induced osteolysis. We included the H&E staining data in the revised manuscript:

      "H&E staining showed that calvarial bone marrow is surrounded by a thin layer of cortical bone (Fig. 5C). After the LPS injection, normal calvarial structure, including suture and cortical bone, were mostly eroded and filled with inflammatory cells in WT mice, but unaltered in CKO mice."

      Figure R1. Calvarial bone marrow structure. (A) Representative coronal section of 1.5-month-old Adipoq/Td mouse calvaria. Bone surfaces are outlined by dashed lines. Boxed areas in the low magnification image (top) are enlarged to show periosteum (bottom left), suture (bottom middle), and bone marrow (BM, bottom right) regions. Red: Td; Blue: DAPI. Adopted from our previous publication (2). (B) H&E staining of coronal sections of WT and Csf1 CKOAdipoq mice after LPS injection. Blue arrows point to bone marrow space close to suture (indicated by *). Green arrows point to the osteolytic lesion where cortical bone was eroded, and the space were filled with inflammatory cells.

      2) Although the contrast between the two Csf1 conditional deletion models (Adipoq-Cre and Prx1-Cre) is very interesting, the relationship between these two cell populations are not well described. The authors did not clarify if MALPs are also targeted by Prx1-Cre, or these two cell types are from different cell lineages. "Other mesenchymal lineage cells" in the subtitle is not extremely helpful to place this finding in context.

      We thank the Reviewer for this comment. The original article constructing Prx1-Cre mouse line demonstrates that Prx1-Cre targets all mesenchymal cells in the limb bud at early as 10.5 dpc (10). This early expression pattern ensures that all bone marrow mesenchymal lineage cells, including MALPs, are targeted by Prx1-Cre. In addition, based on our scRNA-seq data (1), Adipoq is mainly expressed in MALPs, while Prrx1 (Prx1) is highly expressed not only in MALPs but also in EMPs, IMPs, LMPs, LCPs, and OBs (Fig. R2). Thus, the fact that Prx1-Cre driven CKO mice have much more severer bone phenotypes than AdipoqCre driven CKO mice indicates that mesenchymal lineage cells other than MALPs also contribute Csf1 to regulate bone resorption. To avoid confusion, we changed the title and the first sentence in the Result session about Prx1 mice to the following:

      "Csf1 from mesenchymal lineage cells other than MALPs regulate bone structure.

      To explore whether Csf1 from MALPs plays a dominant role in regulating bone structure, we generated Prx1-Cre Csf1flox/flox (Csf1 CKOPrx1) mice to knockout Csf1 in all mesenchymal lineage cells in bone (10), including MALPs."

      Figure R2. Dotplot of Prrx1 and Adipoq expression in bone marrow mesenchymal lineage cells based on our scRNA-seq analysis of 1-month-old mice.

      3) The data supporting defective bone marrow hematopoiesis in Csf1 CKO mice are not particularly strong. They observed a reduction in bone marrow cellularity, but this was only associated with an expected reduction in macrophages and a mild reduction in overall HSPC populations. More in-depth analyses might be required to define mechanisms underlying reduced bone marrow cellularity in CKO mice.

      We thank the Reviewer for this constructive comment. Accordingly, we performed a thorough analysis of bone marrow hematopoietic compartments and observed significant decreases of monocytes and erythroid progenitors in CKO mice compared to WT mice. These results are now included as Fig. 6E.

      4) Some of the phenotypic analyses are still incomplete. The authors did not report whether CHet (Adipoq-Cre Csf1(flox/+)) showed any bone phenotype. Further, the authors did not report whether Csf1 mRNA or M-Csf protein is indeed expressed by MALPs, with current evidence solely reliant on scRNAseq and qPCR data of bulk-isolated cells. More specific histological methods will be helpful to support the premise of the study.

      A pilot microCT study revealed the same femoral trabecular bone structure in WT and Adipoq-Cre Csf1flox/+ (Csf1 Het) mice at 3 months of age (Fig. R3). While the sample number for Het is low, we are confident about this conclusion.

      Figure R3. MicroCT measurement of trabecular bone structural parameters from WT and Csf1 Het mice. BV/TV: bone volume fraction; BMD: bone mineral density; Tb.N: trabecular number; Tb.Th: trabecular thickness; Tb.Sp: trabecular separation; SMI: structural model index. n=3-8 mice/group.

    1. Author response:

      Reviewer #1 (Public Review):

      In this paper, Tompary & Davachi present work looking at how memories become integrated over time in the brain, and relating those mechanisms to responses on a priming task as a behavioral measure of memory linkage. They find that remotely but not recently formed memories are behaviorally linked and that this is associated with a change in the neural representation in mPFC. They also find that the same behavioral outcomes are associated with the increased coupling of the posterior hippocampus with category-sensitive parts of the neocortex (LOC) during a post-learning rest period-again only for remotely learned information. There was also correspondence in rest connectivity (posterior hippocampus-LOC) and representational change (mPFC) such that for remote memories specifically, the initial post-learning connectivity enhancement during rest related to longer-term mPFC representational change.

      This work has many strengths. The topic of this paper is very interesting, and the data provide a really nice package in terms of providing a mechanistic account of how memories become integrated over a delay. The paper is also exceptionally well-written and a pleasure to read. There are two studies, including one large behavioral study, and the findings replicate in the smaller fMRI sample. I do however have two fairly substantive concerns about the analytic approach, where more data will be required before we can know whether the interpretations are an appropriate reflection of the findings. These and other concerns are described below.

      Thank you for the positive comments! We are proud of this work, and we feel that the paper is greatly strengthened by the revisions we made in response to your feedback. Please see below for specific changes that we’ve made.

      1) One major concern relates to the lack of a pre-encoding baseline scan prior to recent learning.

      a) First, I think it would be helpful if the authors could clarify why there was no pre-learning rest scan dedicated to the recent condition. Was this simply a feasibility consideration, or were there theoretical reasons why this would be less "clean"? Including this information in the paper would be helpful for context. Apologies if I missed this detail in the paper.

      This is a great point and something that we struggled with when developing this experiment. We considered several factors when deciding whether to include a pre-learning baseline on day two. First, the day 2 scan session was longer than that of day 1 because it included the recognition priming and explicit memory tasks, and the addition of a baseline scan would have made the length of the session longer than a typical scan session – about 2 hours in the scanner in total – and we were concerned that participant engagement would be difficult to sustain across a longer session. Second, we anticipated that the pre-learning scan would not have been a ‘clean’ measure of baseline processing, but rather would include signal related to post-learning processing of the day 1 sequences, as multi-variate reactivation of learned stimuli have been observed in rest scans collected 24-hours after learning (Schlichting & Preston, 2014). We have added these considerations to the Discussion (page 39, lines 1047-1070).

      b) Second, I was hoping the authors could speak to what they think is reflected in the post-encoding "recent" scan. Is it possible that these data could also reflect the processing of the remote memories? I think, though am not positive, that the authors may be alluding to this in the penultimate paragraph of the discussion (p. 33) when noting the LOC-mPFC connectivity findings. Could there be the reinstatement of the old memories due to being back in the same experimental context and so forth? I wonder the extent to which the authors think the data from this scan can be reflected as strictly reflecting recent memories, particularly given it is relative to the pre-encoding baseline from before the remote memories, as well (and therefore in theory could reflect both the remote + recent). (I should also acknowledge that, if it is the case that the authors think there might be some remote memory processing during the recent learning session in general, a pre-learning rest scan might not have been "clean" either, in that it could have reflected some processing of the remote memories-i.e., perhaps a clean pre-learning scan for the recent learning session related to point 1a is simply not possible.)

      We propose that theoretically, the post-learning recent scan could indeed reflect mixture of remote and recent sequences. This is one of the drawbacks of splitting encoding into two sessions rather than combining encoding into one session and splitting retrieval into an immediate and delayed session; any rest scans that are collected on Day 2 may have signal that relates to processing of the Day 1 remote sequences, which is why we decided against the pre-learning baseline for Day 2, as you had noted.

      You are correct that we alluded to in our original submission when discussing the LOC-mPFC coupling result, and we have taken steps to discuss this more explicitly. In Brief, we find greater LOC-mPFC connectivity only after recent learning relative to the pre-learning baseline, and cortical-cortical connectivity could be indicative of processing memories that already have undergone some consolidation (Takashima et al., 2009; Smith et al., 2010). From another vantage point, the mPFC representation of Day 1 learning may have led to increased connectivity with LOC on Day 2 due to Day 1 learning beginning to resemble consolidated prior knowledge (van Kesteren et al., 2010). While this effect is consistent with prior literature and theory, it's unclear why we would find evidence of processing of the remote memories and not the recent memories. Furthermore, the change in LOC-mPFC connectivity in this scan did not correlate with memory behaviors from either learning session, which could be because signal from this scan reflects a mix of processing of the two different learning sessions. With these ideas in mind, we have fleshed out the discussion of the post-encoding ‘recent’ scan in the Discussion (page 38-39, lines 1039-1044).

      c) Third, I am thinking about how both of the above issues might relate to the authors' findings, and would love to see more added to the paper to address this point. Specifically, I assume there are fluctuations in baseline connectivity profile across days within a person, such that the pre-learning connectivity on day 1 might be different from on day 2. Given that, and the lack of a pre-learning connectivity measure on day 2, it would logically follow that the measure of connectivity change from pre- to post-learning is going to be cleaner for the remote memories. In other words, could the lack of connectivity change observed for the recent scan simply be due to the lack of a within-day baseline? Given that otherwise, the post-learning rest should be the same in that it is an immediate reflection of how connectivity changes as a function of learning (depending on whether the authors think that the "recent" scan is actually reflecting "recent + remote"), it seems odd that they both don't show the same corresponding increase in connectivity-which makes me think it may be a baseline difference. I am not sure if this is what the authors are implying when they talk about how day 1 is most similar to prior investigation on p. 20, but if so it might be helpful to state that directly.

      We agree that it is puzzling that we don’t see that hippocampal-LOC connectivity does not also increase after recent learning, equivalently to what we see after remote learning. However, the fact that there is an increase from baseline rest to post-recent rest in mPFC – LOC connectivity suggests that it’s not an issue with baseline, but rather that the post-recent learning scan is reflecting processing of the remote memories (although as a caveat, there is no relationship with priming).

      On what is now page 23, we were referring to the notion that the Day 1 procedure (baseline rest, learning, post-learning rest) is the most straightforward replication of past work that finds a relationship between hippocampal-cortical coupling and later memory. In contrast, the Day 2 learning and rest scan are less ‘clean’ of a replication in that they are taking place in the shadow of Day 1 learning. We have clarified this in the Results (page 23, lines 597-598).

      d) Fourth and very related to my point 1c, I wonder if the lack of correlations for the recent scan with behavior is interpretable, or if it might just be that this is a noisy measure due to imperfect baseline correction. Do the authors have any data or logic they might be able to provide that could speak to these points? One thing that comes to mind is seeing whether the raw post-learning connectivity values (separately for both recent and remote) show the same pattern as the different scores. However, the authors may come up with other clever ways to address this point. If not, it might be worth acknowledging this interpretive challenge in the Discussion.

      We thought of three different approaches that could help us to understand whether the lack of correlations in between coupling and behavior in the recent scan was due to noise. First, we correlated recognition priming with raw hippocampal-LOC coupling separately for pre- and post-learning scans, as in Author response image 1:

      Author response image 1.

      Note that the post-learning chart depicts the relationship between post-remote coupling and remote priming and between post-recent coupling and recent priming (middle). Essentially, post-recent learning coupling did not relate to priming of recently learned sequences (middle; green) while there remains a trend for a relationship between post-remote coupling and priming for remotely learned sequences (middle; blue). However, the significant relationship between coupling and priming that we reported in the paper (right, blue) is driven both by the initial negative relationship that is observed in the pre-learning scan and the positive relationship in the post-remote learning scan. This highlights the importance of using a change score, as there may be spurious initial relationships between connectivity profiles and to-be-learned information that would then mask any learning- and consolidation-related changes.

      We also reasoned that if comparisons between the post-recent learning scan and the baseline scan are noisier than between the post-remote learning and baseline scan, there may be differences in the variance of the change scores across participants, such that changes in coupling from baseline to post-recent rest may be more variable than coupling from baseline to post-remote rest. We conducted F-tests to compare the variance of the change in these two hippocampal-LO correlations and found no reliable difference (ratio of difference: F(22, 22) = 0.811, p = .63).

      Finally, we explored whether hippocampal-LOC coupling is more stable across participants if compared across two rest scans within the same imaging session (baseline and post-remote) versus across two scans across two separate sessions (baseline and post-recent). Interestingly, coupling was not reliably correlated across scans in either case (baseline/post-remote: r = 0.03, p = 0.89 Baseline/post-recent: r = 0.07, p = .74).

      Finally, we evaluated whether hippocampal-LOC coupling was correlated across different rest scans (see Author response image 2). We reasoned that if such coupling was more correlated across baseline and post-remote scans relative to baseline and post-recent scans, that would indicate a within-session stability of participants’ connectivity profiles. At the same time, less correlation of coupling across baseline and post-recent scans would be an indication of a noisier change measure as the measure would additionally include a change in individuals’ connectivity profile over time. We found that there was no difference in the correlation of hipp-LO coupling is across sessions, and the correlation was not reliably significant for either session (baseline/post-remote: r = 0.03, p = 0.89; baseline/post-recent: r = 0.07, p = .74; difference: Steiger’s t = 0.12, p = 0.9).

      Author response image 2.

      We have included the raw correlations with priming (page 25, lines 654-661, Supplemental Figure 6) as well as text describing the comparison of variances (page 25, lines 642-653). We did not add the comparison of hippocampal-LOC coupling across scans to the current manuscript, as an evaluation of stability of such coupling in the context of learning and reactivation seems out of scope of the current focus of the experiment, but we find this result to be worthy of follow-up in future work.

      In summary, further analysis of our data did not reveal any indication that a comparison of rest connectivity across scan sessions inserted noise into the change score between baseline and post-recent learning scans. However, these analyses cannot fully rule that possibility out, and the current analyses do not provide concrete evidence that the post-recent learning scan comprises signals that are a mixture of processing of recent and remote sequences. We discuss these drawbacks in the Discussion (page 39, lines 1047-1070).

      2) My second major concern is how the authors have operationalized integration and differentiation. The pattern similarity analysis uses an overall correspondence between the neural similarity and a predicted model as the main metric. In the predicted model, C items that are indirectly associated are more similar to one another than they are C items that are entirely unrelated. The authors are then looking at a change in correspondence (correlation) between the neural data and that prediction model from pre- to post-learning. However, a change in the degree of correspondence with the predicted matrix could be driven by either the unrelated items becoming less similar or the related ones becoming more similar (or both!). Since the interpretation in the paper focuses on change to indirectly related C items, it would be important to report those values directly. For instance, as evidence of differentiation, it would be important to show that there is a greater decrease in similarity for indirectly associated C items than it is for unrelated C items (or even a smaller increase) from pre to post, or that C items that are indirectly related are less similar than are unrelated C items post but not pre-learning. Performing this analysis would confirm that the pattern of results matches the authors' interpretation. This would also impact the interpretation of the subsequent analyses that involve the neural integration measures (e.g., correlation analyses like those on p. 16, which may or may not be driven by increased similarity among overlapping C pairs). I should add that given the specificity to the remote learning in mPFC versus recent in LOC and anterior hippocampus, it is clearly the case that something interesting is going on. However, I think we need more data to understand fully what that "something" is.

      We recognize the importance of understanding whether model fits (and changes to them) are driven by similarity of overlapping pairs or non-overlapping pairs. We have modified all figures that visualize model fits to the neural integration model to separately show fits for pre- and post-learning (Figure 3 for mPFC, Supp. Figure 5 for LOC, Supp. Figure 9 for AB similarity in anterior hippocampus & LOC). We have additionally added supplemental figures to show the complete breakdown of similarity each region in a 2 (pre/post) x 2 (overlapping/non-overlapping sequence) x 2 (recent/remote) chart. We decided against including only these latter charts rather than the model fits since the model fits strike a good balance between information and readability. We have also modified text in various sections to focus on these new results.

      In brief, the decrease in model fit for mPFC for the remote sequences was driven primarily by a decrease in similarity for the overlapping C items and not the non-overlapping ones (Supplementary Figure 3, page 18, lines 468-472).

      Interestingly, in LOC, all C items grew more similar after learning, regardless of their overlap or learning session, but the increase in model fit for C items in the recent condition was driven by a larger increase in similarity for overlapping pairs relative to non-overlapping ones (Supp. Figure 5, page 21, lines 533-536).

      We also visualized AB similarity in the anterior hippocampus and LOC in a similar fashion (Supplementary Figure 9).

      We have also edited the Methods sections with updated details of these analyses (page 52, lines 1392-1397). We think that including these results considerably strengthen our claims and we are pleased to have them included.

      3) The priming task occurred before the post-learning exposure phase and could have impacted the representations. More consideration of this in the paper would be useful. Most critically, since the priming task involves seeing the related C items back-to-back, it would be important to consider whether this experience could have conceivably impacted the neural integration indices. I believe it never would have been the case that unrelated C items were presented sequentially during the priming task, i.e., that related C items always appeared together in this task. I think again the specificity of the remote condition is key and perhaps the authors can leverage this to support their interpretation. Can the authors consider this possibility in the Discussion?

      It's true that only C items from the same sequence were presented back-to-back during the priming task, and that this presentation may interfere with observations from the post-learning exposure scan that followed it. We agree that it is worth considering this caveat and have added language in the Discussion (page 40, lines 1071-1086). When designing the study, we reasoned that it was more important for the behavioral priming task to come before the exposure scans, as all items were shown only once in that task, whereas they were shown 4-5 times in a random order in the post-learning exposure phase. Because of this difference in presentation times, and because behavioral priming findings tend to be very sensitive, we concluded that it was more important to protect the priming task from the exposure scan instead of the reverse.

      We reasoned, however, that the additional presentation of the C items in the recognition priming task would not substantially override the sequence learning, as C items were each presented 16 times in their sequence (ABC1 and ABC2 16 times each). Furthermore, as this reviewer suggests, the order of C items during recognition was the same for recent and remote conditions, so the fact that we find a selective change in neural representation for the remote condition and don’t also see that change for the recent condition is additional assurance that the recognition priming order did not substantially impact the representations.

      4) For the priming task, based on the Figure 2A caption it seems as though every sequence contributes to both the control and primed conditions, but (I believe) this means that the control transition always happens first (and they are always back-to-back). Is this a concern? If RTs are changing over time (getting faster), it would be helpful to know whether the priming effects hold after controlling for trial numbers. I do not think this is a big issue because if it were, you would not expect to see the specificity of the remotely learned information. However, it would be helpful to know given the order of these conditions has to be fixed in their design.

      This is a correct understanding of the trial orders in the recognition priming task. We chose to involve the baseline items in the control condition to boost power – this way, priming of each sequence could be tested, while only presenting each item once in this task, as repetition in the recognition phase would have further facilitated response times and potentially masked any priming effects. We agree that accounting for trial order would be useful here, so we ran a mixed-effects linear model to examine responses times both as a function of trial number and of priming condition (primed/control). While there is indeed a large effect of trial number such that participants got faster over time, the priming effect originally observed in the remote condition still holds at the same time. We now report this analysis in the Results section (page 14, lines 337-349 for Expt 1 and pages 14-15, lines 360-362 for Expt 2).

      5) The authors should be cautious about the general conclusion that memories with overlapping temporal regularities become neurally integrated - given their findings in MPFC are more consistent with overall differentiation (though as noted above, I think we need more data on this to know for sure what is going on).

      We realize this conclusion was overly simplistic and, in several places, have revised the general conclusions to be more specific about the nuanced similarity findings.

      6) It would be worth stating a few more details and perhaps providing additional logic or justification in the main text about the pre- and post-exposure phases were set up and why. How many times each object was presented pre and post, and how the sequencing was determined (were any constraints put in place e.g., such that C1 and C2 did not appear close in time?). What was the cover task (I think this is important to the interpretation & so belongs in the main paper)? Were there considerations involving the fact that this is a different sequence of the same objects the participants would later be learning - e.g., interference, etc.?

      These details can be found in the Methods section (pages 50-51, lines 1337-1353) and we’ve added a new summary of that section in the Results (page 17, lines 424- 425 and 432-435). In brief, a visual hash tag appeared on a small subset of images and participants pressed a button when this occurred, and C1 and C2 objects were presented in separate scans (as were A and B objects) to minimize inflated neural similarity due to temporal proximity.

      Reviewer #2 (Public Review):

      The manuscript by Tompary & Davachi presents results from two experiments, one behavior only and one fMRI plus behavior. They examine the important question of how to separate object memories (C1 and C2) that are never experienced together in time and become linked by shared predictive cues in a sequence (A followed by B followed by one of the C items). The authors developed an implicit priming task that provides a novel behavioral metric for such integration. They find significant C1-C2 priming for sequences that were learned 24h prior to the test, but not for recently learned sequences, suggesting that associative links between the two originally separate memories emerge over an extended period of consolidation. The fMRI study relates this behavioral integration effect to two neural metrics: pattern similarity changes in the medial prefrontal cortex (mPFC) as a measure of neural integration, and changes in hippocampal-LOC connectivity as a measure of post-learning consolidation. While fMRI patterns in mPFC overall show differentiation rather than integration (i.e., C1-C2 representational distances become larger), the authors find a robust correlation such that increasing pattern similarity in mPFC relates to stronger integration in the priming test, and this relationship is again specific to remote memories. Moreover, connectivity between the posterior hippocampus and LOC during post-learning rest is positively related to the behavioral integration effect as well as the mPFC neural similarity index, again specifically for remote memories. Overall, this is a coherent set of findings with interesting theoretical implications for consolidation theories, which will be of broad interest to the memory, learning, and predictive coding communities.

      Strengths:

      1) The implicit associative priming task designed for this study provides a promising new tool for assessing the formation of mnemonic links that influence behavior without explicit retrieval demands. The authors find an interesting dissociation between this implicit measure of memory integration and more commonly used explicit inference measures: a priming effect on the implicit task only evolved after a 24h consolidation period, while the ability to explicitly link the two critical object memories is present immediately after learning. While speculative at this point, these two measures thus appear to tap into neocortical and hippocampal learning processes, respectively, and this potential dissociation will be of interest to future studies investigating time-dependent integration processes in memory.

      2) The experimental task is well designed for isolating pre- vs post-learning changes in neural similarity and connectivity, including important controls of baseline neural similarity and connectivity.

      3) The main claim of a consolidation-dependent effect is supported by a coherent set of findings that relate behavioral integration to neural changes. The specificity of the effects on remote memories makes the results particularly interesting and compelling.

      4) The authors are transparent about unexpected results, for example, the finding that overall similarity in mPFC is consistent with a differentiation rather than an integration model.

      Thank you for the positive comments!

      Weaknesses:

      1) The sequence learning and recognition priming tasks are cleverly designed to isolate the effects of interest while controlling for potential order effects. However, due to the complex nature of the task, it is difficult for the reader to infer all the transition probabilities between item types and how they may influence the behavioral priming results. For example, baseline items (BL) are interspersed between repeated sequences during learning, and thus presumably can only occur before an A item or after a C item. This seems to create non-random predictive relationships such that C is often followed by BL, and BL by A items. If this relationship is reversed during the recognition priming task, where the sequence is always BL-C1-C2, this violation of expectations might slow down reaction times and deflate the baseline measure. It would be helpful if the manuscript explicitly reported transition probabilities for each relevant item type in the priming task relative to the sequence learning task and discussed how a match vs mismatch may influence the observed priming effects.

      We have added a table of transition probabilities across the learning, recognition priming, and exposure scans (now Table 1, page 48). We have also included some additional description of the change in transition probabilities across different tasks in the Methods section. Specifically, if participants are indeed learning item types and rules about their order, then both the control and the primed conditions would violate that order. Since C1 and C2 items never appeared together, viewing C1 would give rise to an expectation of seeing a BL item, which would also be violated. This suggests that our priming effects are driven by sequence-specific relationships rather than learning of the probabilities of different item types. We’ve added this consideration to the Methods section (page 45, lines 1212-1221).

      Another critical point to consider (and that the transition probabilities do not reflect) is that during learning, while C is followed either by A or BL, they are followed by different A or BL items. In contrast, a given A is always followed by the same B object, which is always followed by one of two C objects. While the order of item types is semi-predictable, the order of objects (specific items) themselves are not. This can be seen in the response times during learning, such that response times for A and BL items are always slower than for B and C items. We have explained this nuance in the figure text for Table 1.

      2) The choice of what regions of interest to include in the different sets of analyses could be better motivated. For example, even though briefly discussed in the intro, it remains unclear why the posterior but not the anterior hippocampus is of interest for the connectivity analyses, and why the main target is LOC, not mPFC, given past results including from this group (Tompary & Davachi, 2017). Moreover, for readers not familiar with this literature, it would help if references were provided to suggest that a predictable > unpredictable contrast is well suited for functionally defining mPFC, as done in the present study.

      We have clarified our reasoning for each of these choices throughout the manuscript and believe that our logic is now much more transparent. For an expanded reasoning of why we were motivated to look at posterior and not anterior hippocampus, see pages 6-7, lines 135-159, and our response to R2. In brief, past research focusing on post-encoding connectivity with the hippocampus suggests that posterior aspect is more likely to couple with category-selective cortex after learning neutral, non-rewarded objects much like the stimuli used in the present study.

      We also clarify our reasoning for LOC over mPFC. While theoretically, mPFC is thought to be a candidate region for coupling with the hippocampus during consolidation, the bulk of empirical work to date has revealed post-encoding connectivity between the hippocampus and category-selective cortex in the ventral and occipital lobes (page 6, lines 123-134).

      As for the use of the predictable > unpredictable contrast for functionally defining cortical regions, we reasoned that cortical regions that were sensitive to the temporal regularities generated by the sequences may be further involved in their offline consolidation and long-term storage (Danker & Anderson, 2010; Davachi & Danker, 2013; McClelland et al., 1995). We have added this justification to the Methods section (page 18, lines 454-460).

      3) Relatedly, multiple comparison corrections should be applied in the fMRI integration and connectivity analyses whenever the same contrast is performed on multiple regions in an exploratory manner.

      We now correct for multiple comparisons using Bonferroni correction, and this correction depends on the number of regions in which each analysis is conducted. Please see page 55, lines 1483-1490, in the Methods section for details of each analysis.

      Reviewer #3 (Public Review):

      The authors of this manuscript sought to illuminate a link between a behavioral measure of integration and neural markers of cortical integration associated with systems consolidation (post-encoding connectivity, change in representational neural overlap). To that aim, participants incidentally encoded sequences of objects in the fMRI scanner. Unbeknownst to participants, the first two objects of the presented ABC triplet sequences overlapped for a given pair of sequences. This allowed the authors to probe the integration of unique C objects that were never directly presented in the same sequence, but which shared the same preceding A and B objects. They encoded one set of objects on Day 1 (remote condition), another set of objects 24 hours later (recent condition) and tested implicit and explicit memory for the learned sequences on Day 2. They additionally collected baseline and post-encoding resting-state scans. As their measure of behavioral integration, the authors examined reaction time during an Old/New judgement task for C objects depending on if they were preceded by a C object from an overlapping sequence (primed condition) versus a baseline object. They found faster reaction times for the primed objects compared to the control condition for remote but not recently learned objects, suggesting that the C objects from overlapping sequences became integrated over time. They then examined pattern similarity in a priori ROIs as a measure of neural integration and found that participants showing evidence of integration of C objects from overlapping sequences in the medial prefrontal cortex for remotely learned objects also showed a stronger implicit priming effect between those C objects over time. When they examined the change in connectivity between their ROIs after encoding, they also found that connectivity between the posterior hippocampus and lateral occipital cortex correlated with larger priming effects for remotely learned objects, and that lateral occipital connectivity with the medial prefrontal cortex was related to neural integration of remote objects from overlapping sequences.

      The authors aim to provide evidence of a relationship between behavioral and neural measures of integration with consolidation is interesting, important, and difficult to achieve given the longitudinal nature of studies required to answer this question. Strengths of this study include a creative behavioral task, and solid modelling approaches for fMRI data with careful control for several known confounds such as bold activation on pattern analysis results, motion, and physiological noise. The authors replicate their behavioral observations across two separate experiments, one of which included a large sample size, and found similar results that speak to the reliability of the observed behavioral phenomenon. In addition, they document several correlations between neural measures and task performance, lending functional significance to their neural findings.

      Thank you for this positive assessment of our study!

      However, this study is not without notable weaknesses that limit the strength of the manuscript. The authors report a behavioral priming effect suggestive of integration of remote but not recent memories, leading to the interpretation that the priming effect emerges with consolidation. However, they did not observe a reliable interaction between the priming condition and learning session (recent/remote) on reaction times, meaning that the priming effect for remote memories was not reliably greater than that observed for recent. In addition, the emergence of a priming effect for remote memories does not appear to be due to faster reaction times for primed targets over time (the condition of interest), but rather, slower reaction times for control items in the remote condition compared to recent. These issues limit the strength of the claim that the priming effect observed is due to C items of interest being integrated in a consolidation-dependent manner.

      We acknowledge that the lack of a day by condition interaction in the behavioral priming effect should discussed and now discuss this data in a more nuanced manner. While it’s true that the priming effect emerges due to a slowing of the control items over time, this slowing is consistent with classic time-dependent effects demonstrating slower response times for more delayed memories. The fact that the response times in the primed condition does not show this slowing can be interpreted as a protection against this slowing that would otherwise occur. Please see page 29, lines 758-766, for this added discussion.

      Similarly, the interactions between neural variables of interest and learning session needed to strongly show a significant consolidation-related effect in the brain were sometimes tenuous. There was no reliable difference in neural representational pattern analysis fit to a model of neural integration between the short and long delays in the medial prefrontal cortex or lateral occipital cortex, nor was the posterior hippocampus-lateral occipital cortex post-encoding connectivity correlation with subsequent priming significantly different for recent and remote memories. While the relationship between integration model fit in the medial prefrontal cortex and subsequent priming (which was significantly different from that occurring for recent memories) was one of the stronger findings of the paper in favor of a consolidation-related effect on behavior, is it possible that lack of a behavioral priming effect for recent memories due to possible issues with the control condition could mask a correlation between neural and behavioral integration in the recent memory condition?

      While we acknowledge that lack of a statistically reliable interaction between neural measures and behavioral priming in many cases, we are heartened by the reliable difference in the relationship between mPFC similarity and priming over time, which was our main planned prediction. In addition to adding caveats in the discussion about the neural measures and behavioral findings in the recent condition (see our response to R1.1 and R1.4 for more details), we have added language throughout the manuscript noting the need to interpret these data with caution.

      These limitations are especially notable when one considers that priming does not classically require a period of prolonged consolidation to occur, and prominent models of systems consolidation rather pertain to explicit memory. While the authors have provided evidence that neural integration in the medial prefrontal cortex, as well as post-encoding coupling between the lateral occipital cortex and posterior hippocampus, are related to faster reaction times for primed objects of overlapping sequences compared to their control condition, more work is needed to verify that the observed findings indeed reflect consolidation dependent integration as proposed.

      We agree that more work is needed to provide converging evidence for these novel findings. However, we wish to counter the notion that systems consolidation models are relevant only for explicit memories. Although models of systems consolidation often mention transformations from episodic to semantic memory, the critical mechanisms that define the models involve changes in the neural ensembles of a memory that is initially laid down in the hippocampus and is taught to cortex over time. This transformation of neural traces is not specific to explicit/declarative forms of memory. For example, implicit statistical learning initially depends on intact hippocampal function (Schapiro et al., 2014) and improves over consolidation (Durrant et al., 2011, 2013; Kóbor et al., 2017).

      Second, while there are many classical findings of priming during or immediately after learning, there are several instances of priming used to measure consolidation-related changes to newly learned information. For instance, priming has been used as a measure of lexical integration, demonstrating that new word learning benefits from a night of sleep (Wang et al., 2017; Gaskell et al., 2019) or a 1-week delay (Tamminen & Gaskell, 2013). The issue is not whether priming can occur immediately, it is whether priming increases with a delay.

      Finally, it is helpful to think about models of memory systems that divide memory representations not by their explicit/implicit nature, but along other important dimensions such as their neural bases, their flexibility vs rigidity, and their capacity for rapid vs slow learning (Henke, 2010). Considering this evidence, we suggest that systems consolidation models are most useful when considering how transformations in the underlying neural memory representation affects its behavioral expression, rather than focusing on the extent that the memory representation is explicit or implicit.

      With all this said, we have added text to the discussion reminding the reader that there was no statistically significant difference in priming as a function of the delay (page 29, lines 764 - 766). However, we are encouraged by the fact that the relationship between priming and mPFC neural similarity was significantly stronger for remotely learned objects relative to recently learned ones, as this is directly in line with systems consolidation theories.

      References

      Abolghasem, Z., Teng, T. H.-T., Nexha, E., Zhu, C., Jean, C. S., Castrillon, M., Che, E., Di Nallo, E. V., & Schlichting, M. L. (2023). Learning strategy differentially impacts memory connections in children and adults. Developmental Science, 26(4), e13371. https://doi.org/10.1111/desc.13371

      Dobbins, I. G., Schnyer, D. M., Verfaellie, M., & Schacter, D. L. (2004). Cortical activity reductions during repetition priming can result from rapid response learning. Nature, 428(6980), 316–319. https://doi.org/10.1038/nature02400

      Durrant, S. J., Cairney, S. A., & Lewis, P. A. (2013). Overnight consolidation aids the transfer of statistical knowledge from the medial temporal lobe to the striatum. Cerebral Cortex, 23(10), 2467–2478. https://doi.org/10.1093/cercor/bhs244

      Durrant, S. J., Taylor, C., Cairney, S., & Lewis, P. A. (2011). Sleep-dependent consolidation of statistical learning. Neuropsychologia, 49(5), 1322–1331. https://doi.org/10.1016/j.neuropsychologia.2011.02.015

      Gaskell, M. G., Cairney, S. A., & Rodd, J. M. (2019). Contextual priming of word meanings is stabilized over sleep. Cognition, 182, 109–126. https://doi.org/10.1016/j.cognition.2018.09.007

      Henke, K. (2010). A model for memory systems based on processing modes rather than consciousness. Nature Reviews Neuroscience, 11(7), 523–532. https://doi.org/10.1038/nrn2850

      Kóbor, A., Janacsek, K., Takács, Á., & Nemeth, D. (2017). Statistical learning leads to persistent memory: Evidence for one-year consolidation. Scientific Reports, 7(1), 760. https://doi.org/10.1038/s41598-017-00807-3

      Kuhl, B. A., & Chun, M. M. (2014). Successful remembering elicits event-specific activity patterns in lateral parietal cortex. The Journal of Neuroscience, 34(23), 8051–8060. https://doi.org/10.1523/JNEUROSCI.4328-13.2014

      Richter, F. R., Chanales, A. J. H., & Kuhl, B. A. (2016). Predicting the integration of overlapping memories by decoding mnemonic processing states during learning. NeuroImage, 124, Part A, 323–335. https://doi.org/10.1016/j.neuroimage.2015.08.051

      Schapiro, A. C., Gregory, E., Landau, B., McCloskey, M., & Turk-Browne, N. B. (2014). The necessity of the medial-temporal lobe for statistical learning. Journal of Cognitive Neuroscience, 1–12. https://doi.org/10.1162/jocn_a_00578

      Schlichting, M. L., & Preston, A. R. (2014). Memory reactivation during rest supports upcoming learning of related content. Proceedings of the National Academy of Sciences, 111(44), 15845–15850. https://doi.org/10.1073/pnas.1404396111

      Smith, J. F., Alexander, G. E., Chen, K., Husain, F. T., Kim, J., Pajor, N., & Horwitz, B. (2010). Imaging systems level consolidation of novel associate memories: A longitudinal neuroimaging study. NeuroImage, 50(2), 826–836. https://doi.org/10.1016/j.neuroimage.2009.11.053

      Takashima, A., Nieuwenhuis, I. L. C., Jensen, O., Talamini, L. M., Rijpkema, M., & Fernández, G. (2009). Shift from hippocampal to neocortical centered retrieval network with consolidation. The Journal of Neuroscience, 29(32), 10087–10093. https://doi.org/10.1523/JNEUROSCI.0799-09.2009

      Tamminen, J., & Gaskell, M. G. (2013). Novel word integration in the mental lexicon: Evidence from unmasked and masked semantic priming. The Quarterly Journal of Experimental Psychology, 66(5), 1001–1025. https://doi.org/10.1080/17470218.2012.724694

      van Kesteren, M. T. R. van, Fernández, G., Norris, D. G., & Hermans, E. J. (2010). Persistent schema-dependent hippocampal-neocortical connectivity during memory encoding and postencoding rest in humans. Proceedings of the National Academy of Sciences, 107(16), 7550–7555. https://doi.org/10.1073/pnas.0914892107

      Wang, H.-C., Savage, G., Gaskell, M. G., Paulin, T., Robidoux, S., & Castles, A. (2017). Bedding down new words: Sleep promotes the emergence of lexical competition in visual word recognition. Psychonomic Bulletin & Review, 24(4), 1186–1193. https://doi.org/10.3758/s13423-016-1182-7

    1. Author Response

      Reviewer #1 (Public Review):

      This study used a multi-day learning paradigm combined with fMRI to reveal neural changes reflecting the learning of new (arbitrary) shape-sound associations. In the scanner, the shapes and sounds are presented separately and together, both before and after learning. When they are presented together, they can be either consistent or inconsistent with the learned associations. The analyses focus on auditory and visual cortices, as well as the object-selective cortex (LOC) and anterior temporal lobe regions (temporal pole (TP) and perirhinal cortex (PRC)). Results revealed several learning-induced changes, particularly in the anterior temporal lobe regions. First, the LOC and PRC showed a reduced bias to shapes vs sounds (presented separately) after learning. Second, the TP responded more strongly to incongruent than congruent shape-sound pairs after learning. Third, the similarity of TP activity patterns to sounds and shapes (presented separately) was increased for non-matching shape-sound comparisons after learning. Fourth, when comparing the pattern similarity of individual features to combined shape-sound stimuli, the PRC showed a reduced bias towards visual features after learning. Finally, comparing patterns to combined shape-sound stimuli before and after learning revealed a reduced (and negative) similarity for incongruent combinations in PRC. These results are all interpreted as evidence for an explicit integrative code of newly learned multimodal objects, in which the whole is different from the sum of the parts.

      The study has many strengths. It addresses a fundamental question that is of broad interest, the learning paradigm is well-designed and controlled, and the stimuli are real 3D stimuli that participants interact with. The manuscript is well written and the figures are very informative, clearly illustrating the analyses performed.

      There are also some weaknesses. The sample size (N=17) is small for detecting the subtle effects of learning. Most of the statistical analyses are not corrected for multiple comparisons (ROIs), and the specificity of the key results to specific regions is also not tested. Furthermore, the evidence for an integrative representation is rather indirect, and alternative interpretations for these results are not considered.

      We thank the reviewer for their careful reading and the positive comments on our manuscript. As suggested, we have conducted additional analyses of theoretically-motivated ROIs and have found that temporal pole and perirhinal cortex are the only regions to show the key experience-dependent transformations. We are much more cautious with respect to multiple comparisons, and have removed a series of post hoc across-ROI comparisons that were irrelevant to the key questions of the present manuscript. The revised manuscript now includes much more discussion about alternative interpretations as suggested by the reviewer (and also by the other reviewers).

      Additionally, we looked into scanning more participants, but our scanner has since had a full upgrade and the sequence used in the current study is no longer supported by our scanner. However, we note that while most analyses contain 17 participants, we employed a within-subject learning design that is not typically used in fMRI experiments and increases our power to detect an effect. This is supported by the robust effect size of the behavioural data, whereby 17 out of 18 participants revealed a learning effect (Cohen’s D = 1.28) and which was replicated in a follow-up experiment with a larger sample size.

      We address the other reviewer comments point-by-point in the below.

      Reviewer #2 (Public Review):

      Li et al. used a four-day fMRI design to investigate how unimodal feature information is combined, integrated, or abstracted to form a multimodal object representation. The experimental question is of great interest and understanding how the human brain combines featural information to form complex representations is relevant for a wide range of researchers in neuroscience, cognitive science, and AI. While most fMRI research on object representations is limited to visual information, the authors examined how visual and auditory information is integrated to form a multimodal object representation. The experimental design is elegant and clever. Three visual shapes and three auditory sounds were used as the unimodal features; the visual shapes were used to create 3D-printed objects. On Day 1, the participants interacted with the 3D objects to learn the visual features, but the objects were not paired with the auditory features, which were played separately. On Day 2, participants were scanned with fMRI while they were exposed to the unimodal visual and auditory features as well as pairs of visual-auditory cues. On Day 3, participants again interacted with the 3D objects but now each was paired with one of the three sounds that played from an internal speaker. On Day 4, participants completed the same fMRI scanning runs they completed on Day 2, except now some visual-auditory feature pairs corresponded with Congruent (learned) objects, and some with Incongruent (unlearned) objects. Using the same fMRI design on Days 2 and 4 enables a well-controlled comparison between feature- and object-evoked neural representations before and after learning. The notable results corresponded to findings in the perirhinal cortex and temporal pole. The authors report (1) that a visual bias on Day 2 for unimodal features in the perirhinal cortex was attenuated after learning on Day 4, (2) a decreased univariate response to congruent vs. incongruent visual-auditory objects in the temporal pole on Day 4, (3) decreased pattern similarity between congruent vs. incongruent pairs of visual and auditory unimodal features in the temporal pole on Day 4, (4) in the perirhinal cortex, visual unimodal features on Day 2 do not correlate with their respective visual-auditory objects on Day 4, and (5) in the perirhinal cortex, multimodal object representations across Days 2 and 4 are uncorrelated for congruent objects and anticorrelated for incongruent. The authors claim that each of these results supports the theory that multimodal objects are represented in an "explicit integrative" code separate from feature representations. While these data are valuable and the results are interesting, the authors' claims are not well supported by their findings.

      We thank the reviewer for the careful reading of our manuscript and positive comments. Overall, we now stay closer to the data when describing the results and provide our interpretation of these results in the discussion section while remaining open to alternative interpretations (as also suggested by Reviewer 1).

      (1) In the introduction, the authors contrast two theories: (a) multimodal objects are represented in the co-activation of unimodal features, and (b) multimodal objects are represented in an explicit integrative code such that the whole is different than the sum of its parts. However, the distinction between these two theories is not straightforward. An explanation of what is precisely meant by "explicit" and "integrative" would clarify the authors' theoretical stance. Perhaps we can assume that an "explicit" representation is a new representation that is created to represent a multimodal object. What is meant by "integrative" is more ambiguous-unimodal features could be integrated within a representation in a manner that preserves the decodability of the unimodal features, or alternatively the multimodal representation could be completely abstracted away from the constituent features such that the features are no longer decodable. Even if the object representation is "explicit" and distinct from the unimodal feature representations, it can in theory still contain featural information, though perhaps warped or transformed. The authors do not clearly commit to a degree of featural abstraction in their theory of "explicit integrative" multimodal object representations which makes it difficult to assess the validity of their claims.

      Due to its ambiguity, we removed the term “explicit” and now make it clear that our central question was whether crossmodal object representations require only unimodal feature-level representations (e.g., frogs are created from only the combination of shape and sound) or whether crossmodal object representations also rely on an integrative code distinct from the unimodal features (e.g., there is something more to “frog” than its original shape and sound). We now clarify this in the revised manuscript.

      “One theoretical view from the cognitive sciences suggests that crossmodal objects are built from component unimodal features represented across distributed sensory regions.8 Under this view, when a child thinks about “frog”, the visual cortex represents the appearance of the shape of the frog whereas the auditory cortex represents the croaking sound. Alternatively, other theoretical views predict that multisensory objects are not only built from their component unimodal sensory features, but that there is also a crossmodal integrative code that is different from the sum of these parts.9,10,11,12,13 These latter views propose that anterior temporal lobe structures can act as a polymodal “hub” that combines separate features into integrated wholes.9,11,14,15” – pg. 4

      For this reason, we designed our paradigm to equate the unimodal representations, such that neural differences between the congruent and incongruent conditions provide evidence for a crossmodal integrative code different from the unimodal features (because the unimodal features are equated by default in the design).

      “Critically, our four-day learning task allowed us to isolate any neural activity associated with integrative coding in anterior temporal lobe structures that emerges with experience and differs from the neural patterns recorded at baseline. The learned and non-learned crossmodal objects were constructed from the same set of three validated shape and sound features, ensuring that factors such as familiarity with the unimodal features, subjective similarity, and feature identity were tightly controlled (Figure 2). If the mind represented crossmodal objects entirely as the reactivation of unimodal shapes and sounds (i.e., objects are constructed from their parts), then there should be no difference between the learned and non-learned objects (because they were created from the same three shapes and sounds). By contrast, if the mind represented crossmodal objects as something over and above their component features (i.e., representations for crossmodal objects rely on integrative coding that is different from the sum of their parts), then there should be behavioral and neural differences between learned and non-learned crossmodal objects (because the only difference across the objects is the learned relationship between the parts). Furthermore, this design allowed us to determine the relationship between the object representation acquired after crossmodal learning and the unimodal feature representations acquired before crossmodal learning. That is, we could examine whether learning led to abstraction of the object representations such that it no longer resembled the unimodal feature representations.” – pg. 5

      Furthermore, we agree with the reviewer that our definition and methodological design does not directly capture the structure of the integrative code. With experience, the unimodal feature representations may be completely abstracted away, warped, or changed in a nonlinear transformation. We suggest that crossmodal learning forms an integrative code that is different from the original unimodal representations in the anterior temporal lobes, however, we agree that future work is needed to more directly capture the structure of the integrative code that emerges with experience.

      “In our task, participants had to differentiate congruent and incongruent objects constructed from the same three shape and sound features (Figure 2). An efficient way to solve this task would be to form distinct object-level outputs from the overlapping unimodal feature-level inputs such that congruent objects are made to be orthogonal from the representations before learning (i.e., measured as pattern similarity equal to 0 in the perirhinal cortex; Figure 5b, 6, Supplemental Figure S5), whereas non-learned incongruent objects could be made to be dissimilar from the representations before learning (i.e., anticorrelation, measured as patten similarity less than 0 in the perirhinal cortex; Figure 6). Because our paradigm could decouple neural responses to the learned object representations (on Day 4) from the original component unimodal features at baseline (on Day 2), these results could be taken as evidence of pattern separation in the human perirhinal cortex.11,12 However, our pattern of results could also be explained by other types of crossmodal integrative coding. For example, incongruent object representations may be less stable than congruent object representations, such that incongruent objects representation are warped to a greater extent than congruent objects (Figure 6).” – pg. 18

      “As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation.” – pg. 18

      (2) After participants learned the multimodal objects, the authors report a decreased univariate response to congruent visual-auditory objects relative to incongruent objects in the temporal pole. This is claimed to support the existence of an explicit, integrative code for multimodal objects. Given the number of alternative explanations for this finding, this claim seems unwarranted. A simpler interpretation of these results is that the temporal pole is responding to the novelty of the incongruent visual-auditory objects. If there is in fact an explicit, integrative multimodal object representation in the temporal pole, it is unclear why this would manifest in a decreased univariate response.

      We thank the reviewer for identifying this issue. Our behavioural design controls unimodal feature-level novelty but allows object-level novelty to differ. Thus, neural differences between the congruent and incongruent conditions reflects sensitivity to the object-level differences between the combination of shape and sound. However, we agree that there are multiple interpretations regarding the nature of how the integrative code is structured in the temporal pole and perirhinal cortex. We have removed the interpretation highlighted by the reviewer from the results. Instead, we now provide our preferred interpretation in the discussion, while acknowledging the other possibilities that the reviewer mentions.

      As one possibility, these results in temporal pole may reflect “conceptual combination”. “hummingbird” – a congruent pairing – may require less neural resources than an incongruent pairing such as “bark-frog”.

      “Furthermore, these distinct anterior temporal lobe structures may be involved with integrative coding in different ways. For example, the crossmodal object representations measured after learning were found to be related to the component unimodal feature representations measured before learning in the temporal pole but not the perirhinal cortex (Figure 5, 6, Supplemental Figure S5). Moreover, pattern similarity for congruent shape-sound pairs were lower than the pattern similarity for incongruent shape-sound pairs after crossmodal learning in the temporal pole but not the perirhinal cortex (Figure 4b, Supplemental Figure S3a). As one interpretation of this pattern of results, the temporal pole may represent new crossmodal objects by combining previously learned knowledge. 8,9,10,11,13,14,15,33 Specifically, research into conceptual combination has linked the anterior temporal lobes to compound object concepts such as “hummingbird”.34,35,36 For example, participants during our task may have represented the sound-based “humming” concept and visually-based “bird” concept on Day 1, forming the crossmodal “hummingbird” concept on Day 3; Figure 1, 2, which may recruit less activity in temporal pole than an incongruent pairing such as “barking-frog”. For these reasons, the temporal pole may form a crossmodal object code based on pre-existing knowledge, resulting in reduced neural activity (Figure 3d) and pattern similarity towards features associated with learned objects (Figure 4b).”– pg. 18

      (3) The authors ran a neural pattern similarity analysis on the unimodal features before and after multimodal object learning. They found that the similarity between visual and auditory features that composed congruent objects decreased in the temporal pole after multimodal object learning. This was interpreted to reflect an explicit integrative code for multimodal objects, though it is not clear why. First, behavioral data show that participants reported increased similarity between the visual and auditory unimodal features within congruent objects after learning, the opposite of what was found in the temporal pole. Second, it is unclear why an analysis of the unimodal features would be interpreted to reflect the nature of the multimodal object representations. Since the same features corresponded with both congruent and incongruent objects, the nature of the feature representations cannot be interpreted to reflect the nature of the object representations per se. Third, using unimodal feature representations to make claims about object representations seems to contradict the theoretical claim that explicit, integrative object representations are distinct from unimodal features. If the learned multimodal object representation exists separately from the unimodal feature representations, there is no reason why the unimodal features themselves would be influenced by the formation of the object representation. Instead, these results seem to more strongly support the theory that multimodal object learning results in a transformation or warping of feature space.

      We apologize for the lack of clarity. We have now overhauled this aspect of our manuscript in an attempt to better highlight key aspects of our experimental design. In particular, because the unimodal features composing the congruent and incongruent objects were equated, neural differences between these conditions would provide evidence for an experience-dependent crossmodal integrative code that is different from its component unimodal features.

      Related to the second and third points, we were looking at the extent to which the original unimodal representations change with crossmodal learning. Before crossmodal learning, we found that the perirhinal cortex tracked the similarity between the individual visual shape features and the crossmodal objects that were composed of those visual shapes – however, there was no evidence that perirhinal cortex was tracking the unimodal sound features on those crossmodal objects. After crossmodal learning, we see that this visual shape bias in perirhinal cortex was no longer present – that is, the representation in perirhinal cortex started to look less like the visual features that comprise the objects. Thus, crossmodal learning transformed the perirhinal representations so that they were no longer predominantly grounded in a single visual modality, which may be a mechanism by which object concepts gain their abstraction. We have now tried to be clearer about this interpretation throughout the paper.

      Notably, we suggest that experience may change both the crossmodal object representations, as well as the unimodal feature representations. For example, we have previously shown that unimodal visual features are influenced by experience in parallel with the representation of the conjunction (e.g., Liang et al., 2020; Cerebral Cortex). Nevertheless, we remain open to the myriad possible structures of the integrative code that might emerge with experience.

      We now clarify these points throughout the manuscript. For example:

      “We then examined whether the original representations would change after participants learned how the features were paired together to make specific crossmodal objects, conducting the same analysis described above after crossmodal learning had taken place (Figure 5b). With this analysis, we sought to measure the relationship between the representation for the learned crossmodal object and the original baseline representation for the unimodal features. More specifically, the voxel-wise activity for unimodal feature runs before crossmodal learning was correlated to the voxel-wise activity for crossmodal object runs after crossmodal learning (Figure 5b). Another linear mixed model which included modality as a fixed factor within each ROI revealed that the perirhinal cortex was no longer biased towards visual shape after crossmodal learning (F1,32 = 0.12, p = 0.73), whereas the temporal pole, LOC, V1, and A1 remained biased towards either visual shape or sound (F1,30-32 between 16.20 and 73.42, all p < 0.001, η2 between 0.35 and 0.70).” – pg. 14

      “To investigate this effect in perirhinal cortex more specifically, we conducted a linear mixed model to directly compare the change in the visual bias of perirhinal representations from before crossmodal learning to after crossmodal learning (green regions in Figure 5a vs. 5b). Specifically, the linear mixed model included learning day (before vs. after crossmodal learning) and modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object). Results revealed a significant interaction between learning day and modality in the perirhinal cortex (F1,775 = 5.56, p = 0.019, η2 = 0.071), meaning that the baseline visual shape bias observed in perirhinal cortex (green region of Figure 5a) was significantly attenuated with experience (green region of Figure 5b). After crossmodal learning, a given shape no longer invoked significant pattern similarity between objects that had the same shape but differed in terms of what they sounded like. Taken together, these results suggest that prior to learning the crossmodal objects, the perirhinal cortex had a default bias toward representing the visual shape information and was not representing sound information of the crossmodal objects. After crossmodal learning, however, the visual shape bias in perirhinal cortex was no longer present. That is, with crossmodal learning, the representations within perirhinal cortex started to look less like the visual features that comprised the crossmodal objects, providing evidence that the perirhinal representations were no longer predominantly grounded in the visual modality.” – pg. 13

      “Importantly, the initial visual shape bias observed in the perirhinal cortex was attenuated by experience (Figure 5, Supplemental Figure S5), suggesting that the perirhinal representations had become abstracted and were no longer predominantly grounded in a single modality after crossmodal learning. One possibility may be that the perirhinal cortex is by default visually driven as an extension to the ventral visual stream,10,11,12 but can act as a polymodal “hub” region for additional crossmodal input following learning.” – pg. 19

      (4) The most compelling evidence the authors provide for their theoretical claims is the finding that, in the perirhinal cortex, the unimodal feature representations on Day 2 do not correlate with the multimodal objects they comprise on Day 4. This suggests that the learned multimodal object representations are not combinations of their unimodal features. If unimodal features are not decodable within the congruent object representations, this would support the authors' explicit integrative hypothesis. However, the analyses provided do not go all the way in convincing the reader of this claim. First, the analyses reported do not differentiate between congruent and incongruent objects. If this result in the perirhinal cortex reflects the formation of new multimodal object representations, it should only be true for congruent objects but not incongruent objects. Since the analyses combine congruent and incongruent objects it is not possible to know whether this was the case. Second, just because feature representations on Day 2 do not correlate with multimodal object patterns on Day 4 does not mean that the object representations on Day 4 do not contain featural information. This could be directly tested by correlating feature representations on Day 4 with congruent vs. incongruent object representations on Day 4. It could be that representations in the perirhinal cortex are not stable over time and all representations-including unimodal feature representations-shift between sessions, which could explain these results yet not entail the existence of abstracted object representations.

      We thank the reviewer for this suggestion and have conducted the two additional analyses. Specifically, we split the congruent and incongruent conditions and also investigated correlations between unimodal representations on Day 4 with crossmodal object representations on Day 4. There was no significant interaction between modality and congruency in any ROI across or within learning days. One possible explanation for these findings is that both congruent and incongruent crossmodal objects are represented differently from their underlying unimodal features, and all of these representations can transform with experience.

      However, the new analyses also revealed that perirhinal cortex was the only region without a modality-specific bias after crossmodal learning (e.g., Day 4 Unimodal Feature runs x Day 4 Crossmodal Object runs; now shown in Supplemental Figure S5). Overall, these results are consistent with the notion of a crossmodal integrative code in perirhinal cortex that has changed with experience and is different from the component unimodal features. Nevertheless, we explore alternative interpretations for how the crossmodal code emerges with experience in the discussion.

      “To examine whether these results differed by congruency (i.e., whether any modality-specific biases differed as a function of whether the object was congruent or incongruent), we conducted exploratory linear mixed models for each of the five a priori ROIs across learning days. More specifically, we correlated: 1) the voxel-wise activity for Unimodal Feature Runs before crossmodal learning to the voxel-wise activity for Crossmodal Object Runs before crossmodal learning (Day 2 vs. Day 2), 2) the voxel-wise activity for Unimodal Feature Runs before crossmodal learning to the voxel-wise activity for Crossmodal Object Runs after crossmodal learning (Day 2 vs Day 4), and 3) the voxel-wise activity for Unimodal Feature Runs after crossmodal learning to the voxel-wise activity for Crossmodal Object Runs after crossmodal learning (Day 4 vs Day 4). For each of the three analyses described, we then conducted separate linear mixed models which included modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object) and congruency (congruent vs. incongruent)….There was no significant relationship between modality and congruency in any ROI between Day 2 and Day 2 (F1,346-368 between 0.00 and 1.06, p between 0.30 and 0.99), between Day 2 and Day 4 (F1,346-368 between 0.021 and 0.91, p between 0.34 and 0.89), or between Day 4 and Day 4 (F1,346-368 between 0.01 and 3.05, p between 0.082 and 0.93). However, exploratory analyses revealed that perirhinal cortex was the only region without a modality-specific bias and where the unimodal feature runs were not significantly correlated to the crossmodal object runs after crossmodal learning (Supplemental Figure S5).” – pg. 14

      “Taken together, the overall pattern of results suggests that representations of the crossmodal objects in perirhinal cortex were heavily influenced by their consistent visual features before crossmodal learning. However, the crossmodal object representations were no longer influenced by the component visual features after crossmodal learning (Figure 5, Supplemental Figure S5). Additional exploratory analyses did not find evidence of experience-dependent changes in the hippocampus or inferior parietal lobes (Supplemental Figure S4c-e).” – pg. 14

      “The voxel-wise matrix for Unimodal Feature runs on Day 4 were correlated to the voxel-wise matrix for Crossmodal Object runs on Day 4 (see Figure 5 in the main text for an example). We compared the average pattern similarity (z-transformed Pearson correlation) between shape (blue) and sound (orange) features specifically after crossmodal learning. Consistent with Figure 5b, perirhinal cortex was the only region without a modality-specific bias. Furthermore, perirhinal cortex was the only region where the representations of both the visual and sound features were not significantly correlated to the crossmodal objects. By contrast, every other region maintained a modality-specific bias for either the visual or sound features. These results suggest that perirhinal cortex representations were transformed with experience, such that the initial visual shape representations (Figure 5a) were no longer grounded in a single modality after crossmodal learning. Furthermore, these results suggest that crossmodal learning formed an integrative code different from the unimodal features in perirhinal cortex, as the visual and sound features were not significantly correlated with the crossmodal objects. * p < 0.05, ** p < 0.01, *** p < 0.001. Horizontal lines within brain regions indicate a significant main effect of modality. Vertical asterisks denote pattern similarity comparisons relative to 0.” – Supplemental Figure S5

      “We found that the temporal pole and perirhinal cortex – two anterior temporal lobe structures – came to represent new crossmodal object concepts with learning, such that the acquired crossmodal object representations were different from the representation of the constituent unimodal features (Figure 5, 6). Intriguingly, the perirhinal cortex was by default biased towards visual shape, but that this initial visual bias was attenuated with experience (Figure 3c, 5, Supplemental Figure S5). Within the perirhinal cortex, the acquired crossmodal object concepts (measured after crossmodal learning) became less similar to their original component unimodal features (measured at baseline before crossmodal learning); Figure 5, 6, Supplemental Figure S5. This is consistent with the idea that object representations in perirhinal cortex integrate the component sensory features into a whole that is different from the sum of the component parts, which might be a mechanism by which object concepts obtain their abstraction…. As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation.” – pg. 18

      In sum, the authors have collected a fantastic dataset that has the potential to answer questions about the formation of multimodal object representations in the brain. A more precise delineation of different theoretical accounts and additional analyses are needed to provide convincing support for the theory that “explicit integrative” multimodal object representations are formed during learning.

      We thank the reviewer for the positive comments and helpful feedback. We hope that our changes to our wording and clarifications to our methodology now more clearly supports the central goal of our study: to find evidence of crossmodal integrative coding different from the original unimodal feature parts in anterior temporal lobe structures. We furthermore agree that future research is needed to delineate the structure of the integrative code that emerges with experience in the anterior temporal lobes.

      Reviewer #3 (Public Review):

      This paper uses behavior and functional brain imaging to understand how neural and cognitive representations of visual and auditory stimuli change as participants learn associations among them. Prior work suggests that areas in the anterior temporal (ATL) and perirhinal cortex play an important role in learning/representing cross-modal associations, but the hypothesis has not been directly tested by evaluating behavior and functional imaging before and after learning cross- modal associations. The results show that such learning changes both the perceived similarities amongst stimuli and the neural responses generated within ATL and perirhinal regions, providing novel support for the view that cross-modal learning leads to a representational change in these regions.

      This work has several strengths. It tackles an important question for current theories of object representation in the mind and brain in a novel and quite direct fashion, by studying how these representations change with cross-modal learning. As the authors note, little work has directly assessed representational change in ATL following such learning, despite the widespread view that ATL is critical for such representation. Indeed, such direct assessment poses several methodological challenges, which the authors have met with an ingenious experimental design. The experiment allows the authors to maintain tight control over both the familiarity and the perceived similarities amongst the shapes and sounds that comprise their stimuli so that the observed changes across sessions must reflect learned cross-modal associations among these. I especially appreciated the creation of physical objects that participants can explore and the approach to learning in which shapes and sounds are initially experienced independently and later in an associated fashion. In using multi-echo MRI to resolve signals in ventral ATL, the authors have minimized a key challenge facing much work in this area (namely the poor SNR yielded by standard acquisition sequences in ventral ATL). The use of both univariate and multivariate techniques was well-motivated and helpful in testing the central questions. The manuscript is, for the most part, clearly written, and nicely connects the current work to important questions in two literatures, specifically (1) the hypothesized role of the perirhinal cortex in representing/learning complex conjunctions of features and (2) the tension between purely embodied approaches to semantic representation vs the view that ATL regions encode important amodal/crossmodal structure.

      There are some places in the manuscript that would benefit from further explanation and methodological detail. I also had some questions about the results themselves and what they signify about the roles of ATL and the perirhinal cortex in object representation.

      We thank the reviewer for their positive feedback and address the comments in the below point-by-point responses.

      (A) I found the terms "features" and "objects" to be confusing as used throughout the manuscript, and sometimes inconsistent. I think by "features" the authors mean the shape and sound stimuli in their experiment. I think by "object" the authors usually mean the conjunction of a shape with a sound---for instance, when a shape and sound are simultaneously experienced in the scanner, or when the participant presses a button on the shape and hears the sound. The confusion comes partly because shapes are often described as being composed of features, not features in and of themselves. (The same is sometimes true of sounds). So when reading "features" I kept thinking the paper referred to the elements that went together to comprise a shape. It also comes from ambiguous use of the word object, which might refer to (a) the 3D- printed item that people play with, which is an object, or (b) a visually-presented shape (for instance, the localizer involved comparing an "object" to a "phase-scrambled" stimulus---here I assume "object" refers to an intact visual stimulus and not the joint presentation of visual and auditory items). I think the design, stimuli, and results would be easier for a naive reader to follow if the authors used the terms "unimodal representation" to refer to cases where only visual or auditory input is presented, and "cross-modal" or "conjoint" representation when both are present.

      We thank the reviewer for this suggestion and agree. We have replaced the terms “features” and “objects” with “unimodal” and “crossmodal” in the title, text, and figures throughout the manuscript for consistency (i.e., “crossmodal binding problem”). To simplify the terminology, we have also removed the localizer results.

      (B) There are a few places where I wasn't sure what exactly was done, and where the methods lacked sufficient detail for another scientist to replicate what was done. Specifically:

      (1) The behavioral study assessing perceptual similarity between visual and auditory stimuli was unclear. The procedure, stimuli, number of trials, etc, should be explained in sufficient detail in methods to allow replication. The results of the study should also minimally be reported in the supplementary information. Without an understanding of how these studies were carried out, it was very difficult to understand the observed pattern of behavioral change. For instance, I initially thought separate behavioral blocks were carried out for visual versus auditory stimuli, each presented in isolation; however, the effects contrast congruent and incongruent stimuli, which suggests these decisions must have been made for the conjoint presentation of both modalities. I'm still not sure how this worked. Additionally, the manuscript makes a brief mention that similarity judgments were made in the context of "all stimuli," but I didn't understand what that meant. Similarity ratings are hugely sensitive to the contrast set with which items appear, so clarity on these points is pretty important. A strength of the design is the contention that shape and sound stimuli were psychophysically matched, so it is important to show the reader how this was done and what the results were.

      We agree and apologize for the lack of sufficient detail in the original manuscript. We now include much more detail about the similarity rating task. The methodology and results of the behavioral rating experiments are now shown in Supplemental Figure S1. In Figure S1a, the similarity ratings are visualized on a multidimensional scaling plot. The triangular geometry for shape (blue) and sound (red) indicate that the subjective similarity was equated within each unimodal feature across individual participants. Quantitatively, there was no difference in similarity between the congruent and incongruent pairings in Figure S1b and Figure S1c prior to crossmodal learning. In addition to providing more information on these methods in the Supplemental Information, we also now provide a more detailed description of the task in the manuscript itself. For convenience, we reproduce these sections below.

      “Pairwise Similarity Task. Using the same task as the stimulus validation procedure (Supplemental Figure S1a), participants provided similarity ratings for all combinations of the 3 validated shapes and 3 validated sounds (each of the six features were rated in the context of every other feature in the set, with 4 repeats of the same feature, for a total of 72 trials). More specifically, three stimuli were displayed on each trial, with one at the top and two at the bottom of the screen in the same procedure as we have used previously27. The 3D shapes were visually displayed as a photo, whereas sounds were displayed on screen in a box that could be played over headphones when clicked with the mouse. The participant made an initial judgment by selecting the more similar stimulus on the bottom relative to the stimulus on the top. Afterwards, the participant made a similarity rating between each bottom stimulus with the top stimulus from 0 being no similarity to 5 being identical. This procedure ensured that ratings were made relative to all other stimuli in the set.”– pg. 28

      “Pairwise similarity task and results. In the initial stimulus validation experiment, participants provided pairwise ratings for 5 sounds and 3 shapes. The shapes were equated in their subjective similarity that had been selected from a well-characterized perceptually uniform stimulus space27 and the pairwise ratings followed the same procedure as described in ref 27. Based on this initial experiment, we then selected the 3 sounds from the that were most closely equated in their subjective similarity. (a) 3D-printed shapes were displayed as images, whereas sounds were displayed in a box that could be played when clicked by the participant. Ratings were averaged to produce a similarity matrix for each participant, and then averaged to produce a group-level similarity matrix. Shown as triangular representational geometries recovered from multidimensional scaling in the above, shapes (blue) and sounds (orange) were approximately equated in their subjective similarity. These features were then used in the four-day crossmodal learning task. (b) Behavioral results from the four-day crossmodal learning task paired with multi-echo fMRI described in the main text. Before crossmodal learning, there was no difference in similarity between shape and sound features associated with congruent objects compared to incongruent objects – indicating that similarity was controlled at the unimodal feature-level. After crossmodal learning, we observed a robust shift in the magnitude of similarity. The shape and sound features associated with congruent objects were now significantly more similar than the same shape and sound features associated with incongruent objects (p < 0.001), evidence that crossmodal learning changed how participants experienced the unimodal features (observed in 17/18 participants). (c) We replicated this learning-related shift in pattern similarity with a larger sample size (n = 44; observed in 38/44 participants). *** denotes p < 0.001. Horizontal lines denote the comparison of congruent vs. incongruent conditions. – Supplemental Figure S1

      (2) The experiences through which participants learned/experienced the shapes and sounds were unclear. The methods mention that they had one minute to explore/palpate each shape and that these experiences were interleaved with other tasks, but it is not clear what the other tasks were, how many such exploration experiences occurred, or how long the total learning time was. The manuscript also mentions that participants learn the shape-sound associations with 100% accuracy but it isn't clear how that was assessed. These details are important partly b/c it seems like very minimal experience to change neural representations in the cortex.

      We apologize for the lack of detail and agree with the reviewer’s suggestions – we now include much more information in the methods section. Each behavioral day required about 1 hour of total time to complete, and indeed, participants rapidly learned their associations with minimal experience. For example:

      “Behavioral Tasks. On each behavioral day (Day 1 and Day 3; Figure 2), participants completed the following tasks, in this order: Exploration Phase, one Unimodal Feature 1-back run (26 trials), Exploration Phase, one Crossmodal 1-back run (26 trials), Exploration Phase, Pairwise Similarity Task (24 trials), Exploration Phase, Pairwise Similarity Task (24 trials), Exploration Phase, Pairwise Similarity Task (24 trials), and finally, Exploration Phase. To verify learning on Day 3, participants also additionally completed a Learning Verification Task at the end of the session. – pg. 27

      “The overall procedure ensured that participants extensively explored the unimodal features on Day 1 and the crossmodal objects on Day 3. The Unimodal Feature and the Crossmodal Object 1-back runs administered on Day 1 and Day 3 served as practice for the neuroimaging sessions on Day 2 and Day 4, during which these 1-back tasks were completed. Each behavioral session required less than 1 hour of total time to complete.” – pg. 27

      “Learning Verification Task (Day 3 only). As the final task on Day 3, participants completed a task to ensure that participants successfully formed their crossmodal pairing. All three shapes and sounds were randomly displayed in 6 boxes on a display. Photos of the 3D shapes were shown, and sounds were played by clicking the box with the mouse cursor. The participant was cued with either a shape or sound, and then selected the corresponding paired feature. At the end of Day 3, we found that all participants reached 100% accuracy on this task (10 trials).” – pg. 29

      (3) I didn't understand the similarity metric used in the multivariate imaging analyses. The manuscript mentions Z-scored Pearson's r, but I didn't know if this meant (a) many Pearson coefficients were computed and these were then Z-scored, so that 0 indicates a value equal to the mean Pearson correlation and 1 is equal to the standard deviation of the correlations, or (b) whether a Fisher Z transform was applied to each r (so that 0 means r was also around 0). From the interpretation of some results, I think the latter is the approach taken, but in general, it would be helpful to see, in Methods or Supplementary information, exactly how similarity scores were computed, and why that approach was adopted. This is particularly important since it is hard to understand the direction of some key effects.

      The reviewer is correct that the Fisher Z transform was applied to each individual r before averaging the correlations. This approach is generally recommended when averaging correlations (see Corey, Dunlap, & Burke, 1998). We are now clearer on this point in the manuscript:

      “The z-transformed Pearson’s correlation coefficient was used as the distance metric for all pattern similarity analyses. More specifically, each individual Pearson correlation was Fisher z-transformed and then averaged (see 61).” – pg. 32

      (C) From Figure 3D, the temporal pole mask appears to exclude the anterior fusiform cortex (or the ventral surface of the ATL generally). If so, this is a shame, since that appears to be the locus most important to cross-modal integration in the "hub and spokes" model of semantic representation in the brain. The observation in the paper that the perirhinal cortex seems initially biased toward visual structure while more superior ATL is biased toward auditory structure appears generally consistent with the "graded hub" view expressed, for instance, in our group's 2017 review paper (Lambon Ralph et al., Nature Reviews Neuroscience). The balance of visual- versus auditory-sensitivity in that work appears balanced in the anterior fusiform, just a little lateral to the anterior perirhinal cortex. It would be helpful to know if the same pattern is observed for this area specifically in the current dataset.

      We thank the reviewer for this suggestion. After close inspection of Lambon Ralph et al. (2017), we believe that our perirhinal cortex mask appears to be overlapping with the ventral ATL/anterior fusiform region that the reviewer mentions. See Author response image 1 for a visual comparison:

      Author response image 1.

      The top four figures are sampled from Lambon Ralph et al (2017), whereas the bottom two figures visualize our perirhinal cortex mask (white) and temporal pole mask (dark green) relative to the fusiform cortex. The ROIs visualized were defined from the Harvard-Oxford atlas.

      We now mention this area of overlap in our manuscript and link it to the hub and spokes model:

      “Notably, our perirhinal cortex mask overlaps with a key region of the ventral anterior temporal lobe thought to be the central locus of crossmodal integration in the “hub and spokes” model of semantic representations.9,50 – pg. 20

      (D) While most effects seem robust from the information presented, I'm not so sure about the analysis of the perirhinal cortex shown in Figure 5. This compares (I think) the neural similarity evoked by a unimodal stimulus ("feature") to that evoked by the same stimulus when paired with its congruent stimulus in the other modality ("object"). These similarities show an interaction with modality prior to cross-modal association, but no interaction afterward, leading the authors to suggest that the perirhinal cortex has become less biased toward visual structure following learning. But the plots in Figures 4a and b are shown against different scales on the y-axes, obscuring the fact that all of the similarities are smaller in the after-learning comparison. Since the perirhinal interaction was already the smallest effect in the pre-learning analysis, it isn't really surprising that it drops below significance when all the effects diminish in the second comparison. A more rigorous test would assess the reliability of the interaction of comparison (pre- or post-learning) with modality. The possibility that perirhinal representations become less "visual" following cross-modal learning is potentially important so a post hoc contrast of that kind would be helpful.

      We apologize for the lack of clarity. We conducted a linear mixed model to assess the interaction between modality and crossmodal learning day (before and after crossmodal learning) in the perirhinal cortex as described by the reviewer. The critical interaction was significant, which is now clarified in the text as well as in the rescaled figure plots.

      “To investigate this effect in perirhinal cortex more specifically, we conducted a linear mixed model to directly compare the change in the visual bias of perirhinal representations from before crossmodal learning to after crossmodal learning (green regions in Figure 5a vs. 5b). Specifically, the linear mixed model included learning day (before vs. after crossmodal learning) and modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object). Results revealed a significant interaction between learning day and modality in the perirhinal cortex (F1,775 = 5.56, p = 0.019, η2 = 0.071), meaning that the baseline visual shape bias observed in perirhinal cortex (green region of Figure 5a) was significantly attenuated with experience (green region of Figure 5b). After crossmodal learning, a given shape no longer invoked significant pattern similarity between objects that had the same shape but differed in terms of what they sounded like. Taken together, these results suggest that prior to learning the crossmodal objects, the perirhinal cortex had a default bias toward representing the visual shape information and was not representing sound information of the crossmodal objects. After crossmodal learning, however, the visual shape bias in perirhinal cortex was no longer present. That is, with crossmodal learning, the representations within perirhinal cortex started to look less like the visual features that comprised the crossmodal objects, providing evidence that the perirhinal representations were no longer predominantly grounded in the visual modality.” – pg. 13

      We note that not all effects drop in Figure 5b (even in regions with a similar numerical pattern similarity to PRC, like the hippocampus – also see Supplemental Figure S5 for a comparison for patterns only on Day 4), suggesting that the change in visual bias in PRC is not simply due to noise.

      “Importantly, the change in pattern similarity in the perirhinal cortex across learning days (Figure 5) is unlikely to be driven by noise, poor alignment of patterns across sessions, or generally reduced responses. Other regions with numerically similar pattern similarity to perirhinal cortex did not change across learning days (e.g., visual features x crossmodal objects in A1 in Figure 5; the exploratory ROI hippocampus with numerically similar pattern similarity to perirhinal cortex also did not change in Supplemental Figure S4c-d).” – pg. 14

      (E) Is there a reason the authors did not look at representation and change in the hippocampus? As a rapid-learning, widely-connected feature-binding mechanism, and given the fairly minimal amount of learning experience, it seems like the hippocampus would be a key area of potential import for the cross-modal association. It also looks as though the hippocampus is implicated in the localizer scan (Figure 3c).

      We thank the reviewer for this suggestion and now include additional analyses for the hippocampus. We found no evidence of crossmodal integrative coding different from the unimodal features. Rather, the hippocampus seems to represent the convergence of unimodal features, as evidenced by …[can you give some pithy description for what is meant by “convergence” vs “integration”?]. We provide these results in the Supplemental Information and describe them in the main text:

      “Analyses for the hippocampus (HPC) and inferior parietal lobe (IPL). (a) In the visual vs. auditory univariate analysis, there was no visual or sound bias in HPC, but there was a bias towards sounds that increased numerically after crossmodal learning in the IPL. (b) Pattern similarity analyses between unimodal features associated with congruent objects and incongruent objects. Similar to Supplemental Figure S3, there was no main effect of congruency in either region. (c) When we looked at the pattern similarity between Unimodal Feature runs on Day 2 to Crossmodal Object runs on Day 2, we found that there was significant pattern similarity when there was a match between the unimodal feature and the crossmodal object (e.g., pattern similarity > 0). This pattern of results held when (d) correlating the Unimodal Feature runs on Day 2 to Crossmodal Object runs on Day 4, and (e) correlating the Unimodal Feature runs on Day 4 to Crossmodal Object runs on Day 4. Finally, (f) there was no significant pattern similarity between Crossmodal Object runs before learning correlated to Crossmodal Object after learning in HPC, but there was significant pattern similarity in IPL (p < 0.001). Taken together, these results suggest that both HPC and IPL are sensitive to visual and sound content, as the (c, d, e) unimodal feature-level representations were correlated to the crossmodal object representations irrespective of learning day. However, there was no difference between congruent and incongruent pairings in any analysis, suggesting that HPC and IPL did not represent crossmodal objects differently from the component unimodal features. For these reasons, HPC and IPL may represent the convergence of unimodal feature representations (i.e., because HPC and IPL were sensitive to both visual and sound features), but our results do not seem to support these regions in forming crossmodal integrative coding distinct from the unimodal features (i.e., because representations in HPC and IPL did not differentiate the congruent and incongruent conditions and did not change with experience). * p < 0.05, ** p < 0.01, *** p < 0.001. Asterisks above or below bars indicate a significant difference from zero. Horizontal lines within brain regions in (a) reflect an interaction between modality and learning day, whereas horizontal lines within brain regions in reflect main effects of (b) learning day, (c-e) modality, or (f) congruency.” – Supplemental Figure S4.

      “Notably, our perirhinal cortex mask overlaps with a key region of the ventral anterior temporal lobe thought to be the central locus of crossmodal integration in the “hub and spokes” model of semantic representations.9,50 However, additional work has also linked other brain regions to the convergence of unimodal representations, such as the hippocampus51,52,53 and inferior parietal lobes.54,55 This past work on the hippocampus and inferior parietal lobe does not necessarily address the crossmodal binding problem that was the main focus of our present study, as previous findings often do not differentiate between crossmodal integrative coding and the convergence of unimodal feature representations per se. Furthermore, previous studies in the literature typically do not control for stimulus-based factors such as experience with unimodal features, subjective similarity, or feature identity that may complicate the interpretation of results when determining regions important for crossmodal integration. Indeed, we found evidence consistent with the convergence of unimodal feature-based representations in both the hippocampus and inferior parietal lobes (Supplemental Figure S4), but no evidence of crossmodal integrative coding different from the unimodal features. The hippocampus and inferior parietal lobes were both sensitive to visual and sound features before and after crossmodal learning (see Supplemental Figure S4c-e). Yet the hippocampus and inferior parietal lobes did not differentiate between the congruent and incongruent conditions or change with experience (see Supplemental Figure S4).” – pg. 20

      (F) The direction of the neural effects was difficult to track and understand. I think the key observation is that TP and PRh both show changes related to cross-modal congruency - but still it would be helpful if the authors could articulate, perhaps via a schematic illustration, how they think representations in each key area are changing with the cross-modal association. Why does the temporal pole come to activate less for congruent than incongruent stimuli (Figure 3)? And why do TP responses grow less similar to one another for congruent relative to incongruent stimuli after learning (Figure 4)? Why are incongruent stimulus similarities anticorrelated in their perirhinal responses following cross-modal learning (Figure 6)?

      We thank the author for identifying this issue, which was also raised by the other reviewers. The reviewer is correct that the key observation is that the TP and PRC both show changes related to crossmodal congruency (given that the unimodal features were equated in the methodological design). However, the structure of the integrative code is less clear, which we now emphasize in the main text. Our findings provide evidence of a crossmodal integrative code that is different from the unimodal features, and future studies are needed to better understand the structure of how such a code might emerge. We now more clearly highlight this distinction throughout the paper:

      “By contrast, perirhinal cortex may be involved in pattern separation following crossmodal experience. In our task, participants had to differentiate congruent and incongruent objects constructed from the same three shape and sound features (Figure 2). An efficient way to solve this task would be to form distinct object-level outputs from the overlapping unimodal feature-level inputs such that congruent objects are made to be orthogonal from the representations before learning (i.e., measured as pattern similarity equal to 0 in the perirhinal cortex; Figure 5b, 6, Supplemental Figure S5), whereas non-learned incongruent objects could be made to be dissimilar from the representations before learning (i.e., anticorrelation, measured as patten similarity less than 0 in the perirhinal cortex; Figure 6). Because our paradigm could decouple neural responses to the learned object representations (on Day 4) from the original component unimodal features at baseline (on Day 2), these results could be taken as evidence of pattern separation in the human perirhinal cortex.11,12 However, our pattern of results could also be explained by other types of crossmodal integrative coding. For example, incongruent object representations may be less stable than congruent object representations, such that incongruent objects representation are warped to a greater extent than congruent objects (Figure 6).” – pg. 18

      “As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation. Furthermore, these anterior temporal lobe structures may be involved with integrative coding in different ways. For example, the crossmodal object representations measured after learning were found to be related to the component unimodal feature representations measured before learning in the temporal pole but not the perirhinal cortex (Figure 5, 6, Supplemental Figure S5). Moreover, pattern similarity for congruent shape-sound pairs were lower than the pattern similarity for incongruent shape-sound pairs after crossmodal learning in the temporal pole but not the perirhinal cortex (Figure 4b, Supplemental Figure S3a). As one interpretation of this pattern of results, the temporal pole may represent new crossmodal objects by combining previously learned knowledge. 8,9,10,11,13,14,15,33 Specifically, research into conceptual combination has linked the anterior temporal lobes to compound object concepts such as “hummingbird”.34,35,36 For example, participants during our task may have represented the sound-based “humming” concept and visually-based “bird” concept on Day 1, forming the crossmodal “hummingbird” concept on Day 3; Figure 1, 2, which may recruit less activity in temporal pole than an incongruent pairing such as “barking-frog”. For these reasons, the temporal pole may form a crossmodal object code based on pre-existing knowledge, resulting in reduced neural activity (Figure 3d) and pattern similarity towards features associated with learned objects (Figure 4b).” – pg. 18

      This work represents a key step in our advancing understanding of object representations in the brain. The experimental design provides a useful template for studying neural change related to the cross-modal association that may prove useful to others in the field. Given the broad variety of open questions and potential alternative analyses, an open dataset from this study would also likely be a considerable contribution to the field.

    1. Author Response

      Reviewer #1 (Public Review):

      This is a well performed study to demonstrate the antiviral function and viral antagonism of the dynein activating adapter NINL. The results are clearly presented to support the conclusions.

      This reviewer has only one minor suggestion to improve the manuscript.

      Add a discussion (1) why the folds of reduction among VSV, SinV and CVB3 were different in the NINL KO cells and (2) why the folds of reduction of VSV in the NINL KO A549 and U-2 OS cells.

      Thank you for this suggestion. We have amended the results section to include additional information about these observations and possible explanations for these results.

      Reviewer #2 (Public Review):

      This manuscript is of interest to readers for host-viral co-evolution. This study has identified a novel human-virus interaction point NINL-viral 3C protease, where NINL is actively evolving upon the selection pressure against viral infect and viral 3Cpro cleavage. This study demonstrates that the viral 3Cpros-mediated cleavage of host NINL disrupts its adaptor function in dynein motor-mediated cargo transportation to the centrosome, and this disruption is both host- and virus-specific. In addition, this paper indicates the role of NINL in the IFN signaling pathway. Data shown in this manuscript support the major claims.

      In this paper, the authors have identified a novel host-viral interaction, where viral 3C proteases (3Cpro) cleave at specific sites on a host activating adaptor of dynein intracellular transportation machinery, ninein-like protein (NINL or NLP in short) and inhibit its role in the antiviral innate immune response.

      The authors firstly found that, unlike other activating adaptors of dynein intracellular transportation machinery, NINL (or NLP) is rapidly evolving. Thus, the authors hypothesized that this rapid evolution of NINL was caused by its interaction with viral infection. The authors found that viruses replicated higher in NINL knock-out (KO) cells than in wild-type (WT) cells and the replication level was not attenuated upon IFNa treatment in NINL KO cells, unlike in WT cells. Next, the authors investigated the role of NINL in type I IFN-mediated immune response and found that the induction of Janus kinase/signal transducer and activation of transcription (JAK/STAT) genes were attenuated in NINL KO cells upon IFNa treatment. The author further showed that the reduction of replication IFNa sensitive Vaccinia virus mutant upon IFNa treatment was decreased in NINL KO A549 cells compared to WT cells. The authors further showed that the virus antagonized NINL function by cleaving it with viral 3Cpro at its specific cleavage sites. NINL-peroxisome ligation-based cargo trafficking visualization assay showed that the redistribution of immobile membrane-bound peroxisome was disrupted by cleavage of NINL or viral infection.

      This paper has revealed a novel host-virus interaction, and an antiviral function of a rapidly evolving activating adaptor of dynein intracellular transportation machinery, NINL. The major conclusions of this paper are well supported by data, but several aspects can be improved.

      1) It would be necessary to include a couple of other pathways involved in innate immune response besides JAK/STAT pathway.

      We are very interested in this question as well. Our RNAseq data (Supplementary file 4 and Figure 3 – Figure supplement 4) suggest that there are several transcriptional changes that result from NINL KO. Our goal in this manuscript was to focus on IFN signaling in order to understand this specific effect of NINL KO since it might have wide-ranging consequences on viral replication. While we agree that broadening our studies to other signaling pathways, including other pathways involved in innate immune response, is a good idea, we feel that those experiments would take longer than two months to perform and therefore fall outside of the scope of this paper.

      2) The in-cell cleavages of NINL by viral 3Cpros were well demonstrated and supported by data of high quality. A direct biochemical demonstration of the cleavage is needed with purified proteins.

      We agree with the reviewer that a direct biochemical cleavage assay would further demonstrate that viral 3Cpros cleave NINL specifically. However, our attempts to purify full-length NINL have been unsuccessful due to solubility issues (see example gel below), which is not surprising given that NINL is a >150 kDa human protein that has multiple surfaces that bind to other human proteins. As such, we focused our efforts on in-cell cleavage assays using specificity controls for cleavage. Specifically, we used catalytically inactive CVB3 3Cpro to show a dependence on protease catalytic activity and a variety of NINL constructs in which the glutamine in the P1 position is replaced by an arginine to show site specificity of cleavage. Notably, the cleavage sites in NINL that we mapped using this mutagenesis were predicted bioinformatically from known sites of 3Cpro cleavage in viral polyproteins, further indicating that cleavage is 3Cpro-dependent. We believe these results thus demonstrate that cleavage of NINL is dependent on viral protease activity and occurs in a sequence-specific manner. In light of the difficulty of purifying full-length NINL that would make biochemical experiments very challenging and likely take longer than two months to perform, we believe that our in cell data should be sufficient to demonstrate activity-dependent site-specific cleavage of NINL by viral 3Cpros.

      Sypro stained SDS-PAGE gel showing supernatant (S) and insoluble pellet (P) fractions across multiple purifications with altered buffer conditions.

      3) The author used different cell types in different assays. Explain the rationale with a sentence for each assay.

      Throughout this work, we choose to use a variety of cell lines for specific purposes. A549 cells were chosen as our main cell line as they are widely used in virology, are susceptible to the viruses we used, are responsive to interferon, and express both NINL and our control NIN at moderate levels. In the case of our virology and ISG expression data, we performed the same experiments with NINL KOs in other cell lines confirm that the phenotypes we observed in A549 cells could be attributed to the absence of NINL rather than off-target CRISPR perturbations or cell-line specific effects. All cleavage experiments were performed in HEK293T for their ease of transfection and protein expression. The inducible peroxisome trafficking assays were performed in U-2 OS cells as their morphology is ideal for observing the spatial organization of peroxisomes via confocal microscopy, and based on the fact that we had recapitulated the virology results and ISG expression results in those cells. At the suggestion of the reviewer, we have amended the text to include rationales where appropriate.

      4) While cell-based assays well support the conclusions in this paper, further demonstration in vivo would be helpful to provide an implication on the pathogenicity impact of NINL.

      We agree. However, we believe that examining the impact of the loss of or antagonism of NINL on the pathogenesis of infectious diseases in an in vivo model is outside the scope of this study.

      In summary, this manuscript contributes to a novel antiviral target. In addition, it is important to understand the host-virus co-evolution. The use of the evolution signatures to identify the "conflict point" between host and virus is novel.

    1. Author Response

      Reviewer #1 (Public Review):

      In the article "Neuroendocrinology of the lung revealed by single cell RNA sequencing", Kuo et. al. described various aspects of pulmonary neuroendocrine cells (PNECs) including the scRNA-seq profile of one human lung carcinoid sample. Overall, although this manuscript does not have any specific storyline, it is informative and would be an asset for researchers exploring various new roles of PNECs.

      Thank you for appreciating the significance of the data presented. Our storyline focuses on the newly uncovered molecular diversity of PNECs and the extraordinary repertoire of peptidergic signals they express and cell types these signals can directly target in (and outside) the lung, in mice and human, and in health and disease (human carcinoid tumor).

      Major comments:

      The major concern about the work is most results are preliminary, and at a descriptive level, conclusions or sub-conclusions are derived from scRNA-seq analysis only, lacking in-depth functional analysis and validation in other methods or systems. There are many open-end results that have been predicted by the authors based on their scRNA-seq data analysis without functional validation. In order to give them a constructive roadmap, it would be better to investigate literature and put them in a potential or probable hypothesis by citing the available literature. This should be done in each section of the result part. The paper lacks a main theme or specific biology question to address. In addition, the description about the human lung carcinoid by scRNA-seq is somehow disconnected from the main study line. Also, these results are derived from the study on only one single patient, lacking statistical power.

      We agree that much of the data and analysis presented in the paper is descriptive and hypothesis-generating for PNECs, however we do not consider it preliminary. We focused on validating two key conclusions from the scRNA-seq analysis: PNECs are extraordinarily diverse molecularly (as validated by multiplex in situ hybridization and immunostaining) and they express many different combinations of peptidergic signals (and appear to package them in separate vesicles). From the lung expression profiles of the cognate receptors, we also predicted the direct lung targets of the dozens of new PNEC peptidergic signals we uncovered, and validated the cell target (PSN4, a recently identified subtype of pulmonary sensory neuron) of one of the newly identified PNEC signals (the classic hormone angiotensin) by confirming expression of the cognate receptor gene in PSN4 neurons that innervate PNECs and showing that the hormone can directly activate PSN4 neurons. The characterized human carcinoid provided evidence that during tumorigenesis, the amplified PNECs retain a memory (albeit imperfect) of the molecular subtype of PNEC from which they originated. As suggested by the Reviewer, we have provided more background in Results by adding additional citations from the literature to clarify the rationale for each analysis and what was known prior to the analysis. We feel that our paper provides a broad foundation for exploring the diversity and signaling functions of PNECs, and although each molecular type of PNEC and new PNEC peptidergic signal we uncovered and potential target cell in (and outside) the lung warrants follow up (as do the sensory and other properties of PNECs we inferred from their expression profiles), such studies will require the effort of many individuals in many labs studying both normal and disease physiology in mouse and human, and exploiting the data, hypotheses, approaches, and framework we provide.

      Reviewer #2 (Public Review):

      Pulmonary neuroendocrine cells (PNECs) are known to monitor oxygen levels in the airway and can serve as stem cells that repair the lung epithelium after injury. Due to their rarity, however, their functions are still poorly understood. To identify potential sensory functions of PNECs, the authors have used single-cell RNA-sequencing (scRNA-seq) to profile hundreds of mouse and human PNECs. They report that PNECs express over 40 distinct peptidergic genes, and over 150 distinct combinations of these genes can be detected. Receptors for these neuropeptides and peptide hormones are expressed in a wide range of lung cell types, suggesting that PNECs may have mechanical, thermal, acid, and oxygen sensory roles, among others. However, since some of these cognate receptors are not expressed in the lung, PNECs may also have systemic endocrine functions. Although these data are largely descriptive, the results represent a significant resource for understanding the potential roles of PNECs in normal biology as well as in pulmonary diseases and cancer and are likely to be relevant for understanding neuroendocrine cells in other tissue contexts.

      However, there are several aspects of the data analysis that are unclear and require clarification, most notably the definition of a neuroendocrine cell (points #1 and #2 below).

      1) Figure S1 shows the sorting strategy used for isolation of putative PNECs from Ascl1CreER/+; Rosa26ZsGreen/+ mice, and distinguishes neuroendocrine cells defined as ZsGreen+ EpCAM+ and "neural" cells defined as ZsGreen+ EpCAM-; the figure legend also refers to the ZsGreen+ EpCAM- cells as "control" cells. However, the table shown in panel D indicates that the NE population combines 112 ZsGreen+ EpCAM+ cells together with 64 ZsGreen+ EpCAM- cells to generate the 176 cells used for subsequent analyses. Why are these ZsGreen+ EpCAM- cells initially labeled as neural or control, but are then defined as neuroendocrine? If these do not express an epithelial marker, can they be rigorously considered as neuroendocrine?

      As explained above in the response to Essential Revision point 1, we define pulmonary neuroendocrine cells (PNECs) throughout the paper by their transcriptomic clustering and signatures, which includes the dozens of newly identified PNEC markers as well as the few extant marker genes available before this study (listed in Table S2). The confusion here arises from the two previously known markers (Ascl1 lineage marker ZsGreen, EpCAM) we used for flow sorting to enrich for these rare cells for transcriptomic profiling (Fig. S1). Although most of the cells with PNEC transcriptomic profiles were from the ZsGreenhi EpCAMhi sorted population (as expected), some were from the ZsGreenhi EpCAMlo sorted population. The latter resulted from the high EpCAM gating threshold we used during flow sorting, which excluded some PNECs with intermediate levels of surface EpCAM. Indeed, nearly all PNECs (> 95%) expressed EpCAM by scRNAseq, and there was no difference in EpCAM transcript levels or transcriptomic clustering of PNECs that were from the ZsGreenhi EpCAMhi vs. ZsGreenhi EpCAMlo sorted populations, as we now show in the new panels (C', C'') added to Fig S1C. This point is now clarified in the legend to Fig. S1C, and it nicely demonstrates that transcriptomic profiling is a more robust method of identifying PNECs than flow sorting based on two classical markers.

      2) Similarly, in the human scRNA-seq analysis, how were PNECs defined? The methods description states that these cells were identified by their expression of CALCA and ASCL1, but does not indicate whether they also expressed epithelial markers.

      Human PNECs were identified in the single cell transcriptomic analysis by the same strategy described above for mouse PNECs: by their transcriptomic clustering and signatures, which includes the dozens of newly identified PNEC markers as well as the few extant marker genes available before this study (listed in Table S2). In addition to expression of classic and new markers, the human PNEC cluster defined by scRNA-seq indeed showed the expected expressed of epithelial markers (e.g, EPCAM, see dotplot below), like other epithelial cells.

      3) The presentation of sensitivity and specificity in Figure 1 is confusing and potentially misleading. According to Figure 1B, Psck1 and Nov are two of the top-ranked differentially expressed genes in PNECs with respect to both sensitivity and specificity. However, the specificity of these two genes appears to be lower than that of Scg5, Chgb, and several other genes, as suggested in Figure 1C and Figure S1E. In contrast, Chgb appears to have higher specificity and sensitivity than Psck1 in Figures 1C and E but is not shown in the list of markers in Figure 1B.

      As explained above in the response to Essential Revision point 2, because different marker features are important for different applications, we have provided several different graphical formats (Figs. 1B,C, Fig. S1E) and a table (Table S1) to aid in selection of the optimal markers for each application. Fig. 1B shows the most sensitive and specific PNEC markers identified by ratio of the natural logs of the average expression of the marker in PNECs vs. non-PNEC epithelial cells (Table S1), and we have added a two-dimensional plot of this sensitivity and specificity for a large set of PNEC markers (new panel E of Fig. S1). The violin plots in Fig. 1C allow visual comparison of expression of selected markers across PNECs and 40 other lung cell types including non-epithelial cells (from our extensive mouse lung atlas in Travaglini, Nabhan et al, Nature 2020). Pcsk1 and Nov score high in the analysis of Fig. 1B because they are highly sensitive and specific markers within the pulmonary epithelium, and they are also valuable markers because they are highly expressed in PNECs. However, they appear slightly less specific in the violon plots of Fig. 1C (Pcsk1) and Fig. S1F (Nov) because of expression (though at much lower levels) in individual lung cell types outside the epithelium: Pcsk1 is expressed also at low levels in some Alox5+ lymphocytes, and Nov is expressed at low levels in some smooth muscle cells. Chgb is a new PNEC marker that did not make the cutoff for the list in Fig. 1B because it is expressed in a slightly higher percentage of non-PNEC epithelial cells than the markers shown, which ranked slightly above it by this metric (see Table S1).

      4) The expression of serotonin biosynthetic genes in mouse versus human PNECs deserves some comment. The authors fail to detect the expression of Tph1 and Tph2 in any of the mouse PNECs analyzed, but TPH1 is expressed in 76% of the human PNECs (Table S8). Is it possible that Tph1 and Tph2 are not detected in the mouse scRNA-seq data due to gene drop-out? If serotonin signaling by mouse PNECs is due to protein reuptake, as implied on p. 5, is there a discrepancy between serotonin expression as detected by smFISH versus immunostaining?

      It is always possible that the failure to detect expression of Tph1 and Tph2 in the mouse scRNA-seq dataset is due to technical dropout, however when we analyzed this in our other mouse PNEC scRNA-seq dataset obtained using a microfluidic platform and also deeply-sequenced (Ouadah et al, Cell 2019), we found similar values as in the previously analyzed dataset: no Tph2 expression was detected and only 3% (3 of 92) of PNECs had detected Tph1 expression, whereas 24% (22 of 92) had detected expression of serotonin re-uptake transporter Slc6a4. Because our mouse and human scRNA-seq datasets were prepared similarly and sequenced to a similar depth (105 to 106 reads/cell), the difference observed in Tph1/TPH1 expression between mouse (0-3% PNECs) and human (76% PNECs) is more likely a true biological difference. We also analyzed serotonin levels in mouse PNECs by immunohistochemistry (not shown) and detected serotonin in nearly all (~90%) embryonic PNECs but only ~10% of adult PNECs. Systematic follow up studies will be necessary to resolve the mechanism of serotonin biogenesis and uptake in PNECs, and the potential stage and species-specific differences in these processes suggested by this initial data.

      5) The smFISH and immunostaining analyses are often presented without any indication of the number of independent replicate samples analyzed (e.g., Figure 2B, Figure 3F, G).

      The number of samples analyzed have been added (the values for Fig. 2B are given in legend to Fig. 2C, the quantification of Fig. 2B).

      6) It would be helpful to provide a statistical analysis of the similarities and differences shown in the graphs in Figures 1E and G.

      We added a statistical analysis (Fisher's exact test, two-sided) of Fig. 1E comparing expression of each examined gene in the two scRNA-seq datasets (Table S4). We added a similar statistical analysis of Fig. 1G comparing the expression values of each examined gene by scRNA-seq vs smFISH (see Fig. 1G legend).

    1. Author Response

      Reviewer #1 (Public Review):

      This paper tests the hypothesis that 1/f exponent of LFP power spectrum reflects E-I balance in a rodent model and Parkinson's patients. The authors suggest that their findings fit with this hypothesis, but there are concerns about confirmation bias (elaborated on below) and potential methodological issues, despite the strength of incorporating data from both animal model and neurological patients.

      First, the frequency band used to fit the 1/f exponent varies between experiments and analyses, inviting concerns about potentially cherry-picking the data to fit with the prior hypothesis. The frequency band used for fitting the exponent was 30-100 Hz in Experiment 1 (rodent model), 40-90 Hz in Experiment 2 (PD, levodopa), and 10-50 Hz in Experiment 3 (PD, DBS). Ad-hoc reasons were given to justify these choices, such as " to avoid a spectral plateau starting > 50 Hz" in Experiment 3. However, at least in Experiment 3 (Fig. 3), if the frequency range was shifted to 1-10 Hz, the authors would have uncovered the opposite effect, where the exponent is smaller for DBS-on condition.

      We agree that parameter choice is crucial, in particular, choice of the fitting range. In addition to the 40-90 Hz range (Figure 2C), we have performed aperiodic fitting for five other frequency ranges to test to what extent the reported results are sensitive to the selected frequency range (Figure S2A). This analysis showed that the results are robust when a broad frequency range from 30 to 95 Hz was chosen, which is consistent with what has been suggested by Gao et al., 2017 to make inferences on the E/I ratio.

      Accordingly, we have now repeated the analyses for the animal data with the same fitting range used for the ON-OFF medication comparison in humans. Along with Figure S2A where different frequency ranges were tested for data used in Figure 2, this shows that the results in Figure 1 and 2 hold up with higher aperiodic exponents when STN spiking is low and vice versa. Therefore, a broad fitting range from 30 to 90 Hz (excluding harmonics of mains interference) generates consistent results for both human and animal data.

      We opted against a fitting range from 1-10 Hz because of two restraints highlighted in Gerster et al., 2022. First, a fitting range starting at 1 Hz could have a larger y-intercept due to the presence of low-frequency oscillations. This could lead to a larger aperiodic exponent and could be misinterpreted as stronger neural inhibition. Therefore, the lower fitting bound should be chosen to best avoid known oscillations in the delta/theta range (Gerster et al., 2022). Second, frequencies should be chosen to avoid oscillations crossing fitting range limits. In Figure 3A, oscillations in the theta/alpha band both ON and OFF stimulation would complicate parameterisation and would likely result in spurious fits.

      We also tested the effect of changing the peak threshold, peak width limits and the aperiodic fitting mode on FOOOF parameterisation. Increasing and decreasing the peak threshold from its default value (at 2 standard deviations) did not change results (Figure S2B). Similarly, adapting the peak width limits did not affect the exponent difference between medication states (Figure S2C). Finally, choosing the ‘knee’ mode instead of ‘fixed’ resulted in fundamentally different aperiodic fits that did not differ anymore with medication (Figure S2D). This is most likely a consequence of the near linear PSD in log-log space from 40 to 90 Hz (Figure 2B). If there is no bend in the PSD, the FOOOF algorithm will be forced to assign a ‘random’ knee and the aperiodic fit will then mostly reflect the slope of the spectrum above the knee point.

      Second, there are important, fine-grained features in the spectra that are ignored in the analyses, which confounds the interpretation.

      One salient example of this is Fig. 2, where based on the plots in B, one would expect that the power of beta-band oscillations to be higher in the Med-On condition, as the oscillatory peaks rise higher above the 1/f floor and reach the same amplitude level as the Med-OFF condition (in other words, similar total power is subtracted by a smaller 1/f power in the Med-ON condition). But this impression is opposite to the model-fitting results in C, where beta power is lower in the Med-ON condition.

      We agree that PSDs over a broad frequency range (e.g. 5-90 Hz) typically do not have a single 1/f property. Instead, there can be multiple oscillatory peaks and ‘knees/bends’ in the aperiodic component. For these cases, fitting should be performed using the knee mode. To extract periodic beta power, we parameterise the PSD between 5 and 90 Hz and select the largest oscillatory component between 8 and 35 Hz (this range was extended to include the large oscillatory peaks in hemispheres 27 and 28 at ~ 10 Hz, see Figure R1). We now use the knee mode, to model the aperiodic component between 5 and 90 Hz when periodic beta power is calculated (see our previous comments). Figure R1 provides an overview of all PSDs ON and OFF medication, the aperiodic fits (5-90 Hz (knee) and 40-90 Hz (fixed)) and the detected beta peaks. In spite of this modification in our pipeline, periodic beta power is still larger OFF medication (Figure 2C), in keeping with previous studies (Kim et al., 2022; Kühn et al., 2006; Neumann et al., 2017; Ray et al., 2008). We acknowledge the reviewer’s point that the average spectra in Figure 2B are misleading in that respect and for clarity provide here all 30 spectra in both conditions. Note that the calculation of aperiodic exponents between 40 and 90 Hz is not affected by this change in our pipeline. Figures 2B, D+E were revised accordingly.

      We have repeated the analysis of our animal data using the ‘knee mode’ with a fitting range from 30 to 100 Hz. However, using the knee mode did not improve the goodness of fit or fitting error and, in fact, made them slightly worse (Figure S5). Based on this, we think the fixed mode would provide a more holistic model for the PSDs used in this analysis. We have now added this comparison in Figure S5 to justify the choice of the fixed mode.

      Figure R1. PSDs from all 30 hemispheres ON and OFF medication. Aperiodic fits are shown between 5-90 Hz (knee mode), which was used to calculate the power of beta peaks, and between 40-90 Hz (fixed mode), which was used to estimate the aperiodic exponent of the spectrum.

      Another example is Fig. 1C, where the spectra for high and low STN spiking epochs are identical between 10 and 20 Hz, and the difference in higher frequency range could be well-explained by an overall increase of broadband gamma power (e.g. as observed in Manning et al., J Neurosci 2012, Ray & Maunsell PLoS Biol 2011). This increase of broadband gamma power is trivially expected, as broadband gamma power is tightly coupled with population spiking rate, which was used to define the two conditions.

      We agree with the reviewer that in Figure 1C, high and low STN spiking states could well be separated by average gamma power (Figure 1E), too. However, the difference of aperiodic exponents is more prominent between both conditions (Figure 1D+E, based on p-values). What is more, in human LFP data recorded from clinical macroelectrodes, medication states can be reasonably well distinguished using the aperiodic exponent between 40-90 Hz (Figure 2C), but average gamma power does not separate both states (Figure S3A). This suggests that the aperiodic exponent reflects more than just power differences in the high gamma regions. In addition, power changes do not inevitably change the aperiodic exponent and vice versa as elaborated in (Donoghue et al., 2020).

      Manning et al., 2009 show that the power spectrum is shifted to higher power values at all observed frequencies (2-150 Hz) as firing rates increase. As the reviewer points out, power spectra of our data are almost identical between 10-20 Hz (despite the marked spiking differences) and only drift apart from > 20 Hz (Figure 1C). This is a relevant difference between our study and Manning et al., 2009 and suggests that power differences in the gamma range are not solely explained by differences in spiking. This is confirmed when cortical activity at different spikes/sec is modelled (Miller et al., 2009). The entire spectrum is shifted to higher power values if spiking rates increase.

      Ray & Maunsell, 2011 reported low (30-80 Hz) and high (> 80 Hz) gamma activity in the macaque visual cortex, with a positive correlation between spiking activity and high gamma activity. However, activities in the low gamma range (30-80 Hz), which largely overlaps with the frequency range in our study, does not necessarily correlate with firing rates.

      In conclusion, the link between gamma power and spiking activity is not as strong as alluded. Even if the change in spiking activities can lead to changes of both gamma power and the aperiodic exponent, the aperiodic exponent would still constitute a measure to separate E/I levels and medication states.

      The above consideration also speaks to a major weakness of the general approach of considering the 1/f spectrum a monolithic spectrum that can be captured by a single exponent. As the authors' Fig. 1C shows, there are distinct frequency regions within the 1/f spectrum that have different slopes. Indeed, this tripartite shape of the 1/f spectrum, including a "knee" feature around 40-70 Hz which is well visible here, was described in multiple previous papers (Miller et al., PLoS Comput Biol 2009; He et al., Neuron 2010), and have been successfully modeled with a neural network model using biologically plausible mechanisms (Chaudhuri et al., Cereb Cortex, 2017). The neglect of these fine-grained features confounds the authors' model fitting, because an overall increase in the broadband gamma power - which can be explained straightforwardly by the change in population firing rates - can result in the exponent, fit over a larger spectral frequency region, to decrease. However, this is not due to the exponent actually changing, but the overall increase of power in a specific sub-frequency-region of the broadband 1/f activity.

      We have now used the knee mode for aperiodic fits between 5 and 90 Hz when periodic beta power is calculated. We agree that this broad frequency range is unlikely to have a single 1/f component.

      We have also repeated the analysis of our animal data using the knee mode for aperiodic fits between 30 and 100 Hz (Figure S5). However, the goodness of fits had barely changed. In fact, the R2 and error become slightly worse. In addition, the knee parameter complicates interpretation of the aperiodic exponent and has to be considered along with the knee frequency. What is more, we do not see this bend around 40-70 Hz in all subjects. We show PSDs of representative LFP channels in Figure R2 and need to assert that the knee around 40-70 Hz is not a robust finding in our data set. Therefore, we chose the fixed mode for parameterisation within this frequency band.

      Please see our answer to the previous comment regarding the link between broad gamma power and changes in population firing rates.

      Figure R2. PSDs of representative PSD channels for each animal (data used in Figure 1C). The knee around 40-70 Hz is not a robust finding in all PSDs.

    1. Author Response

      Reviewer #1 (Public Review):

      Iyer et al. address the problem of how cells exposed to a graded but noisy morphogen concentration are able to infer their position reliably, in other words how the positional information of a realistic morphogen gradient is decoded through cell-autonomous ligand processing. The authors introduce a model of a ligand processing network involving multiple ”branches” (receptor types) and ”tiers” (compartments where ligand-bound receptors can be located). Receptor levels are allowed to vary with distance from the source independently of the morphogen concentration. All rates, except for the ligand binding and unbinding rates, are potentially under feedback control. The authors assume that the cells can infer their position from the output of the signalling network in an optimal way. The resulting parameter space is then explored to identify optimal ”network architectures” and parameters, i.e. those that maximise the fidelity of the positional inference. The analysis shows how the presence of both specific and non-specific receptors, graded receptor expression and feedback loops can contribute to improving positional inference. These results are compared with known features of the Wnt signalling system in Drosophila wing imaginal disc.

      The authors are doing an interesting study of how feedback control of the signalling network reading a morphogen gradient can influence the precision of the read-out. The main strength of this work is the attention to the development of the mathematical framework. While the family of network architectures introduced here is not completely generic, there is enough flexibility to explore various features of realistic signalling systems. It is exciting to find that some network topologies are particularly efficient at reducing the noise in the morphogen gradient. The comparison with the Wnt system in Drosophila is also promising.

      Major comments:

      1) The authors assume that the cell estimates its position through the maximum a posteriori estimate, Eq.(5), which is a well-defined mathematical object; it seems to us however that whether the cell is actually capable of performing this measurement is uncertain (it is an optimal measurement in some sense, but there is no guarantee that the cell is optimal in that respect). Notably, this entails evaluating p(theta), which is a probability distribution over the entire tissue, so this estimate can not be done with purely local measurements. Can the authors comment on this and how the conclusions would change if a different position measurement was performed?

      This is indeed an important question. Our viewpoint is that if the cells were to use a maximum a posteriori (MAP) estimate (Eq. 5) to decode their positions, then what features of the channel architecture would lead to small errors in positional inference. Whether the maximum a posteriori estimate is employed by the cell, or some other estimate, is an important but difficult question to address. Our choice has been motivated by how this estimate has allowed the precise determination of developmental fates in the context of gap gene expression in Drosophila embryo [1, 2, 3]. We had earlier computed the inference error with a different estimate i.e.

      which computes the mean squared deviations of the inferred positions from the true position for each x, taking into account the entire distribution p(x∗|x). While the qualitative results are the same, the inference errors showed spurious jitters from outliers in sampling the noisy morphogen input distribution. This consistency might suggest that our qualitative results are insensitive to the choice of the estimate.

      Further, when evaluating the MAP estimate, the term p(θ) in the denominator serves as a normalisation factor to ensure p(x|θ) is a probability density. This is not strictly necessary for MAP estimation. Since p(θ) does not depend on x, the MAP estimate can be written as follows

      without the need for evaluating p(θ). In the case of a uniform prior, it would be equivalent to maximum likelihood estimate (MLE) i.e.

      2) One of the features of the signalling networks studied in the manuscript is the ability of the system to form a complex (termed a conjugated state, Q) made of two ligands L, one receptor and one nonsignalling receptor. While there are clear examples of a single ligand binding to two signalling receptors (e.g. Bmps), are there also known situations where such a complex with two ligands, one receptor, and one non-signalling receptor can form? In the Wnt example (Fig. 10a), it is not clear what this complex would be? In general, it would be great to have a more extended discussion of how the model hypothesis for the signalling networks could relate to real systems.

      This is a good suggestion. We have now added a discussion on the various possible realisations of the “conjugate state” Q in Section 3.6. We have also explored the various states in the context of different signalling contexts such as Dpp, Hh, Fgf in the Discussion section.

      The conjugated state ‘Q’ represents a combination of the readings from the two branches i.e. receptor types. This could be realised through processes like ligand exchange or complex formation, both in a shared spatial location such as a compartment. As discussed in the original manuscript (Section 3.6 of the revised manuscript), the ligand Wg in the Wg signalling pathway is internalised through two separate endocytic pathways associated with the receptor types - signalling receptor Frizzled (via Clathrin-mediated endocytosis (CME)) and non-signalling receptor HSPGs (via the CLIC/GEEC pathway (CLIC - (clathrin-independent carriers, GEEC - GPI-anchored protein-enriched early endosomal compartments)). Both pathways meet in a common early endosomal compartment where the ligands may be exchanged between the two receptors [4]. In a previous work by Hemalatha et al [4], we had shown that there are more Wg-DFz2 interactions in the endosomal compartment (measured through FRET) than on the cell surface. Therefore, the non-signalling receptors directing Wg through the CLIC/GEEC pathway titrate the amount of Wg interaction with the signalling receptor, DFz2.

      As mentioned in the original manuscript (Section 3.3 and subsection 4.2 of the Discussion in the revised manuscript), apart from Wg signalling, non-signalling receptors such as the HSPGs have also been proposed to act as co-receptors for Dpp, Hh, FGF (reviewed in [5, 6]). Although some ligands bind to the core protein of HSPG, the majority of the ligands bind to the negatively charged HS chains [7, 8]. Here, the coreceptors HSPGs aid in capturing diffusible ligands and presenting the same to signalling receptors (either on the cell surface or within endosomes).

      3) The authors consider feedback on reaction rates - it would seem natural to also consider feedback on the total number of receptors; notably, since there are known examples of receptors transcriptionally down-regulated by their ligands (e.g. Dpp/Tkv)? Also it is not clear in insets such as in Fig. 7b, if the concentration plotted corresponds to the concentration of receptors bound to ligands?

      As mentioned in the original manuscript (Section 2.2 of the revised manuscript), we have indeed considered control on reaction rates and receptors, although the control on the latter is done with the constraint of receptor profiles being monotonic. Further, while the control on reaction rates is considered via feedbacks explicitly, the control on receptors is done via an approach akin to the openloop control used in control theory. In reality, cellular control on receptors will involve transcriptional up- or down-regulation of receptor and thus warrant a feedback control approach – however, the timescales involved in such a control are different from the binding-unbinding and signalling timescales.

      Therefore, in the current work, we take the morphogen profile to be given i.e. independent of receptor concentrations, and we ask for the receptor concentrations that would help reduce the inference errors.

      Our predictions of increasing signalling receptor and decreasing non-signalling receptors in a twobranch channel architecture are consistent with the known transcriptional up-regulation of Dally/Dlp and down-regulation of Fz by Wg signalling [9].

      In a future work, we will extend the control on receptors to include feedbacks explicitly. Furthermore, the explicit feedback control on receptors may need to be considered concomitantly with the effect of receptors on morphogen dynamics (i.e. morphogen sculpting by receptors) along with the possibility of spatial correlations in receptor concentrations through neighbouring cell-cell interactions.

      As mentioned in the original manuscript (Section 2.2 of the revised manuscript), the variables ψ and φ stand for the total (bound + unbound) surface receptor concentrations of the signalling and the non-signalling receptors respectively. Therefore, the insets showing receptor profiles such as in Fig. 6b, 7b, and Appendix H Fig.8b,e correspond to the total surface receptor concentrations.

      4) The authors are clear about the fact that they consider the morphogen gradient to be fixed independently of the reaction network; however, that seems like a very strong assumption; in the Dpp morphogen gradient for instance over expression of the Tkv receptor leads to gradient shortening. Can the authors comment on this?

      This point is related to the earlier question 4. As discussed in the Discussion of the original manuscript (subsection 4.3 of the revised manuscript), we focus on finding the optimal receptor concentration profiles and reaction networks that enable precision and robustness in positional information from a given noisy morphogen profile. The framework and the optimisation scheme within it will prescribe different receptor profiles and reaction networks for different monotonically behaving, noisy morphogen profiles. It is possible that cells may achieve the optimal receptor concentrations via feedback control on production of the receptors.

      Broadly, morphogen dynamics depends on cell surface receptors, which could participate in both the inference and the sculpting of the morphogen profile, and factors independent of them such as extracellular degradation, transport and production, etc. In our present work, we have taken the receptors involved in sculpting and inference as being independent.

      In a more general case, feedback control on receptors will change the receptor concentrations as well as the morphogen profile. We are currently working on realising such a feedback control on receptors within the same broader information theoretic framework proposed in the current work.

      5) Fig. 10f is showing an exciting result on the change in endocytic gradient CV in the WT and in DN mutant of Garz. Can the authors check that the Wg morphogen gradient is not changing in these two conditions? And can they also show the original gradient, and not only its CV?

      The reviewer raises a legitimate concern – could the observed changes in CV upon perturbation of endocytic machinery be attributed to a systematic change in the mean levels of the endocytosed Wg alone? In the original manuscript (Appendix O Fig.17b,c of the revised manuscript), we show the normalised profiles of endocytic Wg in control and myr-Garz-DN cases. Here, in Fig.1 below, we show a comparison between the mean Wg concentrations (measured as fluorescence intensity) in control wing discs and discs wherein CLIC/GEEC endocytic pathway is removed using UAS-myr-Garz-DN. For clarity, we show the discs with largest and smallest fluorescence intensities from the control and myr-Garz-DN discs. It is hard to conclude that the mean concentrations are significantly different in the two cases.

      Reviewer #2 (Public Review):

      The work of Iyer et al. uses a computational approach to investigate how cells using multiple tiers of processing and multiple parallel receptor types allow more accurate reading of position from a noisy signal. Authors find that combining signaling and non-signaling types of receptors together with additional feedback increases the accuracy of positional readout against extrinsic noise that is conveyed in the morphogen signal. Further, extending the number of layers of signal processing counteracts the intrinsic stochasticity of the signal reading and processing steps. The mathematical formulation of the model is general but comprehensive in the way it handles the difference between branches and tiers for the processing of channels with feedbacks. The results of the model are presented from simple one-branch and one-tier architecture to two-branch and two-tier architecture with feedbacks. Interestingly authors find that adding more tiers results in only very small improvements in the accuracy of positional readout. The model is tested against a perturbation experiment that impairs one of the signaling branches in the Drosophila wing disc, but the comparison is only qualitative as further experiment-oriented work is planned in a separate paper.

      Strengths

      There is a clear statement of objectives, model, and how the model is evaluated. In particular, the objective is to find what number of receptor types and their concentrations for a given number of tiers and feedback types is resulting in the most accurate positional readout. The employed optimization procedure is capable to find signalling architectures that result in one cell diameter positional precision for most of the tissue with 3-4 cells at the tissue end that is most distant to the morphogen source. This demonstrates that employing additional complexity in signal processing results in a very accurate positional readout, which is comparable with estimates of positional precision obtained in other developmental systems (Petkova et al., Cell 2019, Zagorski et al., Science 2017).

      The optimal signalling architectures indicate that both signalling (specific) and non-signalling (nonspecific) receptors affect the precision of positional readout, but the contributions of each type of these receptors are qualitatively different. Even slight perturbation of signalling receptors drives the system out of optimum, resulting in a decrease in positional precision. In contrast, the non-signalling receptors could accommodate much larger perturbations. This observation could provide a biophysical explanation for how cross-talk between different morphogen species could be realized in a way that positional precision is kept at the optimum when morphogen signaling undergoes extrinsic and intrinsic perturbations.

      Last, the model formulation allows to specifically address perturbations of signalling and feedbacks, that could be explored to validate model predictions experimentally in Drosophila wing disc, but also in other developmental tissues. The authors present a proof-of-concept by obtaining consistent results of variation of output profiles in two-tier two-branch architectures with non-signaling branch removed and intensity profiles of Wg in wing disc where the CLIC/GEEC endocytic pathway was perturbed.

      Weaknesses

      The list of model parameters is long including more than 20 entries for two-tier two-branch architectures. This is expected, as the aim of the model is to describe the sophisticated signalling architecture mimicking the biological system. However, this also makes it very challenging or impossible to provide guiding principles or understanding of the system behaviour for the complete space of signalling architectures that optimize positional readout. Although, the employed optimization procedure finds solutions that exhibit very high positional accuracy, there is only very limited notion how these solutions depend on variation of different parameters. The authors do not address the following question, whether these solutions correspond to broad global optima in the space of all solutions, or were rather fine-tuned by the optimization procedure and are quite rare.

      It is unclear how contributions from the intrinsic noise affect the system behaviour compared to contributions from extrinsic noise. In principle, the two-branch one-tier architecture results in an already very accurate positional readout across the tissue. The adding of another tier seems to provide only a very weak improvement over a one-tier solution. It is possible that contributions from intrinsic noise for the investigated signalling architectures are only mildly affecting the system compared with contributions from extrinsic noise. Hence, it is difficult to assess whether the claim of reducing intrinsic noise by adding another tier is supported by the presented data, as the contributions from intrinsic noise could overall very weakly affect the positional readout.

      The optimal response of the channel to extrinsic and intrinsic noises is very distinct. As noted correctly by the reviewer, an additional tier provides only a marginal improvement in inference error due extrinsic noise (compare Fig.7 and Fig.8 in the revised manuscript). However, as shown in Fig.9c of the revised manuscript (same as in the original manuscript), adding an extra tier provides a substantial improvement in inference errors due to intrinsic noise.

      References

      [1] Gasper Tkacik, Julien O Dubuis, Mariela D Petkova, and Thomas Gregor. Positional information, positional error, and readout precision in morphogenesis: a mathematical framework. Genetics, 199:39– 59, 2015.

      [2] Mariela D Petkova, Gasper Tkacik, William Bialek, Eric F Wieschaus, and Thomas Gregor. Optimal decoding of cellular identities in a genetic network. Cell, 176:844–855, 2019.

      [3] Julien O Dubuis, Gaˇsper Tkaˇcik, Eric F Wieschaus, Thomas Gregor, and William Bialek. Positional information, in bits. Proceedings of the National Academy of Sciences, 110:16301–16308, 2013.

      [4] Anupama Hemalatha, Chaitra Prabhakara, and Satyajit Mayor. Endocytosis of wingless via a dynaminindependent pathway is necessary for signaling in drosophila wing discs. Proceedings of the National Academy of Sciences, 113:E6993–E7002, 2016.

      [5] Xinhua Lin. Functions of heparan sulfate proteoglycans in cell signaling during development. Development, 131:6009–6021, 2004.

      [6] Stephane Sarrazin, William C Lamanna, and Jeffrey D Esko. Heparan sulfate proteoglycans. Cold Spring Harbor perspectives in biology, 3(7):a004952, 2011.

      [7] Catherine A Kirkpatrick, Sarah M Knox, William D Staatz, Bethany Fox, Daniel M Lercher, and Scott B Selleck. The function of a drosophila glypican does not depend entirely on heparan sulfate modification. Developmental biology, 300(2):570–582, 2006.

      [8] Mariana I Capurro, Ping Xu, Wen Shi, Fuchuan Li, Angela Jia, and Jorge Filmus. Glypican-3 inhibits hedgehog signaling during development by competing with patched for hedgehog binding. Developmental cell, 14(5):700–711, 2008.

      [9] Kenneth M Cadigan, Matthew P Fish, Eric J Rulifson, and Roel Nusse. Wingless repression of drosophila frizzled 2 expression shapes the wingless morphogen gradient in the wing. Cell, 93(5):767–777, 1998.

    1. Author Response

      Reviewer #1 (Public Review):

      Strength: The study is summarizing a large cohort of human samples of blood, nasal swabs and nasopharyngeal aspirates. This is very uncommon as most of the time studies focus on the blood and serum of patients. Within the study, 3 monocyte and 3 DC subsets have been followed in healthy and Influenza A virus-infected persons. The study also includes functional data on the responsiveness of Influenza A virus-infected DC and monocyte populations. The authors achieved their aims in that they were able to show that the tissue microenvironment is important to understand subset specific migration and activation behavior in Influenza A virus infection and in addition that it matters with which kind of agent a person is infected. Thus, this study also impacts a better understanding of vaccine design for respiratory viruses.

      We thank Reviewer 1 for highlighting what we believe to be the greatest strengths of our study. The key feature of this study was to generate a comprehensive description of monocytes and dendritic cells (DC) in the human nasopharynx during influenza A virus infection, and to provide a comparison with healthy and convalescent individuals. Further, we wished to emphasize the value of studying the nasopharynx during respiratory viral infections, particularly in light of the ongoing COVID-19 pandemic. We describe a non-invasive method to (longitudinally) sample this anatomical compartment that allows retrieval of intact immune cells as well as mucosal fluid for soluble marker analysis. We also believe that the addition of proteomic profiles in the different compartments (new Figure 7) further highlights the importance of the tissue microenvironment.

      Weakness: In the described study, the authors used a different nomenclature to introduce the DC subsets. This is confusing and the authors should stick to the nomenclature introduced by Guilliams et al., 2014 (doi.org/10.1038/nri3712) and commented in Ginhoux et al., 2022 (DOI: 10.1038/s41577-022-00675-7 ) or at least should introduce the alternative names (cDC1, cDC2, expression markers XCR1, CD172a/Sirpa). Further, Segura et al., 2013 (doi: 10.1084/jem.20121103) showed that all three DC subpopulations were able to perform cross-presentation when directly isolated. Overall, a more up-to-date introduction would be useful.

      Reviewer 1 commented on the DC nomenclature used in the manuscript. We agree that our manuscript would benefit from appropriately updating the DC nomenclature. We therefore revised the text, and now we refer to the subsets previously described as CD1c+ and CD141+ myeloid DCs (MDC) as cDC2 and CDC1 subsets, respectively. We have also modified the text in the Introduction of the revised manuscript to reflect the same and give a more up-to-date introduction of DC subsets (marked-up version lines 75-81).

      As the data of this was already obtained in 2016-2018 it is clear that the FACS panel was not developed to study DC3. If possible, the authors might be able to speculate about the role of this subset in their data set. Moreover, there were other studies on SARS-CoV-2 infection and DC subset analyses in blood (line 87, and line 489) e.g. Winheim et al., (DOI: 10.1371/journal.ppat.1009742 ), which the authors should introduce and discuss in regard to their own data.

      As reviewer 1 accurately pointed out, the flow cytometry panel used in this study was indeed not developed to study the DC3 subset. The data was obtained in 2016-2018, and lack the typical markers used to identify the DC3 subset, such as CD163, BTLA and CD5 (Cytlak et al, https://doi.org/10.1016/j.immuni.2020.07.003, Villani et al, https://doi.org/10.1126/science.aah4573). Due to the constraints of the panel, we would not be able to accurately identify DC3s. However, in an attempt to dig deeper into the data that is available, we re-analyzed the data to identify CD14+CD1c+ cells among the lineage–HLADR+CD16–CD14+ cells, here collectively called “mo-DC”. This population is likely a combination of monocytes upregulating CD1c and bona fide DC3 expressing CD14. Accordingly, the gating strategy was updated in Supplementary figure 1 (marked-up version lines 192-194), and new data plot in Figure 2H (marked-up version lines 208-220) summarizes the changes observed in mo-DC numbers in IAV patients between blood and the nasopharynx. Parallel to the pattern seen in other DC subsets, mo-DC frequencies are reduced in blood and we observed an increase (not significant) in the nasopharynx.

      As CD88 was not included in the original panel, it was not possible to discriminate between bona fide monocytes and DC3s. We performed a staining of PBMCs (buffy coat) with CD88 (FITC) added to the original flow panel used in the study, to assess if CD88 can be helpful for future studies (Reviewer figure 1). The staining showed that some cells in the mo-DC population are CD88 positive, indicating a bona fide monocyte origin, whereas some are negative, indicating that they are bona fide DC3 expressing CD14. (Bourdely et al, https://doi.org/10.1016/j.immuni.2020.06.002).

      Reviewer figure 1. Expression of CD88 in the “mo-DC” population. Cells from a buffy coat were stained with the flow cytometry panel used in the manuscript, with the addition of CD88 (FITC). Within the CD14+CD1c+ population, the “mo-DC” population, we identified both CD88+ and CD88- cells.

      Reviewer 1 also suggested citing Winheim et al (https://doi.org/10.1371/journal.ppat.1009742), and we thank them for their suggestion. We have now cited Winheim et al, and two additional reports (Kvedaraite et al, https://doi.org/10.1073/pnas.2018587118 and Affandi et al, https://doi.org/10.3389/fimmu.2021.697840) describing a depletion of DC3s (and other DC subsets) from circulation, and functional impairment of DCs following SARS-CoV-2 infection. Further, Winheim et al observed an increased frequency of a CD163+CD14+ subpopulation within the DC3s, which correlated with systemic inflammatory responses in SARS-CoV-2 infection. We speculate that perhaps in IAV infection too, DC3s may follow the trend of other DC subsets and be found in increased numbers in the nasopharynx (marked-up version lines 75-81 and 543-552).

      Taken together, although the data are very important and very interesting, my overall impression of the manuscript is that in the era of RNA seq and scRNA seq analyses the study lacks a bit of comprehensiveness.

      The final comment from reviewer 1 is well taken, in that our study does not include RNA-seq analyses. Again, we ask Reviewer 1 to take into consideration the challenging material we worked with in our study in combination with the COVID-19 pandemic that subsequently has excluded recruitment of new influenza patients to the study. The cell numbers and viability in the nasopharyngeal aspirates limit what experimental approaches can be done simultaneously, and flow cytometry seemed to be the best approach for the study. However, we agree that in future studies, both our own and those of others in the field, will greatly benefit from single cell analysis of nasopharyngeal immune cells, and from generating transcriptomic or epigenetic profiles of these cells. Unfortunately, it is a limitation that we are currently unable to overcome within the scope of this revision. Despite this weakness, we agree with Reviewer 1 that the methods we developed and the data we generated are important and interesting.

      Moreover, we have added additional proteomics data from both NPA and plasma from influenza and COVID-19 patients, using the SomaScan platform (new Figure 7) (marked-up version lines 472-511, 738-755 and 768-792). We also included a supplementary table listing enriched pathway data from gProfiler. Briefly, our data showed sizeable changes within the blood and nasopharyngeal proteome during respiratory virus infection (IAV or SARS-CoV-2), as compared to healthy controls. Importantly, we found several differentially expressed proteins unique to the nasopharynx that were not seen in blood, and pathway analysis highlighted “host immune responses” and “innate immunity” pathways, containing TNF, IL-6, ISG15, IL-18R, CCL7, CXCL10 (IP-10), CXCL11, GZMB, SEMA4A, S100A8, S100A9. These findings are in line with our flow cytometry data, and support our hypothesis that the immunological response to viral infection in the upper airways differ from that in matching plasma samples. One of the main messages in this manuscript is the importance of looking at the site of infection, and not only at systemic immune responses to better understand respiratory viral infections in humans. We believe that the addition of the proteomics data serves to further highlight this point.

      Reviewer #2 (Public Review):

      This study aims to describe the distribution and functional status of monocytes and dendritic cells in the blood and nasopharyngeal aspirate (NPA) after respiratory viral infection in more than 50 patients affected by influenza A, B, RSV and SARS-CoV2. The authors use flow cytometry to define HLA-DR+ lineage negative cells, and within this gate, classical, intermediate and non-classical monocytes and CD1c+, CD141+, and CD123+ dendritic cells (DC). They show a large increase in classical monocytes in NPA and an increase in intermediate monocytes in blood and NPA, with more subtle changes in non-classical monocytes. Changes in intermediate monocytes were age-dependent and resolution was seen with convalescence. While blood monocytes tended to increase in blood and NPA, DC frequency was reduced in blood but also increased in NPA. There were signs of maturation in monocytes and DC in NPA compared with blood as judged by expression of HLA-DR and CD86. Cytokine levels in NPA were increased in infection in association with enrichment of cytokine-producing cells. Various patterns were observed in different viral infections suggesting some specificity of pathogen response. The work did not fully document the diversity of human myeloid cells that have arisen from single-cell transcriptomics over the last 5 years, notably the classification of monocytes which shows only two distinct subsets (intermediate cannot be distinguished from classical), distinct populations of DC1, DC2 and DC3 (DC2 and 3 both having CD1c, but different levels of monocyte antigens), and the lack of distinction provided by CD123 which also includes a precursor population of AXL+SIGLEC6+ myeloid cells in addition to plasmacytoid DC. Furthermore, some greater precision of the gating could have been achieved for the subsets presented. Specifically, CD34+ cells were not excluded from the HLA-DR+ lineage- gate, and the threshold of CD11c may have excluded some DC1 owing to the low expression of this antigen. Overall, the work shows that interesting results can be obtained by comparing myeloid populations of blood and NPA during viral infection and that lineage, viral and age-specific patterns are observed. However, the mechanistic insights for host defense provided by these observations remain relatively modest.

      We thank Reviewer 2 for their assessment of our manuscript and summarizing our key findings in their public review. As reviewer 2 noted, our study describes changes in frequencies of monocytes and DCs during acute IAV infection, in blood and in the nasopharynx. Additionally, we also demonstrate pathogen-specific changes in both compartments. Reviewer 2 also highlighted a drawback of our study- that the approach did not fully capture the breadth of monocyte and DC diversity as it currently stands. Despite this, the findings we presented here laid the groundwork for continued research and led to significant progress, including mechanistic insights (Falck-Jones et al, https://doi.org/10.1172/JCI144734 and Cagigi et al, https://doi.org/10.1172/jci.insight.151463, Havervall et al. https://doi.org/10.1056/nejmc2209651 and Marking et al. Lancet Infectious Diseases in press), in understanding the role of myeloid cells in the human airways during viral infections.

    1. Author Response

      We thank the reviewers for their positive feedback and thoughtful suggestions that will improve our manuscript. Here we summarise our plan for immediate action. We will resubmit our manuscript once additional experiments have been performed to clarify all the major and minor concerns of the reviewers and the manuscript has been revised. At that point, we will respond to all reviewer’s points and highlight the changes made in the text.

      Reviewer #1 (Public Review):

      The authors have tried to correlate changes in the cellular environment by means of altering temperature, the expression of key cellular factors involved in the viral replication cycle, and small molecules known to affect key viral protein-protein interactions with some physical properties of the liquid condensates of viral origin. The ideas and experiments are extremely interesting as they provide a framework to study viral replication and assembly from a thermodynamic point of view in live cells.

      The major strengths of this article are the extremely thoughtful and detailed experimental approach; although this data collection and analysis are most likely extremely time-consuming, the techniques used here are so simple that the main goal and idea of the article become elegant. A second major strength is that in other to understand some of the physicochemical properties of the viral liquid inclusion, they used stimuli that have been very well studied, and thus one can really focus on a relatively easy interpretation of most of the data presented here.

      There are three major weaknesses in this article. The way it is written, especially at the beginning, is extremely confusing. First, I would suggest authors should check and review extensively for improvements to the use of English. In particular, the abstract and introduction are extremely hard to understand. Second, in the abstract and introduction, the authors use terms such as "hardening", "perturbing the type/strength of interactions", "stabilization", and "material properties", for just citing some terms. It is clear that the authors do know exactly what they are referring to, but the definitions come so late in the text that it all becomes confusing. The second major weakness is that there is a lack of deep discussion of the physical meaning of some of the measured parameters like "C dense vs inclusion", and "nuclear density and supersaturation". There is a need to explain further the physical consequences of all the graphs. Most of them are discussed in a very superficial manner. The third major weakness is a lack of analysis of phase separations. Some of their data suggest phase transition and/or phase separation, thus, a more in-deep analysis is required. For example, could they calculate the change of entropy and enthalpy of some of these processes? Could they find some boundaries for these transitions between the "hard" (whatever that means) and the liquid?

      The authors have achieved almost all their goals, with the caveat of the third weakness I mentioned before. Their work presented in this article is of significant interest and can become extremely important if a more detailed analysis of the thermodynamics parameters is assessed and a better description of the physical phenomenon is provided.

      We thank reviewer 1 for the comments and, in particular, for being so positive regarding the strengths of our manuscript and for raising concerns that will surely improve the manuscript. At this point, we propose the following actions to address the concerns of Reviewer 1:

      1) We will extensively revise the use of English, particularly, in the abstract and introduction, defining key terms as they come along in the text to make the argument clearer.

      2) We acknowledge the importance of discussing our data in more detail and we propose the following. We will discuss the graphs and what they mean as exemplified in the paragraph below.

      Regarding Figure 3 - As the concentration of vRNPs increases, we observe an increase in supersaturation until 12hpi. This means that contrary to what is observed in a binary mixture, in which the Cdilute is constant (Klosin et al., 2020), the Cdilute in our system increases with concentration. It has been reported that Cdilute increases in a multi-component system with bulk concentration (Riback et al., 2020). Our findings have important implications for how we think about the condensates formed during influenza infection. As the 8 different genomic vRNPs have a similar overall structure, they could, in theory, behave as a binary system between units of vRNPs and Rab11a. However, a change in Cdilute with concentration shows that our system behaves as a multi-component system. This means that the differences in length, RNA sequence and valency that each vRNP have are key for the integrity of condensates.

      3) The reviewer calls our attention to the lack of analysis of phase separations. We think that phase separation (or percolation coupled to phase separation) governs the formation of influenza A virus condensates. However, we think we ought to exert caution at this point as the condensates we are working with are very complex and that the physics of our system in cells may not be sufficient to claim phase separation without an in vitro reconstitution system. In fact, IAV inclusions contain cellular membranes, different vRNPs and Rab11a. So far, we can only speculate that the liquid character of IAV inclusions may arise from a network of interacting vRNPs that bridge several cognate vRNP-Rab11 units on flexible membranes, similarly to what happens in phase separated vesicles in neurological synapses. However, the speculative model for our system, although being supported by correlative light and electron microscopy, currently lacks formal experimental validation.

      For this reason, we thought of developing the current work as an alternative to explore the importance of the liquid material properties of IAV inclusions. By finding an efficient method to alter the material properties of IAV inclusions, we provide proof of principle that it is possible to impose controlled phase transitions that reduce the dynamics of vRNPs in cells and negatively impact progeny virion production. Despite having discussed these issues in the limitations of the study, we will make our point clearer.

      We are currently establishing an in vitro reconstitution system to formally demonstrate, in an independent publication, that IAV inclusions are formed by phase separation. For this future work, we teamed up with Pablo Sartori, a theorical physicist to derive in- depth analysis of the thermodynamics of the viral liquid condensates. Collectively, we think that cells have too many variables to derive meaningful physics parameters (such as entropy and enthalpy) as well as models and need to be complemented by in vitro systems. For example, increasing the concentration inside a cell is not a simple endeavour as it relies on cellular pathways to deliver material to a specific place. At the same time, the 8 vRNPs, as mentioned above, have different size, valency and RNA sequence and can behave very differently in the formation of condensates and maintenance of their material properties. Ideally, they should be analysed individually or in selected combinations. For the future, we will combine data from in vitro reconstitution systems and cells to address this very important point raised by the reviewer.

      From the paper on the section Limitations of the study: “Understanding condensate biology in living cells is physiologically relevant but complex because the systems are heterotypic and away from equilibria. This is especially challenging for influenza A liquid inclusions that are formed by 8 different vRNP complexes, which although sharing the same structure, vary in length, valency, and RNA sequence. In addition, liquid inclusions result from an incompletely understood interactome where vRNPs engage in multiple and distinct intersegment interactions bridging cognate vRNP-Rab11 units on flexible membranes (Chou et al., 2013; Gavazzi et al., 2013; Haralampiev et al., 2020; Le Sage et al., 2020; Shafiuddin & Boon, 2019; Sugita, Sagara, Noda, & Kawaoka, 2013). At present, we lack an in vitro reconstitution system to understand the underlying mechanism governing demixing of vRNP-Rab11a-host membranes from the cytosol. This in vitro system would be useful to explore how the different segments independently modulate the material properties of inclusions, explore if condensates are sites of IAV genome assembly, determine thermodynamic values, thresholds accurately, perform rheological measurements for viscosity and elasticity and validate our findings”.

      Reviewer #2 (Public Review):

      During Influenza virus infection, newly synthesized viral ribonucleoproteins (vRNPs) form cytosolic condensates, postulated as viral genome assembly sites and having liquid properties. vRNP accumulation in liquid viral inclusions requires its association with the cellular protein Rab11a directly via the viral polymerase subunit PB2. Etibor et al. investigate and compare the contributions of entropy, concentration, and valency/strength/type of interactions, on the properties of the vRNP condensates. For this, they subjected infected cells to the following perturbations: temperature variation (4, 37, and 42{degree sign}C), the concentration of viral inclusion drivers (vRNPs and Rab11a), and the number or strength of interactions between vRNPs using nucleozin a well-characterized vRNP sticker. Lowering the temperature (i.e. decreasing the entropic contribution) leads to a mild growth of condensates that does not significantly impact their stability. Altering the concentration of drivers of IAV inclusions impact their size but not their material properties. The most spectacular effect on condensates was observed using nucleozin. The drug dramatically stabilizes vRNP inclusions acting as a condensate hardener. Using a mouse model of influenza infection, the authors provide evidence that the activity of nucleozin is retained in vivo. Finally, using a mass spectrometry approach, they show that the drug affects vRNP solubility in a Rab11a-dependent manner without altering the host proteome profile.

      The data are compelling and support the idea that drugs that affect the material properties of viral condensates could constitute a new family of antiviral molecules as already described for the respiratory syncytial virus (Risso Ballester et al. Nature. 2021).

      Nevertheless, there are some limitations in the study. Several of them are mentioned in a dedicated paragraph at the end of a discussion. This includes the heterogeneity of the system (vRNP of different sizes, interactions between viral and cellular partners far from being understood), which is far from equilibrium, and the absence of minimal in vitro systems that would be useful to further characterize the thermodynamic and the material properties of the condensates.

      We thank reviewer 2 for highlighting specific details that need improving and raising such interesting questions to validate our findings. We will address all the minor comments of Reviewer 2. To address the comments of Reviewer 2, we propose the actions described in blue below each point raised that is written in italics.

      1) The concentrations are mostly evaluated using antibodies. This may be correct for Cdilute. However, measurement of Cdense should be viewed with caution as the antibodies may have some difficulty accessing the inner of the condensates (as already shown in other systems), and this access may depend on some condensate properties (which may evolve along the infection). This might induce artifactual trends in some graphs (as seen in panel 2c), which could, in turn, affect the calculation of some thermodynamic parameters.

      The concern of using antibodies to calculate Cdense is valid. We will address this concern by validating our results using a fluorescent tagged virus that has mNeon Green fused to the viral polymerase PA (PA-mNeonGreen PR8 virus). Like NP, PA is a component of vRNPs and labels viral inclusions, colocalising with Rab11 when vRNPs are in the cytosol without the need of using antibodies.

      This virus would be the best to evaluate inclusion thermodynamics, where it not an attenuated virus (Figure 1A below) with a delayed infection as demonstrated by the reduced levels of viral proteins (Figure 1B below). Consistently, it shows differences in the accumulation of vRNPs in the cytosol and viral inclusions form later in infection. After their emergence, inclusions behave as in the wild-type virus (PR8-WT), fusing and dividing (Figure 1C below) and displaying liquid properties. The differences in concentration may shift or alter thermodynamic parameters such as time of nucleation, nucleation density, inclusion maturation rate, Cdense, Cdilute. This is the reason why we performed the thermodynamics profiling using antibodies upon PR8-WT infection. For validating our results, and taking into account a possible delayed kinetics, and differenced that may occur because of reduced vRNP accumulation in the cytosol, this virus will be useful and therefore we will repeat the thermodynamics using it.

      As a side note, vRNPs are composed of viral RNA coated with several molecules of NP and each vRNP also contains 1 copy of the trimeric RNA dependent RNA polymerase formed by PA, PB1 and PB2. It is well documented that in the cytosol the vast majority of PA (and other components of the polymerase) is in the form of vRNPs (Avilov, Moisy, Munier, et al., 2012; Avilov, Moisy, Naffakh, & Cusack, 2012; Bhagwat et al., 2020; Lakdawala et al., 2014), and thus we can use this virus to label vRNPs on condensates to corroborate our studies using antibodies.

      Figure 1 – The PA- mNeonGreen virus is attenuated in comparison to the WT virus. A. Cells (A549) were infected or mock-infected with PR8 WT or PA- mNeonGreen (PA-mNG) viruses, at a multiplicity of infection (MOI) of 3, for the indicated times. Viral production was determined by plaque assay and plotted as plaque forming units (PFU) per milliliter (mL) ± standard error of the mean (SEM). Data are a pool from 2 independent experiments. B. The levels of viral PA, NP and M2 proteins and actin in cell lysates at the indicated time points were determined by western blotting. C. Cells (A549) were transfected with a plasmid encoding mCherry-NP and co-infected with PA-mNeonGreen virus for 16h, at an MOI of 10. Cells were imaged under time-lapse conditions starting at 16 hpi. White boxes highlight vRNPs/viral inclusions in the cytoplasm in the individual frames. The dashed white and yellow lines mark the cell nucleus and the cell periphery, respectively. The yellow arrows indicate the fission/fusion events and movement of vRNPs/ viral inclusions. Bar = 10 µm. Bar in insets = 2 µm.

      2) Although the authors have demonstrated that vRNP condensates exhibit several key characteristics of liquid condensates (they fuse and divide, they dissolve upon hypotonic shock or upon incubation with 1,6-hexanediol, FRAP experiments are consistent with a liquid nature), their aspect ratio (with a median above 1.4) is much higher than the aspect ratio observed for other cellular or viral liquid compartments. This is intriguing and might be discussed.

      IAV inclusions have been shown to interact with microtubules and the endoplasmic reticulum, that confers movement, and also undergo fusion and fission events. We propose that these interactions and movement impose strength and deform inclusions making them less spherical. To validate this assumption, we compared the aspect ratio of viral inclusions in the absence and presence of nocodazole (that abrogates microtubule-based movement). The data in figure 2 shows that in the presence of nocodazole, the aspect ratio decreases from 1.42±0.36 to 1.26 ±0.17, supporting our assumption.

      Figure 2 – Treatment with nocodazole reduces the aspect ratio of influenza A virus inclusions. Cells (A549) were infected PR8 WT and treated with nocodazole (10 µg/mL) for 2h time after which the movement of influenza A virus inclusions was captured by live cell imaging. Viral inclusions were segmented, and the aspect ratio measured by imageJ, analysed and plotted in R.

      3) Similarly, the fusion event presented at the bottom of figure 3I is dubious. It might as well be an aggregation of condensates without fusion.

      We will change this, thank you for the suggestion.

      4) The authors could have more systematically performed FRAP/FLAPh experiments on cells expressing fluorescent versions of both NP and Rab11a to investigate the influence of condensate size, time after infection, or global concentrations of Rab11a in the cell (using the total fluorescence of overexpressed GFP-Rab11a as a proxy) on condensate properties.

      We will try our best to be able to comply with this suggestion as we think it is important.

      Reviewer #3 (Public Review):

      This study aims to define the factors that regulate the material properties of the viral inclusion bodies of influenza A virus (IAV). In a cellular model, it shows that the material properties were not affected by lowering the temperature nor by altering the concentration of the factors that drive their formation. Impressively, the study shows that IAV inclusions may be hardened by targeting vRNP interactions via the known pharmacological modulator (also an IAV antiviral), nucleozin, both in vitro and in vivo. The study employs current state-of-the-art methodology in both influenza virology and condensate biology, and the conclusions are well-supported by data and proper data analysis. This study is an important starting point for understanding how to pharmacologically modulate the material properties of IAV viral inclusion bodies.

      We thank this reviewer for all the positive comments. We will address the minor issues brought to our attention entirely, including changing the tittle of the manuscript and we will investigate the formation and material properties of IAV inclusions in the presence and absence of nucleozin for the nucleozin escape mutant NP-Y289H.

      References

      Avilov, S. V., Moisy, D., Munier, S., Schraidt, O., Naffakh, N., & Cusack, S. (2012). Replication- competent influenza A virus that encodes a split-green fluorescent protein-tagged PB2 polymerase subunit allows live-cell imaging of the virus life cycle. J Virol, 86(3), 1433- 1448. doi:10.1128/JVI.05820-11

      Avilov, S. V., Moisy, D., Naffakh, N., & Cusack, S. (2012). Influenza A virus progeny vRNP trafficking in live infected cells studied with the virus-encoded fluorescently tagged PB2 protein. Vaccine, 30(51), 7411-7417. doi:10.1016/j.vaccine.2012.09.077

      Bhagwat, A. R., Le Sage, V., Nturibi, E., Kulej, K., Jones, J., Guo, M., . . . Lakdawala, S. S. (2020). Quantitative live cell imaging reveals influenza virus manipulation of Rab11A transport through reduced dynein association. Nat Commun, 11(1), 23. doi:10.1038/s41467-019-13838-3

      Chou, Y. Y., Heaton, N. S., Gao, Q., Palese, P., Singer, R. H., & Lionnet, T. (2013). Colocalization of different influenza viral RNA segments in the cytoplasm before viral budding as shown by single-molecule sensitivity FISH analysis. PLoS Pathog, 9(5), e1003358. doi:10.1371/journal.ppat.1003358

      Gavazzi, C., Yver, M., Isel, C., Smyth, R. P., Rosa-Calatrava, M., Lina, B., . . . Marquet, R. (2013). A functional sequence-specific interaction between influenza A virus genomic RNA segments. Proc Natl Acad Sci U S A, 110(41), 16604-16609. doi:10.1073/pnas.1314419110

      Haralampiev, I., Prisner, S., Nitzan, M., Schade, M., Jolmes, F., Schreiber, M., . . . Herrmann, A. (2020). Selective flexible packaging pathways of the segmented genome of influenza A virus. Nat Commun, 11(1), 4355. doi:10.1038/s41467-020-18108-1

      Klosin, A., Oltsch, F., Harmon, T., Honigmann, A., Julicher, F., Hyman, A. A., & Zechner, C. (2020). Phase separation provides a mechanism to reduce noise in cells. Science, 367(6476), 464-468. doi:10.1126/science.aav6691

      Lakdawala, S. S., Wu, Y., Wawrzusin, P., Kabat, J., Broadbent, A. J., Lamirande, E. W., . . . Subbarao, K. (2014). Influenza a virus assembly intermediates fuse in the cytoplasm. PLoS Pathog, 10(3), e1003971. doi:10.1371/journal.ppat.1003971

      Le Sage, V., Kanarek, J. P., Snyder, D. J., Cooper, V. S., Lakdawala, S. S., & Lee, N. (2020). Mapping of Influenza Virus RNA-RNA Interactions Reveals a Flexible Network. Cell Rep, 31(13), 107823. doi:10.1016/j.celrep.2020.107823

      Riback, J. A., Zhu, L., Ferrolino, M. C., Tolbert, M., Mitrea, D. M., Sanders, D. W., . . . Brangwynne, C. P. (2020). Composition-dependent thermodynamics of intracellular phase separation. Nature, 581(7807), 209-214. doi:10.1038/s41586-020-2256-2

      Shafiuddin, M., & Boon, A. C. M. (2019). RNA Sequence Features Are at the Core of Influenza a Virus Genome Packaging. J Mol Biol. doi:10.1016/j.jmb.2019.03.018

      Sugita, Y., Sagara, H., Noda, T., & Kawaoka, Y. (2013). Configuration of viral ribonucleoprotein complexes within the influenza A virion. J Virol, 87(23), 12879- 12884. doi:10.1128/JVI.02096-13

    2. Author Response

      Reviewer #1 (Public Review):

      The authors have tried to correlate changes in the cellular environment by means of altering temperature, the expression of key cellular factors involved in the viral replication cycle, and small molecules known to affect key viral protein-protein interactions with some physical properties of the liquid condensates of viral origin. The ideas and experiments are extremely interesting as they provide a framework to study viral replication and assembly from a thermodynamic point of view in live cells.

      The major strengths of this article are the extremely thoughtful and detailed experimental approach; although this data collection and analysis are most likely extremely time-consuming, the techniques used here are so simple that the main goal and idea of the article become elegant. A second major strength is that in other to understand some of the physicochemical properties of the viral liquid inclusion, they used stimuli that have been very well studied, and thus one can really focus on a relatively easy interpretation of most of the data presented here.

      There are three major weaknesses in this article. The way it is written, especially at the beginning, is extremely confusing. First, I would suggest authors should check and review extensively for improvements to the use of English. In particular, the abstract and introduction are extremely hard to understand. Second, in the abstract and introduction, the authors use terms such as "hardening", "perturbing the type/strength of interactions", "stabilization", and "material properties", for just citing some terms. It is clear that the authors do know exactly what they are referring to, but the definitions come so late in the text that it all becomes confusing. The second major weakness is that there is a lack of deep discussion of the physical meaning of some of the measured parameters like "C dense vs inclusion", and "nuclear density and supersaturation". There is a need to explain further the physical consequences of all the graphs. Most of them are discussed in a very superficial manner. The third major weakness is a lack of analysis of phase separations. Some of their data suggest phase transition and/or phase separation, thus, a more in-deep analysis is required. For example, could they calculate the change of entropy and enthalpy of some of these processes? Could they find some boundaries for these transitions between the "hard" (whatever that means) and the liquid?

      The authors have achieved almost all their goals, with the caveat of the third weakness I mentioned before. Their work presented in this article is of significant interest and can become extremely important if a more detailed analysis of the thermodynamics parameters is assessed and a better description of the physical phenomenon is provided.

      We thank you for the comments and, in particular, for being so positive regarding the strengths of our manuscript and for raising concerns that will surely improve it. We have taken the following actions to address your concerns:

      1) Extensive revisions have been made to the use of English, particularly in the abstract and introduction. Key terms are defined as they are introduced in the text to enhance the clarity of the argument. This is a significant revision that is highlighted within the text, but it is too extensive to detail here.

      2) In the results section, we improved and extended the discussion of our graphs to the extent possible. However, we found that attempting to explain the graphs' meanings more thoroughly would detract from our manuscript's main focus: identifying thermodynamic changes that could potentially lead to alterations in material properties, specifically aspect ratio, size, and Gibbs free energy. As a result, we introduced the type of information we could obtain from our analyses in the introduction (Lines 112-125) and briefly commented on it in the ‘results’ section (Lines 304-306, sentences below).

      From introduction – lines 112-125:

      “In addition, other parameters like nucleation density determine how many viral condensates are formed per area of cytosol. Overall, the data will inform us if changing one parameter, e.g. the concentration, drives the system towards larger condensates with the same or more stable properties, or more abundant condensates that are forced to maintain the initial or a different size on account of available nucleation centres (Riback et al., 2020:Snead, 2022 #1152). It will also inform us if liquid viral inclusions behave like a binary or a multi-component system. In a binary mixture, Cdilute is constant (Klosin et al., 2020). However, in multi-component systems, Cdilute increases with bulk concentration (Riback et al., 2020). This type of information could have direct implications about the condensates formed during influenza infection. As the 8 different genomic vRNPs have a similar overall structure, they could, in theory, behave as a binary system between units of vRNPs and Rab11a. However, a change in Cdilute with concentration would mean that the system behaves as a multi-component system. This could raise the hypothesis that the differences in length, RNA sequence and valency that each vRNP has may be relevant for the integrity and behaviour of condensates.”.

      From results lines 304-306:

      This indicates that the liquid inclusions behave as a multi-component system and allow us to speculate that the differences in length, RNA sequence and valency that each vRNP may be key for the integrity and behaviour of condensates.

      3) The reviewer has drawn our attention to the absence of phase separation analysis in our study. We believe that the formation of influenza A virus condensates is governed by phase separation (or percolation coupled to phase separation). However, we must exercise caution at this point because the condensates we are studying are highly complex, and the physics of our cellular system may not be adequate to claim phase separation without being validated by an in vitro reconstitution system. IAV inclusions contain a variety of cellular membranes, different vRNPs, and Rab11a. While we have robust data to propose a model in which the liquid-like properties of IAV inclusions arise from a network of interacting vRNPs that bridge multiple cognate vRNP-Rab11 units on flexible membranes, similar to what occurs in phase-separated vesicles in neurological synapses, our model for this system still lacks formal experimental validation. As a note, the data supporting our model includes: the demonstration of the liquid properties of our liquid inclusions (Alenquer et al. 2019, Nature Communications, 10, 1629); and impairment of recycling endocytic activity during IAV infection Bhagwat et al. 2020, Nat Commun, 11, 23; Kawaguchi et al. 2012, J Virol, 86, 11086-95; Vale-costa et al. 2016, J Cell Sci, 129, 1697-710. This leads to aggregated vesicles seen by correlative light and electron microscopy (Vale-Costa et al., 2016 JCS, 129, 1697-710) and by immunofluorescence and FISH (Amorim et al. 2011,. J Virol 85, 4143-4156; Avilov et al. 2012, Vaccine 30, 7411-7417; Chou et al. 2013, PLoS Pathog 9, e1003358; Eisfeld et al. 2011, J Virol 85, 6117-6126 and Lakdawala et al. 2014, PLoS Pathog 10, e1003971.

      To be able to explore the significance of the liquid material properties of IAV inclusions, we used the strategy described in this current work. By developing an effective method to manipulate the material properties of IAV inclusions, we provide evidence that controlled phase transitions can be induced, resulting in decreased vRNP dynamics in cells and a negative impact on progeny virion production. This suggests that the liquid character of liquid inclusions is important for their function in IAV infection. We have improved our explanation addressing this concern in the limitations of our study (as outlined below in the box and in manuscript in lines 857-872).

      We are currently establishing an in vitro reconstitution system to formally demonstrate, in an independent publication, that IAV inclusions are formed by phase separation (or percolation coupled to phase separation). For this future work, we teamed up with Pablo Sartori, a theorical physicist to derive in-depth analysis of the thermodynamics of the viral liquid condensates in the in vitro reconstituted system and compare it to results obtained in the cell. This will provide means to establish comparisons. We think that cells have too many variables to derive meaningful physics parameters (such as entropy and enthalpy) and models that need to be complemented by in vitro systems. For example, increasing the concentration inside a cell is not a simple endeavour as it relies on cellular pathways to deliver material to a specific place. At the same time, the 8 vRNPs, as mentioned above, have different size, valency and RNA sequence and can behave very differently in the formation of condensates and maintenance of their material properties. Ideally, they should be analysed individually or in selected combinations. For the future, we will combine data from in vitro reconstitution systems and cells to address this very important point raised by the reviewer.

      From the paper on the section ‘Limitations of the study’:

      “Understanding condensate biology in living cells is physiological relevant but complex because the systems are heterotypic and away from equilibria. This is especially challenging for influenza A liquid inclusions that are formed by 8 different vRNP complexes, which although sharing the same structure, vary in length, valency, and RNA sequence. In addition, liquid inclusions result from an incompletely understood interactome where vRNPs engage in multiple and distinct intersegment interactions bridging cognate vRNP-Rab11 units on flexible membranes (Chou et al., 2013, Gavazzi et al., 2013, Sugita et al., 2013, Shafiuddin and Boon, 2019, Haralampiev et al., 2020, Le Sage et al., 2020). At present, we lack an in vitro reconstitution system to understand the underlying mechanism governing demixing of vRNP-Rab11a-host membranes from the cytosol. This in vitro system would be useful to explore how the different segments independently modulate the material properties of inclusions, explore if condensates are sites of IAV genome assembly, determine thermodynamic values, thresholds accurately, perform rheological measurements for viscosity and elasticity and validate our findings. The results could be compared to those obtained in cell systems to derive thermodynamic principles happening in a complex system away from equilibrium. Using cells to map how liquid inclusions respond to different perturbations provide the answer of how the system adapts in vivo, but has limitations.

      Reviewer #2 (Public Review):

      During Influenza virus infection, newly synthesized viral ribonucleoproteins (vRNPs) form cytosolic condensates, postulated as viral genome assembly sites and having liquid properties. vRNP accumulation in liquid viral inclusions requires its association with the cellular protein Rab11a directly via the viral polymerase subunit PB2. Etibor et al. investigate and compare the contributions of entropy, concentration, and valency/strength/type of interactions, on the properties of the vRNP condensates. For this, they subjected infected cells to the following perturbations: temperature variation (4, 37, and 42{degree sign}C), the concentration of viral inclusion drivers (vRNPs and Rab11a), and the number or strength of interactions between vRNPs using nucleozin a well-characterized vRNP sticker. Lowering the temperature (i.e. decreasing the entropic contribution) leads to a mild growth of condensates that does not significantly impact their stability. Altering the concentration of drivers of IAV inclusions impact their size but not their material properties. The most spectacular effect on condensates was observed using nucleozin. The drug dramatically stabilizes vRNP inclusions acting as a condensate hardener. Using a mouse model of influenza infection, the authors provide evidence that the activity of nucleozin is retained in vivo. Finally, using a mass spectrometry approach, they show that the drug affects vRNP solubility in a Rab11a-dependent manner without altering the host proteome profile

      The data are compelling and support the idea that drugs that affect the material properties of viral condensates could constitute a new family of antiviral molecules as already described for the respiratory syncytial virus (Risso Ballester et al. Nature. 2021)

      Nevertheless, there are some limitations in the study. Several of them are mentioned in a dedicated paragraph at the end of a discussion. This includes the heterogeneity of the system (vRNP of different sizes, interactions between viral and cellular partners far from being understood), which is far from equilibrium, and the absence of minimal in vitro systems that would be useful to further characterize the thermodynamic and the material properties of the condensates.

      There are other ones.

      We thank reviewer 2 for highlighting specific details that need improving and raising such interesting questions to validate our findings. We have addressed the comments of Reviewer 2, we performed the experiments as described (in blue) below each point raised.

      1) The concentrations are mostly evaluated using antibodies. This may be correct for Cdilute. However, measurement of Cdense should be viewed with caution as the antibodies may have some difficulty accessing the inner of the condensates (as already shown in other systems), and this access may depend on some condensate properties (which may evolve along the infection). This might induce artifactual trends in some graphs (as seen in panel 2c), which could, in turn, affect the calculation of some thermodynamic parameters.

      The concern of using antibodies to calculate Cdense is valid, and we thought it was very important. We addressed this concern by performing the same analyses using a fluorescent tagged virus that has mNeon Green fused to the viral polymerase PA (PA-mNeonGreen PR8 virus). Like NP, PA is a component of vRNPs and labels viral inclusions, colocalising with Rab11 when vRNPs are in the cytosol. However, per vRNP there is only one molecule of PA, whilst of NP there are 37-96 depending on the size of vRNPs. As predicted, we did observe changes in the Cdilute, Cdense and nucleation density. However, the measurements and values obtained for Gibbs free energy, size, aspect ratio detecting viral inclusions with fluorescently tagged vRNPs or antibody staining followed the same trend and allow us to validate our conclusion that major changes in Gibbs free energy occur solely when there is a change in the valency/strength of interactions but not in temperature or concentration (Figure 1 below). Given the extent of these data, we show here the results but, in the manuscript, we will describe the limitations of using antibodies in our study within the section ‘Limitations of the study’ from lines 881-894. Given the importance of the question regarding the pros and cons of the different systems for analysing thermodynamic parameters, we have decided to systematically assess and explore these differences in detail in a future manuscript.

      For more information. This reviewer may be asking why we did not use the PA-fluorescent virus in the first place to evaluate inclusion thermodynamics and avoid problems in accessibility that antibodies may have to get deep into large inclusions. Our answer is that no system is perfect. In the case of the PA-fluorescent virus, the caveats revolve around the fact that the virus is attenuated (Figure 1a below), exhibiting a delayed infection as demonstrated by reduced levels of viral proteins (Figure 1b below). Consistently, it shows differences in the accumulation of vRNPs in the cytosol and viral inclusions form later in infection and the amount of vRNPs in the cytosol does not reach the levels observed in PR8-WT virus. After their emergence, inclusions behave as in the wild-type virus (PR8-WT), fusing and dividing (Figure 1c below) and displaying liquid properties.

      As the overarching goal of this manuscript is to evaluate the best strategies to harden liquid IAV inclusions and given that one of the parameters we were testing is concentration, we reasoned that using PR8-WT virus for our analyses would be reasonable.

      In conclusions, both systems have caveats that are important to systematically assess, and these differences may shift or alter thermodynamic parameters such as nucleation density, inclusion maturation rate, Cdense, Cdilute in particular by varying the total concentration. As a note, to validate all our results using the PA-mNeonGreen PR8 virus, we considered the delayed kinetics and applied our thermodynamic analyses up to 20 hpi rather than 16 hpi.

      However, because of the question raised by this reviewer, on which is the best solution for mitigating errors induced by using antibodies, we re-checked all our data. Not only have we compared the data originated from attenuated fluorescently tagged virus with our data, but also made comparisons with images acquired from Z stacks (as used for concentration and for type/strength of interactions) with those acquired from 2D images. Our analysis revealed that there is a very good match using images acquired with Z-stacks and analysed as Z projections with between antibody staining and vRNP fluorescent virus. Therefore, we re-analysed all our thermodynamic data done with temperature using images acquired from Z stacks and altered entirely Figure 2. We believe that all these comparisons and analyses have greatly improved the manuscript and hence we thank all reviewers for their input.

      Figure 1 – The PA-mNeonGreen virus is attenuated in comparison to the WT virus and data obtained is consistent for Gibbs free energy with analyses done with images processed with antibody fluorescent vRNPs. A. Representation of the PA-mNeonGreen virus (PA-mNG; Abbreviations: NCR: non coding region). B. Cells (A549) were transfected with a plasmid encoding mCherry-NP and co-infected with PA-mNeonGreen virus for 16h, at an MOI of 10. Cells were imaged under time-lapse conditions starting at 16 hpi. White boxes highlight vRNPs/viral inclusions in the cytoplasm in the individual frames. The dashed white and yellow lines mark the cell nucleus and the cell periphery, respectively. The yellow arrows indicate the fission/fusion events and movement of vRNPs/ viral inclusions. Bar = 10 µm. Bar in insets = 2 µm. C-D. Cells (A549) were infected or mock-infected with PR8 WT or PA-mNG viruses, at a multiplicity of infection (MOI) of 3, for the indicated times. C. Viral production was determined by plaque assay and plotted as plaque forming units (PFU) per milliliter (mL) ± standard error of the mean (SEM). Data are a pool from 2 independent experiments. D. The levels of viral PA, NP and M2 proteins and actin in cell lysates at the indicated time points were determined by western blotting. (E-G) Biophysical calculations in cells infected with the PA-mNeonGreen virus upon altering temperature (at 10 hpi, evaluating the concentration of vRNPs (over a time course) in conditions expressing native amounts of Rab11a or overexpressing low levels of Rab11a and upon altering the type/strength of vRNP interactions by adding nucleozin at 10 hpi during the indicated time periods. All data: Ccytoplasm/Cnucleus; Cdense, Cdilute, area aspect ratio and Gibbs free energy are represented as boxplots. Above each boxplot, same letters indicate no significant difference between them, while different letters indicate a statistical significance at α = 0.05 using one-way ANOVA, followed by Tukey multiple comparisons of means for parametric analysis, or Kruskal-Wallis Bonferroni treatment for non-parametric analysis.

      2) Although the authors have demonstrated that vRNP condensates exhibit several key characteristics of liquid condensates (they fuse and divide, they dissolve upon hypotonic shock or upon incubation with 1,6-hexanediol, FRAP experiments are consistent with a liquid nature), their aspect ratio (with a median above 1.4) is much higher than the aspect ratio observed for other cellular or viral liquid compartments. This is intriguing and might be discussed.

      IAV inclusions have been shown to interact with microtubules and the endoplasmic reticulum, that confers movement, and undergo fusion and fission events. We propose that these interactions and movement impose strength and deform inclusions making them less spherical. To validate this assumption, we compared the aspect ratio of viral inclusions in the absence and presence of nocodazole (that abrogates microtubule-based movement). The data in figure 2 shows that in the presence of nocodazole, the aspect ratio decreases from 1.42±0.36 to 1.26 ±0.17, supporting our assumption.

      Figure 2 – Treatment with nocodazole reduces the aspect ratio of influenza A virus inclusions. Cells (A549) were infected with PR8 WT for 8 h and treated with nocodazole (10 µg/mL) for 2h, after which the movement of influenza A virus inclusions was captured by live cell imaging. Viral inclusions were segmented, and the aspect ratio measured by imageJ, analysed and plotted in R.

      3) Similarly, the fusion event presented at the bottom of figure 3I is dubious. It might as well be an aggregation of condensates without fusion.

      We have changed this (check Fig 5A and B in the manuscript), thank you for the suggestion.

      4) The authors could have more systematically performed FRAP/FLAPh experiments on cells expressing fluorescent versions of both NP and Rab11a to investigate the influence of condensate size, time after infection, or global concentrations of Rab11a in the cell (using the total fluorescence of overexpressed GFP-Rab11a as a proxy) on condensate properties.

      We have included a new figure, figure 5 with the suggested data.

    1. Author Response

      Reviewer #2 (Public Review):

      1) The main limitation of this study is that the results are primarily descriptive in nature, and thus, do not provide mechanistic insight into how Ryr1 disease mutations lead to the muscle-specific changes observed in the EDL, soleus and EOM proteomes.

      An intrinsic feature of the high-throughput proteomic analysis technology is the generation of lists of differentially expressed proteins (DEP) in different muscles from WT and mutated mice. Although the definition of mechanistic insights related to changes of dozens of proteins is very interesting, it is a difficult task to accomplish and goes beyond the goal of the high-throughput proteomic analysis presented here. Nevertheless, the analysis of DEPs may indeed provide arguments to speculate on the pathogenesis of the phenotype linked to recessive RyR1 mutations. In the unrevised manuscript, we pointed out that the fiber type I predominance observed in congenital myopathies linked to recessive Ryr1 mutation are consistent with the high expression level of heat shock proteins in slow twitch muscles. However, as suggested by Reviewer 3, we have removed "vague statements" from the text of the revised manuscript, concerning major insights into pathophysiological mechanisms, since we are aware that the mechanistic information, if any, that we can extract from the data set, cannot go over the intrinsic limitation of the high-throughput proteomic technology.

      b) Results comparing fast twitch (EDL) and slow twitch (soleus) muscles from WT mice confirmed several known differences between the two muscle types. Similar analyses between EOM/EDL and EOM/soleus muscles from WT mice were not conducted.

      We agree with the point raised by the Reviewer. In the revised manuscript we have changed Figure 2. The new Figure 2 shows the analysis of differentially expressed proteins in EDL, soleus and EOMs from WT mice. We have also added 2 new Tables (new Supplementary Table 2 and 3) and have inserted our findings in the revised Results section (page, 7, lines 157-176, pages 8 and 9).

      c) While a reactome pathway analysis for proteins changes observed in EDL is shown in Supplemental Figure 1, the authors do not fully discuss the nature of the proteins and corresponding pathways impacted in the other two muscle groups analyzed.

      We have now included in the revised manuscript a new Figure 2 which includes the Reactome pathway analysis comparing EDL with soleus, EDL with EOM and soleus with EOM (panels C, F and I, respectively). We have also inserted into the revised manuscript a brief description of the pathways showing the greatest changes in protein content (page 7 line 156-175, pages 8 and 9). We agree that the data showing changes in protein content between the 3 muscle groups of the WT mice are important also because they validate the results of the proteomic approach. Indeed, the present results confirm that many proteins including MyHCIIb, calsequestrin 1, SERCA1, parvalbumin etc are more abundantly expressed in fast twitch EDL muscles compared to soleus. Similarly, our results confirm that EOMs are enriched in MyHC-EO as well as cardiac isoforms of ECC proteins. This point has been clarified in the revised version of the manuscript (page 8, lines 198-213; page 9 lines 214-228). Nevertheless, we would like to point out that the main focus of our study is to compare the changes of protein content induced by the presence of recessive RyR1 mutations.

      Reviewer #3 (Public Review):

      a) it would be useful to determine whether changes in protein levels correlated with changes in mRNA levels …….

      We performed qPCR analysis of Stac3 and Cacna1s in EDL, Soleus and EOM from WT mice (see Figure 1 below). The expression of transcripts encoding Cacna1s and Stac3 is approximately 9-fold higher in EDL compared to Soleus. The fold change of Stac3 and Cacna1s transcripts in EDL muscles is higher compared to the differences we observed by Mass spectrometry at the protein level between EDL and Soleus. Indeed, we found that the content of the Stac3 protein in EDL is 3-fold higher compared to that in soleus. Although there is no apparent linear correlation between mRNA and protein levels, we believe that a few plausible conclusions can be drawn, namely: (i) the expression level of both transcripts and proteins is higher EDL compared to EOM and soleus muscles, respectively, (ii) the expression level of transcripts encoding Stac3 correlate with those encoding Cacan1s and confirm proteomic data. In addition, the level of Stac3 transcript does not changes between WT and dHT, confirming our proteomic data which show that Stac3 protein content in muscles from dHT is similar to that found in WT littermates. Altogether these results support the concept that the differences in Stac3 content between EDL and soleus occur at both the protein and transcript levels, namely high Stac3 mRNA level correlates with higher protein content (EDL) and low mRNA levels correlated with low Stac3 protein content in Soleus muscles (see Figure 1 below).

      Figure 2: qPCR of Cacna1s and Stac3 in muscles from WT mice. The expression levels of the transcripts encoding Cacna1s and Stac3 are the highest in EDL muscles and the lowest in soleus muscles (top panels). There are no significant changes in their relative expression levels in dHT vs WT. Each symbol represents the value from of a single mouse. * p=0.028 Mann Whitney test qPCR was performed as described in Elbaz et al., 2019 (Hum Mol Genet 28, 2987-2999).

      ….and whether or not the protein present was functional, and whether Stac3 was in fact stoichiometrically depleted in relation to Cacna1s.

      We thought about this point but think that there are no plausible arguments to believe that Stac3 is not functional, one simple reason being that our WT mice do not have a phenotype which would be associated with the absence of Stac3 (Reinholt et al., PLoS One 8, e62760 2013, Nelson et al. Proc. Natl. Acad. Sci. USA 110:11881 2013).

      b) In the abstract, the authors stated that skeletal muscle is responsible for voluntary movement. It is also responsible for non-voluntary. The abstract needs to be refocused on the mutation and on what we learn from this study. Please avoid vague statements like "we provide important insights to the pathophysiological mechanisms..." mainly when the study is descriptive and not mechanistic.

      The abstract of the revised manuscript has been rewritten. In particular, we removed statements referring to important “pathophysiological mechanistic insight”.

      c) The author should bring up the mutation name, location and phenotype early in the introduction.

      In the revised manuscript we provide the information requested by the Reviewer (page 2 lines 36-38 and page 4, lines 98-102).

      d) This reviewer also suggests that the authors refocus the introduction on the mutation location in the 3D RyR1 structure (available cryo-EM structure), if there is any nearby ligand binding site, protomers junction or any other known interacting protein partners. This will help the reader to understand how this mutation could be important for the channel's function

      The residue Ala4329 is present inside the TMx (Auxiliary transmembrane helices) domain which spans from residue 4322 to 4370 and interposes structurally (des Georges A et al. 2016 Cell 167,145-57; Chen W, et al. 2020 EMBO Rep. 21, e49891). Although the structural resolution of the region has been improved (des Georges et al, 2016), parts of the domain still remain with no defined atomic coordinates, especially the region encompassing a.a. E4253 – F4540. Because of such undefined atomic coordinates of the region E4253-F4540, we are not able to determine the real orientation and the disposition of the amino acids in this region, including the A4329 residue. As reference, structure PDB: 5TAL of des Georges et al, 2016 was analyzed with UCSF Chimera (production version 1.16) (Pettersen et al. J. Comput. Chem. 25: 1605-1612. doi: 10.1002/jcc.20084).

    1. Author Response

      Public Evaluation Summary:

      The authors re-analyzed a previously published dataset and identify patterns suggestive of increased bacterial biodiversity in the gut may creating new niches that lead to gene loss in a focal species and promote generation of more diversity. Two limitations are (i) that sequencing depth may not be sufficient to analyze strain-level diversity and (ii) that the evidence is exclusively based on correlations, and the observed patterns could also be explained by other eco-evolutionary processes. The claims should be supported by a more detailed analysis, and alternative hypotheses that the results do not fully exclude should be discussed. Understanding drivers of diversity in natural microbial communities is an important question that is of central interest to biomedically oriented microbiome scientists, microbial ecologists and evolutionary biologists.

      We agree that understanding the drivers of diversity in natural communities is an important and challenging question to address. We believe that our analysis of metagenomes from the gut microbiomes is complementary to controlled laboratory experiments and modeling studies. While these other studies are better able to establish causal relationships, we rely on correlations – a caveat which we make clear, and offer different mechanistic explanations for the patterns we observe.

      We also mention the caveat that we are only able to measure sub-species genetic diversity in relatively abundant species with high sequencing depth in metagenomes. These relatively abundant species include dozens of species in two metagenomic datasets, and we see no reason why they would not generalize to other members of the microbiome. Nonetheless, further work will be required to extend our results to rarer species.

      Our revised manuscript includes two major new analyses. First, we extend the analysis of within-species nucleotide diversity to non-synonymous sites, with generally similar results. This suggests that evolutionarily older, less selectively constrained synonymous mutations and more recent non-synonymous mutations that affect protein structure both track similarly with measures of community diversity – with some subtle differences described in the manuscript.

      Second, we extend our analysis of dense time series data from one individual stool donor and one deeply covered species (B. vulgatus) to four donors and 15 species. This allowed us to reinforce the pattern of gene loss in more diverse communities with greater statistical support. Our correlational results are broadly consistent with the predictions of DBD from modeling and experimental studies, and they open up new lines of inquiry for microbiome scientists, ecologists, and evolutionary biologists.

      Reviewer #1 (Public Review):

      This paper makes an important contribution to the current debate on whether the diversity of a microbial community has a positive or negative effect on its own diversity at a later time point. In my view, the main contribution is linking the diversity-begets-diversity patterns, already observed by the same authors and others, to genomic signatures of gene loss that would be expected from the Black Queen Hypothesis, establishing an eco-evolutionary link. In addition, they test this hypothesis at a more fine-grained scale (strain-level variation and SNP) and do so in human microbiome data, which adds relevance from the biomedical standpoint. The paper is a well-written and rigorous analysis using state-of-the-art methods, and the results suggest multiple new experiments and testable hypotheses (see below), which is a very valuable contribution.

      We thank the reviewer for their generous comments.

      That being said, I do have some concerns that I believe should be addressed. First of all, I am wondering whether gene loss could also occur because of environmental selection that is independent of other organisms or the diversity of the community. An alternative hypothesis to the Black Queen is that there might have been a migration of new species from outside and then loss of genes could have occurred because of the nature of the abiotic environment in the new host, without relationship to the community diversity. Telling the difference between these two hypotheses is hard and would require extensive additional experiments, which I don't think is necessary. But I do think the authors should acknowledge and discuss this alternative possibility and adjust the wording of their claims accordingly.

      We concur with the reviewer that the drivers of the correlation between community diversity and gene loss are unclear. Therefore, we have now added the following text to the Discussion:

      “Here we report that genome reduction in the gut is higher in more diverse gut communities. This could be due to de novo gene loss, preferential establishment of migrant strains encoding fewer genes, or a combination of the two. The mechanisms underlying this correlation remain unclear and could be due to biotic interactions – including metabolic cross-feeding as posited by some models (Estrela et al., 2022; San Roman and Wagner, 2021, 2018) but not others (Good and Rosenfeld, 2022) – or due to unknown abiotic drivers of both community diversity and gene loss.”

      Additionally, we have revised Figure 1 to show that strain invasions/replacements, in addition to evolutionary change, could be an important driver of changes in intra-species diversity in the microbiome.

      Another issue is that gene loss is happening in some of the most abundant species in the gut. Under Black Queen though, we would expect these species to be most likely "donors" in cross-feeding interactions. Authors should also discuss the implications, limitations, and possible alternative hypotheses of this result, which I think also stimulates future work and experiments.

      We thank the reviewer for raising this point. It is unclear to us whether the more abundant species would be donors in cross-feeding interactions. If we understand correctly, the reviewer is suggesting that more abundant donors will contribute more total biomass of shared metabolites to the community. This idea makes sense under the assumption that the abundant species are involved in cross-feeding interactions in the first place, which may or may not be the case. As our work heavily relies on a dataset that we previously analyzed (HMP), we wish to cite Figure S20 in Garud, Good et al. 2019 PLoS Biology in which we found there are comparable rates of gene changes across the ~30 most abundant species analyzed in the HMP. This suggests that among the most abundant species analyzed, there is no relationship between their abundance and gene change rate.

      That being said, we acknowledge that our study is limited to the relatively abundant focal species and state now in the Discussion: “Deeper or more targeted sequencing may permit us to determine whether the same patterns hold for rarer members of the microbiome.”

      Regarding Figure 5B, there is a couple of questions I believe the authors should clarify. First, How is it possible that many species have close to 0 pathways? Second, besides the overall negative correlation, the data shows some very conspicuous regularities, e.g. many different "lines" of points with identical linear negative slope but different intercept. My guess is that this is due to some constraints in the pathway detection methods, but I struggle to understand it. I think the authors should discuss these patterns more in detail.

      We sincerely thank the reviewer for raising this issue, as it prompted us to investigate more deeply the patterns observed at the pathway level. In short, we decided to remove this analysis from the paper because of a number of bioinformatics issues that we realized were contributing to the signal. However, in support of BQH-like mechanisms at play, we do find evidence for gene loss in more diverse communities across multiple species in both the HMP and Poyet datasets. Below we detail our investigation into Figure 5b and how we arrived at the conclusion that is should be removed:

      (1) Regarding data points in Figure 5B where many focal species have “zero pathways”,we firstly clarify how we compute pathway presence and richness. Pathway abundance data per species were downloaded from the HMP1-2 database, and these pathway abundances were computed using HUMAnN (HMP Unified Metabolic Analysis Network). According to HUMAnN documentation, pathway abundance is proportional to the number of complete copies of the pathway in the community; this means that if at least one component reaction in a certain pathway is missing coverage (for a sample-species pair), the pathway abundance may be zero (note that HUMAnN also employs “gap filling” to allow no more than one required reaction to have zero abundance). As such, it is likely that insufficient coverage, especially for low-abundance species, causes many pathways to report zero abundance in many species in many samples. Indeed, 556 of the 649 species considered had zero “present” pathways (i.e. having nonzero abundance) in at least 400 of the 469 samples (see figure below).

      (2) We thank the reviewer for pointing out the “conspicuous regularities” in Figure 5B,particularly “parallel lines” of data points that we discovered are an artifact of the flawed way in which we computed “community pathway richness [excluding the focal species].” Each diagonal line of points corresponds to different species in the same sample, and because community pathway richness is computed as the total number of pathways [across all species in the sample] minus the number of pathways in the focal species, the current Figure 5B is really plotting y against X-y for each sample (where X is a sample’s total community pathway richness, and y is the pathway richness of an individual species in that sample). This computation fails to account for the possibility that a pathway in an excluded focal species will still be present in the community due to redundancy, and indeed BQH tests for whether this redundancy is kept low in diverse communities due to mechanisms such as gene loss.

      We attempted to instead plot community pathway richness defined as the number of unique pathways covered by all species other than the focal species. This is equivalent to [number of unique pathways across all species in a sample] minus the [number of pathways that are ONLY present in the focal species and not any other species in the sample]. However, when we recomputed community pathway richness this way, it is rare that a pathway is present in only one species in a sample. Moreover, we find that with the exception of E. coli, focal species pathway richness tended to be very similar across the 469 samples, often reaching an upper limit of focal species pathway richness observed. (It is unclear to what extent lower pathway richnesses are due to low species abundance/low sample coverage versus gene loss). This new plot reveals even more regularities and is difficult to interpret with respect to BQH. (Note that points are colored by species; the cluster of black dots with outlying high focal pathway richness corresponds to the “unclassified” stratum which can be considered a group of many different species.)

      Overall, because community pathway richness (excluding a focal species) seems to primarily vary with sample rather than focal species in this dataset when using the most simple/strict definition of community pathway richness as described above, it is difficult to probe the Black Queen Hypothesis using a plot like Figure 5B. As pointed out by reviewers, lack of sequencing depth to analyze strain-level diversity and accurately quantify pathway abundance, irrespective of species abundance, seems to be a major barrier to this analysis. As such, we have decided to remove Figure 5B from the paper and rewrite some of our conclusions accordingly.

      Finally, I also have some conceptual concerns regarding the genomic analysis. Namely, genes can be used for biosynthesis of e.g. building blocks, but also for consumption of nutrients. Under the Black Queen Hypothesis, we would expect the adaptive loss of biosynthetic genes, as those nutrients become provided by the community. However, for catabolic genes or pathways, I would expect the opposite pattern, i.e. the gain of catabolic genes that would allow taking advantage of a more rich environment resulting from a more diverse community (or at least, the absence of pathway loss). These two opposing forces for catabolic and biosynthetic genes/pathways might obscure the trends if all genes are pooled together for the analysis. I believe this can be easily checked with the data the authors already have, and could allow the authors to discuss more in detail the functional implications of the trends they see and possibly even make a stronger case for their claims.

      We thank the reviewer for their suggestion. As explained above, we have removed the pathway analysis from the paper due to technical reasons. However, we did investigate catabolic and biosynthetic pathways separately as suggested by the reviewer as we describe below:

      We obtained subsets of biosynthetic pathways and catabolic pathways by searching for keywords (such as “degradation” for catabolic) in the MetaCyc pathway database. After excluding the “unclassified” species stratum, we observe a total of 279 biosynthetic and 167 catabolic pathways present in the HMP1-2 pathway abundance dataset. Using the corrected definition of community pathway richness excluding a focal species, for each pathway type—either biosynthetic or catabolic—we plotted focal species pathway richness against community pathway richness including all pathways regardless of type:

      We observe the same problem where, within a sample, community pathway richness excluding the focal species hardly varies no matter which focal species it is, due to nearly all of its detected pathways being present in at least one other species; this makes the plots difficult to interpret.

      Reviewer #2 (Public Review):

      The authors re-analysed two previously published metagenomic datasets to test how diversity at the community level is associated with diversity at the strain level in the human gut microbiota. The overall idea was to test if the observed patterns would be in agreement with the "diversity begets diversity" (DBD) model, which states that more diversity creates more niches and thereby promotes further increase of diversity (here measured at the strain-level). The authors have previously shown evidence for DBD in microbiomes using a similar approach but focusing on 16S rRNA level diversity (which does not provide strain-level insights) and on microbiomes from diverse environments.

      One of the datasets analysed here is a subset of a cross-sectional cohort from the Human Microbiome Project. The other dataset comes from a single individual sampled longitudinally over 18 months. This second dataset allowed the authors to not only assess the links between different levels of diversity at single timepoints, but test if high diversity at a given timepoint is associated with increased strain-level diversity at future timepoints.

      Understanding eco-evolutionary dynamics of diversity in natural microbial communities is an important question that remains challenging to address. The paper is well-written and the detailed description of the methodological approaches and statistical analyses is exemplary. Most of the analyses carried out in this study seem to be technically sound.

      We thank the reviewer for their kind words, comments, and suggestions.

      The major limitation of this study comes with the fact that only correlations are presented, some of which are rather weak, contrast each other, or are based on a small number of data points. In addition, finding that diversity at a given taxonomic rank is associated with diversity within a given taxon is a pattern that can be explained by many different underlying processes, e.g. species-area relationships, nutrient (diet) diversity, stressor diversity, immigration rate, and niche creation by other microbes (i.e. DBD). Without experiments, it remains vague if DBD is the underlying process that acts in these communities based on the observed patterns.

      We thank the reviewer for their comments. First, regarding the issue of this being a correlative study, we now more clearly acknowledge that mechanistic studies (perhaps in experimental settings) are required to fully elucidate DBD and BQH dynamics. However, we note that our correlational study from natural communities is complementary to experimental and modeling studies, to test the extent to which their predictions hold in more complex, realistic settings. This is now mentioned throughout the manuscript, most explicitly at the end of the Introduction:

      “Although such analyses of natural diversity cannot fully control for unmeasured confounding environmental factors, they are an important complement to controlled experimental and theoretical studies which lack real-world complexity.”

      Second, to increase the number of data points analyzed in the Poyet study, we now include 15 species and four different hosts (new Figure 5). The association between community diversity and gene loss is now much more statistically robust, and consistent across the Poyet and HMP time series.

      Third, we acknowledge more clearly in the Discussion that other processes, including diet and other environmental factors can generate the DBD pattern. We also now stress more prominently the possibility that strain migration across hosts may be responsible for the patterns observed. For example, in Figure 1, we illustrate the possibility of strain migration generating the patterns we observe.

      Below we quote a paragraph that we have now added in the Discussion:

      "Second, we cannot establish causal relationships without controlled experiments. We are therefore careful to conclude that positive diversity slopes are consistent with the predictions of DBD, and negative slopes with EC, but unmeasured environmental drivers could be at play. For example, increased dietary diversity could simultaneously select for higher community diversity and also higher intra-species diversity. In our previous study, we found that positive diversity slopes persisted even after controlling for potential abiotic drivers such as pH and temperature (Madi et al., 2020), but a similar analysis was not possible here due to a lack of metadata. Neutral processes can account for several ecological patterns such as species-area relationships (Hubbell, 2001), and must be rejected in favor of niche-centric models like DBD or EC. Using neutral models without DBD or EC, we found generally flat or negative diversity slopes due to sampling processes alone and that positive slopes were hard to explain with a neutral model (Madi et al., 2020). These models were intended mainly for 16S rRNA gene sequence data, but we expect the general conclusions to extend to metagenomic data. Nevertheless, further modeling and experimental work will be required to fully exclude a neutral explanation for the diversity slopes we report in the human gut microbiome.”

      Finally, we now put more emphasis on the importance of migration (strain invasion) as a non-exclusive alternative to de novo mutation and gene gain/loss. This is mentioned in the Abstract and is also illustrated in the revised Figure 1.

      Another limitation is that the total number of reads (5 mio for the longitudinal dataset and 20 mio for the cross-sectional dataset) is low for assessing strain-level diversity in complex communities such as the human gut microbiota. This is probably the reason why the authors only looked at one species with sufficient coverage in the longitudinal dataset.

      Indeed, this is a caveat which means we can only consider sub-species diversity in relatively abundant species. Nevertheless, this allows us to study dozens of species in the HMP and 15 in the more frequent Poyet time series. As more deeply sequenced metagenomes become available, future studies will be able to access the rarer species to test whether the same patterns hold or not. This is now mentioned prominently as a caveat our study in the second Discussion paragraph:

      “First, using metagenomic data from human microbiomes allowed us to study genetic diversity, but limited us to considering only relatively abundant species with genomes that were well-covered by short sequence reads. Deeper or more targeted sequencing may permit us to determine whether the same patterns hold for rarer members of the microbiome. However, it is notable that the majority of the dozens of species across the two datasets analyzed support DBD, suggesting that the phenomenon may generalize.”

      We also note that rarefaction was only applied to calculate community richness, not to estimate sub-species diversity. We apologize for this confusion, which is now clarified in the Methods as follows:

      “SNV and gene content variation within a focal species were ascertained only from the full dataset and not the rarefied dataset.”

      Analyzing the effect of diversity at a given timepoint on strain-level diversity at a later timepoint adds an important new dimension to this study which was not assessed in the previous study about the DBD in microbiomes by some of the authors. However, only a single species was analysed in the longitudinal dataset and comparisons of diversity were only done between two consecutive timepoints. This dataset could be further exploited to provide more insights into the prevailing patterns of diversity.

      We thank the reviewer for raising this point. We now have considered all 15 species for which there was sufficient coverage from the Poyet dataset, which included four different stool donors. Additionally, in the HMP dataset, we analyze 54 species across 154 hosts, with both datasets showing the same correlation between community diversity and gene loss.

      Additionally, we followed the suggestion of the reviewer of examining additional time lags, and in Figure 5 we do observe a dependency on time. This is now described in the Results as follows:

      “Using the Poyet dataset, we asked whether community diversity in the gut microbiome at one time point could predict polymorphism change at a future time point by fitting GAMs with the change in polymorphism rate as a function of the interaction between community diversity at the first time point and the number of days between the two time points. Shannon diversity at the earlier time point was correlated with increases in polymorphism (consistent with DBD) up to ~150 days (~4.5 months) into the future (Figure S4), but this relationship became weaker and then inverted (consistent with EC) at longer time lags (Fig 5A, Table S8, GAM, P=0.023, Chi-square test). The diversity slope is approximately flat for time lags between four and six months, which could explain why no significant relationship was found in HMP, where samples were collected every ~6 months. No relationship was observed between community richness and changes in polymorphism (Table S8, GAM, P>0.05).”

      Finally, the evidence that gene loss follows increase in diversity is weak, as very few genes were found to be lost between two consecutive timepoints, and the analysis is based on only a single species. Moreover, while positive correlation were found between overall community diversity and gene family diversity in single species, the opposite trend was observed when focusing on pathway diversity. A more detailed analysis (of e.g. the functions of the genes and pathways lost/gained) to explain these seemingly contrasting results and a more critical discussion of the limitations of this study would be desirable.

      We agree that our previous analysis of one species in one host provided weak support for gene loss following increases in diversity. As described in the response above, we have now expanded this analysis to 15 focal species and 4 independent hosts with extensive time series. We now analyze this larger dataset and report the more statistically robust results as follows:

      “We found that community Shannon diversity predicted future gene loss in a focal species, and this effect became stronger with longer time lags (Fig 5B, Table S9, GLMM, P=0.006, LRT for the effect of the interaction between the initial Shannon diversity and time lag on the number of genes lost). The model predicts that increasing Shannon diversity from its minimum to its maximum would result in the loss of 0.075 genes from a focal species after 250 days. In other words, about one of the 15 focal species considered would be expected to lose a gene in this time frame.

      Higher Shannon diversity was also associated with fewer gene gains, and this relationship also became stronger over time (Fig 5C, Table S9, GLMM, P=1.11e-09, LRT). We found a similar relationship between community species richness and gene gains, although the relationship was slightly positive at shorter time lags (Fig 5D, Table S9, GLMM, P=3.41e-04, LRT). No significant relationship was observed between richness and gene loss (Table S9, GLMM, P>0.05). Taken together with the HMP results (Fig 4), these longer time series reveal how the sign of the diversity slope can vary over time and how community diversity is generally predictive of reduced focal species gene content.”

      As described in detail in the response to Reviewer 1 above, we found that the HUMAnN2 pathway analyses previously described suffered from technical challenges and we deemed them inconclusive. We have therefore removed the pathway results from the manuscript.

      Reviewer #3 (Public Review):

      This work provides a series of tests of hypothesis, which are not mutually exclusive, on how genomic diversity is structured within human microbiomes and how community diversity may influence the evolution of a focal species.

      Strengths:

      The paper leverages on existing metagenomic data to look at many focal species at the same time to test for the importance of broad eco-evolutionary hypothesis, which is a novelty in the field.

      Thank you for the succinct summary and recognition of the strengths of our work.

      Weaknesses:

      It is not very clear if the existing metagenomic data has sufficient power to test these models.

      It is not clear, neither in the introduction nor in the analysis what precise mechanisms are expected to lead to DBD.

      The conclusion that data support DBD appears to depend on which statistics to measure of community diversity are used. Also, performing a test to reject a null neutral model would have been welcome either in the results or in the discussion.

      In our revised manuscript, we emphasize several caveats – including that we only have power to test these hypotheses in focal species with sufficient metagenomic coverage to measure sub-species diversity. We also describe more in the Introduction how the processes of competition and niche construction can lead to DBD. We also acknowledge that unmeasured abiotic drivers of both community diversity and sub-species diversity could also lead to the observed patterns. Throughout the manuscript, we attempt to describe the results and acknowledge multiple possible interpretations, including DBD and EC acting with different strengths on different species and time scales. Our previous manuscript assessing the evidence for DBD using 16S rRNA gene amplicon data from the Earth Microbiome Project (Madi et al., eLife 2020) assessed null models based on neutral ecological theory, and found it difficult to explain the observation of generally positive diversity slopes without invoking a non-neutral mechanism like DBD. While a new null model tailored to metagenomic data might provide additional nuance, we think developing one is beyond the scope of the manuscript – which is in the format of a short ‘Research Advance’ to expand on our previous eLife paper, and we expect that the general results of our previously reported null model provide a reasonable intuition for our new metagenomic analysis. This is now mentioned in the Discussion as follows:

      “In our previous study, we found that positive diversity slopes persisted even after controlling for potential abiotic drivers such as pH and temperature (Madi et al., 2020), but a similar analysis was not possible here due to a lack of metadata. Neutral processes can account for several ecological patterns such as species-area relationships (Hubbell, 2001), and must be rejected in favor of niche-centric models like DBD or EC. Using neutral models without DBD or EC, we found generally flat or negative diversity slopes due to sampling processes alone and that positive slopes were hard to explain with a neutral model (Madi et al., 2020). These models were intended mainly for 16S rRNA gene sequence data, but we expect the general conclusions to extend to metagenomic data. Nevertheless, further modeling and experimental work will be required to fully exclude a neutral explanation for the diversity slopes we report in the human gut microbiome.”

    1. Author Response

      Reviewer #2 (Public Review):

      Zou et al. presented a comprehensive study where they generated single-cell RNA profiling of 138,982 cells from 13 samples of six patients including AK, squamous cell carcinoma in situ (SCCIS), cSCC, and their matched normal tissues, covering comprehensive clinical courses of cSCC. Using bioinformatics analysis, they identified keratinocytes, CAFs, immune cells, and their subpopulations. The authors further compared signatures within subpopulations of keratinocytes along with the clinical progression, especially basal cells, and identified many interesting genes. They also further validate some of the markers in an independent cohort using IHC, followed by some knockdown experiments using cSCC cell lines.

      The strength of this study is the unique data set they have created, providing the community with invaluable resources to study and validate their findings. However, a lot of analyses were not robust enough to support the claims and conclusions in the paper. More clarification and cross-comparison with polished data are needed to further strengthen the study and claims.

      1) Stemness markers were used. The authors used COL17A1, TP63, ITGB1, and ITGA3 to represent stemness markers. However, these were not common classic stemness markers used in cSCC. What is the source claiming these genes were stemness markers in cSCC? TP63 is a master regulator and early driver event in SCC, while COL17A1, ITGB1, and ITGA3 are all ECM genes. The authors need to use commonly well-known stem cell markers in cSCC, e.g., LGR5, to mark stem-like cells.

      Thanks for raising this good point. We may not have provided a clear description of the markers COL17A1, TP63, ITGB1, and ITGA3 in the previous texts. We would like to clarify that these genes were used as the markers of epidermal stem cells in normal skin samples rather than tumor stem cells in cSCC. To avoid any possible misunderstanding, we revised the main text accordingly and added the references [4-11].

      2) Cell proportion analysis. The authors used the mean proportions to compare different clinical groups for subpopulations of keratinocytes, e.g., Figure 2B, and Figure 5B. This is not robust, as no statistics can be derived from this. For example, from Fig 2A, it is clearly shown there is a high level of heterogeneity of cellular compositions for normal samples. One cannot say which group is higher or lower simply based on mean not variance as well.

      We replotted the proportion analysis with statistics and presented the new graphs in Figure 2-figure supplement 1 for Figure 2B and Figure 5-figure supplement 1 for Figure 5B.

      3) Basal tumour cells in SCCIS and SCC. To make the findings valid, authors need to compare these cells/populations with the keratinocyte cell populations defined by Ji et al. Cell 2020. Do basal-SCCIS-tumours cells, also in SCC samples, resemble any of the population defined in Ji et al. Ji et al. also had 10 match normal, thus the authors need to validate their findings of SCC vs normal analysis using the Ji et al. dataset.

      Thanks for this valuable suggestion. We compared basal tumor cell in our study with the cell populations defined in Ji et al. Cell 2020 data using SingleCellNet [1]. The results showed that both the basal-SCCIS-tumor cells of SCCIS and basal tumor cells of cSCC in our study closely resemble the Tumor_KC_Basal subcluster defined in Ji et al’s paper (Figure 4-figure supplement 4, C and D). Tumor_KC_Basal highly expressed CCL2, CXCL14, FTH1, MT2A, which is consistent with our findings in basal tumor cells.

      4) Copy number analysis. Authors used inferCNV to perform copy number analysis using scRNA-seq data and identified CNVs in subpopulations of keratinocytes in SCCIS and SCC. To ensure these CNVs were not artefacts, were some of the CNVs identified by inferCNV well-known copy number changes previously reported in cSCC?

      In poorly-differentiated cSCC sample, the significant gains in chromosome 7, 9 and deletion in chromosome 10 were reported in previous study, indicating the reliability of the CNV analysis results (Figure 5-figure supplement 2) [12].

      5) Pseudotime analysis lines 308-313. Not sure the pseudotime analysis added much as, as it is unclear two distinct subgroups were identified from this analysis. Suggest removing this to keep it neater

      Thank you for this suggestion. We have deleted the result of pseudotime analysis.

      6) Selection of candidate genes for validation using IHC and cell line work. For example, lines 205-206, lines 352-356 and lines 437-441, authors selected several genes associated with AK and SCC to further validate using IHC and cell line knockdown work. What are the criteria for selecting those genes for validation? It is unclear to readers how these were selected. It reads like a fishing experiment, then followed by a knockdown. Clear rationale/criteria need to be elaborated.

      The first consideration of candidate gene selection is the fold change of expression. We have provided the statistical results of DEGs in Supplementary file 1b, 1h, 1j-1m. Then we selected top changed genes and conducted an extensive literature search on these genes. We prioritized genes that, although not directly associated with cSCC development, have a close relationship with related pathways, as determined through functional enrichment analysis. These genes were arranged for further verification experiments. We have added more details in main text and methods section.

      7) TME. Compared to keratinocytes populations, the investigation of TME cells was weak. (a) can authors produce UMAP files just for T cells, DC cells, and fibroblasts separately? Figure 7B is not easy to see those subclusters. (b) similar to what was done for keratinocytes, can authors find differentially expressed clusters and genes among the different clinical groups, associated with disease progression? (c) where are the myeloid cell populations, also B cells?

      Thank you for your suggestions. (a) We have added the UMAP files for T cells, DC cells and stromal cells separately in new Figure 7A. (b) We identified DEGs in TME cells among the different groups. Several key genes showed monotonically changing trends associated with disease progression. For example, with the increase of malignancy, FOS shows down-regulation while S100A8 and S100A9 monotonically increase in all three types of TME cells (Figure 7C). (c) We identified two types of myeloid cell populations, macrophage and monocyte derived DCs (MoDC). We didn’t find other myeloid cells, such as neutrophil. For B cells, there were only 28 B cells in poorly-differentiated cSCC sample, which didn’t meet the threshold for further cell-cell communication analysis.

      8) Heat shock protein genes line 327-329. HSP signature was well-known to be induced via tissue dissociation and library prep during the scRNA experiment. How could the authors be sure these were not artefacts induced by the experiment? If authors regress their gene expression against HSP gene signatures, would this cluster still be identified?

      Thank you for this valuable suggestion. It is important to note that the Basal-SCCIS-tumor cluster was identified through CNV analysis, rather than the HSP signature. To address this concern and further validate this result, “AddModuleScore” function in Seurat package was used to regress gene expression against HSP gene signatures for retrieved basal cells. Our result showed that Basal_SCCIS tumor population still can be identified after regression, even more clearly (Author response image 1).

      Author response image 1.

      The identity of Basal-SCCIS-tumor cluster considering regression against HSP signatures.

      9) Cell-cell communication analysis. The authors claimed that that cell-to-cell interaction was significantly enhanced in poorly-differentiated cSCC, and multiple interaction pathways were significantly active. How was this kind of analysis carried out? How did the authors define significance? what statistical method was used? these were all unclear. Furthermore, it is difficult to judge the robustness of the cell-cell communication analysis. Were these findings also supported by another method, such as celltalker, and cellphoneDB?

      To determine the significance of the increased overall cell-to-cell interaction strength between two groups, we utilized CellChat to obtain the communication strength in different samples. We combined the communication strength based on cell type pairs, where missing values were set to 0. We performed a paired Wilcoxon test to determine whether the enhancement of cell-to-cell interaction between samples was significant.

      For the comparison of outgoing or incoming interaction strength of the same cell types between two groups, we first extracted the communication strength of each signal pathway contributing to outgoing or incoming strength, and then merged the strengths of signal pathways among samples, where the strength of non-shared pathways with missing value was determined to be 0. Subsequently, we performed a paired Wilcoxon test to define the significance.

      For multiple groups comparisons, the Kruskal-Wallis rank sum test was first performed. If the p-value is less than 0.1, the pairwise Wilcoxon test was used for subsequent pairwise comparisons. The comparison of individual signaling pathways between groups is similar to the above. We defined p-value < 0.1 as significance threshold. We have added the significance test method in figure legend for Figure 7 and Figure 8 as well as and detailed statistical data in new Supplementary file 1q-1u.

      As suggested, we also used the approach of CellPhoneDB based on CellChatDB database to verify our cell-cell communication results. There are 55-58% of the ligand-receptor interactions predicted by CellChat were also predicted by CellPhoneDB (Author response image 2). The enhancement of cell interaction through MHC-II, Laminin and TNF signaling pathways in poorly-differentiated cSCC sample compare to normal sample were consistent in both CellChat and CellPhoneDB (Figure 8C and Figure 8-figure supplement 1B).

      Author response image 2.

      The overlap of the predicted ligand-receptor interactions between CellChat and CellPhoneDB.

      10) Statistics and significance. In general, the detail of statistics and significance was lacking throughout the paper. Authors need to specify what statistical tests were used, and the p-values. It is difficult to judge the correctness of the test, and robustness without seeing the stats.

      We have included all statistics and significance values in the figure legend and supplemental tables, and described the statistical tests in the methods section. In this revision, we have added the necessary details of statistics and significance in the main text and figures.

      11) Overall, this manuscript needs a lot of re-writing. A lot of discussion was also included in the results, making it really difficult to read overall. The authors should simplify the results sections, remove the discussion bits, and further highlight and streamline with the key results of this paper.

      Thanks a lot for this advice. We have revised the paper thoroughly, removed discussion in results section to make the manuscript easier to read.

    1. Author Response

      Reviewer #3 (Public Review):

      This manuscript by Pendse et al aimed to identify the role of the complement component C1q in intestinal homeostasis, expecting to find a role in mucosal immunity. Instead, however, they discovered an unexpected role for C1qa in regulating gut motility. First, using RNA-Seq and qPCR of cell populations isolated either by mechanical separation or flow cytometry, the authors found that the genes encoding the subunits of C1q are expressed predominantly in a sub-epithelial population of cells in the gut that Cd11b+MHCII+F4/80high, presumably macrophages. They support this conclusion by analyzing mice in which intestinal macrophages are depleted with anti-CSF1R antibody treatment and show substantial loss of C1qa, b and c transcripts. Then, they generate Lyz2Cre-C1qaflx/flx mice to genetically deplete C1qa in macrophages and assess the consequences on the fecal microbiome, transcript levels of cytokines, macromolecular permeability of the epithelial barrier, and immune cell populations, finding no major effects. Furthermore, provoking intestinal injury with chemical colitis or infection (Citrobacter) did not reveal macrophage C1qa-dependent changes in body weight or pathogen burden.

      Then, they analyzed C1q expression by IHC of cross-sections of small and large intestine and find that C1q immunoreactivity is detectable adjacent to, but not colocalizing with, TUBB3+ nerve fibers and CD169+ cells in the submucosa. Interestingly, they find little C1q immunoreactivity in the muscularis externa. Nevertheless, they perform RNA-sequencing of LMMP preparations (longitudinal muscle with adherent myenteric plexus) and find a number of changes in gene ontology pathways associates with neuronal function. Finally, they perform GI motility testing on the conditional knockout mice and find that they have accelerated GI transit times manifesting with subtle changes in small intestinal transit and more profound changes in measures of colonic motility.

      Overall, the manuscript is very well-written and the observation that macrophages are the major source of C1q in the intestine is well supported by the data, derived from multiple approaches. The observations on C1q localization in tissue and the strength of the conclusions that can be drawn from their conditional genetic model of C1qa depletion, however, would benefit from more rigorous validation.

      1) Interpretation of the majority of the findings in the paper rest on the specificity of the Lyz2 Cre for macrophages. While the specificity of this Cre to macrophages and some dendritic cells has been characterized in the literature in circulating immune cells, it is not clear if this has been characterized at the tissue level in the gut. Evidence demonstrating the selectivity of Cre activity in the gut would strengthen the conclusions that can be drawn.

      As indicated by the reviewer, Cre expression driven by the Lyz2 promoter is restricted to macrophages and some myeloid cells in the circulation (Clausen et al., 1999). To better understand intestinal Lyz2 expression at a cellular level, we analyzed Lyz2 transcripts from a published single cell RNAseq analysis of intestinal cells (Xu et al., 2019; see Figure below). These data show that intestinal Lyz2 is also predominantly expressed in gut macrophages with limited expression in dendritic cells and neutrophils.

      Figure. Lyz2 expression from single cell RNAseq analysis of mouse intestinal cells. Data are from Xu et al., Immunity 51, 696-708 (2019). Analysis was done through the Single Cell Portal, a repository of scRNAseq data at the Broad Institute.

      Additionally, our study shows that intestinal C1q expression is restricted to macrophages (CD11b+MHCII+F4/80hi) and is absent from other gut myeloid cell lineages (Figure 1E-H). This conclusion is supported by our finding that macrophage depletion via anti-CSF1R treatment also depletes most intestinal C1q (Figure 2A-C). Importantly, we found that the C1qaDMf mice retain C1q expression in the central nervous system (Figure 2 – figure supplement 1). Thus, the C1qaDMf mice allow us to assess the function of macrophage C1q in the gut and uncouple the functions of macrophage C1q from those of C1q in the central nervous system.

      2) Infectious and inflammatory colitis models were used to suggest that C1qa depletion in Lyz2+ lineage cells does not alter gut mucosal inflammation or immune response. However, the phenotyping of the mice in these models was somewhat cursory. For example, in DSS only body weight was shown without other typical and informative read-outs including colon length, histological changes, and disease activity scoring. Similarly, in Citrobacter only fecal cfu were measured. Especially if GI motility is accelerated in the KO mice, pathogen burden may not reflect efficiency of immune-mediated clearance alone.

      We have added additional results which support our conclusion that C1qaDMf mice do not show a heightened sensitivity to acute chemically induced colitis. In Figure 3 – figure supplement 1 we now show a histological analysis of the small intestines of DSS-treated C1qafl/fl and C1qaΔMφ mice. This analysis shows that C1qaDMf mice have similar histopathology, colon lengths, and histopathology scores following DSS treatment. Likewise, our revised manuscript includes histological images of the colons of Citrobacter rodentium-infected C1qafl/fl and C1qaΔMφ mice showing similar pathology (Figure 3 – figure supplement 2).

      3) The evidence for C1q expression being restricted to nerve-associated macrophages in the submucosal plexus was insufficient. Localization was shown at low magnification on merged single-planar images taken from cross-sections. The data shown in Figure 4C is not of sufficient resolution to support the claims made - C1q immunoreactivity, for example, is very difficult to even see. Furthermore, nerve fibers closely approximate virtually type of macrophage in the gut, from those in the lamina propria to those in the muscularis….Finally, the resolution is too low to rule out C1q immunoreactivity in the muscularis externa.

      Similar points were raised by Reviewer 2. Our original manuscript claimed that C1q-expressing macrophages were mostly located near enteric neurons in the submucosal plexus but were largely absent from the myenteric plexus. However, as both Reviewers have pointed out, this conclusion was based solely on our immunofluorescence analysis of tissue cross-sections.

      To address this concern we further characterized C1q+ macrophage localization by performing a flow cytometry analysis on macrophages isolated from the mucosa (encompassing both the lamina propria and submucosa) and the muscularis, finding similar levels of C1q expression in macrophages from both tissues (Figure 4 – figure supplement 1 in the revised manuscript). Although the mucosal macrophage fraction encompasses both lamina propria and submucosal macrophages, our immunofluorescence analysis (Figure 4 B and C) suggests that the mucosal C1q-expressing macrophages are mostly from the submucosal plexus. This observation is consistent with the immunofluorescence studies of CD169+ macrophages shown in Asano et al., which suggest that most C169+ macrophages are located in or near the submucosal region, with fewer near the villus tips (Fig. 1e, Nat. Commun. 6, 7802).

      Most importantly, our flow cytometry analysis indicates that the muscularis/myenteric plexus harbors C1q-expressing macrophages. To further characterize C1q expression in the muscularis, we performed RNAscope analysis by confocal microscopy of the myenteric plexus from mouse small intestine and colon (Figure 4D). The results show numerous C1q-expressing macrophages positioned close to myenteric plexus neurons, thus supporting the flow cytometry analysis. We note that although the majority of C1q immunofluorescence in our tissue cross-sections was observed in the submucosal plexus, we did observe some C1q expression in the muscularis by immunofluorescence (Figure 4B and C). We have rewritten the Results section to take these new findings into account.

      Is the 5um average on the proximity analysis any different for other macrophage populations to support the idea of a special relationship between C1q-expressing macrophages and neurons?

      We agree that the proximity analysis lacks context and have therefore removed it from the figure. The other data in the figure better support the idea that C1q+ macrophages are found predominantly in the submucosal and myenteric plexuses and that they are closely associated with neurons at these tissue sites.

      There are many vessels in the submucosa and many associated perivascular nerve fibers - could the proximity simply reflect that both cell types are near vessels containing C1q in circulation?

      Our revised manuscript includes RNAscope analysis showing C1q transcript expression by macrophages that are closely associated with enteric neurons (Figure 4D). These findings support the idea that the C1q close to enteric neurons is derived from macrophages rather than from the circulation.

      4) A major disconnect was between the observation that C1q expression is in the submucosa and the performance of RNA-seq studies on LMMP preparations. This makes it challenging to draw conclusions from the RNA-Seq data, and makes it particularly important to clarify the specificity of Lyz2-Cre activity.

      Our revised manuscript provides flow cytometry data (Figure 4 – figure supplement 1) and RNAscope analysis (Figure 4D) showing that C1q is expressed in macrophages localized to the myenteric plexus. This accords with the results of our RNAseq analysis, which indicates altered LMMP neuronal function in C1qa∆Mφ mice (Figure 6A and B). Since neurons in the myenteric plexus are known to govern gut motility, it also helps to explain our finding that gut motility is accelerated in C1qa∆Mφ mice.

      Finally, the pathways identified could reflect a loss of neurons or nerve fibers. No assessment of ENS health in terms of neuronal number or nerve fiber density is provided in either plexus.

      Reviewers 1 and 2 also raised this point. Our revised manuscript includes a comparison of the numbers of enteric neurons in C1qafl/fl and C1qaΔMφ mice. There were no marked differences in neuron numbers in C1qaDMf mice when compared to C1qafl/fl controls (Figure 5A and B). There were also similar numbers of inhibitory (nitrergic) and excitatory (cholinergic) neuronal subsets and a similar enteric glial network (Figure 5C-E). Thus, our data suggest that the altered gut motility in the C1qaΔMφ mice arises from altered neuronal function rather than from an overt loss of neurons or nerve fibers. This conclusion is further supported by increased neurogenic activity of peristalsis (Figure 6H and I), and the expression of the C1q receptor BAI1 on enteric neurons (Figure 6 – figure supplement 4).

      5) To my knowledge, there is limited evidence that the submucosal plexus has an effect on GI motility. A recent publication suggests that even when mice lack 90% of their submucosal neurons, they are well-appearing without overt deficits (PMID: 29666241). Submucosal neurons, however, are well known to be involved in the secretomotor reflex and fluid flux across the epithelium. Assessment of these ENS functions in the knockout mice would be important and valuable.

      Our revised manuscript provides new data showing C1q expression by muscularis macrophages in the myenteric plexus. We analyzed muscularis macrophages by flow cytometry and found that they express C1q (Figure 4 – figure supplement 1). These findings are further supported by RNAscope analysis of C1q expression in wholemounts of LMMP from small intestine and colon (Figure 4D and E). These results are thus consistent with the increased CMMC activity and accelerated gut motility in the C1qaDMf mice. As suggested by the reviewer, our finding of C1q+ macrophages in the submucosal plexus indicates that C1q may also have a role controlling the function of submucosal plexus neurons. We are further exploring this idea through extensive additional experimentation. Given the expanded scope of these studies, we are planning to include them in a follow-up manuscript.

      6) Immune function and GI motility can be highly sex-dependent - in all experiments mice of both sexes were reportedly used but it is not clear if sex effects were assessed.

      This is a great point, and as suggested by the reviewer we indeed did encounter differences between male and female mice in our preliminary assays of gut motility. We therefore conducted our quantitative comparisons of gut motility between C1qafl/fl and C1qaDMf mice in male mice and now clearly indicate this point in the Materials and Methods.

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript by Royall et al. builds on previous work in the mouse that indicates that neural progenitor cells (NPCs) undergo asymmetric inheritance of centrosomes and provides evidence that a similar process occurs in human NPCs, which was previously unknown.

      The authors use hESC-derived forebrain organoids and develop a novel recombination tag-induced genetic tool to birthdate and track the segregation of centrosomes in NPCs over multiple divisions. The thoughtful experiments yield data that are concise and well-controlled, and the data support the asymmetric segregation of centrosomes in NPCs. These data indicate that at least apical NPCs in humans undergo asymmetric centrosome inheritance. The authors attempt to disrupt the process and present some data that there may be differences in cell fate, but this conclusion would be better supported by a better assessment of the fate of these different NPCs (e.g. NPCs versus new neurons) and would support the conclusion that younger centriole is inherited by new neurons.

      We thank the reviewer for their supportive comments (“…thoughtful experiments yield data that are concise and well-controlled…”).

      Reviewer #2 (Public Review):

      Royall et al. examine the asymmetric inheritance of centrosomes during human brain development. In agreement with previous studies in mice, their data suggest that the older centrosome is inherited by the self-renewing daughter cell, whereas the younger centrosome is inherited by the differentiating daughter cell. The key importance of this study is to show that this phenomenon takes place during human brain development, which the authors achieved by utilizing forebrain organoids as a model system and applying the recombination-induced tag exchange (RITE) technology to birthdate and track the centrosomes.

      Overall, the study is well executed and brings new insights of general interest for cell and developmental biology with particular relevance to developmental neurobiology. The Discussion is excellent, it brings this study into the context of previous work and proposes very appealing suggestions on the evolutionary relevance and underlying mechanisms of the asymmetric inheritance of centrosomes. The main weakness of the study is that it tackles asymmetric inheritance only using fixed organoid samples. Although the authors developed a reasonable mode to assign the clonal relationships in their images, this study would be much stronger if the authors could apply time-lapse microscopy to show the asymmetric inheritance of centrosomes.

      We thank the reviewer for their constructive and supportive comments (“…the study is well executed and brings new insights of general interest for cell and developmental biology with particular relevance to developmental neurobiology….”). We understand the request for clonal data or dynamic analyses in organoids (e.g., using time-lapse microscopy). We also agree that such data would certainly strengthen our findings. However, as outlined above (please refer to point #1 of the editorial summary), this is unfortunately currently not feasible. However, we have explicitly discussed this shortcoming in our revised manuscript and why future experiments (with advanced methodology) will have to do these experiments.

      Reviewer #3 (Public Review):

      In this manuscript, the authors report that human cortical radial glia asymmetrically segregates newly produced or old centrosomes after mitosis, depending on the fate of the daughter cell, similar to what was previously demonstrated for mouse neocortical radial glia (Wang et al. 2009). To do this, the authors develop a novel centrosome labelling strategy in human ESCs that allows recombination-dependent switching of tagged fluorescent reporters from old to newly produced centrosome protein, centriolin. The authors then generate human cortical organoids from these hESCs to show that radial glia in the ventricular zone retains older centrosomes whereas differentiated cells, i.e. neurons, inherit the newly produced centrosome after mitosis. The authors then knock down a critical regulator of asymmetric centrosome inheritance called Ninein, which leads to a randomization of this process, similar to what was observed in mouse cortical radial glia.

      A major strength of the study is the combined use of the centrosome labelling strategy with human cortical organoids to address an important biological question in human tissue. This study is similarly presented as the one performed in mice (Wang et al. 2009) and the existence of the asymmetric inheritance mechanism of centrosomes in another species grants strength to the main claim proposed by the authors. It is a well-written, concise article, and the experiments are well-designed. The authors achieve the aims they set out in the beginning, and this is one of the perfect examples of the right use of human cortical organoids to study an important phenomenon. However, there are some key controls that would elevate the main conclusions considerably.

      We thank the reviewer for their overall support of our findings (“..authors achieve the aims they set out in the beginning, and this is one of the perfect examples of the right use of human cortical organoids to study an important phenomenon…”). We also understand the reviewer’s request for additional experiments/controls that “…would elevate the main conclusions considerably.”

      1) The lack of clonal resolution or timelapse imaging makes it hard to assess whether the inheritance of centrosomes occurs as the authors claim. The authors show that there is an increase in newly made non-ventricular centrosomes at a population level but without labelling clones and demonstrating that a new or old centrosome is inherited asymmetrically in a dividing radial glia would grant additional credence to the central conclusion of the paper. These experiments will put away any doubt about the existence of this mechanism in human radial glia, especially if it is demonstrated using timelapse imaging. Additionally, knowing the proportions of symmetric vs asymmetrically dividing cells generating old/new centrosomes will provide important insights pertinent to the conclusions of the paper. Alternatively, the authors could soften their conclusions, especially for Fig 2.

      We understand the reviewer’s request. As outlined above (please refer to point #1 of the editorial summary), we had tried previously to add data using single cell timelapse imaging. However, due to the size and therefore weakness of the fluorescent signal we had failed despite extensive efforts. According to the reviewer’s suggestion we have now explicitly discussed this shortcoming and softened our conclusions.

      2) Some critical controls are missing. In Fig. 1B, there is a green dot that does not colocalize with Pericentrin. This is worrying and providing rigorous quantifications of the number of green and tdTom dots with Pericentrin would be very helpful to validate the labelling strategy. Quantifications would put these doubts to rest. Additionally, an example pericentrin staining with the GFP/TdTom signal in figure 4 would also give confidence to the reader. For figure 4, having a control for the retroviral infection is important. Although the authors show a convincing phenotype, the effect might be underestimated due to the incomplete infection of all the analyzed cells.

      We have included more rigorous quantifications in our revised manuscript.

      For Figure 1: There are indeed some green speckles that might be misinterpreted as a green centrosome. However, the speckles are usually smaller and by applying a strict size requirement we exclude speckles. To check whether the classifier might interpret any speckles as centrosomes, we manually checked 60 green “dots” that were annotated as centrosome. From these images all green spots detected as centrosome co-localized with Pericentrin signal (Images shown in Author response image 1).

      For Figure 4: as we are comparing cells that were either infected with a retrovirus expressing scrambled or Ninein-targeting shRNA we compare cells that experienced a similar treatment. Besides that, only cells infected with the virus express Cre-ERT2 whereby only the centrosomes of targeted cells were analyzed. Accordingly, we only compare cells expressing scrambled or Ninein-targeting shRNA, all surrounding “wt” cells are not considered.

      Author response image 1.

      Pictures used to test the classifier. Each of the green “dots” recognized by the classifier as a Centriolin-NeonGreen-containing centrosome (green) co-localized with Pericentrin signal (white).

      3) It would be helpful if the authors expand on the presence of old centrosomes in apical radial glia vs outer radial glia. Currently, in figure 3, the authors only focus on Sox2+ cells but this could be complemented with the inclusion of markers for outer radial glia and whether older centrosomes are also inherited by oRGCs. This would have important implications on whether symmetric/asymmetric division influences the segregation of new/old centrosomes.

      That is an interesting question and we do agree that additional analyses, stratified by ventricular vs. oRGCs would be interesting. However, at the time points analysed there are only very few oRGCs present (if any) in human ESC-derived organoids (Qian et al., Cell, 2016). However, we have now added this point for future experiments to our discussion.

    1. Author Response

      Reviewer #1 (Public Review):

      The study by Akter et al demonstrates that astrocyte-derived L-lactate plays a key role in schema memory formation and promotes mitochondrial biogenesis in the Anterior Cingulate Cortex (ACC).

      The main tool used by the authors is the DREADD technology that allows to pharmacologically activate receptors in a cell-specific manner. In the study, the authors used the DREADD technique to activate appropriately transfected astrocytes, a subtype of muscarinic receptor that is not normally present in cells. This receptor being coupled to a Gi-mediated signal transduction pathway inhibiting cAMP formation, the authors could demonstrate cell-(astrocyte) specific decreases in cAMP levels that result in decreased L-lactate production by astrocytes.

      Behaviorally this pharmacological manipulation results in impairments of schema memory formation and retrieval in the ACC in flavor-place paired associate paradigms. Such impairments are prevented by co-administration of L-lactate.

      The authors also show that activation of Gi signaling resulting in L-lactate decreased release by astrocytes impairs mitochondrial biogenesis in neurons in an L-lactate reversible manner.

      By using MCT 2 inhibitors and an NMDAR antagonist the authors conclude that the molecular mechanisms underlying the observed effects are mediated by L-lactate entering neurons through MCT2 transporters and involve NMDAR.

      Overall, the article's conclusions are warranted by the experimental evidence, but some weak points could be addressed which would make the conclusions even stronger.

      The number of animals in some of the experiments is on the low side (4 to 6).

      In the revised manuscript, we have increased the animal numbers in two key experimental groups (hM4Di-CNO and Control groups) of behavioral experiments. Now the animal numbers in different groups are as follows:

      • 15 rats in hM4Di-CNO group

      o Further divided into two subgroups for probe tests (PT1-4) conducted during flavor-place paired associate training; 8 rats in the hM4Di-CNO (saline) and 7 rats in the hM4Di-CNO (CNO) subgroups receiving I.P. saline or I.P. CNO, respectively, before these PTs.

      • 8 rats in the Control group

      • 7 rats in the Rescue group (hM4Di-CNO+L-lactate)

      • 4 rats in the Control-CNO group. Animal number in this group was not increased as it was apparent from these 4 rats that CNO alone was not impairing the PA learning and memory retrieval in these rats (AAV8-GFAP-mCherry injected). Their result was very similar to the control group. Additionally, in a previous study (Liu et al., 2022), we showed that CNO administration in the rats injected with AAV8-GFAP-mCherry into the hippocampus does not show any impairments in schema.

      Also, in the newly added open field test experiments to investigate the locomotor activity as suggested by the Reviewer #2, 8 rats were used in each group.

      The use of CIN to inhibit MCT2 is not optimal. Authors may want to decrease MCT2 expression by using antisense oligonucleotides.

      In the revised manuscript, we have conducted the experiment using MCT2 antisense oligodeoxynucleotide (ODN) as suggested.

      To test whether the L-lactate-induced neuronal mitochondrial biogenesis is dependent on MCT2, we bilaterally injected MCT2 antisense oligodeoxynucleotide (MCT2-ODN, n=8 rats, 2 nmol in 1 μl PBS per ACC) or scrambled ODN (SC-ODN, n=8 rats, 2 nmol in 1 μl PBS per ACC) into the ACC. After 11 hours, bilateral infusion of L-lactate (10 nmol, 1 μl) or ACSF (1 μl) was given into the ACC and the rats were kept in the PA event arena. After 60 mins (12 hours from MCT2-ODN or SC-ODN administration), the rats were sacrificed. As shown in Author response image 1B, SC-ODN+L-lactate group showed significantly increased relative mtDNA copy number compared to the SC-ODN+ACSF group (p<0.001, ANOVA followed by Tukey's multiple comparisons test). However, this effect was completely abolished in MCT2-ODN+L-lactate group, suggesting that MCT2 is required for the L-lactate-induced mitochondrial biogenesis in the ACC.

      We have integrated this new data and results in the revised manuscript.

      Author response image 1.

      Mitochondrial biogenesis by L-lactate is dependent on MCT2 and NMDAR. A. Experimental design to investigate whether MCT2 and NMDAR activity are required for L-lactate-induced mitochondrial biogenesis. B and C. mtDNA copy number abundance in the ACC of different rat groups relative to nDNA. Data shown as mean ± SD (n=4 rats in each group). ***p<0.001, ANOVA followed by Tukey's multiple comparisons test.

      The experiment using AVP to block NMDAR only partially supports the conclusions. Indeed, blocking NMDAR will knock down any response that involves these receptors, whether L-lactate is necessary or not.

      In the current study we found that Astrocytic Gi activation in the ACC reduced L-lactate level in the ECF of ACC which was also associated with decreased PGC-1α/SIRT3/ATPB/mtDNA abundance suggesting downregulation of mitochondrial biogenesis pathway. We also found that exogenous administration of L-lactate into the ACC of astrocytic Gi-activated rats rescued this downregulation. In line with this, in a recently published study (Akter et al., 2023), we found upregulation of mitochondrial biogenesis pathway in the hippocampus neurons of exogenous L-lactate-treated anesthetized rats. Another recent study has demonstrated that exercise-induced L-lactate release from skeletal muscle or I.P. injection of L-lactate can induce hippocampal PGC-1α (which is a master regulator of mitochondrial biogenesis) expression and mitochondrial biogenesis in mice (Park et al., 2021). Together, these results provide compelling evidence that L-lactate promotes mitochondrial biogenesis.

      L-lactate is known to promote expression of synaptic plasticity genes like Arc, c-Fos, and Zif268 in neurons (Yang et al., 2014). After entry into the neuronal cytoplasm, mainly through MCT2, it is converted into pyruvate by lactate dehydrogenase 1 (LDH1). This conversion also produces NADH, affecting the redox state of the neuron. NADH positively modulates the activity of NMDAR resulting in enhanced Ca2+ currents, the activation of intracellular signaling cascades, and the induction of the expression of plasticity-associated genes (Yang et al., 2014; Magistretti & Allaman, 2018). The study demonstrated that L-lactate–induced plasticity gene expression was abolished in the presence of NMDAR antagonists including D-APV (Yang et al., 2014). These results suggested that the MCT2 and NMDAR are key players in the regulation of L-lactate induced plasticity gene expression.

      In the current study, we investigated whether similar mechanisms might be involved in L-lactate-induced neuronal mitochondrial biogenesis. We now used MCT2 antisense oligodeoxynucleotide to decrease the expression of MCT2 (as mentioned in the previous response and Author response image 1B) and showed that MCT2 is necessary for L-lactate-induced mitochondrial biogenesis to manifest, indicating that L-lactate’s entry into the neuron is required. As mentioned before, after entry into neuron, L-lactate is converted into pyruvate by LDH, which also produce NADH, which in turn potentiates NMDAR activity. Therefore, we investigated whether NMDAR activity is required for L-lactate-induced mitochondrial biogenesis. We used D-APV to inhibit NMDAR (Author response image 1C) and found that L-lactate does not increase mtDNA copy number abundance if D-APV is given, suggesting that NMDAR activity is required for L-lactate to promote mitochondrial biogenesis.

      NMDAR serves diverse functions. Therefore, as mentioned by the reviewer, blocking NMDAR may knock down many such functions. While our current data only suggests the involvement of MCT2 and NMDAR in the upregulation of mitochondrial biogenesis by L-lactate, we have not investigated other mechanisms and pathways modulating mitochondrial biogenesis that are either dependent or independent of MCT2 and NMDAR activity. Further studies are needed in future to dissect and better understand this interesting observation. We have now clarified this in the discussion section of the manuscript.

      Is inhibition of glycogenolysis involved in the observed effects mediated by Gi signaling? Indeed, L-lactate is formed both by glycolysis and glycogenolysis. The authors could test whether the glycogen metabolism-inhibiting drug DAB would mimic the effects of Gi activation.

      In this study we have shown that astrocytic Gi activation in the ACC leads to a decrease in the cAMP and L-lactate. L-lactate is produced by glycogenolysis and glycolysis. cAMP in astrocytes acts as a trigger for L-lactate production (Choi et al., 2012; Horvat, Muhič, et al., 2021; Horvat, Zorec, et al., 2021; Zhou et al., 2021) by promoting glycogenolysis and glycolysis (Vardjan et al., 2018; Horvat, Muhič, et al., 2021; Horvat, Zorec, et al., 2021). Therefore, one promising explanation of reduced L-lactate level observed in our study is the reduction of L-lactate production in the astrocyte due to decreased glycogen metabolism as a result of decreased cAMP. We have now mentioned this in the discussion.

      DAB is an inhibitor of glycogen phosphorylase that suppresses L-lactate production. It was shown to impair memory by decreasing L-lactate (Newman et al., 2011; Suzuki et al., 2011; Iqbal et al., 2023). As we found that the impairment in the schema memory and mitochondrial biogenesis was associated with decreased L-lactate level in the ACC and that the exogenous L-lactate administration can rescue the impairments, it is likely that DAB will mimic the effect of Gi activation in terms of schema memory and mitochondrial biogenesis. However, further study is needed to confirm this.  

      Reviewer #2 (Public Review):

      The manuscript of Akter et al is an important study that investigates the role of astrocytic Gi signaling in the anterior cingulate cortex in the modulation of extracellular L-lactate level and consequently impairment in flavor-place associates (PA) learning. However, whereas some of the behavioral observations and signaling mechanism data are compelling, the conclusions about the effect on memory are inadequate as they rely on an experimental design that does not allow to differentiate acute or learning effect from the effect outlasting pharmacological treatments, i.e. effect on memory retention. With the addition of a few experiments, this paper would be of interest to the larger group of researchers interested in neuron-glia interactions during complex behavior.

      • Largely, I agree with the authors' conclusion that activating Gi signaling in astrocytes impairs PA learning, however, the effect on memory retrieval is not that obvious. All behavioral and molecular signaling effects described in this study are obtained with the continuous presence of CNO, therefore it is not possible to exclude the acute effect of Gi pathway activation in astrocytes. What will happen with memory on retrieval test when CNO is omitted selectively during early, middle, or late session blocks of PA learning?

      We have now added 8 more rats to the hM4Di-CNO group (i.e., the group with astrocytic Gi activation) to clarify the memory retrieval. These rats underwent flavor-place paired associate (PA) training similar to the previously described rats (n=7) of this group, that is they received CNO 30 minutes before and 30 minutes after the PA training sessions (S1-2, S4-8, S10-17). However, contrasting to the previous rats of this group which received CNO before PTs (PT1, PT2, PT3), we omitted the CNO (instead administered I.P. saline) selectively on these PTs conducted at the early, middle, and late stage of PA training, as suggested by the reviewer. These newly added rats did not show memory retrieval in these PTs, suggesting that the rats were not learning the PAs from the PA training sessions. See Author response image 2C-E, where this subgroup is denoted as hM4Di-CNO (Saline).

      We then continued more PA training sessions (S21 onwards, Author response image 2B) for these rats without CNO. They gradually learned the PAs. PTs (PT5, PT6, PT7; Author response image 2G-I) were done during this continuation phase of PA training; once without CNO (i.e., with I.P. saline instead), and another one with CNO. As seen in the Author response image 2H and 2I, they retrieved the memory when PT6 and PT7 were done without CNO. However, if these PTs were done with CNO, they could not retrieve the memory. Together these results suggest that ACC astrocytic Gi activation by CNO during PT can impair memory retrieval in rats which have already learned the PAs.

      As shown in the Author response image 2B, we replaced two original PAs with two new PAs (NPA 9 and 10) at S34. This was followed by PT8 (S35). As seen in Author response image 2J, these rats retrieved the NPA memory if the PT is done without CNO. However, they could not retrieve the NPA memory if the PT was done with CNO. This result suggests that ACC astrocytic Gi activation by CNO during PT can impair NPA memory retrieval.

      In summary, these data show that astrocytic Gi activation in the ACC can impair PA memory retrieval. We have integrated this new data and results in the revised manuscript.

      Author response image 2.

      A. PI (mean ± SD) during the acquisition of the six original PAs (OPAs) (S1-2, 4-8, 10-17) and new PAs (NPAs) (S19) of the control (n=8), hM4Di-CNO (n=15), and rescue (hM4Di-CNO+L-lactate) (n=7) groups. From S6 onwards, hM4Di-CNO group consistently showed lower PI compared to control. However, concurrent L-lactate administration into the ACC (rescue group) can rescue this impairment. B. PI (mean ± SD) of hM4Di-CNO group (n=8) from S21 onwards showing gradual increase in PI when CNO was withdrawn. C, D, and E. Non-rewarded PTs (PT1, PT2, and PT3 conducted on S3, S9, and S18, respectively) to test memory retrieval of OPAs for the control, hM4Di-CNO, and rescue groups. The percentage of digging time at the cued location relative to that at the non-cued locations are shown (mean ± SD). In both PT2 and PT3, the control group spent significantly more time digging the cued sand well above the chance level, indicating that the rats learned OPAs and could retrieve it. Contrasting to this, hM4Di-CNO group did not spend more time digging the cued sand well above the chance level irrespective of CNO administration before the PTs. The rescue group showed results similar to the hM4Di-CNO group if CNO is given without L-lactate. On the other hand, they showed results similar to the control group if L-lactate is concurrently given with CNO, indicating that this group learned OPAs and could retrieve it. p < 0.05, p < 0.01, p < 0.001, one-sample t-test comparing the proportion of digging time at the cued sand well with the chance level of 16.67%. F. Non-rewarded PT4 (S20) which was conducted after replacing two OPAs with two NPAs (NPA 7 & 8) in S19 for the control, hM4Di-CNO, and rescue groups. Results show that the control group spent significantly more time digging the new cued sand well above the chance level indicating that the rats learned the NPAs from S19 and could retrieve it in this PT. Contrasting to this, hM4Di-CNO group did not spend more time digging the new-cued sand well above the chance level irrespective of CNO administration before the PT. The rescue group showed results similar to the hM4Di-CNO group if CNO is given without L-lactate. On the other hand, they showed results similar to the control group if L-lactate is concurrently given with CNO indicating that this group learned NPAs from S19 and could retrieve it. p < 0.001, one-sample t-test comparing the proportion of digging time at the new cued sand well with the chance level of 16.67%. G, H, and I. Non-rewarded PTs (PT5, PT6, and PT7 conducted on S23, S27, and S33, respectively) to test memory retrieval of OPAs for the hM4Di-CNO group. In both PT6 and PT7, the rats spent significantly more time digging the cued sand well above the chance level if the tests are done without CNO, indicating that the rats learned the OPAs and could retrieve it. However, CNO prevented memory retrieval during these PTs. p < 0.001, one-sample t-test comparing the proportion of digging time at the cued sand well with the chance level of 16.67%. J. Non-rewarded PT4 (S35) which was conducted after replacing two OPAs with two NPAs (NPA 9 & 10) in S34 for the hM4Di-CNO group. Results show that the rats spent significantly more time digging the new cued sand well above the chance level if CNO was not given before the PT, indicating that the rats learned the NPAs from S34 and could retrieve it in this PT. However, if CNO is given before the PT, the retrieval is impaired. *p < 0.001, one-sample t-test comparing the proportion of digging time at the new cued sand well with the chance level of 16.67%.

      • I found it truly exciting that the administration of exogenous L-lactate is capable to rescue CNO-induced PA learning impairment, when co-applied. Would it be possible that this treatment has a sensitivity to a particular stage of learning (acquisition, consolidation, or memory retrieval) when L-lactate administration would be the most efficacious?

      The hM4Di-CNO group, when continued with PA training without CNO (S21-S32) (Author response image 2B), was able to learn the six original PAs (OPAs). In the PT7 done at S33 (Author response image 2I), this group of rats was able to retrieve the memory if the test was done without CNO but could not retrieve the memory if CNO was given. Similarly, the Rescue group (hM4Di-CNO+L-lactate) (Author response image 2A), which received both CNO and L-lactate during PA training sessions (S1-S17), they were able to learn the OPAs. And at PT3 done at S18 (Author response image 2E), these rats were able to retrieve the memory when the test was done with CNO+L-lactate but not if the test is done with only CNO. Together, these results clearly show that ACC astrocytic Gi activation with CNO impairs memory retrieval and exogenous L-lactate can rescue the impairment. Therefore, it can be concluded that the memory retrieval is sensitive to L-lactate.

      The PA learning is hippocampus-dependent. Over the course of repeated PA training, systems consolidation occurs in the ACC, after which the already learned PA memory (schema) becomes hippocampus-independent (Tse et al., 2007; Tse et al., 2011). A higher activation (indicated by expression of c-Fos) in the hippocampus relative to the ACC during the early period of schema development, and the reverse at the late stage was observed in our previous study (Liu et al., 2022). However, rapid assimilation of new PA into the ACC requires simultaneous activation/retrieval of previous schema from ACC and hippocampus dependent new PA learning (Tse et al., 2007; Tse et al., 2011). During new PA learning, increase of c-Fos neurons in both CA1 and ACC was detected (Liu et al., 2022).

      Our hM4Di-CNO group received CNO 30 mins before and after each PA training session in S1-S17 (Author response image 2A). Also, the Rescue group similarly received CNO+L-lactate before and after each PA training session in S1-S17. Therefore, while this study design allowed us to conclude that ACC astrocytic Gi activation impairs PA learning and that exogenous L-lactate can rescue the impairment, it does not allow clear differentiation of the effects of these treatments on memory acquisition and consolidation. Further studies are needed to investigate this.

      • The hypothesis that observed learning impairments could be associated with diminished mitochondrial biogenesis caused by decreased l-lactate in the result of astrocytic Gi-DREADDS stimulation is very appealing, but a few key pieces of evidence are missing. So far, the hypothesis is supported by experiments demonstrating reduced expression of several components of mitochondrial membrane ATP synthase and a decrease in relative mtDNA copy numbers in ACC of rats injected with Gi-DREADDs. L-lactate injections into ACC restored and even further increased the expression of the above-mentioned markers. Co-administration of NMDAR antagonist D-APV or MCT-2 (mostly neuronal) blocker 4-CIN with L-lactate, prevented L-lactate-induced increase in relative mtDNA copy. I am wondering how the interference with mitochondrial biogenesis is affecting neuronal physiology and if it would result in impaired PA learning or schema memory.

      The observation of diminished mitochondrial biogenesis in the astrocytic Gi-activated rats that showed impaired PA learning is exciting. However, our study does not provide experimental data on how mitochondrial biogenesis could be associated with impaired PA learning and schema memory. Results from several previous studies linked mitochondrial biogenesis and its regulators such as PGC-1α and SIRT3 to diverse neuronal and cognitive functions as described in the discussion section of the manuscript. In the revised manuscript, we have provided further discussion as follows to discuss potential mechanisms:

      “In this study, we have demonstrated that ACC astrocytic Gi activation impairs PA learning and schema formation, PA memory retrieval, and NPA learning and retrieval by decreasing L-lactate level in the ACC. Although we have shown that these impairments are associated with diminished expression of proteins of mitochondrial biogenesis, the precise mechanisms of how astrocytic Gi activation affects neuronal functions and schema memory remain to be elucidated. We previously demonstrated that neuronal inhibition in either the hippocampus or the ACC impairs PA learning and schema formation (Hasan et al., 2019). In another recent study (Liu et al., 2022), we showed that astrocytic Gi activation in the CA1 impaired PA training-associated CA1-ACC projecting neuronal activation. Yao et al. recently showed that reduction of astrocytic lactate dehydrogenase A (an enzyme that reversibly catalyze L-lactate production from pyruvate) in the dorsomedial prefrontal cortex reduces L-lactate levels and neuronal firing frequencies, promoting depressive-like behaviors in mice (Yao et al., 2023). These impairments could be rescued by L-lactate infusion. It is possible that the impairment in PA learning and schema observed in our study might have involved a similar functional consequence of reduced neuronal activity in the ACC neurons upon astrocytic Gi activation.

      Schema consolidation is associated with synaptic plasticity-related gene expression (such as Zif268, Arc) in the ACC (Tse et al., 2011). L-lactate, after entry into neurons, can be converted to pyruvate during which NADH is also produced, promoting synaptic plasticity-related gene expression by potentiating NMDA signaling in neurons (Yang et al., 2014; Margineanu et al., 2018). Furthermore, L-lactate acts as an energy substrate to fuel learning-induced de novo neuronal translation critical for long-term memory (Descalzi et al., 2019). On the other hand, mitochondria play crucial role in fueling local translation during synaptic plasticity (Rangaraju et al., 2019). Therefore, it could be hypothesized that the rescue of astrocytic Gi activation-mediated impairment of schema by exogenous L-lactate could have been mediated by facilitating synaptic plasticity-related gene expression by directly fueling the protein translation, potentiating NMDA signaling, as well as increasing mitochondrial capacity for ATP production by promoting mitochondrial biogenesis. Furthermore, the potential involvement of HCAR1, a receptor for L-lactate that may regulate neuronal activity (Bozzo et al., 2013; Tang et al., 2014; Herrera-López & Galván, 2018; Abrantes et al., 2019), cannot be excluded. Future research could explore these potential mechanisms, examining the interactions among them, and determining their relative contributions to schema. Our previous study also showed that ACC myelination is necessary for PA learning and schema formation, and that repeated PA training is associated with oligodendrogenesis in the ACC (Hasan et al., 2019). Oligodendrocytes facilitate fast, synchronized, and energy efficient transfer of information by wrapping axons in myelin sheath. Furthermore, they supply axons with glycolysis products, such as L-lactate, to offer metabolic support (Fünfschilling et al., 2012; Lee et al., 2012). The association of oligodendrogenesis and myelination with schema memory may suggest an adaptive response of oligodendrocytes to enhance metabolic support and neuronal energy efficiency during PA learning. Given the impairments in PA learning observed in the ACC astrocytic Gi-activated rats in the current study, it is reasonable to conclude that the direct metabolic support to axons provided by oligodendrocytes is not sufficient to rescue the schema impairments caused by decreased L-lactate levels upon astrocytic Gi activation. On the other hand, L-lactate was shown to be important for oligodendrogenesis and myelination (Sánchez-Abarca et al., 2001; Rinholm et al., 2011; Ichihara et al., 2017). Therefore, it is tempting to speculate that a decrease in L-lactate level may also impede oligodendrogenesis and myelination, consequently preventing the enhanced axonal support provided by oligodendrocytes and myelin during schema learning. Recently, a study has demonstrated that upon demyelination, mitochondria move from the neuronal cell body to the demyelinated axon (Licht-Mayer et al., 2020). Enhancement of this axonal response of mitochondria to demyelination, by targeting mitochondrial biogenesis and mitochondrial transport from the cell body to axon, protects acutely demyelinated axons from degeneration. Given the connection between schema and increased myelination, it remains an open question whether L-lactate-induced mitochondrial biogenesis plays a beneficial role in schema through a similar mechanism. Nevertheless, our results contribute to the mounting evidence of the glial role in cognitive functions and underscores the new paradigm in which glial cells are considered as integral players in cognitive functions alongside neurons. Disruption of neurons, myelin, or astrocytes in the ACC can disrupt PA learning and schema memory.”

      Reviewer #3 (Public Review):

      Akter et al. investigated how the astroglial Gi signaling pathway in the rat anterior cingulate cortex (ACC) affects cognitive functions, in particular schema memory formation. Using a stereotactic approach they intracranially introduced AAV8 vectors carrying mCherry-tagged hM4Di DREADD (Designer Receptor Exclusively Activated by Designer Drugs) under astrocyte selective GFAP promotor (AAV8-GFAP-hM4Di-mCherry) into the AAC region of the rat brain. hM4Di DREADD is a genetically modified form of the human M4 muscarinic (hM4) receptor insensitive to endogenous acetylcholine but is activated by the inert clozapine metabolite clozapine-N-oxide (CNO), triggering the Gi signaling pathway. The authors confirmed that hM4Di DREADD is selectively expressed in astrocytes after the application of the AAV8 vector by analysing the mCherry signals and immunolabeling of astrocytes and neurons in the ACC region of the rat brain. They activated hM4Di DREADD (Gi signalling) in astrocytes by intraperitoneal administration of CNO and measured cognitive functions in animals after CNO administration. Activation of Gi signaling in astrocytes by CNO application decreased paired-associate (PA) learning, schema formation, and memory retrieval in tested animals. This was associated with a decrease in cAMP in astrocytes and L-lactate in extracellular fluid as measured by immunohistochemistry in situ and in awake rats by microdialysis, respectively. Administration of exogenous L-lactate rescued the astroglial Gi-mediated deficits in PA learning, memory retrieval, and schema formation, suggesting that activation of astroglial Gi signalling downregulates L-lactate production in astrocytes and its transport to neurons affecting memory formation. Authors also show that expression level of proteins involved in mitochondrial biogenesis, which is associated with cognitive functions, is decreased in neurons, when Gi signalling is activated in astrocytes, and rescued when exogenous L-lactate is applied, suggesting the implication of astrocyte-derived L-lactate in the maintenance of mitochondrial biogenesis in neurons. The latter depended on lactate MCT2 transporter activity and glutamate NMDA receptor activity.

      The paper is very well written and discussed. The conclusions of this paper are well supported by the data. Although this is a study that uses established and previously published methodologies, it provides new insights into L-lactate signalling in the brain, particularly in AAC, and further confirms the role of astroglial L-lactate in learning and memory formation. It also raises new questions about the molecular mechanisms underlying astrocyte-derived L-lactate-mediated mitochondrial biogenesis in neurons and its contribution to schema memory formation.

      • The authors discuss astrocytic L-lactate signalling without considering the recently discovered L-lactate-sensitive Gs and Gi protein-coupled receptors in the brain, which are present in both astrocytes and neurons. The use of nonendogenous L-lactate receptor agonists (Compound 2, 3-chloro-5-hydroxybenzoic acid) would clarify the implication of L-lactate receptor signalling in schema memory formation.

      In the revised manuscript, we have included this point in the discussion section to mention the potential role of HCAR1 in schema memory as follows:

      “Schema consolidation is associated with synaptic plasticity-related gene expression (such as Zif268, Arc) in the ACC (Tse et al., 2011). L-lactate, after entry into neurons, can be converted to pyruvate during which NADH is also produced, promoting synaptic plasticity-related gene expression by potentiating NMDA signaling in neurons (Yang et al., 2014; Margineanu et al., 2018). Furthermore, L-lactate acts as an energy substrate to fuel learning-induced de novo neuronal translation critical for long-term memory (Descalzi et al., 2019). On the other hand, mitochondria play crucial role in fueling local translation during synaptic plasticity (Rangaraju et al., 2019). Therefore, it could be hypothesized that the rescue of astrocytic Gi activation-mediated impairment of schema by exogenous L-lactate could have been mediated by facilitating synaptic plasticity-related gene expression by directly fueling the protein translation, potentiating NMDA signaling, as well as increasing mitochondrial capacity for ATP production by promoting mitochondrial biogenesis. Furthermore, the potential involvement of HCAR1, a receptor for L-lactate that may regulate neuronal activity (Bozzo et al., 2013; Tang et al., 2014; Herrera-López & Galván, 2018; Abrantes et al., 2019), cannot be excluded. Future research could explore these potential mechanisms, examining the interactions among them, and determining their relative contributions to schema.”

      • The use of control animals transduced with an "empty" AAV9 vector (AAV8-GFAP-mCherry) compared with animals transduced with AAV8-GFAP-hM4Di-mCherry throughout the study would strengthen the results of this study, since transfection itself, as well as overexpression of the mCherry protein, may affect cell function.

      We thank the reviewer for pointing this. The schema experiment includes a control group (Control-CNO group) of rats injected with AAV8-GFAP-mCherry bilaterally into the ACC. As shown in Author response image 3, after habituation and pretraining, these rats were trained for PA learning similarly to the other groups. Before 30 mins and after 30 mins of each PA training session, they received I.P. CNO. The PA learning, schema formation, memory retrieval, NPA learning and retrieval, and latency (time needed to commence digging at the correct well) were similar to the control group of rats. This result is consistent with our previous study where rats bilaterally injected with AAV8-GFAP-mCherry into CA1 of hippocampus did not show impairments in PA learning and schema formation upon CNO treatment (Liu et al., 2022).

      Author response image 3.

      A. PI (mean ± SD) during the acquisition of the original six PAs (OPAs) (S1-2, 4-8, 10-17) and new PAs (NPAs) (S19) of the control (n=6) and control-CNO (n=4) groups. B. Non-rewarded PTs (PT1, PT2, and PT3 done on S3, S9, and S18, respectively) to test memory retrieval of OPAs for the control-CNO group. C. Non-rewarded PT4 (S20) which was done after replacing two OPAs with two NPAs (NPA 7 & 8) in S19 for the control-CNO group. D. Latency (in seconds) before commencing digging at the correct well for control and control-CNO groups. Data shown as mean ± SD.

      References

      Abrantes, H. d. C., Briquet, M., Schmuziger, C., Restivo, L., Puyal, J., Rosenberg, N., Rocher, A.-B., Offermanns, S., & Chatton, J.-Y. (2019). The Lactate Receptor HCAR1 Modulates Neuronal Network Activity through the Activation of Gα and Gβγ Subunits. The Journal of Neuroscience, 39(23), 4422-4433. https://doi.org/10.1523/jneurosci.2092-18.2019

      Akter, M., Ma, H., Hasan, M., Karim, A., Zhu, X., Zhang, L., & Li, Y. (2023). Exogenous L-lactate administration in rat hippocampus increases expression of key regulators of mitochondrial biogenesis and antioxidant defense [Original Research]. Frontiers in Molecular Neuroscience, 16. https://doi.org/10.3389/fnmol.2023.1117146

      Bozzo, L., Puyal, J., & Chatton, J.-Y. (2013). Lactate Modulates the Activity of Primary Cortical Neurons through a Receptor-Mediated Pathway. PLoS One, 8(8), e71721. https://doi.org/10.1371/journal.pone.0071721

      Choi, H. B., Gordon, G. R., Zhou, N., Tai, C., Rungta, R. L., Martinez, J., Milner, T. A., Ryu, J. K., McLarnon, J. G., Tresguerres, M., Levin, L. R., Buck, J., & MacVicar, B. A. (2012). Metabolic communication between astrocytes and neurons via bicarbonate-responsive soluble adenylyl cyclase. Neuron, 75(6), 1094-1104. https://doi.org/10.1016/j.neuron.2012.08.032

      Covelo, A., Eraso-Pichot, A., Fernández-Moncada, I., Serrat, R., & Marsicano, G. (2021). CB1R-dependent regulation of astrocyte physiology and astrocyte-neuron interactions. Neuropharmacology, 195, 108678. https://doi.org/https://doi.org/10.1016/j.neuropharm.2021.108678

      Descalzi, G., Gao, V., Steinman, M. Q., Suzuki, A., & Alberini, C. M. (2019). Lactate from astrocytes fuels learning-induced mRNA translation in excitatory and inhibitory neurons. Communications Biology, 2(1), 247. https://doi.org/10.1038/s42003-019-0495-2

      Endo, F., Kasai, A., Soto, J. S., Yu, X., Qu, Z., Hashimoto, H., Gradinaru, V., Kawaguchi, R., & Khakh, B. S. (2022). Molecular basis of astrocyte diversity and morphology across the CNS in health and disease. Science, 378(6619), eadc9020. https://doi.org/10.1126/science.adc9020

      Fünfschilling, U., Supplie, L. M., Mahad, D., Boretius, S., Saab, A. S., Edgar, J., Brinkmann, B. G., Kassmann, C. M., Tzvetanova, I. D., Möbius, W., Diaz, F., Meijer, D., Suter, U., Hamprecht, B., Sereda, M. W., Moraes, C. T., Frahm, J., Goebbels, S., & Nave, K.-A. (2012). Glycolytic oligodendrocytes maintain myelin and long-term axonal integrity. Nature, 485(7399), 517-521. https://doi.org/10.1038/nature11007

      Harris, R. A., Lone, A., Lim, H., Martinez, F., Frame, A. K., Scholl, T. J., & Cumming, R. C. (2019). Aerobic Glycolysis Is Required for Spatial Memory Acquisition But Not Memory Retrieval in Mice. eNeuro, 6(1). https://doi.org/10.1523/ENEURO.0389-18.2019

      Hasan, M., Kanna, M. S., Jun, W., Ramkrishnan, A. S., Iqbal, Z., Lee, Y., & Li, Y. (2019). Schema-like learning and memory consolidation acting through myelination. FASEB J, 33(11), 11758-11775. https://doi.org/10.1096/fj.201900910R

      Herrera-López, G., & Galván, E. J. (2018). Modulation of hippocampal excitability via the hydroxycarboxylic acid receptor 1. Hippocampus, 28(8), 557-567. https://doi.org/https://doi.org/10.1002/hipo.22958

      Horvat, A., Muhič, M., Smolič, T., Begić, E., Zorec, R., Kreft, M., & Vardjan, N. (2021). Ca2+ as the prime trigger of aerobic glycolysis in astrocytes. Cell Calcium, 95, 102368. https://doi.org/https://doi.org/10.1016/j.ceca.2021.102368

      Horvat, A., Zorec, R., & Vardjan, N. (2021). Lactate as an Astroglial Signal Augmenting Aerobic Glycolysis and Lipid Metabolism [Review]. Frontiers in Physiology, 12. https://doi.org/10.3389/fphys.2021.735532

      Ichihara, Y., Doi, T., Ryu, Y., Nagao, M., Sawada, Y., & Ogata, T. (2017). Oligodendrocyte Progenitor Cells Directly Utilize Lactate for Promoting Cell Cycling and Differentiation. J Cell Physiol, 232(5), 986-995. https://doi.org/10.1002/jcp.25690

      Iqbal, Z., Liu, S., Lei, Z., Ramkrishnan, A. S., Akter, M., & Li, Y. (2023). Astrocyte L-Lactate Signaling in the ACC Regulates Visceral Pain Aversive Memory in Rats. Cells, 12(1), 26. https://www.mdpi.com/2073-4409/12/1/26

      Jourdain, P., Rothenfusser, K., Ben-Adiba, C., Allaman, I., Marquet, P., & Magistretti, P. J. (2018). Dual action of L-Lactate on the activity of NR2B-containing NMDA receptors: from potentiation to neuroprotection. Sci Rep, 8(1), 13472. https://doi.org/10.1038/s41598-018-31534-y

      Kofuji, P., & Araque, A. (2021). G-Protein-Coupled Receptors in Astrocyte-Neuron Communication. Neuroscience, 456, 71-84. https://doi.org/10.1016/j.neuroscience.2020.03.025

      Lee, Y., Morrison, B. M., Li, Y., Lengacher, S., Farah, M. H., Hoffman, P. N., Liu, Y., Tsingalia, A., Jin, L., Zhang, P. W., Pellerin, L., Magistretti, P. J., & Rothstein, J. D. (2012). Oligodendroglia metabolically support axons and contribute to neurodegeneration. Nature, 487(7408), 443-448. https://doi.org/10.1038/nature11314

      Licht-Mayer, S., Campbell, G. R., Canizares, M., Mehta, A. R., Gane, A. B., McGill, K., Ghosh, A., Fullerton, A., Menezes, N., Dean, J., Dunham, J., Al-Azki, S., Pryce, G., Zandee, S., Zhao, C., Kipp, M., Smith, K. J., Baker, D., Altmann, D., Anderton, S. M., Kap, Y. S., Laman, J. D., Hart, B. A. t., Rodriguez, M., Watzlawick, R., Schwab, J. M., Carter, R., Morton, N., Zagnoni, M., Franklin, R. J. M., Mitchell, R., Fleetwood-Walker, S., Lyons, D. A., Chandran, S., Lassmann, H., Trapp, B. D., & Mahad, D. J. (2020). Enhanced axonal response of mitochondria to demyelination offers neuroprotection: implications for multiple sclerosis. Acta Neuropathologica, 140(2), 143-167. https://doi.org/10.1007/s00401-020-02179-x

      Liu, S., Wong, H. Y., Xie, L., Iqbal, Z., Lei, Z., Fu, Z., Lam, Y. Y., Ramkrishnan, A. S., & Li, Y. (2022). Astrocytes in CA1 modulate schema establishment in the hippocampal-cortical neuron network. BMC Biol, 20(1), 250. https://doi.org/10.1186/s12915-022-01445-6

      Magistretti, P. J., & Allaman, I. (2018). Lactate in the brain: from metabolic end-product to signalling molecule. Nat Rev Neurosci, 19(4), 235-249. https://doi.org/10.1038/nrn.2018.19

      Margineanu, M. B., Mahmood, H., Fiumelli, H., & Magistretti, P. J. (2018). L-Lactate Regulates the Expression of Synaptic Plasticity and Neuroprotection Genes in Cortical Neurons: A Transcriptome Analysis. Front Mol Neurosci, 11, 375. https://doi.org/10.3389/fnmol.2018.00375

      Netzahualcoyotzi, C., & Pellerin, L. (2020). Neuronal and astroglial monocarboxylate transporters play key but distinct roles in hippocampus-dependent learning and memory formation. Progress in Neurobiology, 194, 101888. https://doi.org/https://doi.org/10.1016/j.pneurobio.2020.101888

      Newman, L. A., Korol, D. L., & Gold, P. E. (2011). Lactate produced by glycogenolysis in astrocytes regulates memory processing. PLoS One, 6(12), e28427. https://doi.org/10.1371/journal.pone.0028427

      Park, J., Kim, J., & Mikami, T. (2021). Exercise-Induced Lactate Release Mediates Mitochondrial Biogenesis in the Hippocampus of Mice via Monocarboxylate Transporters. Front Physiol, 12, 736905. https://doi.org/10.3389/fphys.2021.736905

      Peterson, S. M., Pack, T. F., & Caron, M. G. (2015). Receptor, Ligand and Transducer Contributions to Dopamine D2 Receptor Functional Selectivity. PLoS One, 10(10), e0141637. https://doi.org/10.1371/journal.pone.0141637

      Rangaraju, V., Lauterbach, M., & Schuman, E. M. (2019). Spatially Stable Mitochondrial Compartments Fuel Local Translation during Plasticity. Cell, 176(1), 73-84.e15. https://doi.org/10.1016/j.cell.2018.12.013

      Rinholm, J. E., Hamilton, N. B., Kessaris, N., Richardson, W. D., Bergersen, L. H., & Attwell, D. (2011). Regulation of oligodendrocyte development and myelination by glucose and lactate. J Neurosci, 31(2), 538-548. https://doi.org/10.1523/JNEUROSCI.3516-10.2011

      Sánchez-Abarca, L. I., Tabernero, A., & Medina, J. M. (2001). Oligodendrocytes use lactate as a source of energy and as a precursor of lipids. Glia, 36(3), 321-329. https://doi.org/10.1002/glia.1119

      Suzuki, A., Stern, S. A., Bozdagi, O., Huntley, G. W., Walker, R. H., Magistretti, P. J., & Alberini, C. M. (2011). Astrocyte-neuron lactate transport is required for long-term memory formation. Cell, 144(5), 810-823.

      Tang, F., Lane, S., Korsak, A., Paton, J. F. R., Gourine, A. V., Kasparov, S., & Teschemacher, A. G. (2014). Lactate-mediated glia-neuronal signalling in the mammalian brain. Nature Communications, 5(1), 3284. https://doi.org/10.1038/ncomms4284

      Tauffenberger, A., Fiumelli, H., Almustafa, S., & Magistretti, P. J. (2019). Lactate and pyruvate promote oxidative stress resistance through hormetic ROS signaling. Cell Death Dis, 10(9), 653. https://doi.org/10.1038/s41419-019-1877-6

      Tse, D., Langston, R. F., Kakeyama, M., Bethus, I., Spooner, P. A., Wood, E. R., Witter, M. P., & Morris, R. G. (2007). Schemas and memory consolidation. Science, 316(5821), 76-82. https://doi.org/10.1126/science.1135935

      Tse, D., Takeuchi, T., Kakeyama, M., Kajii, Y., Okuno, H., Tohyama, C., Bito, H., & Morris, R. G. (2011). Schema-dependent gene activation and memory encoding in neocortex. Science, 333(6044), 891-895. https://doi.org/10.1126/science.1205274

      Vardjan, N., Chowdhury, H. H., Horvat, A., Velebit, J., Malnar, M., Muhič, M., Kreft, M., Krivec, Š. G., Bobnar, S. T., Miš, K., Pirkmajer, S., Offermanns, S., Henriksen, G., Storm-Mathisen, J., Bergersen, L. H., & Zorec, R. (2018). Enhancement of Astroglial Aerobic Glycolysis by Extracellular Lactate-Mediated Increase in cAMP [Original Research]. Frontiers in Molecular Neuroscience, 11. https://doi.org/10.3389/fnmol.2018.00148

      Vezzoli, E., Cali, C., De Roo, M., Ponzoni, L., Sogne, E., Gagnon, N., Francolini, M., Braida, D., Sala, M., Muller, D., Falqui, A., & Magistretti, P. J. (2020). Ultrastructural Evidence for a Role of Astrocytes and Glycogen-Derived Lactate in Learning-Dependent Synaptic Stabilization. Cereb Cortex, 30(4), 2114-2127. https://doi.org/10.1093/cercor/bhz226

      Wang, J., Tu, J., Cao, B., Mu, L., Yang, X., Cong, M., Ramkrishnan, A. S., Chan, R. H. M., Wang, L., & Li, Y. (2017). Astrocytic l-Lactate Signaling Facilitates Amygdala-Anterior Cingulate Cortex Synchrony and Decision Making in Rats. Cell Rep, 21(9), 2407-2418. https://doi.org/10.1016/j.celrep.2017.11.012

      Yang, J., Ruchti, E., Petit, J. M., Jourdain, P., Grenningloh, G., Allaman, I., & Magistretti, P. J. (2014). Lactate promotes plasticity gene expression by potentiating NMDA signaling in neurons. Proc Natl Acad Sci U S A, 111(33), 12228-12233. https://doi.org/10.1073/pnas.1322912111

      Yao, S., Xu, M.-D., Wang, Y., Zhao, S.-T., Wang, J., Chen, G.-F., Chen, W.-B., Liu, J., Huang, G.-B., Sun, W.-J., Zhang, Y.-Y., Hou, H.-L., Li, L., & Sun, X.-D. (2023). Astrocytic lactate dehydrogenase A regulates neuronal excitability and depressive-like behaviors through lactate homeostasis in mice. Nature Communications, 14(1), 729. https://doi.org/10.1038/s41467-023-36209-5

      Yu, X., Zhang, R., Wei, C., Gao, Y., Yu, Y., Wang, L., Jiang, J., Zhang, X., Li, J., & Chen, X. (2021). MCT2 overexpression promotes recovery of cognitive function by increasing mitochondrial biogenesis in a rat model of stroke. Anim Cells Syst (Seoul), 25(2), 93-101. https://doi.org/10.1080/19768354.2021.1915379

      Zhou, Z., Okamoto, K., Onodera, J., Hiragi, T., Andoh, M., Ikawa, M., Tanaka, K. F., Ikegaya, Y., & Koyama, R. (2021). Astrocytic cAMP modulates memory via synaptic plasticity. Proc Natl Acad Sci U S A, 118(3), e2016584118. https://doi.org/10.1073/pnas.2016584118

      Zhu, J., Hu, Z., Han, X., Wang, D., Jiang, Q., Ding, J., Xiao, M., Wang, C., Lu, M., & Hu, G. (2018). Dopamine D2 receptor restricts astrocytic NLRP3 inflammasome activation via enhancing the interaction of β-arrestin2 and NLRP3. Cell Death Differ, 25(11), 2037-2049. https://doi.org/10.1038/s41418-018-0127-2

    1. Author Response

      Reviewer #1 (Public Review):

      This thorough study expands our understanding of BMP signaling, a conserved developmental pathway, involved in processes diverse such as body patterning and neurogenesis. The authors applied multiple, state-of-art strategies to the anthozoan Nematostella vectensis in order to first identify the direct BMP signaling targets - bound by the activated pSMAD1/5 protein - and then dissect the role of a novel pSMAD1/5 gradient modulator, zwim4-6. The list of target genes features multiple developmental regulators, many of which are bilaterally expressed, and which are notably shared between Drosophila and Xenopus. The analysis identified in particular zswim4-6 a novel nuclear modulator of the BMP pathway conserved also in vertebrates. A combination of both loss-of-function (injection of antisense morpholino oligonucleotide, CRISPR/Cas9 knockout, expression of dominant negative) and gain-of-function assays, and of transcriptome sequencing identified that zwim acts as a transcriptional repression of BMP signaling. Functional manipulation of zswim5 in zebrafish shows a conserved role in modulating BMP signaling in a vertebrate.

      The particular strength of the study lies in the careful and thorough analysis performed. This is solid developmental work, where one clear biological question is progressively dissected, with the most appropriate tools. The functional results are further validated by alternative approaches. Data is clearly presented and methods are detailed. I have a couple of comments.

      1) I was intrigued - as the authors - by the fact that the ChiP-Seq did not identify any known BMP ligand bound by pSMAD1/5. Are these genes found in the published ChiP-Seq data of the other species used for the comparative analysis? One hypothesis could be that there is a change in the regulatory interactions and that the initial set-up of the gradient requires indeed a feedback loop, which is then turned off at later gastrula. In this case, immunoprecipitation at early gastrula, prior to the set-up of the pSMAD1/5 gradient, could reveal a different scenario. Alternately, the regulation could be indirect, for example, through RGM, an additional regulator of BMP signaling expressed on the side of lower BMP activity, which is among the targets of the ChiP-Seq. This aspect could be discussed. Additionally, even if this is perhaps outside the scope of this study, I think it would be informative to further assess the effect of ZSWIM manipulation on RGM (and vice versa).

      Indeed, BMP genes are direct BMP signaling targets in Drosophila (dpp) (Deignan et al., 2016, https://doi.org/10.1371/journal.pgen.1006164) and frog (bmp2, bmp4, bmp5, bmp7) (Stevens et al., 2021, https://doi.org/10.1242/dev.145789). Of all these ligands, only the dorsally expressed Xenopus bmp2 is repressed by BMP signaling, while another dorsally expressed Xenopus BMP gene admp is not among the direct targets. All other BMP genes listed here are expressed in the pMad/pSMAD1/5/8-positive domain and are activated by BMP signaling.

      In Nematostella, we do not find BMP genes among the ChIP-Seq targets, but this is not that surprising considering the dynamics of the bmp2/4, bmp5-8 and chordin expression, as well as the location of the pSMAD1/5-positive cells. In late gastrulae/early planulae, Chordin appears to be shuttling BMP2/4 and BMP5-8 away from their production source and over to the gdf5-like side of the directive axis (Genikhovich et al., 2015; Leclere and Rentsch, 2014). By 4 dpf, chordin expression stops, and BMP2/4 and BMP5-8 start to be both expressed AND signal in the mesenteries. If bmp2/4 and bmp5-8 expression were directly suppressed by pSMAD1/5 (as is the case chordin or rgm expression), this mesenterial expression would not be possible. Therefore, in our opinion, it is most likely that at late gastrula and early planula the regulation of bmp2/4 and bmp5-8 expression by BMP signaling is indirect. We do not have an explanation for why gdf5-like (another BMP gene expressed on the “high pSMAD1/5” side) is not retrieved as a direct BMP target in our ChIP data. Since we do not understand well enough how BMP gene expression is regulated, we do not discuss this at length in the manuscript.

      As the Reviewer suggested, we analyzed the effect of ZSWIM4-6 KD on the expression of rgm. Expectedly, since it is expressed on the “low BMP side”, its expression was strongly expanded (Figure 6 - Figure Supplement 4)

      2) I do not fully understand the rationale behind the choice of performing the comparative assays in zebrafish: as the conservation was initially identified in Xenopus, I would have expected the experiment to be performed in frog. Furthermore, reading the phylogeny (Figure 4A), it is not obvious to me why ZSWIM5 was chosen for the assay (over the other paralog ZSWIM6). Could the Authors comment on this experiment further?

      The comparison was done in zebrafish because we were planning to generate zswim5 mutants, whose analysis is currently in progress. ZSWIM6 is not expressed at the developmental stages we were interested in, while ZSWIM5 was, based on available zebrafish expression data (White et al., 2017):

      Reviewer #2 (Public Review):

      The authors provide a nice resource of putative direct BMP target genes in Nematostella vectensis by performing ChIP-seq with an anti-pSmad1/5 antibody, while also performing bulk RNA-seq with BMP2/4 or GDF5 knockdown embryos. Genes that exhibit pSmad1/5 binding and have changes in transcription levels after BMP signaling loss were further annotated to identify those with conserved BMP response elements (BREs). Further characterization of one of the direct BMP target genes (zswim4-6) was performed by examining how expression changed following BMP receptor or ligand loss of function, as well as how loss or gain of function of zswim4-6 affected development and BMP signaling. The authors concluded that zswim4-6 modulates BMP signaling activity and likely acts as a pSMAD1/5 dependent co-repressor. However, the mechanism by which zswim4-6 affects the BMP gradient or interacts with pSMAD1/5 to repress target genes is not clear. The authors test the activity of a zswim4-6 homologue in zebrafish (zswim5) by over-expressing mRNA and find that pSMAD1/5/9 labeling is reduced and that embryos have a phenotype suggesting loss of BMP signaling, and conclude that zswim4-6 is a conserved regulator of BMP signaling. This conclusion needs further support to confirm BMP loss of function phenotypes in zswim5 over-expression embryos.

      Major comments

      1) The BMP direct target comparison was performed between Nematostella, Drosophila, and Xenopus, but not with existing data from zebrafish (Greenfeld 2021, Plos Biol). Given the functional analysis with zebrafish later in the paper it would be nice to see if there are conserved direct target genes in zebrafish, and in particular, is zswim5 (or other zswim genes) are direct targets. Since conservation of zswim4-6 as a direct BMP target between Nematostella and Xenopus seemed to be part of the rationale for further functional analysis, it would also be nice to know if this is a conserved target in zebrafish.

      Thank you for the suggestion. In the paper by Greenfeld et al., 2021, zebrafish zswim5 was downregulated approximately 2.4x in the bmp7 mutant at 6 hpf, while zswim6 was barely expressed and not affected at this stage. We added this information to the text of the manuscript. Expression of several other zebrafish zswim genes was also affected in the bmp7 mutant, but these genes do not appear relevant for our study since their corresponding orthologs are not identified as pSMAD1/5 ChIP-Seq targets in Nematostella. Notably, zebrafish zzswim5 is not clearly differentially expressed in BMP or Chd overexpression conditions (See Supplementary file 1 in Rogers et al. 2020). Importantly, in the paper, we wanted to compare ChiP-Seq data with ChIP-Seq data, however, unfortunately, no ChIP-Seq data for pSMAD1/5/8 is currently available for zebrafish, thus precluding comparisons.

      Related to this, in the discussion it is mentioned that zswim4/6 is also a direct BMP target in mouse hair follicle cells, but it wasn't obvious from looking at the supplemental data in that paper where this was drawn from.

      Please see Supplementary Table 1, second Excel sheet labeled “Mx ChIP_Seq” in Genander et al., 2014, https://doi.org/10.1016/j.stem.2014.09.009. Zswim4 has a single pSMAD1 peak associated with it, Zswim6 has two.

      2) The loss of zswim4-6 function via MO injection results in changes to pSmad1/5 staining, including a reduction in intensity in the endoderm and gain of intensity in the ectoderm, while over-expression results in a loss of intensity in the ectoderm and no apparent change in the endoderm. While this is interesting, it is not clear how zswim4-6 is functioning to modify BMP signaling, and how this might explain differential effects in ectoderm vs. endoderm. Is the assumption that the mechanism involves repression of chordin? And if so one could test the double knockdown of zswim4-6 and chordin and look for the rescue of pSad1/5 levels or morphological phenotype.

      We do not think that the mechanism of the ZSWIM4-6 action is via repression of Chordin. As loss of chordin leads to the loss of pSMAD1/5 in Nematostella (Genikhovich et al., 2015), the proposed experiment is, unfortunately, not feasible to test this hypothesis. Currently, we see two distinct effects of the modulation of zswim4-6 expression. First, it affects the pSMAD1/5 gradient, possibly by destabilizing nuclear SMAD1/5, as has been proposed by Wang et al., 2022 for the vertebrate Zswim4. This is in line with our results shown on Fig. 6C-F’ and Fig. 6-Figure supplement 3. In our opinion, the reaction of the genes expressed on the “high BMP” side of the directive axis to the overexpression or KD of ZSWIM4-6 (Fig. 6I-K’, 6N-P’) can be explained by these changes in the pSMAD1/5 signaling intensity. Secondly, zswim4-6 appears to promote pSMAD1/5-mediated gene repression. This is in line with the reaction of the genes expressed on the “low BMP” side of the directive axis (Fig. 6G-H’, 6L-M’, Fig. 6-Figure Supplement 4). These genes are repressed by BMP signaling, but they expand their expression upon zswim4-6 KD in spite of the increased pSMAD1/5. Our ChiP experiment (Fig. 6Q) supports this view.

      3) Several experiments are done to determine how zswim4-6 expression responds to the loss of function of different BMP ligands and receptors, with the conclusion being that swim4-6 is a BMP2/4 target but not a GDF5 target, with a lot of the discussion dedicated to this as well. However, the authors show a binary response to the loss of BMP2/4 function, where zswim4-6 is expressed normally until pSmad1/5 levels drop low enough, at which point expression is lost. Since the authors also show that GDF5 morphants do not have as strong a reduction in pSmad1/5 levels compared to BMP2/4 morphants, perhaps GDF5 plays a positive but redundant role in swim4-6 expression. To test this possibility the authors could inject suboptimal doses of BMP2/4 MO with GDF5 MO and look for synergy in the loss of zswim4-6 expression.

      Thanks for this great suggestion! We performed this experiment (Fig. 5H’’-L) and indeed, a suboptimal dose of BMP2/4MO + GDF5lMO results in a complete radialization of the embryo and abolished zswim4–6, similar to the effect of a high dose of BMP2/4. This result suggests that rather than being a ligand-specific signaling function, GDF5-like signaling alone still provides sufficiently high pSmad1/5 levels to activate zswim4-6 expression to apparent wildtype levels, demonstrating the sensitivity of this gene to even very low amounts of BMP signaling.

      4) The zswim4-6 morphant embryos show increased expression of zswim4-6 mRNA, which is said to indicate that zswim4-6 negatively regulates its own expression. However in zebrafish translation blocking MOs can sometimes stabilize target transcripts, causing an artifact that can be mistakenly assumed to be increased transcription (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7162184/). Some additional controls here would be warranted for making this conclusion.

      Thanks for raising this important experimental consideration. To-date, we do not have any evidence for MO-mediated transcript stabilization in Nematostella, and we have not found such data in the literature on models other than zebrafish. mRNA stabilization by the MO also seemed unlikely because we were unable to KD zswim4-6 using several independent shRNAs - an effect we frequently observe with genes, whose activity negatively regulates their own expression. However, to test the possibility that zswim4-6MO binding stabilizes zswim4-6 mRNA, we injected mRNA containing the zswim4-6MO recognition sequence followed by the mCherry coding sequence (zswim4-6MO-mCherrry) with either zswim4-6MO or control MO. We could clearly detect mCherry fluorescence at 1 dpf if control MO was co-injected with the mRNA, but not if zswim4-6MO was coninjected with the mRNA. At 2 dpf (the stage at which we showed upregulation of zswim4-6 upon zswim4-6MO injection on Fig. 6I-I’), zswim4-6MO-mCherrry mRNA was undetectable by in situ hybridization with our standard FITC-labeled mCherry probe independent of whether zswim4-6MO-mCherrry mRNA was co-injected with the control MO or ZSWIM4-6MO, while hybridization with the FITC-labeled FoxA probe worked perfectly.

      Author response image 1.

      We are currently offering two alternative hypothesis for the observed increase in zswim4-6 levels in the paper rather than stating explicitly that ZSWIM4-6 negatively regulates its own expression: “The KD of zswim4-6 translation resulted in a strong upregulation of zswim4-6 transcription, especially in the ectoderm, suggesting that ZSWIM4-6 might either act as its own transcriptional repressor or that zswim4-6 transcription reacts to the increased ectodermal pSMAD1/5 (Fig. 6I-I’).” Given the sensitivity of zswim4-6 to even the weakest pSMAD1/5 signal (zswim4/6 is expressed upon GDF5-like KD, which drastically reduces pSMAD1/5 signaling intensity (see Fig. 1 and 2 in Genikhovich et al., 2015, http://doi.org/10.1016/j.celrep.2015.02.035 and Fig. 6-Figure supplement 3 of this paper), the latter option (that it reacts to the increased ectodermal pSMAD1/5) is, in our opinion, clearly the more probable one.

      5) Zswim4-6 is proposed to be a co-repressor of pSmad1/5 targets based on the occupancy of zswim4-6 at the chordin BRE (which is normally repressed by BMP signaling) and lack of occupancy at the gremlin BRE (normally activated by BMP signaling). This is a promising preliminary result but is based only on the analysis of two genes. Since the authors identified BREs in other direct target genes, examining more genes would better support the model.

      We suggest that ZSWIM4-6 may be a co-repressor of pSMAD1/5 targets because it is a nuclear protein (Fig. 4G), whose knockdown results in the expansion of the ectodermal expression of several genes repressed by pSMAD1/5 in spite of the expansion of pSMAD1/5 itself (Fig. 6G-H’, 6L-M’, Fig. 6-Figure Supplement 4). Our limited ChIP analysis supports this idea by showing that ZSWIM4-6 is bound to the pSMAD1/5 site of chordin (repressed by pSMAD1/5) but not on gremlin (activated by pSMAD1/5). We agree that adding the analysis of more targets in order to challenge our hypothesis would be good. However, given technical limitations (having to inject many thousands of eggs with the EF1a::ZSWIM4-6-GFP plasmid in order to get enough nuclei to extract sufficient immunoprecipitated chromatin for qPCR on 3 genes (chordin, gremlin, GAPDH) for each biological replicate, it is currently unfortunately not feasible to test more genes. It will be of great interest for follow up studies to generate a knock-in line with tagged zswim4-6 to analyze target binding on a genome-wide scale. We stress in the discussion that currently the power of our conclusion is low.

      6) The rationale for further examination of zswim4-6 function in Nematostella was based in part on it being a conserved direct BMP target in Nematostella and Xenopus. The analysis of zebrafish zswim5 function however does not examine whether zswim5 is a BMP target gene (direct or indirect). BMP inhibition followed by an in situ hybridization for zswim5 would establish whether its expression is activated downstream of BMP.

      In the paper by Greenfeld et al., 2021, zebrafish zswim5 was downregulated approximately 2.4x in the bmp7 mutant at 6 hpf. However, this gene was not among the 57 genes, which were considered to be direct BMP targets because their expression was affected by bmp7 mRNA injection into cycloheximide-treated bmp7 mutants (Greenfeld et al., 2021). We added this information to the text of the manuscript.

      7) Although there is a reduction in pSmad1/5/9 staining in zebrafish injected with zswim5 mRNA, it is difficult to tell whether the resulting morphological phenotypes closely resemble zebrafish with BMP pathway mutations (such as bmp2b). More analysis is warranted here to determine whether stereotypical BMP loss of function phenotypes are observed, such as dorsalization of the mesoderm and loss of ventral tail fin.

      We agree, and we have tuned down all zebrafish arguments. Analyses of zswim5 mutants are currently ongoing.

    1. Author Response

      Reviewer #3 (Public Review):

      1) Validation of reagents: The authors generated a pY1230 Afadin antibody claiming that (page 6) "this new antibody is specific to tyrosine phosphorylated Afadin, and that pY1230 is targeted for dephosphorylation by PTPRK, in a D2-domain dependent manner". The WB in Fig 1B shows a lot of background, two main bands are visible which both diminish in intensity in ICT WT pervanadate-treated MCF10A cell lysates. The claim that the developed peptide antibody is selective for pY1230 in Afadin would need to be substantiated, for instance by pull down studies analysed by pY-MS to substantiate a claim of antibody specificity for this site. However, for the current study it would be sufficient to demonstrate that pY1230 is indeed the dephosphorylated site. I suggest therefore including a site directed mutant (Y1230F) that would confirm dephosphorylation at this site and the ability of the antibody recognizing the phosphorylation state at this position.

      We would like this antibody to be a useful and freely accessible tool in the field and have taken on board the request for additional validation. To this end we have significantly expanded Supplementary Figure 2 (now Figure 1 - figure supplement 2) and included a dedicated section of the results as follows: 1. We have now included information about all of the Afadin antibodies used in this study, since Afadin(BD) appears to be sensitive to phosphorylation (Figure 1 - figure supplement 2A). 2. We have demonstrated that the Afadin pY1230 antibody detects an upregulated band in PTPRK KO MCF10A cells, consistent with our previous tyrosine phosphoproteomics (Figure 1 - figure supplement 2B). This indicates that the antibody can be used to detect endogenous Afadin phosphorylation. 3. We have included two new knock down experiments demonstrating the recognition of Afadin by our antibody (Figure 1 - figure supplement 2C). There appear to be two Afadin isoforms recognised in HEK293T cells by both the BD and pY1230 antibody, consistent with previous reports (Umeda et al. MBoC, 2015). We have highlighted these in the figure. 4. We have performed mutagenesis to demonstrate the specificity of the antibody. We tagged Afadin with a fluorescent protein tag, reasoning that it would cause a shift in molecular weight that could be resolved by SDS PAGE, as is the case. We noted that the phosphopeptide used spans an additional tyrosine, Y1226, which has been detected as phosphorylated (although to a much lower extent than Y1230) on Phosphosite plus. The data clearly show that Afadin cannot be phosphorylated when Y1230 is mutated to a phenylalanine (compared to CIP control), indicating that this is the predominant site recognised by the antibody. In addition, the endogenous pervanadate-stimulated signal is completely abolished by CIP treatment (Figure 1 - figure supplement 2D). 5. We have included densitometric quantification of the dephosphorylation assay shown in Figure 1B, which was part of a time course and shows preferential dephosphorylation by the PTPRK ICD compared to the PTPRK D1. The signal stops declining with time, which could indicate antibody background, or an inaccessible pool of Afadin-pY1230 (Figure 1 - figure supplement 2E). 6. To further demonstrate that this site is modulated by PTPRK in post-confluent cells, we have used doxycycline (dox)-inducible cell lines generated in Fearnley et al, 2019. Upon treatment with 500 ng/ml Dox for 48 hours PTPRK is induced to lower levels than wildtype, however, normalized quantification of the Afadin pY1230 against the Afadin (CST) signal clearly indicates downregulation by PTPRK WT, but not the catalytically inactive mutant (Figure 1 - figure supplement 2F and 2G). Together these data strengthen our assertion that this antibody recognises endogenously phosphorylated Afadin at site Y1230, which is modulated in vitro and in cells by PTPRK phosphatase activity. For clarity, we have highlighted and annotated the relevant bands in figures. We have also included identifiers for each Afadin total antibody was used in particular experiments.

      2) The authors claim that a short, 63-residue predicted coiled coil (CC) region, is both necessary and sufficient for binding to the PTPRK-ICD. The region is predicted to have alpha-helical structure and as a consequence, a helical structure has been used in the docking model. Considering that the authors recombinantly expressed this region in bacteria, it would be experimentally simple confirming the alpha-helical structure of the segment by CD or NMR spectroscopy.

      To clarify, the helical structure in the docking model was independently predicted by several sequence and structural analysis programmes including AlphaFold2, RobettaFold, NetSurfP and as annotated in Uniprot (as a coiled coil). We did not stipulate prior to the AF2 prediction that it was helical. Isolated short peptides frequently adopt helical structure, therefore prediction of a helix within the context of the full Afadin sequence is, in our opinion, stronger evidence than CD of an isolated fragment.

      3) Only two mutants have been introduced into PTPRK-ICD to map the Afadin interaction site. One of the mutations changes a possibly structurally important residues (glycine) into a histidine. Even though this residue is present in PTPRM, it does not exclude that the D2 domain no longer functionally folds. Also the second mutation represents a large change in chemical properties and the other 2 predicted residues have not been investigated.

      The residues that were selected for mutation are all localised to the protein surface and therefore are unlikely to be involved in stable folding of PTPRK. In support of the correct folding of the mutated PTPRK, we include in Figure 1 below SEC elution traces for wild-type and mutant D2 showing that they elute as single symmetric peaks at the same elution volume as the WT protein. This is consistent with them having a similar shape and size, and not being aggregated or unfolded.

      Figure 1. PTPRK-D2 wild-type and mutant preparative SEC elution profiles. A280nm has been normalised to help illustrate that the different proteins elute at the same volume. The main peak from these samples was used for binding assays in the main paper.

      Furthermore, the yield for the double mutant was very high (4 mg of pure protein from a 2 L culture, see A280 value in graph below), whereas poorly folded proteins tend to have significantly reduced yields. This protein was also very stable over time whereas unfolded proteins tend to degrade during or following purification.

      Figure 2. Analytical SEC elution profile for the PTPRK-D2 DM construct showing the very high yield consistent with a well-folded, stable protein.

      Finally, we have carried out thermal melt curves of the WT and mutant PTPRK D2 domains showing that they all possess melting temperatures between 39.3°C and 41.7°C, supporting that they are all equivalently folded. We include these data as an additional Supplementary Figure (Figure 4 - figure supplement 3) in the paper.

      4) The interface on the Afadin substrate has not been investigated apart from deleting the entire CC or a central charge cluster. Based on the docking model the authors must have identified key positions of this interaction that could be mutated to confirm the proposed interaction site.

      We have now made and tested several additional mutations within both the Afadin-CC and PTPRK-D2 domains to further validate the AF2 predicted model of the complex.

      For Afadin-CC we introduced several single and double mutations along the helix including residues predicted to be in the interface and residues distal from the interface. These mutations and the pulldown with PTPRK are described in the text and are included as additional panels to a modified Figure 3. All mutations have the expected effect on the interaction based on the predicted complex structure. To help illustrate the positions of these mutations we have also included a figure of the interface with the residues highlighted.

      For the PTPRK-D2 we have also introduced two new mutations, one buried in the interface (F1225A) and one on the edge of the interface encompassing a loop that is different in PTPRM (labelled the M-loop). GST-Afadin WT protein was bound to GSH beads and tested for their ability to pulldown WT and mutated PTPRK. These new mutations (illustrated in the new Figure 4 – figure supplement 2) further support the model prediction. F1225A almost completely abolishes binding as predicted, while the M-loop retains binding. These mutations and their effects are now described in the main text and the pull-down data, including controls and retesting of the original DM mutant, are included as panel H in a newly modified Figure 4 focussed solely on the PTPRK interface.

      5) A minor point is that ITC experiments have not been run long enough to determine the baseline of interaction heats. In addition, as large and polar proteins were used in this experiment, a blank titration would be required to rule out that dilution heats effect the determined affinities.

      All control experiments including buffer into buffer, Afadin into buffer and buffer into PTPRK were carried out at the same time as the main binding experiment and are shown below overlaid with the binding curve. These demonstrate the very small dilution heats consistent with excellent buffer matching of the samples.

      We were able to obtain excellent fits to the titration curves by fitting 1:1 binding with a calculated linear baseline (see Figure 2B,D). Very similar results were obtained by fitting to the sum (‘composite’) of fitted linear baselines obtained for the three control experiments for each titration.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors convincingly show in this study the effects of the fas5 gene on changes in the CHC profile and the importance of these changes toward sexual attractiveness.

      The main strength of this study lies in its holistic approach (from genes to behaviour) showing a full and convincing picture of the stated conclusions. The authors succeeded in putting a very interdisciplinary set of experiments together to support the main claims of this manuscript.

      We appreciate the kind comments from the reviewer.

      The main weakness stems from the lack of transparency behind the statistical analyses conducted in the study. Detailed statistical results are never mentioned in the text, nor is it always clear what was compared to what. I also believe that some tests that were conducted are not adequate for the given data. I am therefore unable to properly assess the significance of the results from the presented information. Nevertheless, the graphical representations are convincing enough for me to believe that a revision of the statistics would not significantly affect the main conclusions of this manuscript.

      We apologize for neglecting a detailed description of statistical tests that were performed. We wrote additional paragraphs in the method part specifically explaining the statistical analyses (line 435-445; 489-502; 559-561; 586-591).

      The second major problem I had with the study was how it brushes over the somewhat contradicting results they found in males (Fig S2). These are only mentioned twice in the main text and in both cases as being "similarly affected", even though their own stats seem to indicate otherwise for many of the analysed compound groups. This also should affect the main conclusion concerning the effects of fas5 genes in the discussion, a more careful wording when interpreting the results is therefore necessary.

      Thank you for pointing this out. Though our focus clearly lay on the female CHC profiles as a function in sexual signaling has only been described thus far for them, we now elaborated the result and discussion for the fas5 RNAi male part (line 167-178; 258-268).

      Reviewer #2 (Public Review):

      Insects have long been known to use cuticular hydrocarbons for communication. While the general pathways for hydrocarbon synthesis have been worked out, their specificity and in particular the specificity of the different enzymes involved is surprisingly little understood. Here, the authors convincingly demonstrate that a single fatty acid synthase gene is responsible for a shift in the positions of methyl groups across the entire alkane spectrum of a wasp, and that the wasps males recognize females specifically based on these methyl group positions. The strength of the study is the combination of gene expression manipulations with behavioural observations evaluating the effect of the associated changes in the cuticular hydrocarbon profiles. The authors make sure that the behavioural effect is indeed due to the chemical changes by not only testing life animals, but also dead animals and corpses with manipulated cuticular hydrocarbons.

      I find the evidence that the hydrocarbon changes do not affect survival and desiccation resistance less convincing (due to the limited set of conditions and relatively small sample size), but the data presented are certainly congruent with the idea that the methyl alkane changes do not have large effects on desiccation.

      We appreciate the kind comments from the reviewer.

      Reviewer #3 (Public Review):

      In this manuscript, the authors are aiming to demonstrate that a fatty-acyl synthase gene (fas5) is involved in the composition of the blend of surface hydrocarbons of a parasitoid wasp and that it affects the sexual attractiveness of females for males. Overall, the manuscript reads very well, it is very streamlined, and the authors' claims are mostly supported by their experiments and observations.

      We appreciate the kind comments from the reviewer.

      However, I find that some experiments, information and/or discussion are absent to assess how the effects they observe are, at least in part, not due to other factors than fas5 and the methyl-branched (MB) alkanes. I'm also wondering if what the authors observe is only a change in the sexual attractiveness of females and not related to species recognition as well.

      We appreciate the interesting point that the reviewer raises in sexual attractiveness and species recognition and now expand upon this potential aspect in the discussion (lines 327-330). However, in this manuscript, we very much focused on the effect of fas5 knockdown on the conveyance of female sexual attractiveness in a single species (Nasonia vitripennis). Therefore, we argue that species recognition constitutes a different communication modality here, and we currently cannot infer whether and how species recognition is exactly encoded in Nasonia CHC profiles despite some circumstantial evidence for species-specificity (Buellesbach et al. 2013; Mair et al. 2017). Thus, we would like to refrain from any further speculation on species recognition before this can be unambiguously demonstrated, and remain within the mechanism of sexual attractiveness within a single species which we clearly show is mediated by the female MB-alkane fraction governed by the fatty acid synthase genes. We however still consider potential alternative explanations (e.g., n-alkenes acting as a deterrent of homosexual mating attempts).

      The authors explore the function of cuticular hydrocarbons (CHCs) and a fatty-acyl synthase in Nasonia vitripennis, a parasitic wasp. Using RNAi, they successfully knockdown the expression of the fas5 gene in wasps. The authors do not justify their choice of fatty-acyl synthase candidate gene. It would have been interesting to know if that is one of many genes they studied or if there was some evidence that drove them to focus their interest in fas5.

      In a previous study, 5 fas candidate genes orthologous to Drosophila melanogaster fas genes were identified and mapped in the genome of Nasonia vitripennis (Buellesbach et al. 2022). We actually investigated the effects of all of these fas genes on CHC variation, but only fas5 led to such a striking, traceable pattern shift. We are currently preparing another manuscript discussing the effects of the other fas genes, but decided to focus exclusively on fas5 here, due to its significance for revealing how sexual attractiveness can be encoded and conveyed in complex chemical profiles, maintained and governed by a surprisingly simple genetic basis.

      The authors observe large changes in the cuticular hydrocarbons (CHC) profile of male and females. These changes are mostly a reduction of some MB alkanes and an increase in others as well as an increase of n-alkene in fas5 knockdown females. For males fas5 knockdowns, the overall quantity of CHC is increased and consequently, multiple types of compounds are increased compared to wild-type, with only one compound appearing to decrease compared to wild-type. Insects are known to rely on ratios of compounds in blends to recognize odors. Authors address this by showing a plot of the relative ratios, but it seems to me that they do show statistical tests of those changes in the proportions of the different types of compounds. In the results section, the authors give percentages while referring to figures showing the absolute amount of CHCs. They should also test if the ratios are significantly different or not between experimental conditions. Similar data should be displayed for the males as well.

      We appreciate your suggestions. We kindly refer you to our response to reviewer 1, where we addressed the statistical tests. Specifically, we generated separate subplots to display the proportions of different compound classes and performed statistical tests to compare these proportions between different treatments for both males and females. Additionally, we have revised the results section to replace relative abundances with absolute quantity, as depicted in Figure 2C-G.

      Furthermore, the authors didn't use an internal standard to measure the quantity of CHCs in the extracts, which, to me, is the gold standard in the field. If I understood correctly, the authors check the abundance measured for known quantities of n-alkanes. I'm sure this method is fine, but I would have liked to be reassured that the quantities measured through this method are good by either testing some samples with an internal standard, or referring to work that demonstrates that this method is always accurate to assess the quantities of CHC in extracts of known volumes.

      We actually did include 7,5 ng/μl dodecane (C12) as an “internal” standard in the hexane resuspensions of all of our processed samples (line 456, Materials and Methods). This was primarily done to allow for visually inspecting and comparing the congruence of all chromatograms in the subsequent data analysis and immediately detect any variation from sample preparation, injection process and instrument fluctuation. In our study, we have a very elaborate and standardized CHC extraction method that the volume of solvent and duration for extraction are strictly controlled to minimize the variation from sample preparation steps. Furthermore, we calibrated each individual CHC compound quantity with a dilution series of external standards (C21-C40) of known concentration. By constructing a calibration curve based on this dilution series, we achieved the most accurate compound quantification, also taking into account and counteracting the generally diminishing quantities of compounds with higher chain lengths.

      The authors provide a sensible control for their RNAi experiments: targeting an unrelated gene, absent in N. vitripennis (the GFP). This allows us to see if the injection of RNAi might affect CHC profiles, which it appears to do in some cases in males, but not in females. The authors also show to the reader that their RNAi experiments do reduce the expression of the target gene. However, one of the caveats of their experiments, is that the authors don't provide evidence or information to allow the (non-expert) reader to assess whether the fas5 RNAi experiments did affect the expression of other fatty-acyl synthase genes. I'm not an expert in RNAi, so maybe this suggestion is not relevant, but it should, at least, be addressed somewhere in the manuscript that such off-target effects are very unlikely or impossible, in that case, or more generally.

      We acknowledge the reviewer’s concern about potential off-target effect of the fas5 knockdown. We actually did check initially for off-target effects on the other four previously published fas genes in N. vitripennis (Lammers et al. 2019; Buellesbach et al. 2022) and did not find any effects on their respective expressions. We now include these results as supplementary data (Figure 2-figure supplement 1). However, as mentioned in the cover letter to the editor, we discovered a previously uncharacterized fas gene in the most recent N. vitripennis genome assembly (NC_045761.1), fas6, most likely constituting a tandem gene duplication of fas5. These two genes turned out to have such high sequence similarity (> 90 %, Figure 2-figure supplement 2) that both were simultaneously downregulated by our fas5 dsRNAi construct, which we confirmed with qPCR and now incorporated into our manuscript (Fig. 2H). Therefore, we now explicitly mention that the knockdown affects both genes, and either one or both could have the observed phenotypic effects. Recognizing this RNAi off-target effect, we have now also incorporated a discussion of this issue in the appropriate section of the manuscript (line 364-377), as well as the potential off-target effects of our GFP dsRNAi controls (line 262-274).

      The authors observe that the modified CHCs profiles of RNAi females reduce courtship and copulation attempts, but not antennation, by males toward live and (dead) dummy females. They show that the MB alkanes of the CHC profile are sufficient to elicit sexual behaviors from males towards dummy females and that the same fraction from extracts of fas5 knockdown females does so significantly less. From the previous data, it seems that dummy females with fas5 female's MB alkanes profile elicit more antennation than CHC-cleared dummy females, but the authors do not display data for this type of target on the figure for MB alkane behavioral experiments.

      Actually similar proportions of males performed antennation behavior towards female dummies with MB alkane fraction of fas5 RNAi females and CHC-cleared female dummies (55% and 50%, respectively, see Author response image 1 for the corresponding parts of the sub-figures 3 E and 4 D). We did not deem it necessary to show the same data on CHC-cleared female dummies in Figure 3 as well.

      Author response image 1.

      Unfortunately, the authors don't present experiments testing the effect of the non-MB alkanes fractions of the CHC extracts on male behavior toward females. As such, they are not able to (and didn't) conclude that the MB-alkane is necessary to trigger the sexual behaviors of males. I believe testing this would have significantly enhanced the significance of this work. I would also have found it interesting for the authors to comment on whether they observe aggressive behavior of males towards females (live or dead) and/or whether such behavior is expected or not in inter-individual interactions in parasitoids wasps.

      In our experiment, we focus on the function of the MB-alkane fraction in female CHC profiles, and we comprehensibly demonstrate in figure 4 that the MB-alkane fraction from WT females alone is sufficient to trigger mating behavior coherent with that on alive and untreated female dummies. Therefore, we do not completely understand the reviewer’s concern about us not being ” able to (and didn't) conclude that the MB-alkane is necessary to trigger the sexual behaviors of males”. We appreciate the suggestion from the reviewer of testing the non-MB alkanes (n-alkanes and n-alkenes). However, due to the experimental procedure of separating the CHC compound class fractions through elution with molecular sieves, it was not possible for us to retrieve either the whole n-alkane or n-alkene fraction remaining bound to the sieves after separation). The role of n-alkenes in N. vitripennis is however considered in the discussion, as a deterrent for homosexual interactions between males (Wang et al. 2022a). Moreover, we did not observe aggressive behavior of males towards live or dead females.

      CHCs are used by insects to signal and/or recognize various traits of targets of interest, including species or groups of origin, fertility, etc. The authors claim that their experiments show the sexual attractiveness of females can be encoded in the specific ratio of MB alkanes. While I understand how they come to this conclusion, I am somewhat concerned. The authors very quickly discuss their results in light of the literature about the role of CHCs (and notably MB alkanes) in various recognition behaviors in Hymenoptera, including conspecific recognition. Previous work (cited by the authors) has shown that males recognize males from females using an alkene (Z9C31). As such, it remains possible that the "sexual attractiveness" of N. vitripennis females for males relies on them not being males and being from the right species as well. The authors do not address the question of whether the CHCs (and the MB alkanes in particular) of females signal their sex or their species. While I acknowledge that responding to this question is beyond the scope of this work, I also strongly believe that it should be discussed in the manuscript. Otherwise, non-specialist readers would not be able to understand what I believe is one of the points that could temper the conclusions from this work.

      We acknowledge the reviewer’s insight about the MB alkanes in signaling sex or species in N. vitripennis, and now include this aspect in our revised discussion (line 324-330). Moreover, we clearly demonstrate that n-alkenes have been reduced to minute trace components after our compound class separation, and the males still do not display courtship and copulation behaviors similar to WT females, thus strongly indicating that the n-alkenes do not play a role when relying solely on the changed MB-alkane patterns, further strengthening our main argument.

      References

      Benjamini, Y. and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29:1165-1188.

      Buellesbach, J., J. Gadau, L. W. Beukeboom, F. Echinger, R. Raychoudhury, J. H. Werren, and T. Schmitt. 2013. Cuticular hydrocarbon divergence in the jewel wasp Nasonia: Evolutionary shifts in chemical communication channels? J. Evol. Biol. 26:2467-2478.

      Buellesbach, J., C. Greim, and T. Schmitt. 2014. Asymmetric interspecific mating behavior reflects incomplete prezygotic isolation in the jewel wasp genus Nasonia. Ethology 120:834-843.

      Buellesbach, J., H. Holze, L. Schrader, J. Liebig, T. Schmitt, J. Gadau, and O. Niehuis. 2022. Genetic and genomic architecture of species-specific cuticular hydrocarbon variation in parasitoid wasps. Proc. R. Soc. B 289:20220336.

      Engl, T., N. Eberl, C. Gorse, T. Krüger, T. H. P. Schmidt, R. Plarre, C. Adler, and M. Kaltenpoth. 2018. Ancient symbiosis confers desiccation resistance to stored grain pest beetles. Mol. Ecol. 27:2095-2108.

      Ferveur, J. F., J. Cortot, K. Rihani, M. Cobb, and C. Everaerts. 2018. Desiccation resistance: effect of cuticular hydrocarbons and water content in Drosophila melanogaster adults. Peerj 6.

      Lammers, M., K. Kraaijeveld, J. Mariën, and J. Ellers. 2019. Gene expression changes associated with the evolutionary loss of a metabolic trait: lack of lipogenesis in parasitoids. BMC Genom. 20:309.

      Mair, M. M., V. Kmezic, S. Huber, B. A. Pannebakker, and J. Ruther. 2017. The chemical basis of mate recognition in two parasitoid wasp species of the genus Nasonia. Entomol. Exp. Appl. 164:1-15.

      Wang, Y., W. Sun, S. Fleischmann, J. G. Millar, J. Ruther, and E. C. Verhulst. 2022a. Silencing Doublesex expression triggers three-level pheromonal feminization in Nasonia vitripennis males. Proc. R. Soc. B 289:20212002.

      Wang, Z., J. P. Receveur, J. Pu, H. Cong, C. Richards, M. Liang, and H. Chung. 2022b. Desiccation resistance differences in Drosophila species can be largely explained by variations in cuticular hydrocarbons. eLife 11:e80859.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, the authors investigate the genes involved in the retention of eggs in Aedes aegypti females. They do so by identifying two candidate genes that are differentially expressed across the different reproductive phases and also show that the transcripts of those two genes are present in ovaries and in the proteome. Overall, I think this is interesting and impressive work that characterizes the function of those two specific protein-coding genes thoroughly. I also really enjoyed the figures. Although they were a bit packed, the visuals made it easy to follow the authors' arguments. I have a few concerns and suggested changes, listed below.

      1) These two genes/loci are definitely rapidly evolving. However, that does not automatically imply that positive selection has occurred in these genes. Clearly, you have demonstrated that these gene sequences might be important for fitness in Aedes aegypti. However, if these happen to be disordered proteins, then they would evolve rapidly, i.e., under fewer sequence constraints. In such a scenario, dN/dS values are likely to be high. Another possibility is that as these are expressed only in one tissue and most likely not expressed constitutively, they could be under relaxed constraints relative to all other genes in the genome. For instance, we know that average expression levels of protein-coding genes are highly correlated with their rate of molecular evolution (Drummond et al., 2005). Moreover, there have clearly been genome rearrangements and/or insertion/deletions in the studied gene sequences between closely- related species (as you have nicely shown), thus again dN/dS values will naturally be high. Thus, high values of dN/dS are neither surprising nor do they directly imply positive selection in this case. If the authors really want to investigate this further, they can use the McDonald Kreitman test (McDonald and Kreitman 1991) to ask if non- synonymous divergence is higher than expected. However, this test would require population-level data. Alternatively, the authors can simply discuss adaptation as a possibility along with the others suggested above. A discussion of alternative hypotheses is extremely important and must be clearly laid out.

      We agree with the reviewer’s point that rapid evolution is not the same as positive selection. We also agree with the reviewer’s point that McDonald-Kreitman test (MK test) is more powerful than dN/dS analysis. We took advantage of a large population dataset from Rose et al. 2020. After filtering the data, we kept 454 genomes for MK tests. We found both genes are marginally significant or insignificant (tweedledee p = 0.068; tweedledum p = 0.048), despite that these are small genes and have low Pn values. This suggests that it is likely the genes evolve under positive selection.

      In line with the reviewer’s suggestion, we performed another analysis using a large amount of population data. We asked if the SNP frequencies of tweedledee and tweedledum are correlated with environmental variables. We found that when compared to a distribution of 10,000 simulated genes with randomly-sampled genetic variants, both tweedledee and tweedledum showed significant correlation to multiple ecological variables reflecting climate variability, such as mean diurnal range, temperature seasonality, and precipitation seasonality (p<0.05). These results are now incorporated into the manuscript in Figure 5 and Figure 5 – Figure supplement 1.

      2) The authors show that the two genes under study are important for the retention of viable eggs. However, as these genes are close to two other conserved genes (scratch and peritrophin-like gene), it is unclear to me how it is possible to rule out the contribution of the conserved genes to the same phenotype. Is it possible that the CRISPR deletion leads to the disruption of expression of one of the other important genes nearby (i.e., in a scratch or peritrophin-like gene) as the deleted region could have included a promoter region for instance, which is causing the phenotype you observe? Since all of these genes are so close to each other, it is possible that they are co-regulated and that tweedledee and tweedledum and expressed and translated along with the scratch and peritrophin-like gene. Do we know whether their expression patterns diverge and that scratch and peritrophin-like genes do not play a role in the retention of viable eggs?

      This is a fair criticism; however, we think the chance that the phenotypes are caused by interrupting nearby genes is very low. First, peritrophin-like acts in the immune response, and scratch is a brain-biased transcription factor. Neither of the genes show expression in the ovary before or after blood feeding (TPM <1 or 2 are generally considered unexpressed, while scratch and peritrophin-like expression levels are overall lower than 0.1 TPM).

      This suggests that peritrophin-like and scratch are not likely to function in the ovary. Thus, although we cannot completely rule out the gene knockout impacts regulation of very distant genes, it is unlikely. Since the mounting evidence we show in this manuscript that tweedledee and tweedledum are highly translated in the ovary after blooding feeding, under the principle of parsimony, we expect the phenotypes came from knocking out the highly expressed and translated genes.

      Reviewer #2 (Public Review):

      This manuscript is overall quite convincing, presenting a well- thought-out approach to candidate gene detection and systemic follow- ups on two genes that meet their candidate gene criteria. There are several major claims made by the authors, and some have more compelling evidence than others, but in general, the conclusions are quite sound. My main issues stem from how the strategy to identify genes playing a role in egg retention success has led to very particular genes being examined, and so I question some of the elements of the discussion focusing on the rapid evolution and taxon- uniqueness of the identified genes. In short, while I believe the authors have demonstrated that tweedledee and tweedledum play an important role in egg retention, I'm not sure whether this study should be taken as evidence that taxon-specific or rapidly evolving genes, in general, are responsible for this adaptation, or simply play an important role in it.

      We have revised the paper to make it clearer that the focus is indeed on these two genes on not on the greater question of taxon-specific or rapidly-evolving genes.

      First, the authors present evidence that Aedes aegypti females can retain eggs when a source of fresh water is lacking, confirming that females are not attracted to human forearms while retaining eggs and that up to 70% of the retained eggs hatch after retaining them for nearly a week. This ability is likely an important adaptation that allows Aedes aegypti to thrive in a broad range of conditions. The data here seem fairly compelling.

      Based on this observation, the authors reason that genes responsible for the ability to retain eggs must: 1) be highly expressed in ovaries during retention, but not before or after. 2) be taxon-specific (as this behavior seems limited to Aedes aegypti). While this approach to enriching candidate genes has proven fruitful in this particular case, I'm not sure I agree with the authors' rationale. First, even genes at a low expression in the ovaries may be crucial to egg retention. Second, while egg-laying behavior is vastly varied in insects, I'm not sure focusing on taxon-restricted genes is necessary. It is entirely possible that many of the genes identified in Figure 2E play a crucial role in egg retention evolution. These are minor issues, but they are relevant to some later points made by the authors.

      We regret framing the discovery of tweedledee and tweedledum in the original submission using this somewhat artificial set of filtering criteria. The reality is that the genes caught our attention for their novel sequence, tight genetic linkage, and interesting expression profile. That really is the focus of the paper, not these other peripheral questions that have been the focus of attention of the reviews. We really do apologize for all of the confusion about what this paper is about.

      Nonetheless, the authors provide very compelling evidence that the two genes meeting their criteria - tweedledee and tweedledum, play an important role in egg retention. The genes seem to be expressed primarily in ovaries during egg retention (some observed expression in brain/testes is expected for any gene), and the proteins they code seem to be found in elevated quantities in both ovaries and hemolymph during and immediately after egg retention. RNA for the genes is detected in follicles within the ovary, and CRISPR knockouts of both the genes lead to a large decrease in egg viability post retention.

      My earlier qualms about their search strategy relay into some issues with Figure 4, which describes how the two genes are 1) taxon- restricted and 2) have evolved very rapidly. Neither of the two statements is unexpected given the authors' search strategy. Of course, the genes examined precisely for their lack of homologs do not have any homologs. Similarly, by limiting themselves to genes that show a lack of homology (i.e. low sequence similarity) to other genes as well as genes with high expression levels in the ovaries, a higher rate of evolution is almost inevitable to infer (as ovary expressed genes tend to evolve more rapidly in mosquitoes). I agree with the authors that inferences of the evolutionary history of these genes are quite difficult because of their uniqueness, and I especially appreciate their attempts to identify homologs (although I really dislike the term "conceptualog").

      We have removed our term “conceptualog” and replaced with the mor conventional “putative ortholog”

      This leads to my main (fairly minor) issue of the paper - the discussion on the evolutionary history of these genes and its implications (sections "Taxon-restricted genes underlie tailored adaptations in a diverse world" and "Evolutionary histories and catering to different natural histories"). As noted, inferring this history is very difficult because the authors have focused on two rapidly evolving, taxon-restricted genes. The analyses they have performed here definitely demonstrate that the genes play an important role in egg retention, however, they do not show that taxon-restricted genes play a disproportionate role in egg retention evolution. Indeed, the only data relevant to this point would be the proportion of genes in Figure 2E that are taxon-restricted (3/9), but I'm not sure what the null expectation for this proportion for highly expressed ovary genes is to begin with. Furthermore, the extremely rapid evolution of this gene makes it hard to judge how truly taxon-restricted it is. My own search of tweedle homologs identified multiple as previously having been predicted to be "Knr4/Smi1-like", and while no similar genes are located in a similar location in melanogaster, there is generally little synteny conservation in Drosophila (for instance Bhutkar et al 2008), so I'm unsure what can really be said about their evolutionary origins/lack of homologs in Drosophila.

      In short - the manuscript makes clear that tweedledee and tweedledum play an important role in egg retention in A. aegypti, nonetheless, it is not clear that this is a demonstration of how important taxon- restricted genes are to understanding the evolution of life-history strategies.

      Again, we should have never framed the paper the way we did in the original version. We make no claims whatsoever that taxon-restricted genes in general should play a role in this biology, only that the two candidate genes under study influence egg viability after extended retention. We hope that the framing is clearer in this revision.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors evaluate the involvement of the hippocampus in a fast-paced time-to-contact estimation task. They find that the hippocampus is sensitive to feedback received about accuracy on each trial and has activity that tracks behavioral improvement from trial to trial. Its activity is also related to a tendency for time estimation behavior to regress to the mean. This is a novel paradigm to explore hippocampal activity and the results are thus novel and important, but the framing as well as discussion about the meaning of the findings obscures the details of the results or stretches beyond them in many places, as detailed below.

      We thank the reviewer for their constructive feedback and were happy to read that s/he considered our approach and results as novel and important. The comments led us to conduct new fMRI analyses, to clarify various unclear phrasings regarding our methods, and to carefully assess our framing of the interpretation and scope of our results. Please find our responses to the individual points below.

      1) Some of the results appear in the posterior hippocampus and others in the anteriorhippocampus. The authors do not motivate predictions for anterior vs. posterior hippocampus, and they do not discuss differences found between these areas in the Discussion. The hippocampus is treated as a unitary structure carrying out learning and updating in this task, but the distinct areas involved motivate a more nuanced picture that acknowledges that the same populations of cells may not be carrying out the various discussed functions.

      We thank the reviewer for pointing this out. We split the hippocampus into anterior and posterior sections because prior work suggested a different whole-brain connectivity and function of the two. This was mentioned in the methods section (page 15) in the initial submission but unfortunately not in the main text. Moreover, when discussing the results, we did indeed refer mostly to the hippocampus as a unitary structure for simplicity and readability, and because statements about subcomponents are true for the whole. However, we agree with the reviewer that the differences between anterior and posterior sections are very interesting, and that describing these effects in more detail might help to guide future work more precisely.

      In response to the reviewer's comment, we therefore clarified at various locations throughout the manuscript whether the respective results were observed in the posterior or anterior section of the hippocampus, and we extended our discussion to reflect the idea that different functions may be carried out by distinct populations of hippocampal cells. In addition, we also now motivate the split into the different sections better in the main text. We made the following changes.

      Page 3: “Second, we demonstrate that anterior hippocampal fMRI activity and functional connectivity tracks the behavioral feedback participants received in each trial, revealing a link between hippocampal processing and timing-task performance.

      Page 3: “Fourth, we show that these updating signals in the posterior hippocampus were independent of the specific interval that was tested and activity in the anterior hippocampus reflected the magnitude of the behavioral regression effect in each trial.”

      Page 5: “We performed both whole-brain voxel-wise analyses as well as regions-of-interest (ROI) analysis for anterior and posterior hippocampus separately, for which prior work suggested functional differences with respect to their contributions to memory-guided behavior (Poppenk et al., 2013, Strange et al. 2014).”

      Page 9: “Because anterior and posterior sections of the hippocampus differ in whole-brain connectivity as well as in their contributions to memory-guided behavior (Strange et al. 2014), we analyzed the two sections separately. “

      Page 9: “We found that anterior hippocampal activity as well as functional connectivity reflected the feedback participants received during this task, and its activity followed the performance improvements in a temporal-context-dependent manner. Its activity reflected trial-wise behavioral biases towards the mean of the sampled intervals, and activity in the posterior hippocampus signaled sensorimotor updating independent of the specific intervals tested.”

      Page 10: “Intriguingly, the mechanisms at play may build on similar temporal coding principles as those discussed for motor timing (Yin & Troger, 2011; Eichenbaum, 2014; Howard, 2017; Palombo & Verfaellie, 2017; Nobre & van Ede, 2018; Paton & Buonomano, 2018; Bellmund et al., 2020, 2021; Shikano et al., 2021; Shimbo et al., 2021), with differential contributions of the anterior and posterior hippocampus. Note that our observation of distinct activity modulations in the anterior and posterior hippocampus suggests that the functions and coding principles discussed here may be mediated by at least partially distinct populations of hippocampal cells.”

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence [...]

      2) Hippocampal activity is stronger for smaller errors, which makes the interpretationmore complex than the authors acknowledge. If the hippocampus is updating sensorimotor representations, why would its activity be lower when more updating is needed?

      Indeed, we found that absolute (univariate) activity of the hippocampus scaled with feedback valence, the inverse of error (Fig. 2A). We see multiple possibilities for why this might be the case, and we discussed some of them in a dedicated discussion section (“The role of feedback in timed motor actions”). For example, prior work showed that hippocampal activity reflects behavioral feedback also in other tasks, which has been linked to learning (e.g. Schönberg et al., 2007; Cohen & Ranganath, 2007; Shohamy & Wagner, 2008; Foerde & Shohamy, 2011; Wimmer et al., 2012). In our understanding, sensorimotor updating is a form of ‘learning’ in an immediate and behaviorally adaptive manner, and we therefore consider our results well consistent with this earlier work. We agree with the reviewer that in principle activity should be stronger if there was stronger sensorimotor updating, but we acknowledge that this intuition builds on an assumption about the relationship between hippocampal neural activity and the BOLD signal, which is not entirely clear. For example, prior work revealed spatially informative negative BOLD responses in the hippocampus as a function of visual stimulation (e.g. Szinte & Knapen 2020), and the effects of inhibitory activity - a leading motif in the hippocampal circuitry - on fMRI data are not fully understood. This raises the possibility that the feedback modulation we observed might also involve negative BOLD responses, which would then translate to the observed negative correlation between feedback valence and the hippocampal fMRI signal, even if the magnitude of the underlying updating mechanism was positively correlated with error. This complicates the interpretation of the direction of the effect, which is why we chose to avoid making strong conclusions about it in our manuscript. Instead, we tried discussing our results in a way that was agnostic to the direction of the feedback modulation. Importantly, hippocampal connectivity with other regions did scale positively with error (Fig. 2B), which we again discussed in the dedicated discussion section.

      In response to the reviewer’s comment, we revisited this section of our manuscript and felt the latter result deserved a better discussion. We therefore took this opportunity to extend our discussion of the connectivity results (including their relationship to the univariate-activity results as well as the direction of these effects), all while still avoiding strong conclusions about directionality. Following changes were made to the manuscript.

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence, unlike its absolute activity, which scaled positively with feedback valence (Fig. 2A,B), suggesting that the two measures may be sensitive to related but distinct processes.

      Page 11: Such network-wide receptive-field re-scaling likely builds on a re-weighting of functional connections between neurons and regions, which may explain why anterior hippocampal connectivity correlated negatively with feedback valence in our data. Larger errors may have led to stronger re-scaling, which may be grounded in a corresponding change in functional connectivity.

      3) Some tests were one-tailed without justification, which reduces confidence in the robustness of the results.

      We thank the reviewer for pointing us to the fact that our choice of statistical tests was not always clear in the manuscript. In the analysis the reviewer is referring to, we predicted that stronger sensorimotor updating should lead to stronger activity as well as larger behavioral improvements across the respective trials. This is because a stronger update should translate to a more accurate “internal model” of the task and therefore to a better performance. We tested this one-sided hypothesis using the appropriate test statistic (contrasting trials in which behavioral performance did improve versus trials in which it did not improve), but we did not motivate our reasoning well enough in the manuscript. The revised manuscript therefore includes the two new statements shown below to motivate our choice of test statistic more clearly.

      Page 7: [...] we contrasted trials in which participants had improved versus the ones in which they had not improved or got worse (see methods for details). Because stronger sensorimotor updating should lead to larger performance improvements, we predicted to find stronger activity for improvements vs. no improvements in these tests (one-tailed hypothesis).

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively. Because we predicted to find stronger activity for improvements vs. no improvements in behavioral performance, we here performed one-tailed statistical tests, consistent with the direction of this hypothesis. Improvement in performance was defined as receiving feedback of higher valence than in the corresponding previous trial.

      4) The introduction motivates the novelty of this study based on the idea that thehippocampus has traditionally been thought to be involved in memory at the scale of days and weeks. However, as is partially acknowledged later in the Discussion, there is an enormous literature on hippocampal involvement in memory at a much shorter timescale (on the order of seconds). The novelty of this study is not in the timescale as much as in the sensorimotor nature of the task.

      We thank the reviewer for this helpful suggestion. We agree that a key part of the novelty of this study is the use of the task that is typically used to study sensorimotor integration and timing rather than hippocampal processing, along with the new insights this task enabled about the role of the hippocampus in sensorimotor updating. As mentioned in the discussion, we also agree with the reviewer that there is prior literature linking hippocampal activity to mnemonic processing on short time scales. We therefore rephrased the corresponding section in the introduction to put more weight on the sensorimotor nature of our task instead of the time scales.

      Note that the new statement still includes the time scale of the effects, but that it is less at the center of the argument anymore. We chose to keep it in because we do think that the majority of studies on hippocampal-dependent memory functions focus on longer time scales than our study does, and we expect that many readers will be surprised about the immediacy of how hippocampal activity relates to ongoing behavioral performance (on ultrashort time scales).

      We changed the introduction to the following.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions. Importantly, the hippocampus is not traditionally thought to support sensorimotor functions, and its contributions to memory formation are typically discussed for longer time scales (hours, days, weeks). Here, however, we characterize in detail the relationship between hippocampal activity and real-time behavioral performance in a fast-paced timing task, which is traditionally believed to be hippocampal-independent. We propose that the capacity of the hippocampus to encode statistical regularities of our environment (Doeller et al. 2005, Shapiro et al. 2017, Behrens et al., 2018; Momennejad, 2020; Whittington et al., 2020) situates it at the core of a brain-wide network balancing specificity vs. regularization in real time as the relevant behavior is performed.

      5) The authors used three different regressors for the three feedback levels, asopposed to a parametric regressor indexing the level of feedback. The predictions are parametric, so a parametric regressor would be a better match, and would allow for the use of all the medium-accuracy data.

      The reviewer raises a good point that overlaps with question 3 by reviewer 2. In the current analysis, we model the three feedback levels with three independent regressors (high, medium, low accuracy). We then contrast high vs. low accuracy feedback, obtaining the results shown in Fig. 2AB. The beta estimates obtained for medium-accuracy feedback are being ignored in this contrast. Following the reviewer’s feedback, we therefore re-run the model, this time modeling all three feedback levels in one parametric regressor. All other regressors in the model stayed the same. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor.

      The results we observed were highly consistent across the two analyses, and all conclusions presented in the initial manuscript remain unchanged. While the exact t-scores differ slightly, we replicated the effects for all clusters on the voxel-wise map (on whole-brain FWE-corrected levels) as well as for the regions-of-interest analysis for anterior and posterior hippocampus. These results are presented in a new Supplementary Figure 3C.

      Note that the new Supplementary Figure 3B shows another related new analyses we conducted in response to question 4 of reviewer 2. Here, we re-ran the initial analysis with three feedback regressors, but without modeling the inter-trial interval (ITI) and the inter-session interval (ISI, i.e. the breaks participants took) to avoid model over-specification. Again, we replicated the results for all clusters and the ROI analysis, showing that the initial results we presented are robust.

      The following additions were made to the manuscript.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using the beta estimates obtained for the parametric feedback regressor (Fig. 2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      6) The authors claim that the results support the idea that the hippocampus is findingan "optimal trade-off between specificity and regularization". This seems overly speculative given the results presented.

      We understand the reviewer's skepticism about this statement and agree that the manuscript does not show that the hippocampus is finding the trade-off between specificity and regularization. However, this is also not exactly what the manuscript claims. Instead, it suggests that the hippocampus “may contribute” to solving this trade-off (page 3) as part of a “brain-wide network“ (pages 2,3,9,12). We also state that “Our [...] results suggest that this trade-off [...] is governed by many regions, updating different types of task information in parallel” (Page 11). To us, these phrasings are not equivalent, because we do not think that the role of the hippocampus in sensorimotor updating (or in any process really) can be understood independently from the rest of the brain. We do however think that our results are in line with the idea that the hippocampus contributes to solving this trade-off, and that this is exciting and surprising given the sensorimotor nature of our task, the ultrashort time scale of the underlying process, and the relationship to behavioral performance. We tried expressing that some of the points discussed remain speculation, but it seems that we were not always successful in doing so in the initial submission. We apologize for the misunderstanding, adapted corresponding statements in the manuscript, and we express even more carefully that these ideas are speculation.

      Following changes were made to the introduction and discussion.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      Page 12: This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 13: This is in line with the notion that the hippocampus [...] supports finding an optimal trade off between specificity and regularization along with other regions. [...] Our results show that the hippocampus supports rapid and feedback-dependent updating of sensorimotor representations, suggesting that it is a central component of a brain-wide network balancing task specificity vs. regularization for flexible behavior in humans.

      Note that in response to comment 1 by reviewer 2, the revised manuscript now reports the results of additional behavioral analyses that support the notion that participants find an optimal trade-off between specificity and regularization over time (independent of whether the hippocampus was involved or not).

      7) The authors find that hippocampal activity is related to behavioral improvement fromthe prior trial. This seems to be a simple learning effect (participants can learn plenty about this task from a prior trial that does not have the exact same timing as the current trial) but is interpreted as sensitivity to temporal context. The temporal context framing seems too far removed from the analyses performed.

      We agree with the reviewer that our observation that hippocampal activity reflects TTC-independent behavioral improvements across trials could have multiple explanations. Critically, i) one of them is that the hippocampus encodes temporal context, ii) it is only one of multiple observations that we build our interpretation on, and iii) our interpretation builds on multiple earlier reports

      Interval estimates regress toward the mean of the sampled intervals, an effect that is often referred to as the “regression effect”. This effect, which we observed in our data too (Fig. 1B), has been proposed to reflect the encoding of temporal context (e.g. Jazayeri & Shadlen 2010). Moreover, there is a large body of literature on how the hippocampus may support the encoding of spatial and temporal context (e.g. see Bellmund, Polti & Doeller 2020 for review).

      Because both hippocampal activity and the regression effect were linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. If so, one would expect that hippocampal activity should reflect behavioral improvements independently from TTC, it should reflect the magnitude of the regression effect, and it should generally reflect feedback, because it is the feedback that informs the updating of the internal prior.

      All three observations may have independent explanations indeed, but they are all also in line with the idea that the hippocampus does encode temporal context and that this explains the relationship between hippocampal activity and the regression effect. It therefore reflects a sparse and reasonable explanation in our opinion, even though it necessarily remains an interpretation. Of course, we want to be clear on what our results are and what our interpretations are.

      In response to the reviewer’s comment, we therefore toned down two of the statements that mention temporal context in the manuscript, and we removed an overly speculative statement from the result section. In addition, the discussion now describes more clearly how our results are in line with this interpretation.

      Abstract: This is in line with the idea that the hippocampus supports the rapid encoding of temporal context even on short time scales in a behavior-dependent manner.

      Page 13: This is in line with the notion that the hippocampus encodes temporal context in a behavior-dependent manner, and that it supports finding an optimal trade off between specificity and regularization along with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      The following statement was removed, overlapping with comment 2 by Reviewer 3:

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      8) I am not sure the term "extraction of statistical regularities" is appropriate. The termis typically used for more complex forms of statistical relationships.

      We agree with the reviewer that this expression may be interpreted differently by different readers and are grateful to be pointed to this fact. We therefore removed it and instead added the following (hopefully less ambiguous) statement to the manuscript.

      Page 9: This study investigated how the human brain flexibly updates sensorimotor representations in a feedback-dependent manner in the service of timing behavior.

      Reviewer #2 (Public Review):

      The authors conducted a study involving functional magnetic resonance imaging and a time-to-contact estimation paradigm to investigate the contribution of the human hippocampus (HPC) to sensorimotor timing, with a particular focus on the involvement of this structure in specific vs. generalized learning. Suggestive of the former, it was found that HPC activity reflected time interval-specific improvements in performance while in support of the latter, HPC activity was also found to signal improvements in performance, which were not specific to the individual time intervals tested. Based on these findings, the authors suggest that the human HPC plays a key role in the statistical learning of temporal information as required in sensorimotor behaviour.

      By considering two established functions of the HPC (i.e., temporal memory and generalization) in the context of a domain that is not typically associated with this structure (i.e., sensorimotor timing), this study is potentially important, offering novel insight into the involvement of the HPC in everyday behaviour. There is much to like about this submission: the manuscript is clearly written and well-crafted, the paradigm and analyses are well thought out and creative, the methodology is generally sound, and the reported findings push us to consider HPC function from a fresh perspective. A relative weakness of the paper is that it is not entirely clear to what extent the data, at least as currently reported, reflects the involvement of the HPC in specific and generalized learning. Since the authors' conclusions centre around this observation, clarifying this issue is, in my opinion, of primary importance.

      We thank the reviewer for these positive and extremely helpful comments, which we will address in detail below. In response to these comments, the revised manuscript clarifies why the observed performance improvements are not at odds with the idea that an optimal trade-off between specificity and regularization is found, and how the time course of learning relates to those reported in previous literature. In addition, we conducted two new fMRI analyses, ensuring that our conclusions remain unchanged even if feedback is modeled with one parametric regressor, and if the number or nuisance regressors is reduced to control for overparameterization of the model. Please find our responses underneath each individual point below.

      1) Throughout the manuscript, the authors discuss the trade-off between specific and generalized learning, and point towards Figure S1D as evidence for this (i.e., participants with higher TTC accuracy exhibited a weaker regression effect). What appears to be slightly at odds with this, however, is the observation that the deviation from true TTC decreased with time (Fig S1F) as the regression line slope approached 0.5 (Fig S1E) - one would have perhaps expected the opposite i.e., for deviation from true TTC to increase as generalization increases. To gain further insight into this, it would be helpful to see the deviation from true TTC plotted for each of the four TTC intervals separately and as a signed percentage of the target TTC interval (i.e., (+) or (-) deviation) rather than the absolute value.

      We thank the reviewer for raising this important question and for the opportunity to elaborate on the relationship between the TTC error and the magnitude of the regression effect in behavior. Indeed, we see that the regression slopes approach 0.5 and that the TTC error decreases over the course of the experiment. We do not think that these two observations are at odds with each other for the following reasons:

      First, while the reviewer is correct in pointing out that the deviation from the TTC should increase as “generalization increases”, that is not what we found. It was not the magnitude of the regularization per se that increased over time, but the overall task performance became more optimal in the face of both objectives: specificity and generalization. This optimum is at a regression-line slope of 0.5. Generalization (or regularization how we refer to it in the present manuscript), therefore did not increase per se on group level.

      Second, the regression slopes approached 0.5 on the group-level, but the individual participants approached this level from different directions: Some of them started with a slope value close to 1 (high accuracy), whereas others started with a slope value close to 0 (near full regression to the mean). Irrespective of which slope value they started with, over time, they got closer to 0.5 (Rebuttal Figure 1A). This can also be seen in the fact that the group-level standard deviation in regression slopes becomes smaller over the course of the experiment (Rebuttal Figure 1B, SFig 1G). It is therefore not generally the case that the regression effect becomes stronger over time, but that it becomes more optimal for longer-term behavioral performance, which is then also reflected in an overall decrease in TTC error. Please see our response to the reviewer’s second comment for more discussion on this.

      Third, the development of task performance is a function of two behavioral factors: a) the accuracy and b) the precision in TTC estimation. Accuracy describes how similar the participant’s TTC estimates were to the true TTC, whereas precision describes how similar the participant’s TTC estimates were relative to each other (across trials). Our results are a reflection of the fact that participants became both more accurate over time on average, but also more precise. To demonstrate this point visually, we now plotted the Precision and the Accuracy for the 8 task segments below (Rebuttal Figure 1C, SFig 1H), showing that both measures increased as the time progressed and more trials were performed. This was the case for all target durations.

      In response to the reviewer’s comment, we clarified in the main text that these findings are not at odds with each other. Furthermore, we made clear that regularization per se did not increase over time on group level. We added additional supporting figures to the supplementary material to make this point. Note that in our view, these new analyses and changes more directly address the overall question the reviewer raised than the figure that was suggested, which is why we prioritized those in the manuscript.

      However, we appreciated the suggestion a lot and added the corresponding figure for the sake of completeness.

      Following additions were made.

      Page 5: In support of this, participants' regression slopes converged over time towards the optimal value of 0.5, i.e. the slope value between veridical performance and the grand mean (Fig. S1F; linear mixed-effects model with task segment as a predictor and participants as the error term, F(1) = 8.172, p = 0.005, ε2=0.08, CI: [0.01, 0.18]), and participants' slope values became more similar (Fig. S1G; linear regression with task segment as predictor, F(1) = 6.283, p = 0.046, ε2 = 0.43, CI: [0, 1]). Consequently, this also led to an improvement in task performance over time on group level (i.e. task accuracy and precision increased (Fig. S1I), and the relationship between accuracy and precision became stronger (Fig. S1H), linear mixed-effect model results for accuracy: F(1) = 15.127, p = 1.3x10-4, ε2=0.06, CI: [0.02, 0.11], precision: F(1) = 20.189, p = 6.1x10-5, ε2 = 0.32, CI: [0.13, 1]), accuracy-precision relationship: F(1) = 8.288, p =0.036, ε2 = 0.56, CI: [0, 1], see methods for model details).

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 15: We also corroborated this effect by measuring the dispersion of slope values between participants across task segments using a linear regression model with task segment as a predictor and the standard deviation of slope values across participants as the dependent variable (Fig. S1G). As a measure of behavioral performance, we computed two variables for each target-TTC level: sensorimotor timing accuracy, defined as the absolute difference in estimated and true TTC, and sensorimotor timing precision, defined as coefficient of variation (standard deviation of estimated TTCs divided by the average estimated TTC). To study the interaction between these two variables for each target TTC over time, we first normalized accuracy by the average estimated TTC in order to make both variables comparable. We then used a linear mixed-effects model with precision as the dependent variable, task segment and normalized accuracy as predictors and target TTC as the error term. In addition, we tested whether accuracy and precision increased over the course of the experiment using separate linear mixed-effects models with task segment as predictor and participants as the error term.

      2) Generalization relies on prior experience and can be relatively slow to develop as is the case with statistical learning. In Jazayeri and Shadlen (2010), for instance, learning a prior distribution of 11-time intervals demarcated by two briefly flashed cues (compared to 4 intervals associated with 24 possible movement trajectories in the current study) required ~500 trials. I find it somewhat surprising, therefore, that the regression line slope was already relatively close to 0.5 in the very first segment of the task. To what extent did the participants have exposure to the task and the target intervals prior to entering the scanner?

      We thank the reviewer for raising the important question about the time course of learning in our task and how our results relate to prior work on this issue. Addressing the specific reviewer question first, participants practiced the task for 2-3 minutes prior to scanning. During the practice, they were not specifically instructed to perform the task as well as they could nor to encode the intervals, but rather to familiarize themselves with the general experimental setup and to ask potential questions outside the MRI machine. While they might have indeed started encoding the prior distribution of intervals during the practice already, we have no way of knowing, and we expect the contribution of this practice on the time course of learning during scanning to be negligible (for the reasons outlined above).

      However, in addition to the specific question the reviewer asked, we feel that the comment raises two more general points: 1) How long does it take to learn the prior distribution of a set of intervals as a function of the number of intervals tested, and 2) Why are the learning slopes we report quite shallow already in the beginning of the scan?

      Regarding (1), we are not aware of published reports that answer this question directly, and we expect that this will depend on the task that is used. Regarding the comparison to Jazayeri & Shadlen (2010), we believe the learning time course is difficult to compare between our study and theirs. As the reviewer mentioned, our study featured only 4 intervals compared to 11 in their work, based on which we would expect much faster learning in our task than in theirs. We did indeed sample 24 movement directions, but these were irrelevant in terms of learning the interval distribution. Moreover, unlike Jazayeri & Shadlen (2010), our task featured moving stimuli, which may have added additional sensory, motor and proprioceptive information in our study which the participants of the prior study could not rely on.

      Regarding (2), and overlapping with the reviewer’s previous comment, the average learning slope in our study is indeed close to 0.5 already in the first task segment, but we would like to highlight that this is a group-level measure. The learning slopes of some subjects were closer to 1 (i.e. the diagonal in Fig 1B), and the one of others was closer to 0 (i.e. the mean) in the beginning of the experiment. The median slope was close to 0.65. Importantly, the slopes of most participants still approached 0.5 in the course of the experiment, and so did even the group-level slope the reviewer is referring to. This also means that participants’ slopes became more similar in the course of the experiment, and they approached 0.5, which we think reflects the optimal trade-off between regressing towards the mean and regressing towards the diagonal (in the data shown in Fig. 1B). This convergence onto the optimal trade-off value can be seen in many measures, including the mean slope (Rebuttal Figure 1A, SFig 1F), the standard deviation in slopes (Rebuttal Figure 1B, SFig 1G) as well as the Precision vs. Accuracy tradeoff (Rebuttal Figure 1C, SFig 1H). We therefore think that our results are well in line with prior literature, even though a direct comparison remains difficult due to differences in the task.

      In response to the reviewer’s comment, and related to their first comment, we made the following addition to the discussion section.

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is well in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      3) I am curious to know whether differences between high-accuracy andmedium-accuracy feedback as well as between medium-accuracy and low-accuracy feedback predicted hippocampal activity in the first GLM analysis (middle page 5). Currently, the authors only present the findings for the contrast between high-accuracy and low-accuracy feedback. Examining all feedback levels may provide additional insight into the nature of hippocampal involvement and is perhaps more consistent with the subsequent GLM analysis (bottom page 6) in which, according to my understanding, all improvements across subsequent trials were considered (i.e., from low-accuracy to medium-accuracy; medium-accuracy to high-accuracy; as well as low-accuracy to high-accuracy).

      We thank the reviewer for this thoughtful question, which relates to questions 5 by reviewer 1. The reviewer is correct that the contrast shown in Fig 2 does not consider the medium-accuracy feedback levels, and that the model in itself is slightly different from the one used in the subsequent analysis presented in Fig. 3. To reply to this comment as well as to a related one by reviewer 1 together, we therefore repeated the full analysis while modeling the three feedback levels in one parametric regressor, which includes the medium-accuracy feedback trials, and is consistent with the analysis shown in Fig. 3. The results of this new analysis are presented in the new Supplementary Fig. 3B.

      In short, the model included one parametric regressor with three levels reflecting the three types of feedback, and all nuisance regressors remained unchanged. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor. We found that our results presented initially were very robust: Both the observed clusters in the voxel-wise analysis (on whole-brain FWE-corrected levels) as well as the ROI results replicated across the two analyses, and our conclusions therefore remain unchanged.

      We made multiple textual additions to the manuscript to include this new analysis, and we present the results of the analysis including a direct comparison to our initial results in the new Supplementary Fig. 3. Following textual additions were.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using thebeta estimates obtained for the parametric feedback regressor (Fig. S2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      4) The authors modeled the inter-trial intervals and periods of rest in their univariateGLMs. This approach of modelling all 'down time' can lead to model over-specification and inaccurate parameter estimation (e.g. Pernet, 2014). A comment on this approach as well as consideration of not modelling the inter-trial intervals would be useful.

      This is an important issue that we did not address in our initial manuscript. We are aware and agree with the reviewer’s general concern about model over-specification, which can be a big problem in regression as it leads to biased estimates. We did examine whether our model was overspecified before running it, but we did not report a formal test of it in the manuscript. We are grateful to be given the opportunity to do so now.

      In response to the reviewer’s comment, we repeated the full analysis shown in Fig. 2 while excluding the nuisance regressors for inter-trial intervals (ISI) and breaks (or inter-session intervals, ISI). All other regressors and analysis steps stayed unchanged relative to the one reported in Fig. 2. The new results are presented in a new Supplementary Figure 3B.

      Like for our previous analysis, we again see that the results we initially presented were extremely robust even on whole-brain FWE corrected levels, as well as on ROI level. Our conclusions therefore remain unchanged, and the results we presented initially are not affected by potential model overspecification. In addition to the new Supplementary Figure 3B, we made multiple textual changes to the manuscript to describe this new analysis and its implications. Note that we used the same nuisance regressors in all other GLM analyses too, meaning that it is also very unlikely that model overspecification affects any of the other results presented. We thank the reviewer for suggesting this analysis, and we feel including it in the manuscript has further strengthened the points we initially made.

      Following additions were made to the manuscript.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI) [...]

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. All other regressors including the main feedback regressors of interest remained unchanged, and we repeated both the voxel-wise and ROI-wise statistical tests as described above (Fig. S2B).

      Page 17: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Reviewer #3 (Public Review):

      This paper reports the results of an interesting fMRI study examining the neural correlates of time estimation with an elegant design and a sensorimotor timing task. Results show that hippocampal activity and connectivity are modulated by performance on the task as well as the valence of the feedback provided. This study addresses a very important question in the field which relates to the function of the hippocampus in sensorimotor timing. However, a lack of clarity in the description of the MRI results (and associated methods) currently prevents the evaluation of the results and the interpretations made by the authors. Specifically, the model testing for timing-specific/timing-independent effects is questionable and needs to be clarified. In the current form, several conclusions appear to not be fully supported by the data.

      We thank the reviewer for pointing us to many methodological points that needed clarification. We apologize for the confusion about our methods, which we clarify in the revised manuscript. Please find our responses to the individual points below.

      Major points

      Some methodological points lack clarity which makes it difficult to evaluate the results and the interpretation of the data.

      We really appreciate the many constructive comments below. We feel that clarifying these points improved our manuscript immensely.

      1) It is unclear how the 3 levels of accuracy and feedback (high, medium, and lowperformance) were computed. Please provide the performance range used for this classification. Was this adjusted to the participants' performance?

      The formula that describes how the response window was computed for the different speed levels was reported in the methods section of the original manuscript on page 13. It reads as follows:

      “The following formula was used to scale the response window width: d ± ((k ∗ d)/2) where d is the target TTC and k is a constant proportional to 0.3 and 0.6 for high and medium accuracy, respectively.“

      In response to the reviewer’s comment, we now additionally report the exact ranges of the different response windows in a new Supplementary Table 1 and refer to it in the Methods section as follows.

      Page 10: To calibrate performance feedback across different TTC durations, the precise response window widths of each feedback level scaled with the speed of the fixation target (Table S1).

      2) The description of the MRI results lacks details. It is not always clear in the resultssection which models were used and whether parametric modulators were included or not in the model. This makes the results section difficult to follow. For example,

      a) Figure 2: According to the description in the text, it appears that panels A and B report the results of a model with 3 regressors, ie one for each accuracy/feedback level (high, medium, low) without parametric modulators included. However, the figure legend for panel B mentions a parametric modulator suggesting that feedback was modelled for each trial as a parametric modulator. The distinction between these 2 models must be clarified in the result section.

      We thank the reviewer very much for spotting this discrepancy. Indeed, Figure 2 shows the results obtained for a GLM in which we modeled the three feedback levels with separate regressors, not with one parametric regressor. Instead, the latter was the case for Figure 3. We apologize for the confusion and corrected the description in the figure caption, which now reads as follows. The description in the main text and the methods remain unchanged.

      Caption Fig. 2: We plot the beta estimates obtained for the contrast between high vs. low feedback.

      Moreover, note that in response to comment 5 by reviewer 1 and comment 3 by reviewer 2, the revised manuscript now additionally reports the results obtained for the parametric regressor in the new Supplementary Figure 3C. All conclusions remain unchanged.

      Additionally, it is unclear how Figure 2A supports the following statement: "Moreover, the voxel-wise analysis revealed similar feedback-related activity in the thalamus and the striatum (Fig. 2A), and in the hippocampus when the feedback of the current trial was modeled (Fig. S3)." This is confusing as Figure 2A reports an opposite pattern of results between the striatum/thalamus and the hippocampus. It appears that the statement highlighted above is supported by results from a model including current trial feedback as a parametric modulator (reported in Figure S3).

      We agree with the reviewer that our result description was confusing and changed it. It now reads as follows.

      Page 5: Moreover, the voxel-wise analysis revealed feedback-related activity also in the thalamus and the striatum (Fig. 2A) [...]

      Also, note that it is unclear from Figure 2A what is the direction of the contrast highlighting the hippocampal cluster (high vs. low according to the text but the figure shows negative values in the hippocampus and positive values in the thalamus). These discrepancies need to be addressed and the models used to support the statements made in the results sections need to be explicitly described.

      The description of the contrast is correct. Negative values indicate smaller errors and therefore better feedback, which is mentioned in the caption of Fig. 2 as follows:

      “Negative values indicate that smaller errors, and higher-accuracy feedback, led to stronger activity.”

      Note that the timing error determined the feedback, and that we predicted stronger updating and therefore stronger activity for larger errors (similar to a prediction error). We found the opposite. We mention the reasoning behind this analysis at various locations in the manuscript e.g. when talking about the connectivity analysis:

      “We reasoned that larger timing errors and therefore low-accuracy feedback would result in stronger updating compared to smaller timing errors and high-accuracy feedback”

      In response to the reviewer’s remark, we clarified this further by adding the following statement to the result section.

      Page 5: “Using a mass-univariate general linear model (GLM), we modeled the three feedback levels with one regressor each plus additional nuisance regressors (see methods for details). The three feedback levels (high, medium and low accuracy) corresponded to small, medium and large timing errors, respectively. We then contrasted the beta weights estimated for high-accuracy vs. low-accuracy feedback and examined the effects on group-level averaged across runs.”

      b) Connectivity analyses: It is also unclear here which model was used in the PPIanalyses presented in Figure 2. As it appears that the seed region was extracted from a high vs. low contrast (without modulators), the PPI should be built using the same model. I assume this was the case as the authors mentioned "These co-fluctuations were stronger when participants performed poorly in the previous trial and therefore when they received low-accuracy feedback." if this refers to low vs. high contrast. Please clarify.

      Yes, the PPI model was built using the same model. We clarified this in the methods section by adding the following statement to the PPI description.

      Page 17: “The PPI model was built using the same model that revealed the main effects used to define the HPC sphere “

      Yes, the reviewer is correct in thinking that the contrast shows the difference between low vs. high-accuracy feedback. We clarified this in the main text as well as in the caption of Fig. 2.

      Caption Fig 2: [...] We plot results of a psychophysiological interactions (PPI) analysis conducted using the hippocampal peak effects in (A) as a seed for low vs. high-accuracy feedback. [...]

      Page 17: The estimated beta weight corresponding to the interaction term was then tested against zero on the group-level using a t-test implemented in SPM12 (Fig. 2C). The contrast reflects the difference between low vs. high-accuracy feedback. This revealed brain areas whose activity was co-varying with the hippocampus seed ROI as a function of past-trial performance (n-1).

      c) It is unclear why the model testing TTC-specific / TTC-independent effects (resultspresented in Figure 3) used 2 parametric modulators (as opposed to building two separate models with a different modulator each). I wonder how the authors dealt with the orthogonalization between parametric modulators with such a model. In SPM, the orthogonalization of parametric modulators is based on the order of the modulators in the design matrix. In this case, parametric modulator #2 would be orthogonalized to the preceding modulator so that a contrast focusing on the parametric modulator #2 would highlight any modulation that is above and beyond that explained by modulator #1. In this case, modulation of brain activity that is TTC-specific would have to be above and beyond a modulation that is TTC-independent to be highlighted. I am unsure that this is what the authors wanted to test here (or whether this is how the MRI design was built). Importantly, this might bias the interpretation of their results as - by design - it is less likely to observe TTC-specific modulations in the hippocampus as there is significant TTC-independent modulation. In other words, switching the order of the modulators in the model (or building two separate models) might yield different results. This is an important point to address as this might challenge the TTC-specific/TTC-independent results described in the manuscript.

      We thank the reviewer for raising this important issue. When running the respective analysis, we made sure that the regressors were not collinear and we therefore did not expect substantial overlap in shared variance between them. However, we agree with the reviewer that orthogonalizing one regressor with respect to the other could still affect the results. To make sure that our expectations were indeed met, we therefore repeated the main analysis twice: 1) switching the order of the modulators and 2) turning orthogonalization off (which is possible in SPM12 unlike in previous versions). In all cases, our key results and conclusions remained unchanged, including the central results of the hippocampus analyses.

      Anterior (ant.) / Posterior (post.) Hippocampus ROI analysis with A) original order of modulators, B) switching the order of the modulators and C) turning orthogonalization of modulators off. ABC) Orange color corresponds to the TTC-independent condition whereas light-blue color corresponds to the TTC-specific condition. Statistics reflect p<0.05 at Bonferroni corrected levels () obtained using a group-level one-tailed one-sample t-test against zero; A) pfwe = 0.017, B) pfwe = 0.039, C) pfwe = 0.039.*

      Because orthogonalization did not affect the conclusions, the new manuscript simply reports the analysis for which it was turned off. Note that these new figures are extremely similar to the original figures we presented, which can be seen in the exemplary figure below showing our key results at a liberal threshold for transparency. In addition, we clarified that orthogonalization was turned off in the methods section as follows.

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively, and they were not orthogonalized to each other.

      Comparison of old & new results: also see Fig. 3 and Fig. S5 in manuscript

      d) It is also unclear how the behavioral improvement was coded/classified "wecontrasted trials in which participants had improved versus the ones in which they had not improved or got worse"- It appears that improvement computation was based on the change of feedback valence (between high, medium and low). It is unclear why performance wasn't used instead? This would provide a finer-grained modulation?

      We thank the reviewer for the opportunity to clarify this important point. First, we chose to model feedback because it is the feedback that determines whether participants update their “internal model” or not. Without feedback, they would not know how well they performed, and we would not expect to find activity related to sensorimotor updating. Second, behavioral performance and received feedback are tightly correlated, because the former determines the latter. We therefore do not expect to see major differences in results obtained between the two. Third, we did in fact model both feedback and performance in two independent GLMs, even though the way the results were reported in the initial submission made it difficult to compare the two.

      Figure 4 shows the results obtained when modeling behavioral performance in the current trial as an F-contrast, and Supplementary Fig 4 shows the results when modeling the feedback received in the current trial as a t-contrast. While the voxel-wise t-maps/F-maps are also quite similar, we now additionally report the t-contrast for the behavioral-performance GLM in a new Supplementary Figure 4C. The t-maps obtained for these two different analyses are extremely similar, confirming that the direction of the effects as well as their interpretation remain independent of whether feedback or performance is modeled.

      The revised manuscript refers to the new Supplementary Figure 4C as follows.

      Page 17: In two independent GLMs, we analyzed the time courses of all voxels in the brain as a function of behavioral performance (i.e. TTC error) in each trial, and as a function of feedback received at the end of each trial. The models included one mean-centered parametric regressor per run, modeling either the TTC error or the three feedback levels in each trial, respectively. Note that the feedback itself was a function of TTC error in each trial [...] We estimated weights for all regressors and conducted a t-test against zero using SPM12 for our feedback and performance regressors of interest on the group level (Fig. S4A). [...]

      Page 17: In addition to the voxel-wise whole-brain analyses described above, we conducted independent ROI analyses for the anterior and posterior sections of the hippocampus (Fig. S2A). Here, we tested the beta estimates obtained in our first-level analysis for the feedback and performance regressors of interest (Fig. S4B; two-tailed one-sample t tests: anterior HPC, t(33) = -5.92, p = 1.2x10-6, pfwe = 2.4x10-6, d=-1.02, CI: [-1.45, -0.6]; posterior HPC, t(33) = -4.07, p = 2.7x10-4, pfwe = 5.4x10-4, d=-0.7, CI: [-1.09, -0.32]). See section "Regions of interest definition and analysis" for more details.

      If the feedback valence was used to classify trials as improved or not, how was this modelled (one regressor for improved, one for no improvement? As opposed to a parametric modulator with performance improvement?).

      We apologize for the lack of clarity regarding our regressor design. In response to this comment, we adapted the corresponding paragraph in the methods to express more clearly that improvement trials and no-improvement trials were modeled with two separate parametric regressors - in line with the reviewer’s understanding. The new paragraph reads as follows.

      Page 18: One regressor modeled the main effect of the trial and two parametric regressors modeled the following contrasts: Parametric regressor 1: trials in which behavioral performance improved \textit{vs}. parametric regressor 2: trials in which behavioral performance did not improve or got worse relative to the previous trial.

      Last, it is also unclear how ITI was modelled as a regressor. Did the authors mean a parametric modulator here? Some clarification on the events modelled would also be helpful. What was the onset of a trial in the MRI design? The start of the trial? Then end? The onset of the prediction time?

      The Inter-trial intervals (ITIs) were modeled as a boxcar regressor convolved with the hemodynamic response function. They describe the time after the feedback-phase offset and the subsequent trial onset. Moreover, the start of the trial was the moment when the visual-tracking target started moving after the ITI, whereas the trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The onset of the “prediction time” was the moment in which the visual-tracking target stopped moving, prompting participants to estimate the time-to-contact. We now explain this more clearly in the methods as shown below.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI), which were all convolved with the canonical hemodynamic response function of SPM12. The start of the trial was considered as the trial onsets for modeling (i.e. the time when the visual-tracking target started moving). The trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The ITI was the time between the offset of the feedback-phase and the subsequent trial onset.

      On a related note, in response to question 4 by reviewer 2, we now repeated one of the main analyses (Fig. 2) without modeling the ITI (as well as the Inter-session interval, ISI). We found that our key results and conclusions are independent of whether or not these time points were modeled. These new results are presented in the new Supplementary Figure 3B.

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. [...]

      1. Perhaps as a result of a lack of clarity in the result section and the MRI methods, it appears that some conclusions presented in the result section are not supported by the data. E.g. "Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time." The data show that hippocampal activity is higher during and after an accurate trial. This pattern of results could be attributed to various processes such as e.g. reward or learning etc. I would recommend not providing such interpretations in the result section and addressing these points in the discussion.

      Similar to above, statements like "These results suggest that the hippocampus updates information that is independent of the target TTC". The data show that higher hippocampal activity is linked to greater improvement across trials independent of the timing of the trial. The point about updating is rather speculative and should be presented in the discussion instead of the result section.

      The reviewer is referring to two statements in the results section that reflect our interpretation rather than a description of the results. In response to the reviewer’s comment, we therefore removed the following statement from the results.

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      In addition, we replaced the remaining statement by the following. We feel this new statement makes clear why we conducted the analysis that is described without offering an interpretation of the results that were presented before.

      Page 8: We reasoned that updating TTC-independent information may support generalization performance by means of regularizing the encoded intervals based on the temporal context in which they were encoded.

    1. Author Response

      Reviewer #2 (Public Review):

      In this MEG work employing two types of bistable perception test and unique regression analyses, the authors identified different neural frequencies to different components of visual perception: its content and stability.

      Strengths:

      This study has a nice set of three different experiments to clarify neural differences between content, memory and stability of visual perception.

      The state space analysis appears to be powerful to identify such different neural signatures for different cognitive components as well.

      Weaknesses:

      Despite such strengths, this work may have the somewhat critical weakness specified in the recommendations for the authors.

      First, in the analysis to identify content-specific neural frequency, the authors concluded that the SCP is more relevant to the visual perceptual content compared to the neural activity in the alpha and beta-band frequencies. In my impression, to claim this, it would be necessary to show statistically significant differences in the prediction accuracy between the SCP and the other frequencies. Given the not-so-high prediction accuracy seen in the SCP-based analysis, such statistical supports appear essential.

      We have now directly compared decoding accuracy for SCP and alpha/beta oscillations, which showed statistically significant differences in both the ambiguous and unambiguous conditions for both ambiguous images. We have added these results as a supplementary figure (new Figure 2—figure supplement 1).

      Second, two behavioural metrics in the neural state space analysis-i.e., Switch and Direction-may be too arbitrary. As suggested by the power-law distribution of the percept duration, the neural dynamics during seemingly stable percept may not be able to be described in linear functions. Instead, the brain may go back and forth between several neural states even when we are thinking we're experiencing stable visual consciousness. If so, the current definition of the Switch metric and Direction index, which seems to be based on the behaviour of the Switch index, may be arbitrary. In other words, I feel the authors may have to elaborate the rationale for the definitions of such metrics.

      First, we note it is generally accepted in the field that the distribution of percept durations follows a gamma distribution instead of a power-law distribution (e.g., Sterzer et al., TiCS 2009; Blake & Logothetis Nature Rev. Neurosci 2002; Kleinschmidt et al., 1998; Leopold et al., TiCS 1999), and microswitches have not been reported either using the more classic task as that employed here or the more recently developed ‘no-report’ task of using eye-tracking statistics to deduce perceptual switches without overt report (e.g., Frassle et al., J Neurosci 2014).

      Second, while brain activity may fluctuate during these time periods, it never crosses the threshold of evoking a conscious report, and thus we would expect that such fluctuations, if they do occur, would be of a lower magnitude than those that do produce a conscious report.

      Most importantly, our goal here is to define behavioral metrics in order to identify components of neural dynamics underpinning the relevant aspect of behavior. As such, our definition of the behavioral metric should not be directly informed by observed spontaneous dynamics of brain activity (especially those that may be observed in the data but are of unclear relevance to perceptual switching); otherwise the analysis would be prone to circularity and spurious correlations (i.e., using observed brain dynamics to inform construction of behavioral metrics might pick up aspect of brain dynamics not really relevant to behavior in the analysis results).

      Finally, the timing characteristics of ‘Switch’ and ‘Direction’ behavioral metrics are not arbitrary; instead they are the simplest behavioral functions that allow a comparison of pre- and post-switching periods (or when the percepts might be in the ‘stabilizing’ phase vs. the ‘destabilizing’ phase). Nevertheless, the regression analysis can pick up on other temporal patterns of changes not exactly the same as our defined behavioral metric. This can be seen for SCP and beta activity projected onto the Direction axis, where it has the lowest value at ~20th percentile of the trial (not 50th percentile as assumed by the behavioral metric). To confirm that the analysis is not highly dependent on the precise timing definition of the behavioral metrics, we ran a control analysis, where the switching point was set at 30%tile (rather than 50%tile as in the original analysis). This control analysis resulted in a similar pattern of neural results (Figure R1).

      Figure R1: Changing temporal behavior definition (switching point moved from 50th percentile to 30th percentile of percept duration) does not significantly alter the neural results. Compare to Figure 4—figure supplement 1, ‘Switch’ and “Direction’ Columns.

    1. Author Response

      Public Evaluation Summary

      The authors aim to tackle a fundamental question with their study: whether there is a direct age-associated increase of transcriptional noise. To investigate this question, they develop tools to analyze single-cell sequencing data from mouse and human aging datasets. Ultimately, application of their novel tool (Scallop) suggests that transcriptional noise does not change with age, changes in transcriptional noise can be attributed to other sources such as subtle shifts in cell identity. This study is in principle of broad interest, but it currently lacks a definitive demonstration of the robustness of Scallop. Systematic testing of this new package would ultimately strengthen the key conclusion of the work and give additional users more confidence when using the tool to estimate expression noise.

      We have now attempted to further demonstrate the robustness of Scallop by performing a more systematic analysis and a side-by-side comparison to other existing methods using a set of artificially generated datasets. These analyses have resulted in the inclusion of six supplementary figures that are presented in the subsections Scallop membership score accurately identifies transcriptionally noisy cells, Ability to detect noisy cells within cell types, Effect of cellular composition, Effect of dataset size, Effect of feature expression and Effect of cell type marker expression within the Results section of the revised manuscript.

      We have also included a supplementary figure showing an in-depth analysis of a dataset where ageassociated increase in transcriptional noise was detected using alternative methods, but whose closer dissection has revealed that the difference in noise is due to a single donor and to the choice of methods. We discuss this is in the subsection Distance-to-centroid methods detect transcriptionally stable cell subtypes as transcriptional noise within the Results section.

      Finally, we have revised the manuscript to clarify the main points raised by the reviewers: the definition of transcriptional noise, the reasoning behind the choice of the single-cell aging datasets and Leiden’s rationale. Also, we have expanded the description of the method to make the definition of membership score more clear to the readers, and discussed the implications of our main findings (a lack of evidence for age-related transcriptional noise) in the broader context of theories of aging.

      Reviewer #1 (Public Review):

      In the present study, Ibanez-Sole et al evaluate transcriptional noise across aging and tissues in several publicly available mouse and human datasets. Initially, the authors compare 4 generalized approaches to quantify transcriptional noise across cell types and later implement a new approach which uses iterative clustering to assess cellular noise. Based on implementation of this approach (scallop), the authors survey noise across seven sc-seq datasets relevant for aging. Here, the authors conclude that enhanced transcriptional noise is not a hallmark of aging, rather changes in cell identity and abundances, namely immune and endothelial cells. The development of new tools to quantify transcriptional noise from sc-seq data presents appeal, as these datasets are increasing exponentially. Further, the conclusion that increased transcriptional noise is not a defined aspect of aging is clearly an important contribution; however, given the provocative nature of this claim, more comprehensive and systematic analyses should be performed. In particular, the robustness and appeal of scallop is still not sufficiently demonstrated and given the complexity (multiple tissues, species and diverse relative age ranges) of datasets analyzed, a more thorough comparison should be performed. I list a few thoughts below:

      Initially, the authors develop Decibel, which centralizes noise quantification methods. The authors provide schematics shown in Fig 1, and compare noise estimates with aging in Fig 2 - Supplement 2. Since the authors emphasize the necessary use of scallop as a ”better” pipeline, more systematic comparisons to the other methods should be made side-by-side.

      We thank the reviewer for their positive assessment of the manuscript and their suggestions. We agree that side-by-side benchmarking of Scallop with the methods implemented in Decibel, as well as a more thorough analysis on the effect of different features such as dataset size, cellular composition, etc. might have on the output of Scallop will reinforce the main points of the manuscript. To experimentally respond to these requests, we took advantage of a set of four artificial datasets previously generated by us with the R package splatter (v1.10.1; as described in Ascensión et al. [1]). In the present work, we first run a side-by-side comparison between Scallop and two distance-to-centroid (DTC) methods on the four artificial datasets with increasing degrees of transcriptional noise present in them (the novel data are included as Figure 1 – Figure supplement 1 in the revised manuscript). Then, we compared Scallop to one DTC method regarding their ability to detect noisy cells in different cell types (Figure 1 – Figure supplement 2). Finally, we implemented four simulations to test the effect of the following features on the performance of Scallop: cellular composition (Figure 1 – Figure supplement 3), dataset size (Figure 1 – Figure supplement 4), number of genes (Figure 1 – Figure supplement 5) and marker gene expression (Figure 1 – Figure supplement 6). A summary of these results follows.

      Side-by-side comparison of Scallop vs DTC methods

      Each of the four artificial datasets used consists of 10K cells, from 9 populations, named Group1 to Group9, with the following relative abundances: 25, 20, 15, 10, 10, 7, 5.5, 4, and 3.5%, respectively. The four datasets only differ in the de.prob parameter used in their generation. The de.prob parameter determines the probability that a gene is differentially expressed between subpopulations within the dataset. The greater the de.prob value, the more differentially expressed genes there will be between clusters, meaning that the different cell types present in the dataset will cluster in a more robust way. Decreasing the value of de.prob results in datasets with noisy cells, with populations that do not have such a strong transcriptional signature. In order to study how Scallop can capture the degree of robustness with which cells of the same cell type cluster together, we selected four de.prob values (0.05, 0.016, 0.01 and 0.005) and measured transcriptional noise using Scallop and two DTC methods, the whole transcriptome-based Euclidean distance to cell type mean and the invariant gene-based Euclidean distance to tissue mean expression. These two methods were selected because GCL does not yield a transcriptional noise measure per cell, so no comparisons can be made with respect to the amount of noisy cells the method is able to detect within a cluster. Similarly, comparing Scallop to the ERCC spike in-based method was not possible for artificial datasets. Importantly, these analyses showed that Scallop, unlike DTC methods, was able to discern between the core transcriptionally stable cells within each cell type cluster from the more noisy cells that lie in between clusters (provided in the Figure 1 - Supplement 1 of revised manuscript).

      Effect of dataset features on the performance of Scallop

      We simulated five artificial datasets with the same nine cell type populations but whose relative abundances were different between datasets. We used the imbalance degree (ID) to measure class imbalance in each of them and to make sure that the selected cell compositions represented a wide range of imbalance degrees (to this end, we explored ID values between 1.2 and 5.3). The ID provides a normalized summary of the extent of class imbalance in a dataset in so-called ”multiclass” settings, that is to say, where more than two classes are present. It was specifically developed to improve the commonly used imbalance ratio (IR) measurement, whose calculation only considers the abundance of the most and the least popular classes and which gives the same summary for datasets with different numbers of minority classes. The presence of multiple minority classes is not uncommon in single-cell RNAseq datasets, as tissues might contain several rare cell types. We observed that the transcriptional noise measurements provided by Scallop were very robust to changes in imbalance degree (see Figure 1 - Supplement 3), both in qualitative and in quantitative terms. For instance, Group2 and Group8 were always detected as the most stable and noisiest cell types, respectively, regardless of their relative abundance in the dataset, and their average percentage of noise had little variation between different ID values: it ranged between 0-0.14% (Group2) and 16-18% (Group8).

      The effect of dataset size (number of cells) and the number of genes was evaluated by generating versions of an artificial dataset where cells/genes had been subsampled from an original artificial dataset (the one generated with de.prob=0.001). We tested datasets sized 1,000-10,000 cells and with a number of genes between 5,000 and 14,000. Dataset size had nearly no impact on the transcriptional noise measurements provided by Scallop (Figure 1 - Supplement 4 of the revised manuscript). The average percentage of transcriptional noise per cell type remained within a narrow range as we implemented a ten-fold increase in dataset size. Perhaps more strikingly, removing the expression of most genes did not substantially impact transcriptional noise measurements per cell type (Figure 1 - Supplement 5). The variation when removing half of the genes (7,000 genes) was minimal, and we did not see important changes in transcriptional noise measurements unless over 60% of the genes from the original dataset were removed. For example, Figure 1 - Supplement 5C shows that noise measurements suffer important variations when removing 8,000 and 9,000 genes (and therefore keeping 6,000 and 5,000 genes, respectively), but only some cell types (Groups 4, 7, 8 and 9) were affected by these variations.

      In order to measure the effect marker gene expression has on the membership with which cells are assigned to their cell type cluster, we ran a simulation where the top 10 markers for a cell type were removed from the dataset one by one, so that the first simulation lacked the expression of the Top1 marker, the second simulation had the effect of the first 2 markers removed (Top1 and Top2), and so on. Then, we ran Scallop on each of the resulting datasets and observed a steady increase in transcriptional noise associated with that cell type. This provided evidence that the strength of cell type marker expression in a cluster is directly related to its transcriptional stability (or lack of transcriptional noise). We included the result of this experiment in the revised version of the manuscript (Figure 1 - Supplement 6).

      In conclusion, by using artificially generated datasets where the ground truth (cell type labels, degree of noise, etc) was known, the newly provided systematic analyses showed that Scallop had a remarkably robust response to said changes in dataset features, further reinforcing the manuscript conclusions.

      For example, scallop noise estimates (Fig 2) compared to other euclidean distance-based measures (Fig 2 supplement 2) looks fairly similar.

      It is true that some datasets show similar trends regardless of the transcriptional noise quantification method. For instance, the murine brain dataset by Ximerakis et al. shows no overall change in noise between the age groups across different methods. However, we do observe important differences in other examples. This is the case of the human pancreas dataset by Enge et al. and the human skin dataset by Solé-Boldo et al., where not only the magnitude but also the directionality of the trend are different depending on the method used to measure noise. In the former, three methods (Scallop, invariant gene-based Euclidean distance to average tissue expression and GCL) show an age-related increase in noise, whereas one method (whole transcriptome-based Euclidean distance to the cell type mean) shows a decrease in noise. In the latter, two methods (Scallop and GCL) yield a decrease in noise and the two DTC methods measure a mild increase in noise. These inconsistencies can now be reconciled with our proposed explanation that said ”noise” may actually be referring to substantially different biology in the diverse experimental settings.

      Are downstream observations (ex lung immune composition changes more than noise) supported from these methods as well? If so, this would strengthen the overall conclusion on noise with age, but if not, it would be relevant to understand why.

      Studying changes in cell type composition in the lung and other aged tissues would be highly pertinent. Nevertheless, we have measured changes in cell type composition using only one method that is based on Generalized Linear Models, covered in the subsection Age-related cell type enrichment of the Methods. The methods that we have compared in our study (DTC methods, ERCC-based methods, GCL, etc.) were all designed to measure transcriptional noise, but not changes in cell type composition.

      Whether the effects of cell type composition changes are bigger than changes in noise for the rest of the methods used to measure noise was probably not clear enough in the original manuscript. We found no evidence for an increase in noise associated with aging, regardless of the method used. Although not included in the manuscript, we did generate heatmaps similar to the one shown in Figure 3B for each of the noise quantification methods. However, as the heatmap on the right side (the one showing cell type enrichment) was identical in each figure, we considered them to be redundant and decided not to include them, since they did not provide any additional insight besides giving more examples of lack of evidence for transcriptional noise, this time at the cell type level. We consider that the lack of evidence was already well demonstrated in the previous analyses (Figure 2 and Figure 2 - Supplement 2.

      Similarly, the ’validation of scallop seems mostly based on the ability to localize noisy vs stable cells in Fig 1 supplement 1 and relative robustness within dataset to input parameters (Fig 1 supplement 2). A more systematic analysis should be performed to robustly establish this method. For example, noise cell clustering comparisons across the 7 datasets used. In addition, the Levy et all 2020 implemented a pathway-based approach to validate. Specifically, surrogate genes were derived from GCL value where KEGG preservation was used as an output. Similar additional types of analyses should be performed in scallop.

      We believe that this legitimate concern is now solved with the newly included data. In particular, with the systematic comparison between Scallop and DTC methods on three artificially generated datasets with different degrees of transcriptional noise provided in Figure 1 - Supplement 2. The ability of Scallop to detect cells that are particularly noisy within a cell type, or cells that lie between cell types, may represent its biggest advantage with respect to other methods. DTC methods fail to discern between stable and noisy cells within cell types. Also, in our analysis, DTC methods were unable to distinguish between cell types that have a marked transcriptional program (which systematically cluster together) and those that have a less clear transcriptomic identity (which have at least part of their cells be assigned to other cell types across bootstrap iterations). However, comparing the performance of Scallop on the same datasets showed that our method was able distinguish between the two cases.

      The conclusion that immune and endothelial cell transcriptional shifts associate more with age than noise are quite compelling, but seem entirely restricted to the mouse and human lung datasets. It would be interesting to know if pan-tissues these same cell types enrich age-related effects or whether this phenomenon is localized.

      We agree with the reviewer that it would be very interesting to see whether a change in cell type composition (and particularly, an increase in abundance of immune cell types) is observed in aged tissues other than the lung. Qualitative cell type composition changes in the aging lung have been described in the literature [5]. Specifically, the higher abundance of immune cell types was observed in a single-nucleus RNAseq dataset of cardiopulmonary cells in Macaca fascicularis [6]. However, we believe that trying to answer the question whether this phenomenon holds in other tissues would require a systematic analysis of several datasets for each tissue with a sufficient number of donors/individuals in each of them. This is because our approach to measure age-associated cell type enrichment using generalized linear models relies heavily on having multiple biological replicates for each age group. Unfortunately, this is not the case for most published single-cell RNAseq datasets of aging. In any case, we have toned down the last sentence in the subsection Changes in the abundance of the immune and endothelial cell repertoires characterize the human aging lung by making it more clear that our claim regarding changes in the cellular composition of aged tissues is based on lung datasets (the text in italics represents what was added in the revised version of the manuscript):

      "Even though the evidence for changes in tissue composition are based on a single tissue, we hypothesize that these facts may have influenced previous analyses of transcriptional noise associated with aging."

      As discussed in the original manuscript, there is evidence published by other groups pointing out to pantissue changes in cellular composition with age, which undoubtedly will influence those analyses that did not pay attention to cellular composition changes in the datasets that they compared. Cellular composition is in fact a very important aspect that has been greatly overlooked. In fact, only one [7] out of the seven articles that had measured transcriptional noise in aging (the datasets used in Figure 2) had attempted to remove its effect by subsampling cells to balance compositions between age groups prior to their noise analysis. In any case, we do not believe this is the only phenomenon underlying the purported increase in transcriptional noise associated with age. Each dataset will most probably have different issues that the authors originally misread as an increase in noise or loss of cellular identity of a particular organ or tissue. As an additional example of such phenomena, we have now included a re-analysis of the data by Enge et al. [3] on ”noisy” β-cells in the aged human pancreas (Figure 5–Figure supplement 2 of the revised manuscript). In this case, rather than observing an age-dependent pattern, the 21-year-old donor presents much lower transcriptional noise values than the rest of the donors. However, there is no significant difference between the 22-year-old donor and the rest of the donors. We conclude that the statistically significant differences between the ”young” and ”old” age categories can be attributed to the abnormal noise values obtained for the 21-year-old donor, of uncertain origin. Finding out all causes of apparent transcriptional noise in other organs and tissues would be too lengthy, and certainly out of scope for the present manuscript.

      Related to these, there does not seem to be a specific rationale for why these datasets (the seven used in total or the lung for deep-dive), were selected. Clearly, many mouse and human sc-RNA-seq datasets exist with large variations in age so expanding the datasets analyzed and/or providing sufficient rationale as to why these ones are appearing for noise analyses would be helpful. For example, querying ”aging” across sc-seq datasets in Single cell portal yields 79 available datasets: https://singlecell.broadinstitute. org/single_cell?type=study&page=1&terms=aging&facets=organism_age%3A0%7C103%7Cyears.

      We now realize that the reasoning behind our selection of aging datasets was not sufficiently clear in the original manuscript. We thank the reviewer for pointing out this omission. We have made a more explicit reference to Appendices 2, 3, 4 and 6 in the revised manuscript. The seven selected scRNAseq datasets are those where transcriptional noise had originally been measured by the authors, using the computational methods that we later implemented in Decibel. Our aim was to first recapitulate previous reports of transcriptional noise using our novel method (Scallop). Thus, we downloaded all publicly available scRNAseq datasets of aged tissues where transcriptional noise had explicitly been measured. Some of them had reported an increase in transcriptional noise only in some cell types (for instance, the human aged pancreas dataset by Enge et al. [3]), whereas others found an increase in most cell types [7]. Appendix 2 summarizes the main features of those seven datasets (tissue, organism and number of cells) and provides information on whether an increase in transcriptional noise was observed in the original article where they were published. Additionally, the ”scope” column indicates where that increase was found (in which cell types), and the ”Method” column briefly describes the computational method used to measure transcriptional noise in that article. Appendix 3 provides information on the final datasets that were used in our analysis (Figure 2). Not every sample from the original dataset was included, so the inclusion criteria are specified there, as well as the number of cells, individuals and age of each of the cohorts. Appendix 4 shows the abnormal count distribution of two samples that were discarded from the Kimmel lung dataset. As for the selection of lung for the deep dive, the reason was that this was the organ with most datasets available, both for mouse and human. Appendix 6 provides information on the number of cells and donors per age cohort in the human lung datasets included in this study.

      We have included the following sentence in the Increased transcriptional noise is not a universal hallmark of aging subsection in the Results:

      "We provide a summary of the main characteristics of each dataset, as well as the findings regarding transcriptional noise obtained in each of the original studies, whether changes in transcriptional noise were restricted to particular cell types, and the computational method used to measure noise (see Appendix 2)."

      The analysis that noise is indistinguishable from cell fate shifts is compelling, but again relies on one specific example where alternative surfactant genes are used as markers. The same question arises if this observation holds up to other cell types within other organs. For example the human cell atlas contains over dozens of tissue with large variations in age (https://www.science.org/doi/10.1126/science. abl4290).

      We sympathize with this comment but hope that the reviewer will agree with us that providing an additional example of different phenomena originally reported as ”transcriptional noise” (in this case in aged human pancreas; see Figure 5 – Figure supplement 2), but actually reflecting something else, may be sufficient to prevent interested readers. In our opinion, it is likely that diverse phenomena will underlie the purported increases in transcriptional noise, and a re-analysis should be made case-by-case. We can only hope that researchers in the field re-analyze the available aging datasets in this new light.

      Reviewer #2 (Public Review):

      In this manuscript, Ibanez-Sole et al. focus on an important open question in ageing research; ”how does transcriptional noise increase at the cellular level?”. They developed two python toolkits, one for comparison of previously described methods to measure transcriptional noise, Decibel, and another one implementing a new method of variability measure based on cluster memberships, Scallop. Using published datasets and comparing multiple methods, they suggest that increased transcriptional noise is not a fundamental property of ageing, but instead, previous reports might have been driven by age-related changes in cell type compositions.

      I would like to congratulate the authors on openly providing all code and data associated with the manuscript. The authors did not restrict their paper to one dataset or one approach but instead provided a comprehensive analysis of diverse biology across murine and human tissues.

      While the results support their main conclusions, the lack of robustness/sensitivity measures for the methods used makes it difficult to judge the biology.The authors use real data to compare between methods but using synthetic data with known artificial ’variability’ across cell clusters can first establish the methods, which would make the results more convincing and easier to interpret. Despite the comprehensive analysis of biological data, a detailed prior description of how the methods behave against e.g. the number of cells in each cell type cluster, the number of cell types in the dataset, and % feature expression, would make the paper more convincing. Once the details of the method is provided, the python toolkit can be widely used, not limited to the ageing research community. I am also concerned that a definition of ’transcriptional noise’ (e.g. genome-wide noise, transcriptional dysregulation in cell-type-specific genes, noise in certain pathways) and its interpretation with regard to the biology of ageing is missing. Differences in different methods could be explained by the different biology they capture. Moreover, the interpretation of a lack of different types of variability may not be the same for the biology of ageing.

      Increased transcriptional noise is compatible with genomic instability, loss of proteostasis and epigenetic regulation. Showing a lack of consistent transcriptional noise can challenge the widespread assumptions about how these hallmarks affect the organism. Overall, I found the paper very interesting and central to the field of ageing biology. However, I believe it requires a more detailed description of the methods and interpretations in the context of biology and theories of ageing.

      We thank the reviewer for their positive assessment of the manuscript and their suggestions. We respond to each of the specific comments below.

      Major comments

      1) The concept of transcriptional noise is central to the manuscript; however, what the authors consider as transcriptional noise and why is not clear. Genome-wide vs. function or cell-type specific noise could have different implications for the biology of ageing. In line with this, a discussion of the findings in the context of theories of ageing is necessary to understand its implications.

      We thank the reviewer for pointing out the lack of clarity in this key point. The use of the ”transcriptional noise” term in the literature is quite heterogeneous, and we agree that the lack of a consensus definition may be confusing to the reader. For this reason, we adopted in the introduction the definition by Raser and O’Shea [8] as ”the measured level of variation in gene expression among cells supposed to be identical”, i.e. the sum of both intrinsic and extrinsic noise as previously defined by Swain and colleagues [9, 10]. In our opinion, this is generally what the literature of age-associated transcriptional noise is referring to.

      With Scallop, we aimed to translate this concept to the context of single-cell RNAseq datasets, where clusters obtained using a community detection algorithm are typically annotated as distinct cell types.

      Therefore, we aimed to measure transcriptional noise here defined as ”lack of membership to cell type clusters”. When running a clustering algorithm iteratively, if a cell is not unambiguously assigned to the same cluster, we consider it to be noisy. Conversely, when a cell consistently clusters with the same group of cells, we consider it to be stable. The membership score we use as a measure of stability is the frequency with which any given cell was assigned to the same cluster across all iterations.

      We have included in the Results section an explicit reference to the Methods subsection that explains how Scallop works in detail, so that the readers can easily find that information:

      "A detailed description of the three steps of the method (bootstrapping, cluster relabeling and computation of the membership score) is provided in the Scallop subsection in the Methods."

      Additionally, we have now realized that the formula to compute the membership score might be more easily understood if we renamed the freq_score as freq_score(c), to make it clear that each cell is assigned a score. Also, we have used n and m instead of i and j in this notation, to avoid confusing the readers with the notation used in the previous section, where i and j represented the i-th and j-th bootstrap iterations. Finally, we have included a small paragraph to clarify what each component of the formula refers to. Below we show the formula and text included in the Methods section of the revised manuscript:

      "Where |cn| is the number of times cell c was assigned to the n-th cluster, and Pm∈clusters |cm| is the sum of all assignments made on cell c, which is the same as the number of times cell c was clustered across bootstrap iterations."

      Thus, and in order to accommodate this reviewer’s concerns, we have now included this exact definition of how we measure noise plus a statement making clear that we refer to the sum of both intrinsic and extrinsic noise aspects, with no distinction among them.

      Similarly, we had discussed our findings in the framework of different theories of aging, such as their potential relationship to some of the established hallmarks of aging (genomic instability, epigenetic deregulation and loss of proteostasis), as well as with more recent theories of aging such as cell type imbalance in aged organs [11] and inter-tissue convergence [12]. However, it is now clear to us that this was not enough so we have now expanded these paragraphs to make our understanding of the work implications better understood. More specifically:

      "Our results suggest that transcriptional noise is not a bona fide hallmark of aging. Instead, we posit that previous analyses of noise in aging scRNAseq datasets have been confounded by a number of factors, including both computational methods used for analysis as well as other biology-driven sources of variability."

      2) While I found the suggested method, Scallop, quite exciting and valuable, I would suggest including a number of performance/robustness measures (primarily based on simulations) on how sensitive the method is to the number of cells in each cell type (cellular composition), misannotations, % feature expression (number of 0s) etc.:

      We have analyzed the effect of cellular composition and the percentage of feature expression by using artificially generated datasets (see Figure 1 - Supplements 3 and 5, respectively; and section Effect of dataset features on the performance of Scallop in the response to reviewer #1). Although studying the effect of misannotations on downstream analysis is important, we believe that Scallop was already designed so that its effects could be avoided, since the membership is measured for each cluster (and not for each cell type label). That is to say, a reference clustering is obtained at the beginning of the pipeline and memberships are computed using that output as a reference, which means Scallop noise values attributed to each cell are not affected by the original labeling of the dataset.

      The output of these analyses reinforced our original conclusions, and it is now included in the Results section:

      "In order to characterize and validate our method for transcriptional noise quantification, we conducted three types of analyses. First, we used artificially generated datasets containing various degrees of transcriptional noise to compare the performance of Scallop and DTC methods side-by-side, regarding their ability to measure transcriptional noise and detect noisy cells within cell types. Next, we ran simulations using artificial datasets in order to study the effect of a number of dataset features on the performance of Scallop: cellular composition, dataset size, number of genes and marker expression. Finally, we graphically evaluated the output of Scallop on a dataset of human T cells, we analyzed its robustness to its input parameters, and we studied the relationship between membership and robust marker expression, using a PBMC dataset."

      2.1) Most importantly, knowing that cell-type composition changes with age, it is important to know how sensitive community detection is to the number of cells in each cell type. While the average can be robust, I wonder if the size of the cell-type cluster affects membership (voting).

      We have included an analysis on a set of artificial datasets with different cellular compositions to evaluate the performance of Scallop in the presence of different degrees of class imbalance (see Figure 1 - Supplement 3). We explain the output of this analysis, which reinforces the algorithm’s robustness, in the Results section:

      "Next, we ran a series of simulations on artificially generated datasets to evaluate the performance of Scallop in the presence of different levels of class imbalance, dataset size, number of genes, and different degrees of expression of cell type markers. Our analysis showed that Scallop was remarkably robust to changes in cellular composition (see Figure 1 - Supplement 3). Both the average percentage of noise and the distribution remained unchanged for a wide range of class imbalance degrees. Similarly, altering the dataset size (number of cells) and the number of genes of an artificial dataset did not cause any major changes on the transcriptional noise values attributed to each cell type (see Figure 1 - Supplements 4 and 5). Additionally, we conducted an analysis where we identified the 10 most differentially expressed gene markers for a cell type and measured the transcriptional noise associated with that cell type as we removed the expression of those genes from the dataset (Figure 1 - Supplement 5). Transcriptional noise steadily increased as we removed the effect of the top marker genes that defined the cell type under study (see Figure 1 - Supplement 5B). This experiment provides further evidence on how strong marker expression is related to robust cell type identity and how the lack of it results in transcriptional noise."

      3) Although the Leiden algorithm is widely used by many single-cell clustering methods, since the proposed methodology is heavily dependent on clustering, I suggest including a description of the Leiden algorithm.

      We agree that understanding how community detection algorithms in general –and Leiden in particular– work is crucial to understand the core of the paper, so we have included a brief introduction to these methods in the Methods section, at the beginning of the Scallop subsection:

      Leiden is a graph-based community detection algorithm that was designed to improve the popular Louvain method [13]. Graph-community detection methods take a graph representation of a dataset. In the context of single-cell RNAseq data, shared nearest neighbor (SNN) graphs are commonly used. These are graphs whose nodes represent individual cells and edges connect pairs of cells that are part of the K-nearest neighbors of each other by some distance metric. The aim of community detection algorithms like Leiden is to find groups of nodes that are densely connected between them, by optimizing modularity. For a graph with C communities, the modularity (Q) is computed by taking, for each community (group of cells), the difference between the actual number of edges in that community (ei) and the number of expected edges in that community ( K2/1/2m).

      Where r is a resolution parameter (r > 0) that controls for the amount of communities: a greater resolution parameter gives more communities whereas a low resolution parameter fewer clusters. Since maximizing the modularity of a graph is an NP-hard problem, different heuristics are used, and Leiden has shown to outperform Louvain in this task both in terms of quality and speed [14]. However, users can choose to run the Louvain method instead by setting the parameter clustering="louvain" in the initialization of the Bootstrap object.

      3.1) Most importantly, the authors comment that they found stronger expression of cell-type specific markers in the cells with high membership values - is it already a product of the Leiden algorithm that it weighs highly variable (thus cell-type specific) features higher - resulting in better prediction of cell-types for cells with strong cell-marker expression? It is important to make a description of transcriptional noise at this stage as it could be genome-wide or more specific to cell-type markers. Can authors provide any support that their method can capture both?

      We agree with the reviewer that finding a stronger expression of cell-type markers in cells with high membership values is indeed something we expected. The graph representation of the dataset taken as input by Leiden is built after running highly variable gene detection and PCA. The neighbors of each cell are detected based on the expression of genes that are highly variable, as the reviewer pointed out, so genes that are differentially expressed between cells are more likely to contribute to the clusters found by Leiden.

      Whether Scallop measures genome-wide or cell type-specific noise (or a mixture of both) is a very interesting question. Clusters in single-cell RNA sequencing datasets are often mainly driven by the presence/absence of a few cell type markers, rather than changes in expression levels of broader sets of genes. Moreover, it has been shown that single-cell RNAseq datasets generally preserve the same population structure even after data binarization [15]. This is a consequence of the sparsity of single-cell RNAseq datasets. In our case, any difference in expression between one cluster vs the rest of the cells in the dataset –be it the expression of a gene that was not detected in the rest of the cells or a higher expression of a gene whose presence is weaker in other clusters– will certainly have an impact on the output of every downstream analysis, from clustering to dimensionality reduction. The influence of the expression of cell type-specific markers on Scallop membership has been demonstrated in several analyses. First, the simulation where we measured the impact of removing the 10 most defining markers for a particular cell type on transcriptional noise measurements (included in the Figure 1 - Supplement 6 of the revised manuscript). Also, Figure 5 provides evidence that the differential expression of a handful of genes (in this case, genes coding for surfactant proteins) can have an impact on the clustering solutions obtained for a set of human alveolar macrophages, and this in turn influences the membership scores obtained with Scallop. In essence, Scallop merely provides a measure of the robustness of clustering at the single-cell level, so any type of transcriptional noise might have an impact on Scallop memberships, provided it is sufficiently strong to influence the output of the clustering algorithm used. In other words, the fact Scallop membership captures a mixture of both types of noise (genome-wide and that associated with cell type-specific markers) is a consequence of the influence both types of noise have on clustering.

      4) The authors conclude that Scallop outperforms other methods through the analysis of biological data, where there is no positive and negative control. I suggest creating synthetic datasets (which could be based on real data), introducing different levels of noise artificially (considering biological constraints like max/min expression levels) and then testing the performance where the truth about each dataset is known. Otherwise, the definitions of noisy and stable cells, regardless of the method, are arbitrary.

      Our initial focus was on biological datasets, were no positive and negative controls regarding transcriptional noise could be used, but we agree in the need of including an analysis using simulations on artificial datasets. We analyzed artificially generated datasets with known degrees of transcriptional noise in order to evaluate the performance of Scallop on a setting where the ground truth is known beforehand. The way we modeled transcriptional noise was by tuning the de.prob parameter, which determines the probability that a gene will be differentially expressed between clusters. The creation of these datasets is explained in detail in the Methods section of the revised manuscript, and specifically in the subsections Performance of Scallop and two DTC methods on four artificial datasets with increasing transcriptional noise. and Ability to detect noisy cells within cell types.

      We have now included the following section in the Results:

      "We compared the output of Scallop and two DTC methods (the whole transcriptome-based Euclidean distance to average cell type expression and the invariant gene-based Euclidean distance to average tissue expression) on four artificially generated datasets containing various levels of transcriptional noise. The analysis showed that Scallop, unlike DTC methods, was able to discern between the core transcriptionally stable cells within each cell type cluster from the more noisy cells that lie in between clusters (see Figure 1 - Supplement 1). We then compared one of the DTC methods to Scallop regarding their ability to detect noisy cells within each of the cell types, by plotting the top 10% noisiest and top 10% most stable cells and (see Figure 1 - Supplement 2A). Analyzing the distribution of noise values for each cell type separately revealed that Scallop can distinguish between clusters that mainly consist of transcriptionally stable cells from noisier clusters that do not have such a distinct transcriptional signature (Figure 1 - Supplement 2B."

      Reviewer #3 (Public Review):

      In this manuscript, Ibáñez-Solé et al aim to clarify the answer to a very basic and important question that has gained a lot of attention in the past ∼5 years due to fast-increasing pace of research in the aging field and development/optimization of single-cell gene expression quantification techniques: how does noise in gene expression change during the course of cellular/tissue aging? As the authors clearly describe, there have been multiple datasets available in the literature but one could not say the same for the number of available analysis pipelines, especially a pipeline that quantifies membership of single cells to their assigned cell type cluster. To address these needs, Ibáñez-Solé et al developed: 1. a toolkit (named Decibel) to implement the common methods for the quantification of age-related noise in scRNAseq data; and 2. a method (named Scallop) for obtaining membership information for single-cells regarding their assigned celltype cluster. Their analyses showed that previously-published aging datasets had large variability between tissues and datasets, and importantly the author’s results show that noise-increase in aging could not be claimed as a universal phenotype (as previously suggested by various studies).

      We thank the reviewer for their positive assessment of the manuscript and their suggestions.

      Comments:

      1) In two relevant papers (doi.org/10.1038/s41467-017-00752-9anddoi.org/10.1016/j.isci. 2018.08.011), previous work had already shown what haploid/diploid genetic backgrounds could show in terms of intercellular/intracellular noise. Due to the direct nature of age/noise quantification in these papers, one cannot blame any computational pipeline-related issues for the ”unconventional” results. The authors should cite and sufficiently discuss the noise-related results of these papers in their Discussion section. These two papers collectively show how the specific gene, its protein half-life and ploidy can lead to similar/different noise outcomes.

      We agree that we have failed to mention and sufficiently discuss the effects of measuring transcriptional noise from data generated via destructive experimentation, where no longitudinal analyses are possible. As aforementioned in the response to other reviewers, the body of literature on transcriptional noise is quite wide and based on heterogeneous assumptions. We have focused our efforts in measuring actual noise in scRNAseq aging datasets, which by definition imply sampling of different cells and thus make assumptions at the population level. We believe our results provide a different and interesting perspective into transcriptional noise and aging, but we agree with this reviewer in the need to discuss our findings in the context of other attempts to measure transcriptional noise in a more direct way. We have now included a brief discussion of the work by Sarnoski et al. and Liu et al.. This point is explained in more detail later in the letter.

      2) While the authors correctly put a lot of emphasis on studying the same cell type or tissue for a faithful interpretation of noise-related results, they ignore another important factor: tracking the same cell over time instead of calculating noise from single-cell populations at supposedly-different age points. Obviously, scRNAseq cannot analyze the same cell twice, but inability to assess noise-in-aging in the same cell over time is still an important concern. Noise could/does affect the generation durations and therefore neighboring cells in the same cluster may not have experienced the same amount of mitotic aging, for example. Also, perhaps a cell has already entered senescence at early age in the same tissue. This caveat should be properly discussed.

      The distinction between intrinsic and extrinsic noise and the impossibility to discern between the two in destructive experiments is a relevant point that we have now included in the Discussion (the newly added text is shown in italics):

      "Transcriptional noise could be related to genomic instability [18], epigenetic deregulation [19, 20] or loss of proteostasis [21], all established hallmarks of aging. Some authors consider transcriptional noise to be a hallmark of aging in and of itself [22]. In any case, the origin of transcriptional noise is unclear, as it could arise from many different sources. Most importantly, it not possible to distinguish between intrinsic and extrinsic noise from a snapshot of cellular states, i.e., one cannot tell whether the observed differences between cells in a single-cell RNA experiment reflect time-dependent variations in gene expression or differences between cells across a population [23]. Interestingly, recent work by Liu et al. measuring intrinsic noise in S. cerevisiae showed that aging is associated with a steady decrease in noise, with a sudden increase in soon-to-die cells. Another longitudinal study found an increase extrinsic noise and a lack of change in intrinsic noise in diploid yeast [16]."

      Regarding the caveat of cells of individuals in the Young groups showing signs of aging, we can only agree that this is correct: there will be cells sampled that already show signs of cellular damage in the absence of chronological aging. However this applies to every study of aging that samples cells in a destructive manner and it is generally assumed by the field that this is a discrete phenomenon that does not affect the overall results in a meaningful way.

      3) Another weakness of this study is that the authors did not show the source/cause of decreasing/stable/increasing noise during aging. Understanding the source of loss of cell type identity is also important but this manuscript was about noise in aging, so it would have been nice if there could be some attempts to explain why noise is having this/that trend in differentially aged cell types in specific tissues.

      The reviewer raises here a very important point that we would like to discuss in detail. The papers that we have re-analyzed generally assume that an increase in transcriptional noise and a loss in cell type identity are equivalent terms. However, as this reviewer points out, you could theoretically have cells that lose their cell type identity without a concomitant increase in transcriptional noise, for instance by a sharp decrease in a limited number of marker genes that collectively define that cell within a given cell type/cluster. Thus, transcriptional noise can certainly arise from different sources and several mechanisms have been proposed to explain its presence in the context of cellular aging. We agree with the reviewer that discussing how transcriptional noise could be related to aging is of interest to the readers. However, as pointed out in the responses to similar concerns by the other reviewers, our main finding is that we don’t detect meaningful and reliable increases in transcriptional noise associated with cell aging. Instead, what we see is a number of different technical and biological issues/phenomena that have been interpreted as transcriptional noise. We hope this reviewer will agree that the manuscript now presents a full and robust story and that finding the causes of up/down ”noise” trends in the different datasets may be more appropriately tackled by follow up studies.

      4) In the discussion section, the authors say that ”Most importantly, Scallop measures transcriptional noise by membership to cell type-specific clusters which is a re-definition of the original formulation of noise by Raser and O’Shea.” It is not clear what the authors refer to by ”the original formulation of noise by Raser and O’Shea”. Intrinsic/extrinsic noise formulations?? Please be more specific.

      We thank the reviewer for pointing this out, since we agree that the sentence needed to be reformulated for the sake of clarity. What we meant by the definition by Raser and O’Shea was ”the measured level of variation in gene expression among cells supposed to be identical”, which does not make any distinction between intrinsic and extrinsic noise. Since their definition is previous to the development of single-cell technologies, we meant to state our attempt to bring this classic concept to the context of single-cell RNAseq. Nowadays, cell clusters produced by a community detection algorithm are given cell type annotations depending on their expression of known cell type markers. What Scallop aims to measure is the extent of membership each individual cell has for their cluster as evidence of its transcriptional stability. In order to make this point more clear, we have now rewritten the paragraph as follows:

      Most importantly, Scallop measures transcriptional noise by membership to cell type-specific clusters which is a re-definition of the original formulation of noise by Raser and O’Shea: measurable variation among cells that should share the same transcriptome. This is in stark contrast to measurements of noise including other phenomena (as demonstrated in Figure 5) by the distance-to-centroid methods prevalent in the literature.

      References

      [1] M. Alex Ascensión, Olga Ibáñez-Solé, Iñaki Inza, Ander Izeta, and Marcos J Araúzo-Bravo. Triku: A feature selection method based on nearest neighbors for single-cell data. GigaScience, 11, 2022. doi: 10.1093/gigascience/giac017.

      [2] M. Ximerakis, S. L. Lipnick, B. T. Innes, S. K. Simmons, X. Adiconis, D. Dionne, B. A. Mayweather, L. Nguyen, Z. Niziolek, C. Ozek, V. L. Butty, R. Isserlin, S. M. Buchanan, S. S. Levine, A. Regev, G. D. Bader, J. Z. Levin, and L. L. Rubin. Single-cell transcriptomic profiling of the aging mouse brain. Nat Neurosci, 22(10), 2019. doi: https://doi:10.1038/s41593-019-0491-3.

      [3] M. Enge, H. E. Arda, M. Mignardi, J. Beausang, R. Bottino, S. K. Kim, and S. R. Quake. Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell, 171(2), 2017. doi: https://doi:10.1016/j.cell.2017.09.004.

      [4] L. Solé-Boldo, G. Raddatz, and S. et al. Schütz. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun Biol, 3(188), 2020. doi: https://doi.org/10.1038/ s42003-020-0922-4.

      [5] Jaime L. Schneider, Jared H. Rowe, Carolina Garcia-de Alba, Carla F. Kim, Arlene H. Sharpe, and Marcia C. Haigis. The aging lung: Physiology, disease, and immunity. Cell, 184(8):1990–2019, 2021. doi: 10.1016/j.cell.2021.03.005.

      [6] Shuai Ma, Shuhui Sun, Jiaming Li, Yanling Fan, Jing Qu, Liang Sun, Si Wang, Yiyuan Zhang, Shanshan Yang, Zunpeng Liu, and et al. Single-cell transcriptomic atlas of primate cardiopulmonary aging. Cell Research, 31(4):415–432, 2020. doi: 10.1038/s41422-020-00412-6.

      [7] I. Angelidis, L. M. Simon, and I. E. et al. Fernandez. An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics. Nature Communications, 2019. doi: https://doi.org/10. 1038/s41467-019-08831-9.

      [8] Jonathan M. Raser and Erin K. O’Shea. Noise in gene expression: origins, consequences, and control. Science, 309(5743):2010–2013, 2005. doi: 10.1126/science.1105891.

      [9] Michael B. Elowitz, Arnold J. Levine, Eric D. Siggia, and Peter S. Swain. Stochastic gene expression in a single cell. Science, 297:1183– 1186, 2002. doi: 10.1126/science.1070919.

      [10] Peter S. Swain, Michael B. Elowitz, and Eric D. Siggia. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci U S A., 99:12795–12800, 2002. doi: 10.1073/pnas.162041399.

      [11] Alex Cagan, Adrian Baez-Ortega, Natalia Brzozowska, Federico Abascal, Tim H. H. Coorens, Mathijs A. Sanders, Andrew R. J. Lawson, Luke M. R. Harvey, Shriram Bhosle, David Jones, Raul E. Alcantara, Timothy M. Butler, Yvette Hooks, Kirsty Roberts, Elizabeth Anderson, Sharna Lunn, Edmund Flach, Simon Spiro, Inez Januszczak, Ethan Wrigglesworth, Hannah Jenkins, Tilly Dallas, Nic Masters, Matthew W. Perkins, Robert Deaville, Megan Druce, Ruzhica Bogeska, Michael D. Milsom, Björn Neumann, Frank Gorman, Fernando Constantino-Casas, Laura Peachey, Diana Bochynska, Ewan St. John Smith, Moritz Gerstung, Peter J. Campbell, Elizabeth P. Murchison, Michael R. Stratton, and Iñigo Martincorena. Somatic mutation rates scale with lifespan across mammals. Nature, 604: 517–524, 2022. doi: 10.1038/s41586-022-04618-z.

      [12] Hamit Izgi, Dingding Han, Ulas Isildak, Shuyun Huang, Ece Kocabiyik, Philipp Khaitovich, Mehmet Somel, and Handan Melike Dönertas. Inter-tissue convergence of gene expression during ageing suggests age-related loss of tissue and cellular identity. eLife, 11, 2022. doi: 10.7554/eLife.68048.

      [13] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10): P10008, oct 2008. doi: 10.1088/1742-5468/2008/10/p10008. URL https://doi.org/10.1088/ 1742-5468/2008/10/p10008.

      [14] V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9, 2019. doi: https://doi.org/10.1038/s41598-019-41695-z.

      [15] Peng Qiu. Embracing the dropouts in single-cell rna-seq analysis. Nature Communications, 11(1), 2020. doi: 10.1038/s41467-020-14976-9.

      [16] Ethan A. Sarnoski, Ruijie Song, Ege Ertekin, Noelle Koonce, and Murat Acar. Fundamental characteristics of single-cell aging in diploid yeast. iScience, 7:96–109, 2018. doi: 10.1016/j.isci.2018.08.011.

      [17] Ping Liu, Ruijie Song, Gregory L. Elison, Weilin Peng, and Murat Acar. Noise reduction as an emergent property of single-cell aging. Nature Communications, 8(1), 2017. doi: 10.1038/s41467-017-00752-9.

      [18] Jan Vijg. From dna damage to mutations: All roads lead to aging. Ageing Res Rev., 68(101316), 2021. doi: 10.1016/j.arr.2021.101316.

      [19] Yuancheng Lu, Benedikt Brommer, Xiao Tian, Anitha Krishnan, Margarita Meer, Chen Wang, Daniel L. Vera, Qiurui Zeng, Doudou Yu, Michael S. Bonkowski, Jae-Hyun Yang, Songlin Zhou, Emma M. Hoffmann, Margarete M. Karg, Michael B. Schultz, Alice E. Kane, Noah Davidsohn, Ekaterina Korobkina, Karolina Chwalek, Luis A. Rajman, George M. Church, Konrad Hochedlinger, Vadim N. Gladyshev, Steve Horvath, Morgan E. Levine, Meredith S. Gregory-Ksander, Bruce R. Ksander, Zhigang He, and David A. Sinclair. Reprogramming to recover youthful epigenetic information and restore vision. Nature, 588(7836):124–129, 2020. doi: 10.1038/s41586-020-2975-4.

      [20] Giorgio Oliviero, Sergey Kovalchuk, Adelina Rogowska-Wrzesinska, Veit Schwämmle, and Ole N. Jensen. Distinct and diverse chromatin proteomes of ageing mouse organs reveal protein signatures that correlate with physiological functions. eLife, 11(e73524), 2022. doi: 10.7554/eLife.73524.

      [21] Jingyi Li, Yuxuan Zheng, Pengze Yan, Moshi Song, Si Wang, Liang Sun, Zunpeng Liu, Shuai Ma, Juan Carlos Izpisua Belmonte, Piu Chan, Qi Zhou, Weiqi Zhang, Guang-Hui Liu, Fuchou Tang, and Jing Qu. A single-cell transcriptomic atlas of primate pancreatic islet aging. Natl Sci Rev., 8(2): nwaa127, 2020. doi: 10.1093/nsr/nwaa127.

      [22] Alexander R. Mendenhall, George M. Martin, Matt Kaeberlein, and Rozalyn M. Anderson. Cellto-cell variation in gene expression and the aging process. Geroscience, 43(1):181–196, 2021. doi: 10.1007/s11357-021-00339-9.

      [23] Lucy Ham, Marcel Jackson, and Michael PH Stumpf. Pathway dynamics can delineate the sources of transcriptional noise in gene expression. eLife, 10, 2021. doi: 10.7554/elife.69324.

    1. Author Response

      Reviewer #1 (Public Review):

      It is now widely accepted that the age of the brain can differ from the person's chronological age and neuroimaging methods are ideally suited to analyze the brain age and associated biomarkers. Preclinical studies of rodent models with appropriate neuroimaging do attest that lifestyle-related prevention approaches may help to slow down brain aging and the potential of BrainAGE as a predictor of age-related health outcomes. However, there is a paucity of data on this in humans. It is in this context the present manuscript receives its due attention.

      Comments:

      1) Lifestyle intervention benefits need to be analyzed using robust biomarkers which should be profiled non-invasively in a clinical setting. There is increasing evidence of the role of telomere length in brain aging. Gampawar et al (2020) have proposed a hypothesis on the effect of telomeres on brain structure and function over the life span and named it as the "Telomere Brain Axis". In this context, if the authors could measure telomere length before and after lifestyle intervention, this will give a strong biomarker utility and value addition for the lifestyle modification benefits. 2) Authors should also consider measuring BDNF levels before and after lifestyle intervention.

      Response to comments 1+2: we agree that associating both telomere length and BDNF level with brain age would be interesting and relevant. However, we did not measure these two variables. We would certainly consider adding these in future work. Regarding telomere length, we now include a short discussion of brain age in relation to other bodily ages, such as telomere length (Discussion section):

      “Studying changes in functional brain aging is part of a broader field that examines changes in various biological ages, such as telomere length1, DNA methylation2, and arterial stiffness3. Evaluating changes in these bodily systems over time allows us to capture health and lifestyle-related factors that affect overall aging and may guide the development of targeted interventions to reduce age-related decline. For example, in the CENTRAL cohort, we recently reported that reducing body weight and intrahepatic fat following a lifestyle intervention was related to methylation age attenuation4. In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among dataset6 and larger training sample size5 may improve the accuracy of such models in the future. We also suggest that examining the dynamics of multiple bodily ages and their interactions would enhance our understanding of the complex aging process8,9. “

      And

      “These findings complement the growing interest in bodily aging indicated, for example, by DNA methylation4 as health biomarkers and interventions that may affect them.”

      Reviewer #2 (Public Review):

      In this study, Levakov et al. investigated brain age based on resting-state functional connectivity (RSFC) in a group of obese participants following an 18-month lifestyle intervention. The study benefits from various sophisticated measurements of overall health, including body MRI and blood biomarkers. Although the data is leveraged from a solid randomized control set-up, the lack of control groups in the current study means that the results cannot be attributed to the lifestyle intervention with certainty. However, the study does show a relationship between general weight loss and RSFC-based brain age estimations over the course of the intervention. While this may represent an important contribution to the literature, the RSFC-based brain age prediction shows low model performance, making it difficult to interpret the validity of the derived estimates and the scale of change. The study would benefit from more rigorous analyses and a more critical discussion of findings. If incorporated, the study contributes to the growing field of literature indicating that weight-reduction in obese subjects may attenuate the detrimental effect of obesity on the brain.

      The following points may be addressed to improve the study:

      Brain age / model performance:

      1) Figure 2: In the test set, the correlation between true and predicted age is 0.244. The fitted slope looks like it would be approximately 0.11 (55-50)/(80-35); change in y divided by change in x. This means that for a chronological age change of 12 months, the brain age changes by 0.11*12 = 1.3 months. I.e., due to the relatively poor model performance, an 80-year-old participant in the plot (fig 2) has a predicted age of ~55. Hence, although the age prediction step can generate a summary score for all the RSFC data, it can be difficult to interpret the meaning of these brain age estimates and the 'expected change' since the scale is in years.

      2) In Figure 2 it could also help to add the x = y line to get a better overview of the prediction variance. The estimates are likely clustered around the mean/median age of the training dataset, and age is overestimated in younger subs and overestimated in older subs (usually referred to as "age bias"). It is important to inspect the data points here to understand what the estimates represent, i.e., is variation in RSFC potentially lost by wrapping the data in this summary measure, since the age prediction is not particularly accurate, and should age bias in the predictions be accounted for by adjusting the test data for the bias observed in the training data?

      Response to comment 1+2: we agree with the reviewer that due to the relatively moderate correlation between the predicted and observed age, a large change in the observed age corresponds to a small change in the predicted age. We now state this limitation in Results section 2.1:

      “Despite being significant and reproducible, we note that the correlations between the observed and predicted age were relatively moderate.”

      And discuss this point in the Discussion section:

      “In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among dataset6 and larger training sample size5 may improve the accuracy of such models in the future.”

      Moreover, , we now add the x=y line to Fig. 2, so the readers can better assess the prediction variance as suggested by the reviewer:

      We prefer to avoid using different scales (year/month) in the x and y axes to avoid misleading the readers, but the list of observed and predicted ages are available as SI files with a precision of 2 decimals point (~3 days).

      We note that despite the moderate precision accuracy, we replicated these results in three separate cohorts.

      Regarding the effect of “age bias” (also known as “regression attenuation” or “regression dilution” 10), we are aware of this phenomenon and agree that it must be accounted for. In fact, the “age bias” is one of the reasons we chose to use the difference between the expected and observed ages as the primary outcome of the study, as this measure already takes this bias into account. To demonstrate this effect we now compute brain age attenuation in two ways: 1. As described and used in the current study (Methods 4.9); and 2. By regressing out the effect of age on the predicted brain age at both times separately, then subtracting the adjusted predicted age at T18 from the adjusted predicted age at T0. The second method is the standard method to account for age bias as described in a previous work 11. Below is a scatter plot of both measures across all participants:

      The x-axis represents the first method, used in the current study, and the y-axis represents the second method, described in Smith et al., (2019). Across all subjects, we found a nearly perfect 1:1 correspondence between the two methods (r=.998, p<0.001; MAE=0.45), as the two are mathematically identical. The small gap between the two is because the brain age attenuation model also takes into account the difference in the exact time that passed between the two scans for each participant (mean=21.36m, std = 1.68m).

      We now note this in Methods section 4.9:

      “We note that the result of computing the difference between the bias-corrected brain age gap at both times was nearly identical to the brain age attenuation measure (r=.99, p<0.001; MAE=0.45). The difference between the two is because the brain age attenuation model takes into account the difference in the exact time that passed between the two scans for each participant (mean=21.36m, std = 1.68m).”

      3) In Figure 3, some of the changes observed between time points are very large. For example, one subject with a chronological age of 62 shows a ten-year increase in brain age over 18 months. This change is twice as large as the full range of age variation in the brain age estimates (average brain age increases from 50 to 55 across the full chronological age span). This makes it difficult to interpret RSFC change in units of brain age. E.g., is it reasonable that a person's brain ages by ten years, either up or down, in 18 months? The colour scale goes from -12 years to 14 years, so some of the observed changes are 14 / 1.5 = 9 times larger than the actual time from baseline to follow-up.

      We agree that our model precision was relatively low, especially compared to the period of the intervention, as also stated by reviewer #1. We now discuss this issue in light of the studies pointed out by the reviewer (Discussion section):

      “In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among datasets6 and larger training sample size5 may improve the accuracy of such models in the future.”

      Again, we note that despite the moderate precision accuracy, we replicated these results in three separate cohorts and found that both the correlation and the MAE between the predicted and observed age were significant in all of them.

      RSFC for age prediction:

      1) Several studies show better age prediction accuracy with structural MRI features compared to RSFC. If the focus of the study is to use an accurate estimate of brain ageing rather than specifically looking at changes in RSFC, adding structural MRI data could be helpful.

      We focused on brain structural changes in a previous work, and the focus of the current work was assessing age-related functional connectivity alterations. We now added a few sentences in the Introduction section that would hopefully better motivate our choice:

      “We previously found that weight loss, glycemic control, lowering of blood pressure, and increment in polyphenols-rich food were associated with an attenuation in brain atrophy 12. Obesity is also manifested in age-related changes in the brain’s functional organization as assessed with resting-state functional connectivity (RSFC). These changes are dynamic13 and can be observed in short time scales14 and thus of relevance when studying lifestyle intervention.”

      2) If changes in RSFC are the main focus, using brain age adds a complicated layer that is not necessarily helpful. It could be easier to simply assess RSFC change from baseline to follow up, and correlate potential changes with changes in e.g., BMI.

      We are specifically interested in age-related changes as we described a-priori in the registration of the study: https://clinicaltrials.gov/ct2/show/NCT03020186

      Moreover, age-related changes in RSFC are complex, multivariate and dependent upon the choice of theoretical network measures. We think that a data-driven brain age prediction approach might better capture these multifaceted changes and their relation to aging. We now state this in the Introduction section:

      “Studies have linked obesity with decreased connectivity within the default mode network15,16 and increased connectivity with the lateral orbitofrontal cortex17, which are also seen in normal aging18,19. Longitudinal trials have reported changes in these connectivity patterns following weight reduction20,21, indicating that they can be altered. However, findings regarding functional changes are less consistent than those related to anatomical changes due to the multiple measures22 and scales23 used to quantify RSFC. Hence, focusing on a single measure, the functional brain age, may better capture these complex, multivariant changes and their relation to aging. “

      The lack of control groups

      1) If no control group data is available, it is important to clarify this in the manuscript, and evaluate which conclusions can and cannot be drawn based on the data and study design.

      We agree that this point should be made more clear, and we now state this in the limitation section of the Discussion:

      “We also note that the lack of a no-intervention control group limits our ability to directly relate our findings to the intervention. Hence, we can only relate brain age attenuation to the observed changes in health biomarkers.”

      Also, following reviewers’ #2 and #3 comments, we refer to the weight loss following 18 months of lifestyle intervention instead of to the intervention itself. This is now made clear in the title, abstract, and the main text.

      Reviewer #3 (Public Review):

      The authors report on an interesting study that addresses the effects of a physical and dietary intervention on accelerated/decelerated brain ageing in obese individuals. More specifically, the authors examined potential associations between reductions in Body-Mass-Index (BMI) and a decrease in relative brain-predicted age after an 18-months period in N = 102 individuals. Brain age models were based on resting-state functional connectivity data. In addition to change in BMI, the authors also tested for associations between change in relative brain age and change in waist circumference, six liver markers, three glycemic markers, four lipid markers, and four MRI fat deposition measures. Moreover, change in self-reported consumption of food, stratified by categories such as 'processed food' and 'sweets and beverages', was tested for an association with change in relative brain age. Their analysis revealed no evidence for a general reduction in relative brain age in the tested sample. However, changes in BMI, as well as changes in several liver, glycemic, lipid, and fat-deposition markers showed significant covariation with changes in relative brain age. Three markers remained significant after additionally controlling for BMI, indicating an incremental contribution of these markers to change in relative brain age. Further associations were found for variables of subjective food consumption. The authors conclude that lifestyle interventions may have beneficial effects on brain aging.

      Overall, the writing is concise and straightforward, and the langue and style are appropriate. A strength of the study is the longitudinal design that allows for addressing individual accelerations or decelerations in brain aging. Research on biological aging parameters has often been limited to cross-sectional analyses so inferences about intra-individual variation have frequently been drawn from inter-individual variation. The presented study allows, in fact, investigating within-person differences. Moreover, I very much appreciate that the authors seek to publish their code and materials online, although the respective GitHub project page did not appear to be set to 'public' at the time (error 404). Another strength of the study is that brain age models have been trained and validated in external samples. One further strength of this study is that it is based on a registered trial, which allows for the evaluation of the aims and motivation of the investigators and provides further insights into the primary and secondary outcomes measures (see the clinical trial identification code).

      One weakness of the study is that no comparison between the active control group and the two experimental groups has been carried out, which would have enabled causal inferences on the potential effects of different types of interventions on changes in relative brain age. In this regard, it should also be noted that all groups underwent a lifestyle intervention. Hence, from an experimenter's perspective, it is problematic to conclude that lifestyle interventions may modulate brain age, given the lack of a control group without lifestyle intervention. This issue is fueled by the study title, which suggests a strong focus on the effects of lifestyle intervention. Technically, however, this study rather constitutes an investigation of the effects of successful weight loss/body fat reduction on brain age among participants who have taken part in a lifestyle intervention. In keeping with this, the provided information on the main effect of time on brain age is scarce, essentially limited to a sign test comparing the proportions of participants with an increase vs. decrease in relative brain age. Interestingly, this analysis did not suggest that the proportion of participants who benefit from the intervention (regarding brain age) significantly exceeds the number of participants who do not benefit. So strictly speaking, the data rather indicates that it's not the lifestyle intervention per sé that contributes to changes in brain age, but successful weight loss/body fat reduction. In sum, I feel that the authors' claims on the effects of the intervention cannot be underscored very well given the lack of a control group without lifestyle intervention.

      We agree that this point, also raised by reviewer #2, should be made clear, and we now state this in the limitation section of the Discussion:

      “We also note that the lack of a no-intervention control group limits our ability to directly relate our findings to the intervention. Hence, we can only relate brain age attenuation to the observed changes in health biomarkers.”

      Also, following reviewers #2 and #3, we refer to the weight loss following 18 months of lifestyle intervention instead of to the intervention itself. This is now explicitly mentioned in the title, abstract, and within the text:

      Title: “The effect of weight loss following 18 months of lifestyle intervention on brain age assessed with resting-state functional connectivity”

      Abstract: “…, we tested the effect of weight loss following 18 months of lifestyle intervention on predicted brain age, based on MRI-assessed resting-state functional connectivity (RSFC).”

      Another major weakness is that no rationale is provided for why the authors use functional connectivity data instead of structural scans for their age estimation models. This gets even more evident in view of the relatively low prediction accuracies achieved in both the validation and test sets. My notion of the literature is that the vast majority of studies in this field implicate brain age models that were trained on structural MRI data, and these models have achieved way higher prediction accuracies. Along with the missing rationale, I feel that the low model performances require some more elaboration in the discussion section. To be clear, low prediction accuracies may be seen as a study result and, as such, they should not be considered as a quality criterion of the study. Nevertheless, the choice of functional MRI data and the relevance of the achieved model performances for subsequent association analysis needs to be addressed more thoroughly.

      We agree that age estimation from structural compared to functional imaging yields a higher prediction accuracy. In a previous publication using the same dataset12, we demonstrated that weight loss was associated with an attenuation in brain atrophy, as we describe in the introduction:

      “We previously found that weight loss, glycemic control and lowering of blood pressure, as well as increment in polyphenols rich food, were associated with an attenuation in brain atrophy 12.”

      Here we were specifically interested in age-related functional alterations that are associated with successful weight reduction. Compared to structural brain changes aging effect on functional connectivity is more complex and multifaced. Hence, we decided to utilize a data-driven or prediction-driven approach for assessing age-related changes in functional connectivity by predicting participants’ functional brain age. We now describe this rationale in the introduction section:

      “Studies have linked obesity with decreased connectivity within the default mode network15,16 and increased connectivity with the lateral orbitofrontal cortex17, which are also seen in normal aging18,19. Longitudinal trials have reported changes in these connectivity patterns following weight reduction20,21, indicating that they can be altered. However, findings regarding functional changes are less consistent than those related to anatomical changes due to the multiple measures22 and scales23 used to quantify RSFC. Hence, focusing on a single measure, the functional brain age, may better capture these complex changes and their relation to aging.”

      We address the point regarding the low model performance in response to reviewer #2, comment #2.

    1. Author Response

      Reviewer #1 (Public Review):

      This study investigates how pathogens might shape animal societies by driving the evolution of different social movement rules. The authors find that higher disease costs induce shifts away from positive social movement (preference to move towards others) to negative social movement (avoidance from others). This then has repercussions on social structure and pathogen spread.

      Overall, the study comprises a good mixture of intuitive and less intuitive results. One major weakness of the work, however, is that the model is constructed around one pathogen that repeatedly enters a population across hundreds of generations. While the authors provide some justification for this, it does not capture any biological realism in terms of the evolution of the pathogen itself, which would be expected. The lack of co-evolution in the model substantially limits the generality of the results. For example, a number of recent studies have reported that animals might be expected to become very social when pathogens are very infectious, because if the pathogen is unavoidable they may as well gain the benefits of being social. The authors make some arguments about being focused on introduction events, but this does not really align well with their study design that carries through many generations after the introduction. Given the rapid evolutionary dynamics, perhaps the study could have a more focused period immediately after the initial introduction of the pathogen to look at rapid evolutionary responses (albeit this may need some sensitivity analyses around the parameters such as the mutation rates).

      We appreciate the reviewer’s evaluation of our work, and acknowledge that we have not currently included evolutionary dynamics for the pathogen.

      One conceptual impediment to such inclusion is knowing how pathogen traits could be modelled in a mechanistic way. For example, it is widely held that there is a trade-off between infection cost and transmissibility, with a quadratic relationship between them, but this is a pattern and not a process per se. We are unsure which mechanisms could be modelled that impinge upon both infection cost and transmissibility.

      On the practical side, we feel that a mechanistic, individual-based model that includes both pathogen and host evolution would become very challenging to interpret. It might be more tractable to begin with a mechanistic, spatial model that examines pathogen trait evolution with an unchanging host (such as an adaptation of Lion and Boots, 2010). We would be happy to take this on in future work, with a view to combining models thereafter.

      We have taken the suggestion to focus on the period immediately after the introduction, and we now focus on the following 500 generations. While 500 generations is still a long time, we would note that our model dynamics typically stabilise within 200 generations. We show the following generations primarily to check that some stability in the dynamics has indeed been reached (but see our new scenario 2).

      We also appreciate the point regarding mutation rates. Our mutation rates are relatively high to account for the small size of our population. We have found that with smaller mutation rates (0.001 rather than 0.01), evolutionary shifts in our population do not occur within the first 500 generations. This is primarily because prior to pathogen introduction, the ‘agent avoiding’ strategy that becomes common later is actually quite rare. Whether a rapid transition takes place thus depends on whether there are any agent avoiding individuals in the population at the moment of pathogen introduction, or on whether such individuals emerge rapidly thereafter through mutations on the social weights. We expect that with larger population sizes, we would be able to recover our results with smaller mutation rates as well.

      A final, and much more minor comment is whether this is really a paper about movement. The model does not really look at evolutionary changes in how animals move, but rather at where they move. How important is the actual movement process under this model? For example, would the results change if the model was constructed without explicit consideration of space and resources, but instead simply modelled individuals' decisions to form and break ties? (Similar to the recent paper by Ashby & Farine https://onlinelibrary.wiley.com/doi/full/10.1111/evo.14491 ). It might help to provide more information about how putting social decisions into a spatially explicit framework is expected to extend studies that have not done so (e.g.., because they are analytical).

      This paper is indeed about movement, as where to move is a key part of the movement ecology paradigm (Nathan et al. 2008). That said, we appreciate the advice to emphasise the importance of social decisions in a spatial context, we have added these to the Introduction (L. 79 – 81) and Discussion (L. 559 – 562). In brief, we do expect different dynamics that result from the explicit spatial context, as compared to a model in which social associations are probabilistic and could occur with any individual in the population.

      In our models, individual social tendency (whether they are prefer moving towards others) is separated from individual sociality (whether they actually associate with other individuals). This can be seen from our (new) Fig. 3D, in which individuals of each of the social strategies can sometimes have similar numbers of associations (although modulated by movement). This separation of the pattern from the underlying process is possible, we believe, due to the heterogeneity in the social landscape created by the explicit spatial context.

      Reviewer #2 (Public Review):

      This theoretical study looks at individuals' strategies to acquire information before and after the introduction of pathogens into the system. The manuscript is well-written and gives a good summary of the previous literature. I enjoyed reading it and the authors present several interesting findings about the development of social movement strategies. The authors successfully present a model to look at the costs and benefits of sociality.

      I have a couple of major comments about the work in its current form that I think are very important for the authors to address. That said, I think this is a promising start and that with some revisions, this could be a valuable contribution to the literature on behavioral ecology.

      We appreciate the reviewer’s kind words.

      Before starting, I would like to be precise that, given the scope of the models and the number of parameter choices that were necessary, I am going to avoid criticisms of the decisions made when designing the models. However, there are a few assumptions I rather find problematic and would like to give proper attention to.

      The first regards social vs. personal information. Most of the model argumentation is based on the reliance on social information (considering four, but to me overlapping, social strategies that are somehow static and heritable) but in fact, individuals may oscillate between relying on their personal information and/or on social information -- which may depend on the availability of resources, population density, stochastic factors, among others (Dall et al. 2005 Trends Ecol. Evol., Duboscq et al. 2016 Frontiers in Psychology). In my opinion, ignoring the influence of personal and social information decreases the significance of this work. I am aware that the authors consider the detection of food present in the model, but this is considered to a much smaller extent (as seen in their weight on individual decisions) than the social information cues.

      We appreciate the point that individuals can switch between relying on social and personal information. However, we would point out that in our model, the social strategies are not static. The social strategy is a convenient way of representing individuals’ position in behavioural trait-space (the ‘behavioural hypervolume’ of Bastille-Rousseau and Wittemeyer 2019). This essentially means that the importance assigned to each of the three cues available in our model varies among individuals. There are indeed individuals that are primarily guided by the density of food items, and this is the commonest ‘overall’ movement strategy before the pathogen is introduced. We represent this by showing how the importance of social information is low before pathogen introduction (Fig. 2B).

      While we primarily focus on the importance of social information, this is because the population quite understandably evolves a persistent preference for moving towards food items (i.e., using personal information if available). We have made this clearer in the text on lines 367 – 371.

      Critically, it is also unclear how, if at all, the information and pathogen traits are related to each other. If a handler gets sick, how does this affect its foraging activity (does it stop foraging, slow its activities, or does it show signs of sickness)? Perhaps this model is attempting to explore the emergence of social movement strategies only, but how they disentangle an individual's sickness status and behavioral response is unclear.

      We appreciate that infection may lead to physiological effects (e.g. altered metabolic rates, reduction in cognitive capacity) that may then influence behaviour. Our model aims to be relatively simple and general one, and does not consider the explicit mechanisms by which infection imposes a cost on fitness. Thus we do not include any behavioural modifications due to infection, as we feel that these would be much too complex to include in such a model. We would be happy to explore, in future work, phenomena such as the evolution of self-isolation and infection detection which is common among animals such as social insects (Stroeymeyt et al. 2018, Pusceddu et al. 2021).

      However, we have considered an alternative implementation of our model’s scenario 1 which could be interpreted as the infection reducing foraging efficiency by a certain percentage (other interpretations of the redirection of energy away from reproduction are also possible). We show how this implementation leads to very similar outcomes as those seen in our

      Very little is presented about the virulence of the pathogens and how they could affect the emergence of social strategies. The authors keep their main argumentation based on the introduction of novel pathogens (without distinctions on their pathogenicity), but a behavioral response is rather influenced by how fast individuals are infected and which are their chances of recovering. Besides, they consider that only one or two social interactions would be enough for pathogen transmission to occur.

      We have indeed considered a fixed transmission probability of 0.05, a relatively modest attack rate. Setting transmission probability to two other values (0.025, 0.1), we find that our general results are recovered - there is an evolutionary transition away from sociality, with the proportion of agent avoidance evolved increasing with the transmission probability. While we do not show these results in the main text, we have included figures showing the proportions of each social movement strategy here for the reviewers’ reference.

      Figures showing the proportion of social movement strategies in two simulation runs of our default implementation of scenario 1 (dE = 0.25, R = 2, pathogen introduction begins from G = 500). Top: Probability of transmission = 0.025 (half of the default). Bottom: Probability of transmission = 0.10 (double the default). Overall, the proportion of agent avoidance evolved (purple) increases with the probability of transmission. Each figure shows a single replicate of each parameter combination, for only 1,000 generations.

      Another important component is that individuals do not die, and it seems that they always have a chance (even if it is small) to reproduce. So, how the authors consider unsuccessful strategies in the model outputs or how these social strategies would be potentially "dismissed" by natural selection are not considered.

      We appreciate the point that our simulation does not include mortality effects, and that all individuals have some small chance of reproducing. There are a few practical and conceptual challenges when incorporating this level of realism in a general model. Including mortality effects could allow for the emergence of more complex density-dependent dynamics, as dead individuals would not be able to transmit the pathogen to other foragers (although for some pathogens, this could be a valid choice), nor would they be sources of social information. This would make the model much more challenging to interpret, and we have tried to keep this model as simple as possible.

      We have also sought to keep the model’s focus on the evolutionary dynamics, and to not focus on mortality. In order to balance this aim with the reviewer's suggestion, we have included a new implementation of the model’s scenario 1 which has a threshold on reproduction. That means that only individuals with a positive energy balance (intake > infection costs) are allowed to reproduce. We show a potentially counter-intuitive result, that the more social ‘handler tracking’ strategy persists at a higher frequency than in our default implementation, despite having a higher infection rate than the ‘agent avoiding’ strategy. We suggest that this is because the ‘agent avoiding’ individuals have very low or no intake. This is sufficient in our default implementation to have relatively higher fitness than the more frequently infected handler tracking individuals.

      Reviewer #3 (Public Review):

      Gupte and colleagues develop an individual-based model to examine how the introduction of a novel pathogen influences the evolution of social cue use in a population of agents for which social cues can both facilitate more efficient foraging, but also expose individuals to infection. In their simulations, individuals move across a landscape in search of food, and their movements are guided by a combination of cues related to food patches, individuals that are currently handling food items, and individuals that are not actively handling food. The latter two cues can provide indirect information about the likely presence of food due to the patchiness of food across the landscape.

      The authors find that prior to introducing the novel pathogen, selection favors strategies that home in on agents, regardless of whether those agents are currently handling food items. The overall contribution of these social cues to movement decisions, however, tends to be relatively small. After pathogen introduction, agents evolve to rely more heavily on social information and to either be more selective in their use of it (attending to other agents that are currently handling food and avoiding non-handlers) or avoiding other agents altogether. Gupte and colleagues further examine the ecological consequences of these shifts in social decision-making in terms of individuals' overall movement, food consumption, and infection risk. Relative to pre-introduction conditions, individuals move more, consume less food, and are less likely to be infected due to reduced contact with others. Epidemiological models on emergent social networks confirm that evolved behavioral changes generate networks that impede the spread of disease.

      The introduction of novel pathogens into wild populations is expected to be increasingly common due to climate change and increasing global connectedness. The approach taken here by the authors is a potentially worthwhile avenue to explore the potential eco-evolutionary consequences of such introductions. A major strength of this study is how it couples ecological and evolutionary timescales. Dominant behavioral strategies evolve over time in response to changing environmental conditions and impact social, foraging, and epidemiological dynamics within generations. I imagine there are many further questions that could be fruitfully explored using the authors' framework. There are, however, important caveats that impact the interpretation of the authors' findings.

      First, reproduction bears no cost in this model. Individuals produce offspring in proportion to their lifetime net energy intake, which is increased by consuming food and decreased by a set amount per turn once infected. However, prior to reproduction, net energy intake is normalized (0-1) according to the lowest individual value within the generation. This means that individuals need not maintain a positive energy balance nor even consume food at all to successfully reproduce, so long as they perform reasonably well relative to other members of the population. Since consuming food is not necessary to reproduce, declining per capita intake due to evolved social avoidance (Fig. 1d) likely decreases the importance of food to an individual's reproductive success relative to simply avoiding infection. This dynamic could explain the delayed emergence of the 'agent avoiding' strategy (Fig. 1a), as this strategy potentially is only viable once per capita intake reaches a sufficiently low level across the population (Fig. 1d). I am curious to know what the results would be if reproduction required some minimal positive net energy, such that individuals must risk food patches in order to reproduce. It would also be useful for the authors to provide information on how net energy intake changes across generations, as well as whether (and if so, how) attraction to the food itself may change over time.

      We thank the reviewer for their assessment of our work, and appreciate the point raised here (and in an earlier review) about individuals potentially reproducing without any intake. We have addressed this by running our default model [repeated introductions, R = 2, dE = 0.25], with a threshold on reproduction such that only individuals with a positive energy balance can reproduce. We mention these results in the text (L. 495 – 500), and show related figures in the SI Appendix. In brief, as the reviewer suggests, agent avoiding is less common for our default parameter combination, but becomes as common as the default combination when the infection cost is doubled (to dE = 0.5).

      We appreciate the reviewer’s suggestion about decreasing per-capita intake being a precondition for the proliferation of the agent avoiding strategy. With our new results, we now show that there is no overall decrease in intake, but the agent avoiding strategy still becomes a common strategy after pathogen introduction. As the reviewer suggests, this is because these individuals have an equivalent net energy as handler tracking individuals, as they are less frequently infected.

      We suggest that the delayed emergence of the agent avoiding strategy is primarily due to mutation limitations – such individuals are uncommon or non-existent in the simulation before pathogen introduction, and random mutations are required for them to emerge. As we have noted in response to an earlier comment, this becomes clear when the mutation rate is reduced from 0.01 to 0.001 – agent avoidance usually does not evolve at all.

      A second important caveat is that the evolutionary responses observed in the model only appear when novel pathogen introductions are extremely frequent. The model assumes no pathogen co-evolution, but rather that the same (or a functionally identical) pathogen is re-introduced every generation (spillover rate = 1.0). When the authors considered whether evolutionary responses were robust to less frequent introductions, however, they found that even with a per-generation spillover rate of 0.5, there was no impact on social movement strategies. The authors do discuss this caveat, but it is worth highlighting here as it bears on how general the study's conclusions may be.

      We appreciate the reviewer’s point entirely. We would point out that current knowledge about pathogen introductions across species and populations in the wild is very poor. However, the ongoing highly pathogenic avian influenza outbreak (Wille and Barr 2022), the spread of multiple strains of SARS-CoV-2 to wild deer in several different human-to-wildlife transmission events, and recent work on the potential for coronavirus spillovers from bats to humans, all suggest that at least some generalist pathogens must circulate quite widely among wildlife, often crossing into novel host species or populations. We have added these considerations to the text on lines 218 – 231.

      We have also added, in order to confront this point more squarely, a new scenario of our model in which the pathogen is introduced just once, and then transmits vertically and horizontally among individuals (lines 519 – 557). This scenario more clearly suggests when evolutionary responses to pathogen introductions are likely to occur, and what their consequences might be for a pathogen becoming endemic in a population. This scenario also serves as a potential starting point for models of host-pathogen trait co-evolution, and we have added this consideration to the text on lines 613 – 623.

      References

      ● Albery, G. F. et al. 2021. Multiple spatial behaviours govern social network positions in a wild ungulate. - Ecology Letters 24: 676–686.

      ● Bastille-Rousseau, G. and Wittemyer, G. 2019. Leveraging multidimensional heterogeneity in resource selection to define movement tactics of animals. - Ecology Letters 22: 1417–1427.

      ● Gupte, P. R. et al. 2021. The joint evolution of animal movement and competition strategies. - bioRxiv in press.

      ● Lion, S. and Boots, M. 2010. Are parasites ‘“prudent”’ in space? - Ecology Letters 13: 1245–1255.

      ● Lloyd-Smith, J. O. et al. 2005. Superspreading and the effect of individual variation on disease emergence. - Nature 438: 355–359.

      ● Nathan, R. et al. 2008. A movement ecology paradigm for unifying organismal movement research. - PNAS 105: 19052–19059.

      ● Pusceddu, M. et al. 2021. Honey bees increase social distancing when facing the ectoparasite varroa destructor. - Science Advances 7: eabj1398.

      ● Sánchez, C. A. et al. 2022. A strategy to assess spillover risk of bat SARS-related coronaviruses in Southeast Asia. - Nat Commun 13: 4380.

      ● Stroeymeyt, N. et al. 2018. Social network plasticity decreases disease transmission in a eusocial insect. - Science 362: 941–945.

      ● Wilber, M. Q. et al. 2022. A model for leveraging animal movement to understand spatio-temporal disease dynamics. - Ecology Letters in press.

      ● Wille, M. and Barr, I. G. 2022. Resurgence of avian influenza virus. - Science 376: 459–460.

    1. Author Response

      Reviewer #1 (Public Review):

      This work sheds light on the adverse effects of Bacillus thuringiensis, a strong pathogenic bacteria used as a microbial pesticide to kill lepidopteran larvae that threaten crops, on gut homeostasis of non-susceptible organisms. By using the Drosophila melanogaster as a non-susceptible organism model, this paper reveals the mechanisms by which the bacteria disrupt gut homeostasis. Authors combined the use of different genetic tools and Western blot experiments to successfully demonstrate that bacterial protoxins are released and activated throughout the fly gut after ingestion and influence intestinal stem cell proliferation and intestinal cell differentiation. This phenomenon relies on the interaction of activated protoxins with specific components of adherens junctions within the intestinal epithelium. Due to conserved mechanisms governing intestinal cell differentiation, this work could be the starting point for further studies in mammals.

      The conclusions proposed by the authors are in general well supported by the data. However, some improvements in data representation, as well as additional key control experiments, would be needed to further reinforce some key points of the paper.

      We thank reviewer1 for her appreciation of the work and in depth analysis of the data. We agree with all her comments and believe the suggestions significantly improved the manuscript.

      1) Figure 1 and others: Several graphs in the manuscript show the number of cells/20000µm2. How is the shape of the gut in the different conditions studied in this manuscript? The gut shape (shrunk gut versus normal gut for example) could influence the number of cells seen in a small area. For example, the number of total cells quantified in a small area (here 20000µm2) of a shrunk gut can be increased while their size decrease. As a result, the quantification of a specific cell type in a small region (here 20000µm2) can be biased and not represent the real number of cells present in the whole posterior part of the R4 region. Would it make sense to calculate a ratio "number of X cells/number of DAPI positive cells per 20000µm2"?

      We provided a suitable answer in the "Essential Revisions point 1" corresponding to this reviewer's concern. To summarize, we have now added whole posterior midgut images in the different conditions to highlight the intestinal morphology (Figure 1-figure supplement 1A). The whole gut morphology was not affected by the different challenges we performed. Indeed, we used low doses of spores and/or toxins in order to mimic "natural" amounts of spores/toxins the fly can eat in the environment and in order to avoid drastic gut lining disturbances.

      We have also added the cell type ratio in figure 1- figure supplement 2.

      2) Figure 4: Is it possible that Arm staining is less intense between ISC and progenitors after ingestion of the bacteria due to the fact there is a high rate of stem cell proliferation? Could it be an indirect effect of stem cell proliferation rather than the binding of the toxins to Cadherins?

      We thank the reviewer for this pertinent comment. Indeed, for this reason, we compared the intensity of Arm expression at the junction between neighboring progenitors with the Arm intensity around the rest of the cellular membranes and calculated the ratio between both values (see Figure 4-figure supplement 1F-G for an illustration of how we proceeded and the new section in the Material and Methods 736-742). Using this method, even if the whole Arm staining intensity is different (in all the midgut), the ratio reflects the internal cell-cell interaction changes between the two neighboring cells. Moreover, we have observed that Arm staining (using the usual monoclonal antibody N2 7A1 from the DSHB) was very variable from one midgut to another in the same feeding/intoxication condition. So, we do not want to draw conclusion about the whole Arm intensity due to this variability whatever are the intoxication conditions. Finally, the challenged guts always displayed a more disorganized epithelium due to cell proliferation and differentiation. Consequently, Arm staining in ECs and progenitor cells are found in the same focal plane while in unchallenged and well-organized guts, Arm staining in ECs is above the focal plane of Arm staining in progenitor cells. This likely leads to the impression that Arm staining is more intense in challenged midguts. This method description is now added in the Material and methods section (lines 736-742).

      Could the authors use the ReDDM system to distinguish between "old" and newly formed cells? This could be a good control to make sure that the signal is quantified in similar cells between the control and the different conditions.

      We have analyzed intensity of Arm expression between pairs of GFP cells. Most of these pairs arose from de novo divisions. Indeed, as shown in control conditions (water) with Dl-ReDDM (for example see figure 1-figure supplement 1D), pairs of GFP cells (ISC-ISC) are rare. Most pairs correspond to ISC-EB or ISC-EEP pairs with the progenitor marked by the RFP, meaning that it just arises from the GFP+ mother ISC. Therefore we assume, that in the esg>GFP genotype, pairs of GFP+ cells correspond to one ISC and one progenitor (see Figure 4 – figure supplement 1A-A'). Therefore, when we analyzed the Arm intensity between pairs of GFP cells after intoxication, these cells are very likely "newborn" cells. Even if we suppose there are ISCs and progenitors that remain stuck together for a long time (for instance several days), Cry1A toxins can also be able to disrupt their cell junction. In the context of Cry1A toxin activity, it seems important to analyze the whole impact on cell-cell junctions without discriminating old and new cell-cell interactions.

      We tried to use anti-Arm and anti-Pros double staining to mark new EEPs. Unfortunately, anti-Arm and anti-Prospero antibodies were both raised in mice. Co-staining with both antibodies give rise to bad labelling either for Arm or for Prospero or for both. Our first author spent lot of energy trying to set up good conditions but unfortunately this was unsuccessful.

      Here is an example of what we got (this was the best image we got) with esg>GFP flies fed with water (control) and labelled for Arm and Pros in red. White arrows point two EEPs. Red arrows points the Arm staining between two precursors (ISC/ISC or ISC/EB or EB/EB). It was extremely hard to identify junctions marked by Arm between EEPs and ISCs because the Pros staining was too strong.

      Another example with flies fed with spores of SA11 (increasing the number of EEs). In green is the esg>GFP and in Red Arm and Prospero. The right panel correspond to the single red channel (Arm/Prospero).

      Nevertheless, we have now performed a similar analysis in an esg>GFP, Shg::RFP background and analyzed Shg::RFP (Tomato::DE-Cadherin) labelling intensity. We found similar results that are presented in the new Figure 4 (data we Arm have been moved in Figure 4-figure supplement 1). This last analysis have been included in the text lines 285-299.

      Figure 4E' and 4G': Arm staining seems more intense when looking at the whole membrane levels of cells compared to control. Is it possible that the measured ratio contact intensity/membrane intensity presented in Figure 4I could be impacted and not reflect the real contact intensity between ISC and progenitor cells?

      Please check our answer just above: "…//… we have observed that Arm staining (using the usual monoclonal antibody N2 7A1 from the DSHB) was very variable from one midgut to another in the same feeding/intoxication condition. So, we do not want to draw conclusion about the whole Arm intensity due to this variability whatever are the intoxication conditions".

      See also our intensity measurement method described above to avoid bias: "…//… we compared the intensity of Arm expression at the junction between neighboring progenitors with the Arm intensity around the rest of the cellular membranes and calculated the ratio between both values (see Figure 4-figure supplement 1F-G for an illustration of how we proceeded and the new section in the Material and Methods 736-742). Using this method, even if the whole Arm staining intensity is different (in all the midgut), the ratio reflects the internal cell-cell interaction changes between the two neighboring cells."

      What is the hypothesis of the authors about the decrease of Arm or DE-Cad seen after bacterial/crystal ingestion? Does the interaction between the toxins and DE-Cad induce a relocation of DE-Cad?

      It has been shown that E-Cadherin could be recycled when adherens junctions are destabilized both in Drosophila and mammals(Buchon et al., 2010; O'Keefe et al., 2007; Tiwari et al., 2018). To investigate this possibility, we tried to analyze DE-Cad cytoplasmic relocalization using anti-DE-Cad immunostaining (DCAD2 antibody from DSHB) as well as Shg::RFP (Bloomington stock #58789) or Shg::GFP (Bloomington stock #60584) endogenous fusion. Unfortunately, we did not see obvious differences. Nevertheless, we have now added the split channels of the Shg::RFP labelling in the different conditions in Figure 4A-D'. Nevertheless, we are still interested in the behavior of the DE-cadherin (and signaling, see (Liang et al., 2017)) upon binding of the Cry1A toxin. N. Zucchini-Pascal (author in this article) are currently investigating this question.

      The authors should add more details about the way to quantify in the Material and methods section. How many cells have been quantified per intestine? How did they choose the cells where they quantified the contact intensity?..etc

      These details were missing in the methods and we thank the reviewer for highlighting this issue. We added these information to the methods (lines 725-742). The number of cell pairs analyzed was present in the raw data related to figure 4 but absent from the main figure and legend. It is now rectified. We only measured the intensity in isolated pairs of cells.

      Figure 4B, D, F and H: How did the authors recognize the ISCs?

      We agree with the reviewer comment. We cannot recognize ICS per se. Green cells correspond either to ISCs or to EBs. We modified the text accordingly (lines 285-287).

      Could the authors do quantifications of DE-Cad signal?

      This has been done. It is shown now in figure 4E and in Table 1. We also adapted the text (lines 289-299) to fine-tune our interpretation in light of this new analysis. Indeed, what we have now defined as "mild" adherens junction intensity is between the ratio 1.4 and 1.6 instead of the previous ratio (1.3 to 1.6), because we observed most of the EEP progenitors arising from cell displaying a junction intensity with their mother cells below the 1.4 ratio (see Table 1).

      Like Arm staining, the staining seems stronger at the whole membrane level in F and H compared to the control.

      As we described above for Arm staining, the intensity of Tomato::DE-Cad labelling can differ from one posterior midgut to another one. One simple explanation would be related to changes in the structure of midgut epithelium which is well organized in unchallenged conditions, while in challenged midguts the epithelial cells are not well-arranged anymore due to rapid cell proliferation and differentiation. Consequently, DE-Cad labelling in ECs is at the same level as that in ISC/progenitors cells, giving the impression that the labelling is stronger.

      3) Figure 5: How is the stem cell proliferation upon overexpression of DE-Cad in control or upon bacteria/crystals ingestion? Do the authors think that the decrease of Pros+RFP+ new cells upon overexpression of DE-Cad could result from a decrease of stem cell proliferation?

      Great suggestion. Thereby, we chose to count the progenitor cells (GFP+ cells) reflecting the ISC division during the last 3 days. Moreover, this also has the advantage of working on the same pictures (samples) used for all the analyzes shown in figure 5 and Figure 5-figure supplement 1. Hence, If we consider the number of GFP+ cells (esg expressing cells corresponding to ISC, EB or EEP) in challenged midguts, the overexpression of the DE-Cad did not seem to alter ISC division. In addition, we still observed more GFP+ cells when the midguts were challenged with SA11 or crystals than with BtkCry, in agreement with the rate of ISC division observed in the WT genetic background shown in figure 1B.

      We have now added the counting of GFP+ cells in Figure 5-figure supplement 1E. The text has been modified to integrate this results (lines 306-308).

      Did the authors quantify the % of new ECs in the context of overexpression of DE-Cad?

      The data has been added in figure 5F. The text has been modified to integrate this result lines 312-313.

      Figure 5F: As asked before, did the authors distinguish the signal between newly born cells and the signal between older cells?

      In the new figure 5G: we used the esg-ReDDM system that is very efficient. Almost all ISC and progenitors express the GFP. The counting have been done between cell pairs that express both the GFP and RFP. It is specified in the text lines 310-311. Nevertheless, we cannot distinguish between new and old cells here. Indeed, the esg-ReDDM system induce both the GFP and the RFP in all esg+ cells (the old ones and the new ones). Hence, if a division has occurred just before the induction of the system to give birth for instance to an ISC and an EB, both cells will express the GFP and the RFP. But should we consider those pairs of cells as old cells or new cells? Noteworthy, as we analyzed the intensity of junctions 3 days after intoxication and induction of the ReDDM system, we assume that the pairs of GFP+/RFP+ cells arose after the induction of the system. Indeed, to our knowledge, nobody has shown in the posterior midgut, that a progenitor remains stuck to its mother ISC as long as 3 days. Even if we assume that this event can occur, Cry1A toxins can also be able to disrupt their cell junction.

      We now have removed the DAPI channel and added the RFP+ channel in Figure 5-figure supplement 1A-D' (previously the Figure S4A-D) to illustrate this explanation and to facilitate the interpretation by the reader.

      It would be interesting to compare the junction intensity between mother ISCs and their daughter progenitors before and after intoxication in a same intestine. But we think that this event is quite rare because of the experimental conditions we used (i.e. analyses 3 days after the induction of the ReDDM/intoxication).

      The same experiments (stem cell proliferation + quantification of the % of new ECs) could be also done when authors overexpress of the Connectin, supplemental figure 5. This would be another control to conclude that the effects on cell differentiation are specific due to the interaction between DE-Cad and the toxins.

      We have added the analyses in Figure 5 - figure supplement 2J and K.

      The text has been completed lines 317-320.

      In the "crystals" condition, the overexpression of Connection seems to partially rescue the increase % of new Pros+RFP+ new cells observed in Figure 3F (Figure S5I compared to Figure 3F).

      Yes, we agree with the reviewer comment. In an esg-ReDDM background (figure 3F), crystals induced a much greater increase in EE numbers than did SA11 spores. However, in a WT or esg>GFP background, crystals induced a similar increase in EE/EEP to that induced by SA11 spores. So we do not yet have explanation excepted the genetic background of the esg-ReDDM.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors start the study with an interesting clinical observation, found in a small subset of prostate cancers: FOXP2-CPED1 fusion. They describe how this fusion results in enhanced FOXP2 protein levels, and further describe how FOXP2 increases anchorageindependent growth in vitro, and results in pre-malignant lesions in vivo. Intrinsically, this is an interesting observation. However, the mechanistic insights are relatively limited as it stands, and the main issues are described below.

      Main issues:

      1) While the study starts off with the FOXP2 fusion, the vast majority of the paper is actually about enhanced FOXP2 expression in tumorigenesis. Wouldn't it be more logical to remove the FOXP2 fusion data? These data seem quite interesting and novel but they are underdeveloped within the current manuscript design, which is a shame for such an exciting novel finding. Along the same lines, for a study that centres on the prostate lineage, it's not clear why the oncogenic potential of FOXP2 in mouse 3T3 fibroblasts was tested.

      We thank the reviewer very much for the comment. We followed the suggestion and added a set of data regarding the newly identified FOXP2 fusion in Figure 1 to make our manuscript more informative. We tested the oncogenic potential of FOXP2 in NIH3T3 fibroblasts because NIH3T3 cells are a widely used model to demonstrate the presence of transformed oncogenes2,3. In our study, we observed that when NIH3T3 cells acquired the exogenous FOXP2 gene, the cells lost the characteristic contact inhibition response, continued to proliferate and eventually formed clonal colonies. Please refer to "Answer to Essential Revisions #1 from the Editors” for details.

      2) While the FOXP2 data are compelling and convincing, it is not clear yet whether this effect is specific, or if FOXP2 is e.g. universally relevant for cell viability. Targeting FOXP2 by siRNA/shRNA in a non-transformed cell line would address this issue.

      We appreciate these helpful comments. Please refer to the "Answer to Essential Revisions #1 from the Editors” for details.

      3) Unfortunately, not a single chemical inhibitor is truly 100% specific. Therefore, the Foretinib and MK2206 experiments should be confirmed using shRNAs/KOs targeting MEK and AKT. With the inclusion of such data, the authors would make a very compelling argument that indeed MEK/AKT signalling is driving the phenotype.

      We thank the reviewer for highlighting this point and we agree with the reviewer’s point that no chemical inhibitor is 100% specific. In this study, we used chemical inhibitors to provide further supportive data indicating that FOXP2 confers oncogenic effects by activating MET signaling. We characterized a FOXP2-binding fragment located in MET and HGF in LNCaP prostate cancer cells by utilizing the CUT&Tag method. We also found that MET restoration partially reversed oncogenic phenotypes in FOXP2-KD prostate cancer cells. All these data consistently supported that FOXP2 activates MET signaling in prostate cancer. Please refer to the "Answer to Essential Revisions #2 from the Editors” and to the "Answer to Essential Revisions #7 from the Editors” for details.

      4) With the FOXP2-CPED1 fusion being more stable as compared to wild-type transcripts, wouldn't one expect the fusion to have a more severe phenotype? This is a very exciting aspect of the start of the study, but it is not explored further in the manuscript. The authors would ideally elaborate on why the effects of the FOXP2-CPED1 fusion seem comparable to the FOXP2 wildtype, in their studies.

      We thank the reviewer very much for the comment. We had quantified the number of colonies of FOXP2- and FOXP2-CPED1-overexpressing cells, and we found that both wildtype FOXP2 and FOXP2-CPED1 had a comparable putative functional influence on the transformation of human prostate epithelial cells RWPE-1 and mouse primary fibroblasts NIH3T3 (P = 0.69, by Fisher’s exact test for RWPE-1; P = 0.23, by Fisher’s exact test for NIH3T3). We added the corresponding description to the Results section in Line 487 on Page 22 in the tracked changes version of the revised manuscript. Please refer to the "Answer to Essential Revisions #5 from the Editors” for details.

      5) The authors claim that FOXP2 functions as an oncogene, but the most-severe phenotype that is observed in vivo, is PIN lesions, not tumors. While this is an exciting observation, it is not the full story of an oncogene. Can the authors justifiably claim that FOXP2 is an oncogene, based on these results?

      We appreciate the comment, and we made the corresponding revision in the revised manuscript. Please refer to the "Answer to Essential Revisions #3 from the Editors” for details.

      6) The clinical and phenotypic observations are exciting and relevant. The mechanistic insights of the study are quite limited in the current stage. How does FOXP2 give its phenotype, and result in increased MET phosphorylation? The association is there, but it is unclear how this happens.

      We appreciate this valuable suggestion. In the current study, we used the CUT&Tag method to explore how FOXP2 activated MET signaling in LNCaP prostate cancer cells, and we identified potential FOXP2-binding fragments in MET and HGF. Therefore, we proposed that FOXP2 activates MET signaling in prostate cancer through its binding to MET and METassociated gene. Please refer to the "Answer to Essential Revisions #2 from the Editors” for details.

      Reviewer #2 (Public Review):

      1) The manuscript entitled "FOXP2 confers oncogenic effects in prostate cancer through activating MET signalling" by Zhu et al describes the identification of a novel FOXP2CPED1 gene fusion in 2 out of 100 primary prostate cancers. A byproduct of this gene fusion is the increased expression of FOXP2, which has been shown to be increased in prostate cancer relative to benign tissue. These data nominated FOXP2 as a potential oncogene. Accordingly, overexpression of FOXP2 in nontransformed mouse fibroblast NIH-3T3 and human prostate RWPE-1 cells induced transforming capabilities in both cell models. Mechanistically, convincing data were provided that indicate that FOXP2 promotes the expression and/or activity of the receptor tyrosine kinase MET, which has previously been shown to have oncogenic functions in prostate cancer. Notably, the authors create a new genetically engineered mouse model in which FOXP2 is overexpressed in the prostatic luminal epithelial cells. Overexpression of FOXP2 was sufficient to promote the development of prostatic intraepithelial neoplasia (PIN) a suspected precursor to prostate adenocarcinoma and activate MET signaling.

      Strengths:

      This study makes a convincing case for FOXP2 as 1) a promoter of prostate cancer initiation and 2) an upstream regulator of pro-cancer MET signaling. This was done using both overexpression and knockdown models in cell lines and corroborated in new genetically engineered mouse models (GEMMs) of FOXP2 or FOXP2-CPED1 overexpression in prostate luminal epithelial cells as well as publicly available clinical cohort data.

      Major strengths of the study are the demonstration that FOXP2 or FOXP2-CPED1 overexpression transforms RWPE-1 cells to now grow in soft agar (hallmark of malignant transformation) and the creation of new genetically engineered mouse models (GEMMs) of FOXP2 or FOXP2-CPED1 overexpression in prostate luminal epithelial cells. In both mouse models, FOXP2 overexpression increased the incidence of PIN lesions, which are thought to be a precursor to prostate cancer. While FOXP2 alone was not sufficient to cause prostate cancer in mice, it is acknowledged that single gene alterations causing prostate cancer in mice are rare. Future studies will undoubtedly want to cross these GEMMs with established, relatively benign models of prostate cancer such as Hi-Myc or Pb-Pten mice to see if FOXP2 accelerates cancer progression (beyond the scope of this study).

      We appreciate these positive comments from the reviewer. We agree with the suggestion from the reviewer that it is worth exploring whether FOXP2 is able to cooperate with a known disease driver to accelerate the progression of prostate cancer. Therefore, we are going to cross Pb-FOXP2 transgenic mice with Pb-Pten KO mice to assess if FOXP2 is able to accelerate malignant progression.

      2) Weaknesses: It is unclear why the authors decided to use mouse fibroblast NIH3T3 cells for their transformation studies. In this regard, it appears likely that FOXP2 could function as an oncogene across diverse cell types. Given the focus on prostate cancer, it would have been preferable to corroborate the RWPE-1 data with another prostate cell model and test FOXP2's transforming ability in RWPE-1 xenograft models. To that end, there is no direct evidence that FOXP2 can cause cancer in vivo. The GEMM data, while compelling, only shows that FOXP2 can promote PIN in mice and the lone xenograft model chosen was for fibroblast NIH-3T3 cells.

      To determine the oncogenic activity of FOXP2 and the FOXP2-CPDE1 fusion, we initially used mouse primary fibroblast NIH3T3 for transformation experiments, because NIH3T3 cells are a widely used cell model to discover novel oncogenes2,3,10,11. Subsequently, we observed that overexpression of FOXP2 and its fusion variant drove RWPE-1 cells to lose the characteristic contact inhibition response, led to their anchorage-independent growth in vitro, and promoted PIN in the transgenic mice. During preparation of the revised manuscript, we tested the transformation ability of FOXP2 and FOXP2-CPED1 in RWPE1 xenograft models. We subcutaneously injected 2 × 106 RWPE-1 cells into the flanks of NOD-SCID mice. The NODSCID mice were divided into five groups (n = 5 mice in each group): control, FOXP2overexpressing (two stable cell lines) and FOXP2-CPED1- overexpressing (two cell lines) groups. The experiment lasted for 4 months. We observed that no RWPE-1 cell-injected mice developed tumor masses. We propose that FOXP2 and its fusion alone are not sufficient to generate the microenvironment suitable for RWPE-1-xenograft growth. Collectively, our data suggest that FOXP2 has oncogenic potential in prostate cancer, but is not sufficient to act alone as an oncogene.

      3) There is a limited mechanism of action. While the authors provide correlative data suggesting that FOXP2 could increase the expression of MET signaling components, it is not clear how FOXP2 controls MET levels. It would be of interest to search for and validate the importance of potential FOXP2 binding sites in or around MET and the genes of METassociated proteins. At a minimum, it should be confirmed whether MET is a primary or secondary target of FOXP2. The authors should also report on what happened to the 4-gene MET signature in the FOXP2 knockdown cell models. It would be equally significant to test if overexpression of MET can rescue the anti-growth effects of FOXP2 knockdown in prostate cancer cells (positive or negative results would be informative).

      We appreciate all the valuable comments. As suggested, we performed corresponding experiments, please refer to the " Answers to Essential Revisions #2 from the Editors”, to the "Answer to Essential Revisions #6 from the Editors”, and to the "Answer to Essential Revisions #7 from the Editors” for details.

      Reviewer #3 (Public Review):

      1) In this manuscript, the authors present data supporting FOXP2 as an oncogene in PCa. They show that FOXP2 is overexpressed in PCa patient tissue and is necessary and sufficient for PCa transformation/tumorigenesis depending on the model system. Overexpression and knock-down of FOXP2 lead to an increase/decrease in MET/PI3K/AKT transcripts and signaling and sensitizes cells to PI3K/AKT inhibition.

      Key strengths of the paper include multiple endpoints and model systems, an over-expression and knock-down approach to address sufficiency and necessity, a new mouse knock-in model, analysis of primary PCa patient tumors, and benchmarking finding against publicly available data. The central discovery that FOXP2 is an oncogene in PCa will be of interest to the field. However, there are several critically unanswered questions.

      1) No data are presented for how FOXP2 regulates MET signaling. ChIP would easily address if it is direct regulation of MET and analysis of FOXP2 ChIP-seq could provide insights.

      2) Beyond the 2 fusions in the 100 PCa patient cohort it is unclear how FOXP2 is overexpressed in PCa. In the discussion and in FS5 some data are presented indicating amplification and CNAs, however, these are not directly linked to FOXP2 expression.

      3) There are some hints that full-length FOXP2 and the FOXP2-CPED1 function differently. In SF2E the size/number of colonies between full-length FOXP2 and fusion are different. If the assay was run for the same length of time, then it indicates different biologies of the overexpressed FOXP2 and FOXP2-CPED1 fusion. Additionally, in F3E the sensitization is different depending on the transgene.

      We appreciate these valuable comments and constructive remarks. As suggested, we performed the CUT&Tag experiments to detect the binding of FOXP2 to MET, and to examine the association of CNAs of FOXP2 with its expression. Please refer to the " Answer to Essential Revisions #2 from the Editors" and the " Answer to Essential Revisions #4 from the Editors" for details. We also added detailed information to show the resemblance observed between FOXP2 fusion- and wild-type FOXP2-overexpressing cells. We added the corresponding description to the Results section in Line 487 on Page 22 in the tracked changes version of the revised manuscript. Please refer to the “Answer to Essential Revisions #5 from the Editors” for details.

      2) The relationship between FOXP2 and AR is not explored, which is important given 1) the critical role of the AR in PCa; and 2) the existing relationship between the AR and FOXP2 and other FOX gene members.

      We thank the reviewer very much for highlighting this point. We agree that it is important to examine the relationship between FOXP2 and AR. We therefore analyzed the expression dataset of 255 primary prostate tumors from TCGA and observed that the expression of FOXP2 was significantly correlated with the expression of AR (Spearman's ρ = 0.48, P < 0.001) (Figure 1. a). Next, we observed that both FOXP2- and FOXP2-CPED1overexpressing 293T cells had a higher AR protein abundance than control cells (Figure 1. b). In addition, shRNA-mediated FOXP2 knockdown in LNCaP cells resulted in a decreased AR protein level compared to that in control cells (Figure 1. c). However, we analyzed our CUT&Tag data and observed no binding of FOXP2 to AR (Figure 1. d). Our data suggest that FOXP2 might be associated with AR expression.

      Figure 1. a. AR expression in a human prostate cancer dataset (TCGA, Prostate Adenocarcinoma, Provisional; n = 493) classified by FOXP2 expression level (bottom 25%, low expression, n = 120; top 25%, high expression, n = 120; negative expression, n = 15). P values were calculated by the MannWhitney U test. The correlation between FOXP2 and AR expression was evaluated by determining the Spearman's rank correlation coefficient. b. Immunoblot analysis of the expression levels of AR in 293T cells with overexpression of FOXP2 or FOXP2-CPED1. c. Immunoblot analysis of the expression levels of AR in LNCaP cells with stable expression of the scrambled vector or FOXP2 shRNA. d. CUT&Tag analysis of FOXP2 association with the promoter of AR. Representative track of FOXP2 at the AR gene locus is shown.

      Reference

      1. Mayr C, Bartel DP. Widespread shortening of 3'UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell. 2009 Aug 21;138(4):673-84.
      2. Gara SK, Jia L, Merino MJ, Agarwal SK, Zhang L, Cam M et al., Germline HABP2 Mutation Causing Familial Nonmedullary Thyroid Cancer. N Engl J Med. 2015 Jul 30;373(5):448-55.
      3. Kohno T, Ichikawa H, Totoki Y, Yasuda K, Hiramoto M, Nammo T et al., KIF5B-RET fusions in lung adenocarcinoma. Nat Med. 2012 Feb 12;18(3):375-7.
      4. Chen F, Byrd AL, Liu J, Flight RM, DuCote TJ, Naughton KJ et al., Polycomb deficiency drives a FOXP2-high aggressive state targetable by epigenetic inhibitors. Nat Commun. 2023 Jan 20;14(1):336.
      5. Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES, Bryson TD, Henikoff JG et al., CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun. 2019 Apr 29;10(1):1930.
      6. Spiteri E, Konopka G, Coppola G, Bomar J, Oldham M, Ou J et al., Identification of the transcriptional targets of FOXP2, a gene linked to speech and language, in developing human brain. Am J Hum Genet. 2007 Dec;81(6):1144-57.
      7. Lai CS, Fisher SE, Hurst JA, Vargha-Khadem F, Monaco AP. A forkhead-domain gene is mutated in a severe speech and language disorder. Nature. 2001 Oct 4;413(6855):519-23.
      8. Hannenhalli S, Kaestner KH. The evolution of Fox genes and their role in development and disease. Nat Rev Genet. 2009 Apr;10(4):233-40.
      9. Shu W, Yang H, Zhang L, Lu MM, Morrisey EE. Characterization of a new subfamily of winged-helix/forkhead (Fox) genes that are expressed in the lung and act as transcriptional repressors. J Biol Chem. 2001 Jul 20;276(29):27488-97.
      10. Wang C, Liu H, Qiu Q, Zhang Z, Gu Y, He Z. TCRP1 promotes NIH/3T3 cell transformation by over-activating PDK1 and AKT1. Oncogenesis. 2017 Apr 24;6(4):e323.
      11. Suh YA, Arnold RS, Lassegue B, Shi J, Xu X, Sorescu D et al., Cell transformation by the superoxide-generating oxidase Mox1. Nature. 1999 Sep 2;401(6748):79-82.
    1. Author Response

      Reviewer #1 (Public Review):

      The present study by Zander et al. aims at improving our understanding of CD4+ T cell heterogeneity in response to chronic viral infections. The authors utilize the murine LCMV c13 infection model and perform single cell RNA seq analysis on day 10 post infection to identify multiple, previously unappreciated, T cell subsets. The authors then go on and verify these analyses using multi-color flow cytometry before comparing the transcriptome of CD4 T cells from chronic infection to a previously generated data set of CD4 T cells obtained from acutely-resolved LCMV infection.

      The analyses are very well done and provide some interesting novel insights. In particular, the comparison of CD4 T cell subsets across acute and chronic infections is very exciting as they provide a very valuable platform that can answer a long-standing question: do CD4 T cells in chronic infection undergo exhaustion similar to CD8 T cells. While this has been proposed for an extended period, this new dataset by Zander et al. can provide some novel insights by comparing individual cell subsets cross-infection. The manuscript would, however, benefit from a more extensive analysis and focus on this interesting point.

      We thank the reviewer for their time and careful assessment of our manuscript. We were happy to hear that the reviewer found our work interesting.

      On that note, the authors should take advantage of more accurate and present gene datasets to compare the 'dysfunctional' state of CD4 T cells in chronic infection vs acute infection. Also, a different illustration to demonstrate the module score analyses would be more intuitive.

      We have now included T cell “exhaustion” genesets from recently published data (Zander et. al 2019 Immunity), and we have also displayed the relative expression of select signature genes from these genesets in an updated supplemental figure 3.

      Also, at multiple sections in the manuscript, the authors are missing the accurate citations as they are still mentioned as '(Ref)'.

      We apologize for this oversight and have corrected these citations.

      Nevertheless, this study does not require major revisions.

      Reviewer #2 (Public Review):

      In their study "Delineating the transcriptional landscape and clonal diversity of virus-specific CD4+ T cells during chronic viral infection" Zander and co-workers analyze the phenotypic and clonotypic distributions of T cells specific to a LCMV epitope following infection with a chronic LCMV strain in mice. The paper largely follows an earlier study from the same group (Khatun JEM 2021) that has used a similar experimental strategy to analyze T cells responding to an LCMV strain establishing acute infection, and it adds a scTCRseq component to another earlier study of chronic LCMV (Zander Immunity 2022). The main contributions of the paper are to demonstrate that interesting differences between gene expression profiles between chronic and acute LCMV exist, and to identify a new T cell subset (of unknown functional significance).

      While the paper is framed around differences between T cell responses to acute and chronic infections, all analysis is done on T cells at day 10 post primary infection. At such an early time point even the acute LCMV strain virus is likely not completely cleared, or at the very least viral antigens are still presented. The relevance of the presented phenotypic differences to other settings with long-term chronic infection is thus questionable. Additionally, there are a number of methodological concerns regarding the robustness of the statistical and bioinformatic analyses that put in doubt some of the conclusions. Most notably, the analysis of fate biases needs to be substantiated by tests against baseline expectations from random assortment to test for statistical significance.

      We thank the reviewer for their careful review of our manuscript as well as their helpful comments.

      Regarding the day 10 time point-post LCMV Armstrong infection, several groups have previously reported that LCMV viral load is undetectable by day 10 post-infection (see one published example below), although we completely agree with the reviewer that there is still likely to be viral antigens being presented at this time point, as well as ongoing inflammation, which we believe (and as discussed further below) is actually a strength of the study as it allows for a more fair comparison of the transcriptional state of recently stimulated virus-specific CD4 T cells under different contexts (acute vs chronic LCMV infection) . We chose day 10 post LCMV Cl13 and LCMV Armstrong infections as the timepoint for analysis, as this is approximately the peak of the endogenous Gp66-77 CD4+ T cell response (see previously published data below), and is also when there is a more balanced distribution of Th1, Tfh, and T central memory precursor (Tcmp)/ or memory-like cells in these settings, thereby allowing for sufficient numbers of cells/cluster to conduct an in-depth analysis and high-resolution comparison of these subsets between the two different infections. Further, as some degree of TCR stimulation is still likely being experienced at this timepoint during LCMV Armstrong infection, we believe that this is a more useful comparison than at a memory time point (when CD4 T cells are in a quiescent state) as it gives us a better picture of the differentially expressed genes at the peak of the CD4 T cell response, and also provides insight into how chronic viral infection perturbs the transcriptional program of CD4 T cells.

    1. Author Response

      Reviewer #2 (Public Review):

      This study evaluates the causal relationship between childhood obesity on the one hand, and childhood emotional and behavioral problems on the other. It applies Mendelian Randomization (MR), a family of methods in statistical genetics that uses genetic markers to break the symmetry between correlated traits, allowing inference of causation rather than mere correlation. The authors argue convincingly that previous studies of these traits, both those using non-genetic observational epidemiology methods and those using standard MR methods, may be confounded by demographic effects and familial effects. One possible example of this kind of confounding is that the idea that obesity in parents may contribute to emotional and behavioral problems in children; another is the idea that adults with emotional and behavioral issues may be more likely to have children with partners who are obese, and vice-versa. They then make use of a recently proposed "within-family" MR method, which should effectively control for these confounders, at the cost of higher uncertainty in the estimated effect size, and therefore lower power to detect small effects. They report that none of the previously reported associations of childhood BMI with anxiety, depression, or ADHD are replicated using the within-family MR method, and that in the case of depression the primary association appears to be with maternal BMI rather than the child's own BMI.

      This argument that these confounders may affect these phenotypes is fairly sound, and within-family MR should indeed do a good job of controlling for them. I do not see any major issues with the cohort itself or the choice of genetic instruments. I also do not see any major issues with the definitions or ascertainment of the phenotypes studied, though I am not an expert on any of these phenotypes in particular. I am especially satisfied with the series of analyses demonstrating that the results are robust to many variations of MR methodology. Overall, I think the positive result this study reports is very credible: that the known association between childhood BMI and depression is likely primarily due to an effect of maternal BMI rather than the child's own BMI (though given that paternal BMI has a similar effect size with only a slightly wider confidence interval, I would instead say that the effect is from parental BMI generally, not specifically maternal.)

      In the updated results based on the larger genetic data release, the estimates for the association of maternal BMI and paternal BMI with the child’s depressive symptoms are more clearly different than they were in the smaller dataset (for maternal BMI, beta= 0.11, CI:0.02,0.19, p=0.01; for paternal BMI, beta=0.02, CI:-0.09,0.12, p=0.71). Therefore, in this version, it makes sense to note an association with maternal BMI specifically.

      The main weakness of the study comes from its negative results, which the authors emphasize as their primary conclusion: that previously reported associations of childhood BMI with anxiety, depression, and ADHD are not replicated using within-family MR methods. These claims do not seem justified by the evidence presented in this study. In fact, in every panel of figures 2 and 3, the error bars for the within-family MR analysis encompass the estimates for both the regression analysis and the traditional MR analysis, suggesting that the within-family analysis provides no evidence one way or another about which of these analyses is more accurate. More generally, in order to convincingly claim that there is no causal relationship between two traits, an MR study must argue that the study would be powered to detect a relationship if one existed. Within-family MR methods are known to have less power to detect associations and less precision to estimate effect sizes than traditional MR methods or traditional observational epidemiology methods, so it is not sufficient to show that these other methods have power to detect the association. To make this kind of claim, it is necessary to include some kind of power analysis, such as a simulation study or analytic power calculations, and likely also a positive control to show that this method does have power to detect known effects in this cohort.

      We agree that it is imperative that negative (i.e. “non-significant”) results are correctly interpreted - it is just as important to discover what is unlikely to affect emotional and behavioural outcomes as what does affect them. Negative results (non-significant estimates) are neither a weakness nor strength of the study, but simply reflect the estimation error in our analysis of the data. The key question is whether our within-family MR estimates are sufficiently powered to detect effect sizes of interest or rule out clinically meaningful effect sizes – or are they simply too imprecise to draw any conclusions? As the reviewer suggests, one way to address this is via a post-hoc power calculation. We consider post-hoc power calculations redundant, since all the information about the power of our analysis is reflected in the standard errors and reported confidence intervals. Moreover, any post-hoc power calculation will be necessarily approximate compared to using the standard errors and confidence intervals which we report.

      Despite these methodological reservations, we have conducted simulations to estimate the power of our within-family models (the R code is included at the end of this document). These simulations indicate that we do have sufficient power to detect the size of effects seen for depressive symptoms and ADHD in models using the adult BMI PGS. They also indicate that we cannot rule out smaller effects for non-significant associations (e.g., for the impact of the child’s BMI on anxiety). Naturally, this is entirely consistent with the width of the confidence intervals reported in results tables and in Figures 1 and 2. However, although power calculations are important when planning a study, they make little contribution to interpretation once a study has been conducted and confidence intervals are available (e.g., https://psyarxiv.com/tcqrn/). For this reason, we comment on these simulations in this response to reviewers but do not include them in the manuscript or supplementary materials. At the same time, we have changed the language used in the manuscript to be clearer that the results were imprecise and that values contained within the confidence limits cannot be ruled out.

      For example, the discussion now includes the following:

      ‘However, within-family MR estimates using the childhood body size PGS are still consistent with small effects of the child’s BMI on all outcomes, with upper confidence limits around a 0.2 standard-deviation increase in the outcome per 5kg/m2 increase in BMI.’

      And the conclusion of the paper now reads:

      ‘Our results suggest that genetic variation associated with BMI in adulthood affects a child’s depressive and ADHD symptoms, but genetic variation associated with recalled childhood body size does not substantially affect these outcomes. There was little evidence that BMI affects anxiety. However, our estimates were imprecise, and these differences may be due to estimation error. There was little evidence that parental BMI affects a child’s ADHD or anxiety symptoms, but factors associated with maternal BMI may independently influence a child’s depressive symptoms. Genetic studies using unrelated individuals, or polygenic scores for adult BMI, may have overestimated the causal effects of a child’s own BMI.’

      Regarding a positive control: for analyses of BMI in adults, suitable positive controls would include directly measured biomarkers such as fat mass or blood pressure or reported medical outcomes like type 2 diabetes. In adolescents and younger adults, age at menarche or other measures of puberty can be used, as these are reliably influenced by BMI. However, the age of the participants for whom within-family effects are being estimated (8 years), together with the lack of any biomarkers such as fat mass (due to the questionnaire-based survey design) mean no suitable measures are available.

      Reviewer #3 (Public Review):

      Higher BMI in childhood is correlated with behavioral problems (e.g. depression and ADHD) and some studies have shown that this relationship may be causal using Mendelian Randomization (MR). However, traditional MR is susceptible to bias due to population stratification, assortative mating, and indirect effects (dynastic effects). To address this issue, Hughes et al. use within-family MR, which should be immune to the above-listed problems. They were unable to find a causal relationship between children's BMI and depression, anxiety, or ADHD. They do, however, report a causal effect of mother's BMI on depression in their children. They conclude that the causal effect of children's BMI on behavioral phenotypes such as depression and anxiety, if present, is very small, and may have been overestimated in previous studies. The analyses have been carried out carefully in a large sample and the paper is presented clearly. Overall, their assertions are justified but given that the conclusions mostly rest on an absence of an effect, I would like to see more discussion on statistical power.

      1) The authors show that the estimates of within-family MR are imprecise. It would be helpful to know how much power they have for estimating effect sizes reported previously given their sample size.

      As discussed in response to a comment from reviewer 2, the power of our results is already indicated by our standard errors and confidence intervals. Nevertheless, we conducted simulations to estimate the size of effects which we had 80% power to detect. Results, presented below, are consistent with our main results. As discussed in response to a comment from reviewer 2, we consider post-hoc power calculations redundant when standard errors and confidence intervals are reported; for this reason, we include this information in the response to reviewers but not the manuscript itself.

      2) They used the correlation between PGS and BMI to support the assertion that the former is a strong instrument. Were the reported correlations calculated across all individuals? Since we know that stratification, assortative mating, and indirect effects can inflate these correlations, perhaps a more unbiased estimate would be the proportion of children's BMI variance explained by their PGS conditioned on the parents' PGS. This should also be the estimate used in power calculations.

      The manuscript has been updated to quote Sanderson-Windmeijer conditional R2 values: the proportion of BMI variance explained by the BMI PGS for each member of a trio, conditional on the PGS of the other members of the trio, and all genetic covariates included in within-family models. Similarly, we now show Sanderson-Windmeijer conditional F-statistics for a model including the child, mother, and father’s BMI instrumented by the child, mother, and father’s PGS.

      3) In testing the association of mothers' and fathers' BMI with children's symptoms, the authors used a multivariable linear regression conditioning on the child's own BMI. Was the other parent's BMI (either by itself or using the polygenic score) included as a covariate in the multivariable and MR models? This was not entirely clear from the text or from Fig. 2. I suspect that if there were assortative mating on BMI in the parent's generation, the effect of any one parent's BMI on the child's symptoms might be inflated unless the other parent's BMI was included as a covariate (assuming both mother's and father's BMI affect the child's symptoms).

      Non-genetic models include both the mother and father’s phenotypic BMI as well as the child’s, allowing estimation of conditional effects of all three. This controls for assortative mating as noted by the reviewer. This was not previously clear - all relevant text and figure captions have been updated to clarify this.

      4) They report no evidence of cross-trait assortative mating in the parents generation. The power to detect cross-trait assortative mating in the parents' generation using PGS would depend on the actual strength of assortative mating and the respective proportions of trait variance explained by PGS. Could the authors provide an estimate of the power for this test in their sample?

      We have updated the discussion of assortative mating (in both the results and the discussion section) to note possible limitations of power and clarify that that this approach to examining assortment may not capture its full extent.

      The relevant part of the results section now reads:

      “In the parents’ generation, phenotypes were associated within parental pairs, consistent with assortative mating on these traits (Appendix 1 – Table 5). Adjusted for ancestry and other genetic covariates, maternal and paternal BMI were positively associated (beta: 0.23, 95%CI: 0.22,0.25, p<0.001), as were maternal and paternal depressive symptoms (beta: 0.18, 95%CI: 0.16,0.20, p<0.001), and maternal and paternal ADHD symptoms (beta: 0.11, 95%CI: 0.09,0.13, p<0.001). Consistent with cross-trait assortative mating, there was an association of mother’s BMI with father’s ADHD symptoms (beta: 0.03, 95%CI: 0.02,0.05, p<0.001) and mother’s ADHD symptoms with father’s depressive symptoms (beta: 0.05,95%CI: 0.05,0.06, p<0.001). Phenotypic associations can reflect the influence of one partner on another as well as selection into partnerships, but regression models of paternal polygenic scores on maternal polygenic scores also pointed to a degree of assortative mating. Adjusted for ancestry and genotyping covariates, there were small associations between parents’ BMI polygenic scores (beta: 0.01, 95%CI: 0.00,0.02, p=0.02 for the adult BMI PGS, and beta: 0.01, 95%CI: 0.00,0.02, p=0.008 for the childhood body size PGS), and of the mother’s childhood body size PGS with the father’s ADHD PGS (beta: 0.01, 95%CI: 0.00,0.02, p=0.03). We did not detect associations with pairs of other polygenic scores, which may be due to insufficient statistical power.”

      And the relevant part of the discussion section now reads:

      “We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.”

      5) Are the actual phenotypes (BMI, depression or ADHD) correlated between the parents? If so, would this not suffice as evidence of cross-trait assortative mating? It is known that the genetic correlation between parents as a result of assortative mating is a function of the correlation in their phenotypes and the heritabilities underlying the two traits (e.g., see Yengo and Visscher 2018). An alternative way to estimate the genetic correlation between parents without using PGS (which is noisy and therefore underpowered) would be to use the phenotypic correlation and heritability estimated using GREML or LDSC. Perhaps this is outside the scope of the paper but I would like to hear the author's thoughts on this.

      Associations between maternal and paternal phenotypes are consistent with a degree of assortative mating (shown below). These results have added to Appendix 1 - Table 5, which also shows associations between maternal and paternal polygenic scores, and methods and results updated accordingly (see quoted text in response to the comment above). For comparability, both sets of results are based on regression models adjusting for the mother’s and father’s ancestry PCs and genotyping covariates. We agree that analysis of assortative mating using GREML or LDSC is out of scope for this paper. As noted above, we have updated the discussion to acknowledge the limitations of the approach taken:

      ‘We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.’

      6) It would be helpful to include power calculations for the MR-Egger intercept estimates.

      As with our response to the comments above, post-hoc power calculations are redundant, as all the information about the power of our analysis, including the MR-Egger is indicated by the standard errors and confidence intervals. MR-Egger is less precise than other estimators, as is made clear from the wide confidence intervals reported in the relevant tables (Appendix 1 - Tables 8 and 9). However, we have now updated the discussion to give more weight to this as a limitation. The discussion of pleiotropy in the final paragraph of the discussion now reads:

      ‘While robustness checks found little evidence of pleiotropy, these methods rely on assumptions. Moreover, MR-Egger is known to give imprecise estimates (Burgess and Thompson 2017), and confidence intervals from MR-Egger models were wide. Thus, pleiotropy cannot be ruled out.’

      Similarly, we have updated the relevant line of the results section, which now reads:

      ‘MR-Egger models found little evidence of horizontal pleiotropy, although MR-Egger estimates were imprecise (Appendix 1 - Tables 8 and 9).’

      7) Finally, what is the correlation between PGS and genetic PCs/geography in their sample? A correlation might provide evidence to support the point that classic MR effects are inflated due to stratification.

      Figures presenting the association of the child’s BMI polygenic scores and their PCs have been added to the supplementary information as Appendix 1 - Figure 2 and Appendix 1 - Figure 3. Consistent with an influence of residual stratification, a regression of the child’s BMI polygenic scores against their ancestry PCs (adjusting for genotyping centre and chip) found that 7 of the 20 PCs were associated at p<0.05 with the adult BMI PGS, and 8 of 20 with the childhood body size PGS (under the null hypothesis, we would expect one association in each case). When parental polygenic scores were added to the models, these associations attenuated towards to null.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript seeks to identify the mechanism underlying priority effects in a plantmicrobe-pollinator model system and to explore its evolutionary and functional consequences. The manuscript first documents alternative community states in the wild: flowers tend to be strongly dominated by either bacteria or yeast but not both. Then lab experiments are used to show that bacteria lower the nectar pH, which inhibits yeast - thereby identifying a mechanism for the observed priority effect. The authors then perform an experimental evolution unfortunately experiment which shows that yeast can evolve tolerance to a lower pH. Finally, the authors show that low-pH nectar reduces pollinator consumption, suggesting a functional impact on the plant-pollinator system. Together, these multiple lines of evidence build a strong case that pH has far-reaching effects on the microbial community and beyond.

      The paper is notable for the diverse approaches taken, including field observations, lab microbial competition and evolution experiments, genome resequencing of evolved strains, and field experiments with artificial flowers and nectar. This breadth can sometimes seem a bit overwhelming. The model system has been well developed by this group and is simple enough to dissect but also relevant and realistic. Whether the mechanism and interactions observed in this system can be extrapolated to other systems remains to be seen. The experimental design is generally sound. In terms of methods, the abundance of bacteria and yeast is measured using colony counts, and given that most microbes are uncultivable, it is important to show that these colony counts reflect true cell abundance in the nectar.

      We have revised the text to address the relationship between cell counts and colony counts with nectar microbes. Specifically, we point out that our previous work (Peay et al. 2012) established a close correlation between CFUs and cell densities (r2 = 0.76) for six species of nectar yeasts isolated from D. aurantiacus nectar at Jasper Ridge, including M. reukaufii.

      As for A. nectaris, we used a flow cytometric sorting technique to examine the relationship between cell density and CFU (figure supplement 1). This result should be viewed as preliminary given the low level of replication, but this relationship also appears to be linear, as shown below, indicating that colony counts likely reflect true cell abundance of this species in nectar.

      It remains uncertain how closely CFU reflects total cell abundance of the entire bacterial and fungal community in nectar. However, a close association is possible and may be even likely given the data above, showing a close correlation between CFU and total cell count for several yeast species and A. nectaris, which are indicated by our data to be dominant species in nectar.

      We have added the above points in the manuscript (lines 263-264, 938-932).

      The genome resequencing to identify pH-driven mutations is, in my mind, the least connected and developed part of the manuscript, and could be removed to sharpen and shorten the manuscript.

      We appreciate this perspective. However, given the disagreement between this perspective and reviewer 2’s, which asks for a more expanded section, we have decided to add a few additional lines (lines 628-637), briefly expanding on the genomic differences between strains evolved in bacteria-conditioned nectar and those evolved in low-pH nectar.

      Overall, I think the authors achieve their aims of identifying a mechanism (pH) for the priority effect of early-colonizing bacteria on later-arriving yeast. The evolution and pollinator experiments show that pH has the potential for broader effects too. It is surprising that the authors do not discuss the inverse priority effect of early-arriving yeast on later-arriving bacteria, beyond a supplemental figure. Understandably this part of the story may warrant a separate manuscript.

      We would like to point out that, in our original manuscript, we did discuss the inverse priority effects, referring to relevant findings that we previously reported (Tucker and Fukami 2014, Dhami et al. 2016 and 2018, Vannette and Fukami 2018). Specifically, we wrote that: “when yeast arrive first to nectar, they deplete nutrients such as amino acids and limit subsequent bacterial growth, thereby avoiding pH-driven suppression that would happen if bacteria were initially more abundant (Tucker and Fukami 2014; Vannette and Fukami 2018)” (lines 385-388). However, we now realize that this brief mention of the inverse priority effects was not sufficiently linked to our motivation for focusing mainly on the priority effects of bacteria on yeast in the present paper. Accordingly, we added the following sentences: “Since our previous papers sought to elucidate priority effects of early-arriving yeast, here we focus primarily on the other side of the priority effects, where initial dominance of bacteria inhibits yeast growth.” (lines 398-401).

      I anticipate this paper will have a significant impact because it is a nice model for how one might identify and validate a mechanism for community-level interactions. I suspect it will be cited as a rare example of the mechanistic basis of priority effects, even across many systems (not just pollinator-microbe systems). It illustrates nicely a more general ecological phenomenon and is presented in a way that is accessible to a broader audience.

      Thank you for this positive assessment.

      Reviewer #2 (Public Review):

      The manuscript "pH as an eco-evolutionary driver of priority effects" by Chappell et al illustrates how a single driver-microbial-induced pH change can affect multiple levels of species interactions including microbial community structure, microbial evolutionary change, and hummingbird nectar consumption (potentially influencing both microbial dispersal and plant reproduction). It is an elegant study with different interacting parts: from laboratory to field experiments addressing mechanism, condition, evolution, and functional consequences. It will likely be of interest to a wide audience and has implications for microbial, plant, and animal ecology and evolution.

      This is a well-written manuscript, with generally clear and informative figures. It represents a large body and variety of work that is novel and relevant (all major strengths).

      We appreciate this positive assessment.

      Overall, the authors' claims and conclusions are justified by the data. There are a few things that could be addressed in more detail in the manuscript. The most important weakness in terms of lack of information/discussion is that it looks like there are just as many or more genomic differences between the bacterial-conditioned evolved strains and the low-pH evolved strains than there are between these and the normal nectar media evolved strains. I don't think this negates the main conclusion that pH is the primary driver of priority effects in this system, but it does open the question of what you are missing when you focus only on pH. I would like to see a discussion of the differences between bacteria-conditioned vs. low-pH evolved strains.

      We agree with the reviewer and have included an expanded discussion in the revised manuscript [lines 628-637]. Specifically, to show overall genomic variation between treatments, we calculated genome-wide Fst comparing the various nectar conditions. We found that Fst was 0.0013, 0.0014, and 0.0015 for the low-pH vs. normal, low pH vs. bacteria-conditioned, and bacteria-conditioned vs. normal comparisons, respectively. The similarity between all treatments suggests that the differences between bacteria-conditioned and low pH are comparable to each treatment compared to normal. This result highlights that, although our phenotypic data suggest alterations to pH as the most important factor for this priority effect, it still may be one of many affecting the coevolutionary dynamics of wild yeast in the microbial communities they are part of. In the full community context in which these microbes grow in the field, multi-species interactions, environmental microclimates, etc. likely also play a role in rapid adaptation of these microbes which was not investigated in the current study.

      Based on this overall picture, we have included additional discussion focusing on the effect of pH on evolution of stronger resistance to priority effects. We compared genomic differences between bacteria-conditioned and low-pH evolved strains, drawing the reader’s attention to specific differences in source data 14-15. Loci that varied between the low pH and bacteria-conditioned treatments occurred in genes associated with protein folding, amino acid biosynthesis, and metabolism.

      Reviewer #3 (Public Review):

      This work seeks to identify a common factor governing priority effects, including mechanism, condition, evolution, and functional consequences. It is suggested that environmental pH is the main factor that explains various aspects of priority effects across levels of biological organization. Building upon this well-studied nectar microbiome system, it is suggested that pH-mediated priority effects give rise to bacterial and yeast dominance as alternative community states. Furthermore, pH determines both the strengths and limits of priority effects through rapid evolution, with functional consequences for the host plant's reproduction. These data contribute to ongoing discussions of deterministic and stochastic drivers of community assembly processes.

      Strengths:

      Provides multiple lines of field and laboratory evidence to show that pH is the main factor shaping priority effects in the nectar microbiome. Field surveys characterize the distribution of microbial communities with flowers frequently dominated by either bacteria or yeast, suggesting that inhibitory priority effects explain these patterns. Microcosm experiments showed that A. nectaris (bacteria) showed negative inhibitory priority effects against M. reukaffi (yeast). Furthermore, high densities of bacteria were correlated with lower pH potentially due to bacteria-induced reduction in nectar pH. Experimental evolution showed that yeast evolved in low-pH and bacteria-conditioned treatments were less affected by priority effects as compared to ancestral yeast populations. This potentially explains the variation of bacteria-dominated flowers observed in the field, as yeast rapidly evolves resistance to bacterial priority effects. Genome sequencing further reveals that phenotypic changes in low-pH and bacteriaconditioned nectar treatments corresponded to genomic variation. Lastly, a field experiment showed that low nectar pH reduced flower visitation by hummingbirds. pH not only affected microbial priority effects but also has functional consequences for host plants.

      We appreciate this positive assessment.

      Weaknesses:

      The conclusions of this paper are generally well-supported by the data, but some aspects of the experiments and analysis need to be clarified and expanded.

      The authors imply that in their field surveys flowers were frequently dominated by bacteria or yeast, but rarely together. The authors argue that the distributional patterns of bacteria and yeast are therefore indicative of alternative states. In each of the 12 sites, 96 flowers were sampled for nectar microbes. However, it's unclear to what degree the spatial proximity of flowers within each of the sampled sites biased the observed distribution patterns. Furthermore, seasonal patterns may also influence microbial distribution patterns, especially in the case of co-dominated flowers. Temperature and moisture might influence the dominance patterns of bacteria and yeast.

      We agree that these factors could potentially explain the presented results. Accordingly, we conducted spatial and seasonal analyses of the data, which we detail below and include in two new paragraphs in the manuscript [lines 290-309].

      First, to determine whether spatial proximity influenced yeast and bacterial CFUs, we regressed the geographic distance between all possible pairs of plants to the difference in bacterial or fungal abundance between the paired plants. If plant location affected microbial abundance, one should see a positive relationship between distance and the difference in microbial abundance between a given pair of plants: a pair of plants that were more distantly located from each other should be, on average, more different in microbial abundance. Contrary to this expectation, we found no significant relationship between distance and the difference in bacterial colonization (A, p=0.07, R2=0.0003) and a small negative association between distance and the difference in fungal colonization (B, p<0.05, R2=0.004). Thus, there was no obvious overall spatial pattern in whether flowers were dominated by yeast or bacteria.

      Next, to determine whether climatic factors or seasonality affected the colonization of bacteria and yeast per plant, we used a linear mixed model predicting the average bacteria and yeast density per plant from average annual temperature, temperature seasonality, and annual precipitation at each site, the date the site was sampled, and the site location and plant as nested random effects. We found that none of these variables were significantly associated with the density of bacteria and yeast in each plant.

      To look at seasonality, we also re-ordered Fig 2C, which shows the abundance of bacteria- and yeast-dominated flowers at each site, so that the sites are now listed in order of sampling dates. In this re-ordered figure, there is no obvious trend in the number of flowers dominated by yeast throughout the period sampled (6.23 to 7/9), giving additional indication that seasonality was unlikely to affect the results.

      Additionally, sampling date does not seem to strongly predict bacterial or fungal density within each flower when plotted.

      These additional analyses, now included (figure supplements 2-4) and described (lines 290-309) in the manuscript, indicate that the observed microbial distribution patterns are unlikely to have been strongly influenced by spatial proximity, temperature, moisture, or seasonality, reinforcing the possibility that the distribution patterns instead indicate bacterial and yeast dominance as alternative stable states.

      The authors exposed yeast to nectar treatments varying in pH levels. Using experimental evolution approaches, the authors determined that yeast grown in low pH nectar treatments were more resistant to priority effects by bacteria. The metric used to determine the bacteria's priority effect strength on yeast does not seem to take into account factors that limit growth, such as the environmental carrying capacity. In addition, yeast evolves in normal (pH =6) and low pH (3) nectar treatments, but it's unclear how resistance differs across a range of pH levels (ranging from low to high pH) and affects the cost of yeast resistance to bacteria priority effects. The cost of resistance may influence yeast life-history traits.

      The strength of bacterial priority effects on yeast was calculated using the metric we previously published in Vannette and Fukami (2014): PE = log(BY/(-Y)) - log(YB/(Y-)), where BY and YB represent the final yeast density when early arrival (day 0 of the experiment) was by bacteria or yeast, followed by late arrival by yeast or bacteria (day 2), respectively, and -Y and Y- represent the final density of yeast in monoculture when they were introduced late or early, respectively. This metric does not incorporate carrying capacity. However, it does compare how each microbial species grows alone, relative to growth before or after a competitor. In this way, our metric compares environmental differences between treatments while also taking into account growth differences between strains.

      Here we also present additional growth data to address the reviewer’s point about carrying capacity. Our experiments that compared ancestral and evolved yeast were conducted over the course of two days of growth. In preliminary monoculture growth experiments of each evolved strain, we found that yeast populations did reach carrying capacity over the course of the two-day experiment and population size declined or stayed constant after three and four days of growth.

      However, we found no significant difference in monoculture growth between the ancestral stains and any of the evolved strains, as shown in Figure supplement 12B. This lack of significant difference in monoculture suggests that differences in intrinsic growth rate do not fully explain the priority effects results we present. Instead, differences in growth were specific to yeast’s response to early arrival by bacteria.

      We also appreciate the reviewer’s comment about how yeast evolves resistance across a range of pH levels, as well as the effect of pH on yeast life-history traits. In fact, reviewer #2 pointed out an interesting trade-off in life history traits between growth and resistance to priority effects that we now include in the discussion (lines 535-551) as well as a figure in the manuscript (Figure 8).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      The brain-machine interface used in this study differs from typical BMIs in that it's not intended to give subjects voluntary control over their environment. However, it is possible that rats may become aware of their ability to manipulate trial start times using their neural activity. Is there any evidence that the time required to initiate trials on high-coherence or low-coherence trials decreases with experience?

      This is a great question. First, we designed the experiment to avoid this possibility. Rats were experienced on the sequence of the automatic maze both pre and post implantation (totaling to weeks of pre-training and habituation). As such, the majority of the trials ever experienced by the rat were not controlled by their neural activity. During BMI experimentation, only 10% of trials were triggered during high coherence states and 10% for low coherence states, leaving ~80% of trials not controlled by their neural activity. We also implemented a pseudo-randomized trial sequence. When considered together, we specifically designed this experiment to avoid the possibility that rats would actively use their neural activity to control the maze.

      Second, we had a similar question when collecting data for this manuscript and so we conducted a pilot experiment. We took 3 rats from experiment #1 (after its completion) and we required them to perform “forced-runs” over the course of 3-4 days, a task where rats navigate to a reward zone and are rewarded with a chocolate pellet. The trajectory on “forced-runs” is predetermined and rats were always rewarded for navigating along the predetermined route. Every trial was initiated by strong mPFC-hippocampal theta coherence. We were curious as to whether time-to-trial-onset would decrease if we repeatedly paired trial onset to strong mPFC-hippocampal theta coherence. 1 out of 3 rats (rat 21-35) showed a significant correlation between time-to-trial onset and trial number, indicating that our threshold for strong mPFC-hippocampal theta coherence was being met more quickly with experience (Figure R1A). When looking over sessions and rats, there was considerable variability in the magnitude of this correlation and sometimes even the direction (Figure R1B). As such, the degree to which rat 21-35 was aware of controlling the environment by reaching strong mPFC-hippocampal theta coherence is unclear, but this question requires future experimentation.

      Author response image 1.

      Strong mPFC-hippocampal theta coherence was used to control trial onset for the entirety of forced-navigation sessions. Time-to-trial onset is a measurement of how long it took for strong coherence to be met. A) Time-to-trial onset was averaged across sessions for each rat, then plotted as a function of trial number (within-session experience on the forced-runs task). Rat 21-35 showed a significant negative correlation between time-to-trial onset and trial number, indicating that time-to-coherence reduced with experience. The rest of the rats did not display this effect. B) Correlation between trial-onset and trial number (y-axis; see A) across sessions (x-axis). A majority of sessions showed a negative correlation between time-to-trial onset and trial number, like what was seen in (A), but the magnitude and sometimes direction of this effect varied considerably even within an animal.

      Is there any evidence that rats display better performance on trials with random delays in which HPC-PFC coherence was naturally elevated?

      This question is now addressed in Extended Figure 5 and discussed in the section titled “strong prefrontal-hippocampal theta coherence leads to correct choices on a spatial working memory task”.

      The introduction frames this study as a test of the "communication through coherence" hypothesis. In its strongest form, this hypothesis states that oscillatory synchronization is a pre-requisite for inter-areal communication, i.e. if two areas are not synchronized, they cannot transfer information. Recent experimental evidence shows this relationship is more likely inverted-coherence is a consequence of inter-areal interactions, rather than a cause. See Schneider et al. (DOI: 10.1016/j.neuron.2021.09.037) and Vinck et al. (10.1016/j.neuron.2023.03.015) for a more in-depth explanation of this distinction. The authors should expand their treatment of this hypothesis in light of these findings.

      Our introduction and discussions have sections dedicated to these studies now.

      Figure 6 - It would be much more intuitive to use the labels "Rat 1", "Rat 2", and "Rat 3"; the "21-4X" identifiers are confusing.

      This was corrected in the paper.

      Figure 6C - The sub-plots within this figure are rather small and difficult to interpret. The figure would be easier to parse if the data were presented as a heatmap of the ratio of theta power during blue vs. red stim, with each pixel corresponding to one channel.

      This suggestion was implemented in the paper. See Fig 6C. Extended Fig. 8 now shows the power spectra as a function of recording shank and channel.

      Ext. Figure 2B - What happens during an acquisition failure? Instead of "Amount of LFP data," consider using "Buffer size".

      Corrected.

      Ext. Figure 2D-E - Instead of "Amount of data," consider using "Window size"

      Referred to as buffer size.

      Ext. Figure 2E - y-axis should extend down to 4 Hz. Are all of the last four values exactly at 8 Hz?

      Yes. Values plateau at 8Hz. These data represent an average over ~50 samples.

      Ext. Figure 2F - consider moving this before D/E, since those panels are summaries of panel F

      Corrected.

      Ext. Figure 4A - ANOVA tells you that accuracy is impacted by delay duration, but not what that impact is. A post-hoc test is required to show that long delays lead to lower accuracy than short ones. Alternatively, one could compute the correlation between delay duration and proportion correctly for each mouse, and look for significant negative values.

      We included supplemental analyses in Extended Fig. 4

      Reviewer #2 (Recommendations For The Authors):

      The authors should replace terms that suggest a causal relationship between PFC-HPC synchrony and behavior, such as 'leads to', 'biases', and 'enhances' with more neutral terms.

      Causal implications were toned down and wherever “leads” or “led” remains, we specifically mean in the context of coherence being detected prior to a choice being made.

      The rationale for the analysis described in the paragraph starting on line 324, and how it fits with the preceding results, was not clear to me. The authors also write at the start of this paragraph "Given that mPFC-hippocampal theta coherence fluctuated in a periodical manner (Extended Fig. 5B)", but this figure only shows example data from 2 trials.

      The reviewer is correct. While we point towards 3 examples in the manuscript now, we focused this section on the autocorrelation analysis, which did not support our observation as we noticed a rather linear decay in correlation over time. As such, the periodicity observed was almost certainly a consequence of overlapping data in the epochs used to calculate coherence rather than intrinsic periodicity.

      Shortly after the start of the results section (line 112), the authors go into a very detailed description of how they validated their BMI without first describing what the BMI actually does. This made this and the subsequent paragraphs difficult to follow. I suggest the authors start with a general description of the BMI (and the general experiment) before going into the details.

      Corrected. See first paragraph of “Development of a closed-loop…”.

      In Figure 2C, as expected, around the onset of 'high' coherence trials, there is an increase in theta coherence but this appears to be very transient. However, it is unclear what the heatmap represents: is it a single trial, single session, an average across animals, or something else? In Figure 3F, however, the increase appears to be much more sustained.

      The sample size was rats for every panel in this figure. This was clarified at the end of Fig. 3.

      In Figure 2D, it was not clear to me what units of measurement are used when the averages and error bars are calculated. What is the 'n' here? Animals or sessions? This should be made clear in this figure as well as in other figures.

      The sample size is rats. This is now clarified at the end of Fig 2.

      Describing the study of Jones and Wilson (2005), the authors write: "While foundational, this study treated the dependent variable (choice accuracy) as independent to test the effect of choice outcome on task performance." (line 83) It was not clear to me what is meant by "dependent" and "independent" here. Explaining this more clearly might clarify how the authors' study goes beyond this and other previous studies.

      The reviewer is correct. A discussion on independent/dependent variables in the context of rationale for our experiment was removed.

      Reviewer #3 (Recommendations For The Authors):

      As explained in the public review, my comments mainly concern the interpretation of the experimental paradigm and its link with previous findings. I think modifying these in order to target the specific advance allowed by the paradigm would really improve the match between the experimental and analytical data that is very solid and the author's conclusions.

      Concerning the paradigm, I recommend that the authors focus more on their novel ability to clearly dissociate the functional role of theta coherence prior to the choice as opposed to induced by the choice. Currently, they explain by contrasting previous studies based on dependent variables whereas their approach uses an independent variable. I was a bit confused by this, particularly because the task variable is not really independent given that it's based on a brain-driven loop. Since theta coherence remains correlated with many other neurophysiological variables, the results cannot go beyond showing that leading up to the decision it correlates with good choice accuracy, without providing evidence that it is theta coherence itself that enhances this accuracy as they suggest in lines 93-94.

      The reviewer is correct. A discussion on independent/dependent variables in the context of rationale for our experiment was removed.

      Regarding previous results with muscimol inactivation, I recommend that the authors expand their discussion on this point. I think that their correlative data is not sufficient to conclude as they do that despite "these structures being deemed unnecessary" (based on causal muscimol experiments), they "can still contribute rather significantly" since their findings do not show a contribution, merely a correlation. This extra discussion could include possible explanations of the apparent, and thought-provoking discrepancies that they uncover such as: theta coherence may be a correlate of good accuracy without an underlying causal relation, theta coherence may always correlate with good accuracy but only be causally important in some tasks related to spatial working memory or, since muscimol experiments leave the brain time to adapt to the inactivation, redundancy between brain areas may mask their implication in the physiological context in certain tasks (see Goshen et al 2011).

      The second paragraph of the discussion is now dedicated to this.

      Possible further analysis :

      • In Extended 4A the authors show that performance drops with delay duration. It would be very interesting to see this graph with the high coherence / low coherence / yoked trials to see if the theta coherence is most important for longer trials for example.

      This is a great suggestion. Due to 10% of trials being triggered by high coherence states, our sample size precludes a robust analysis as suggested. Given that we found an enhancement effect on a task with minimal spatial working memory requirements (Fig. 4), it seems that coherence may be a general benefit or consequence of choice processes. Nonetheless, this remains an important question to address in a future study.

      • Figure 6: The authors explain in the text that although the effect of stimulation of VMT is variable, overall VMT activation increased PFC-HPC coherence. I think in the figure the results are only shown for one rat and session per panel. It would be interesting to add a figure including their whole data set to show the overall effect as well as the variability.

      The reviewer is correct and this comment promoted significant addition of detail to the manuscript. We have added an extended figure (Ext. Fig. 9) showing our VMT stimulation recording sessions. We originally did not include these because we were performing a parameter search to understanding if VMT stimulation could increase mPFC-hippocampal theta coherence. The results section was expanded accordingly.

      Changes to writing / figures :

      • The paper by Eliav et al, 2018 is cited to illustrate the universality of coupling between hippocampal rhythms and spikes whereas the main finding of this paper is that spikes lock to non-rhythmic LFP in the bat hippocampus. It seems inappropriate to cite this paper in the sentence on line 65.

      We agree with the reviewer and this citation was removed.

      • Line 180 when explaining the protocol, it would help comprehension if the authors clearly stated that "trial initiation" means opening the door to allow the rat to make its choice. I was initially unfamiliar with the paradigm and didn't figure this out immediately.

      We added a description to the second paragraph of our first results section.

      • Lines 324 and following: the analysis shows that there is a slow decay over around 2s of the theta coherence but not that it is periodical (as in regularly occurring in time), this would require the auto-correlation to show another bump at the timescale corresponding to the period of the signal. I recommend the authors use a different terminology.

      This comment is now addressed above in our response to Reviewer #2.

      • Lines 344: I am not sure why the stable theta coherence levels during the fixed delay phase show that the link with task performance is "through mechanisms specific to choice". Could the authors elaborate on this?

      We elaborated on this point further at the end of “Trials initiated by strong prefrontal-hippocampal theta coherence are characterized by prominent prefrontal theta rhythms and heightened pre-choice prefrontal-hippocampal synchrony”

      • Line 85: "independent to test the effect of choice outcome on task performance." I think there is a typo here and "choice outcome" should be "theta coherence".

      The sentence was removed in the updated draft.

    1. Author Response

      Reviewer #2 (Public Review):

      Activation of TEAD-dependent transcription by YAP/TAZ has been implicated in the development and progression of a significant number of malignancies. For example, loss of function mutations in NF2 or LATS1/2 (known upstream regulators that promote YAP phosphorylation and its retention and degradation in the cytoplasm) promote YAP nuclear entry and association with TEAD to drive oncogenic gene transcription and occurs in >70% of mesothelioma patients. High levels of nuclear YAP have also been reported for a number of other cancer cell types. As such, the YAP-TEAD complex represents a promising target for drug discovery and therapeutic intervention. Based on the recently reported essential functional role for TEAD palmitoylation at a conserved cysteine site, several groups have successfully targeted this site using both reversible binding non-covalent TEAD inhibitors (i.e., flufenamic acid (FA), MGH-CP1, compound 2 and VT101~107), as well as covalent TEAD inhibitors (i.e., TED-347, DC-TEADin02, and K-975), which have been demonstrated to inhibit YAP-TEAD function and display antitumor activity in cells and in vivo.

      Here, Fan et al. disclose the development of covalent TEAD inhibitors and report on the therapeutic potential of this class of agents in the treatment of TEAD-YAP-driven cancers (e.g., malignant pleural mesothelioma (MPM)). Optimized derivatives of a previously reported flufenamic acid-based acrylamide electrophilic warhead-containing TEAD inhibitor (MYF-01-37, Kurppa et al. 2020 Cancer Cell), which display improved biochemical- and cell-based potency or mouse pharmacokinetic profiles (MYF-03-69 and MYP-03-176) are described and characterized.

      Strengths:

      All of the authors' claims and conclusions are very well supported and justified by the data that is provided. Clear improvements in biochemical- and cell-based potencies have been made within the compound series. Cell-based selective activities in the HIPPO pathway defective versus normal/control cell types are established. Transcriptional effects and the regulation of BMF proapoptotic mRNA levels are characterized. A 1.68 A X-Ray co-crystal structure of MYF-03-69 covalently bound to TEAD1 via Cys359 is provided. In vivo efficacy in a relevant xenograft is demonstrated, using a 30 mg/kg, BID PO dose.

      We thank the reviewer for appreciating and highlighting the strengths of our study.

      Weaknesses:

      Beyond the impact on BMF gene regulation, new biological insights reported here for this compound series are moderate. Progress and differentiation with respect to activity and/or ADME PK profiles relative to the very closely related and previously described (Keneda et al. 2020 Am J Cancer Res 10:4399. PMID 33415007) acrylamide-based covalent TEAD inhibitor K-975 (identical 11 nM cell-based potencies when compared head-to-head and identical reported in vivo efficacy doses of 30 mg/kg) is not entirely clear. Demonstration of on-target in vivo activity is lacking (e.g., impact on BMF gene expression at the evaluated exposure levels).

      We thank the reviewer’s question. We have compared mouse liver microsome stability and hepatocyte stability of K-975 and MYF-03-176 and found that K-975 is metabolically less stable.

      Consistently, when NCI-H226 cells derived xenograft mice were dosed with 30 mg/kg K-975 twice daily, the tumors kept growing and reach more than 1.5-fold volume on 14th day. While with the same dosage, MYF-03-176 showed a significant tumor regression. K-975 did not reach such efficacy even with 100 or 300 mg/kg twice daily, either in NCI-H226 or MSTO-211H CDX mouse model according to the paper (Keneda et al. 2020 Am J Cancer Res 10:4399).

      To demonstrate the on-target in vivo activity, we tested expression of the TEAD downstream genes and BMF in tumor sample after 3-day BID treatment (PD study) and we observed reduction of CTGF, CYR61, ANKRD1 and an increase of BMF, which indicates an on-target activity in vivo.

    1. Author Response

      Reviewer #2 (Public Review):

      This paper by Angueyra, et al., adds to the field’s current understanding of photoreceptor specification and factors regulating opsin expression in vertebrates. Current models of specification of vertebrate photoreceptors are largely based on studies of mammals. However, a great number of animals including teleosts express a wider array of photoreceptor subtypes. Zebrafish for example have 4 distinct cone subtypes and rods. The approach is sound and the data are quite convincing. The only minor weaknesses are that the statistical analyses need to be revisited and the discussion should be a bit more focused.

      To identify differentially expressed transcription factors, the authors performed bulk RNA-seq of pooled, hand-sorted photoreceptors. The selection criterion was tightly controlled to limit unhealthy cells and cellular debris from other photoreceptors subtypes. The pooling of cells provided a considerable depth of sequencing, orders of magnitude better than scSeq. The authors identified known transcription factors and several that appear to be novel or their role has not been determined. The data are made available on the PIs website as is a program to access and compare the gene expression data.

      The authors then used CRISPR/Cas9 gene targeting of two known and several novel factors identified in their analysis for effects on cell fate decisions and opsin expression. Phenotyping performed on the injected larvae is possible, and the target genes were applied and sequenced to demonstrate the efficiency of the gene targeting. Targeting of 2 genes with know functions in photoreceptor specification in zebrafish, Tbx2b and Foxq2 resulted in the anticipated changes in cell fate, albeit, the strength of the alterations in cell fate in the F0 larvae appears to be less than the published phenotypes for the inherited alleles. Interestingly, the authors also identified the expression of an RH2 opsin in the SWS2 another cone type. The changes are subtle but important.

      The authors then targeted tbx2a, the function of which was not known. The result is quite interesting as it matches the increase of rods and decrease of UV cones observed in tbx2b mutants. However, the injected animals also showed RH2 opsin expression but are now in the LWS cone subtype. These data suggest that Tbx2 transcription factors repress misexpression of opsins in the wrong cell type.

      The authors also show that targeting additional differentially expressed factors does not affect photoreceptor fate or survival in the time frame investigated. These are important data to present. For these or any of the other targeted genes above, did the authors test for changes in photoreceptor number or survival?

      We have attempted to address this point, but the answer is not clear cut. We used activated caspase-3 inmmunolabeling as a marker of apoptosis (Lusk and Kwan 2022). At 5 dpf, the age we chose to make quantifications, we don’t see an increase in activated caspase-3 positive cells when we compare control and tbx2a F0 mutants (Reviewer Figure 1A-B). Labeled cells are very rare and located near the ciliary marginal zone irrespective of genotype. This suggests that there is no detectable active death at this late stage of development in tbx2 F0 mutants. Earlier in development, at 3 dpf, when photoreceptor subtypes first appear, there is also a normal wave of apoptosis in the retina (Blume et al. 2020; Biehlmaier, Neuhauss, and Kohler 2001), resulting in many cells positive for activated caspase-3; our preliminary quantifications don’t show a marked increase in the number of labeled cells in tbx2a F0 mutants, but we consider that it’s likely that subtle effects might be obscured by the physiological wave of apoptosis (Reviewer Figure 1C-D).

      Reviewer Figure 1 - Assessment of apoptosis in tbx2a F0 mutants. (A-B) Confocal images of 5 dpf larval eyes of control (A and A’) and tbx2a F0 mutants (B and B’) counterstained with DAPI (grey) and immunolabeled against activated Caspase 3 (yellow) show sparse and dim labeling, restricted to cells located in the ciliary marginal zone, without clear differences between groups. (C-D) Confocal images of 3 dpf larval eyes of control (C and C’) and tbx2a F0 mutants (D and D’) immunolabeled against activated Caspase 3 show many positive cells, located in all retinal layers, as expected from physiological apoptosis at this stage of development and without clear differences between groups.

      Furthermore, the additional single-cell RNA-seq datasets we have reanalyzed suggest that tbx2a and tbx2b are expressed by other retinal neurons and progenitors and not just photoreceptors (Reviewer Figure 2), further confounding attempts at the quantification of apoptosis specifically in photoreceptor progenitors.

      Reviewer Figure 2 – Expression of tbx2 paralogues across retinal cell types. The transcription factors tbx2a and tbx2b are expressed by many retinal cells. Plots show average counts across clusters in RNA-seq data obtained by Hoang et al. (2020).

      At this stage, we consider that fully resolving this issue is important and will require considerably more work, which we will pursue in the future using full germline mutants and live-imaging experiments.

      Reviewer #3 (Public Review):

      Angueyra et al. tried to establish the method to identify key factors regulating fate decisions in the retinal visual photoreceptor cells by combining transcriptomic and fast genome editing approaches. First, they isolated and pooled five subtypes of photoreceptor cells from the transgenic lines in each of which a specific subtype of photoreceptor cells are labeled by fluorescence protein, and then subjected them to RNA-seq analyses. Second, by comparing the transcriptome data, they extracted the list of the transcription factor genes enriched in the pooled samples. Third, they applied CRISPR-based F0 knockout to functionally identify transcription factor genes involved in cell fate decisions of photoreceptor subtypes. To benchmark this approach, they initially targeted foxq2 and nr2e3 genes, which have been previously shown to regulate S-opsin expression and S-cone cell fate (foxq2) and to regulate rhodopsin expression and rod fate (nr2e3). They then targeted other transcription factor genes in the candidate list and found that tbx2a and tbx2b are independently required for UV-cone specification. They also found that tbx2a expressed in the L-cone subtype and tbx2b expressed in L-cones inhibit M-opsin gene expression in the respective cone subtypes. From these data, the authors concluded that the transcription factors Tbx2a and Tbx2b play a central role in controlling the identity of all photoreceptor subtypes within the retina.

      Overall, the contents of this manuscript are well organized and technically sound. The authors presented convincing data, and carefully analyzed and interpreted them. It includes an evaluation of the presented data on cell-type specific transcriptome by comparing it with previously published ones. I think the current transcriptomic data will be a valuable platform to identify the genes regulating cell-type specific functions, especially in combination with the fast CRISPR-based in vivo screening methods provided here. I hope that the following points would be helpful for the authors to improve the manuscript appropriately.

      1) The manuscript uses the word “FØ” quite often without any proper definition. I wonder how “Ø” should be pronounced - zero or phi? This word is not common and has not been used in previous publications. I feel the phrase “F0 knockout,” which was used in the paper cited by the authors (Kroll et al 2021), is more straightforward. If it is to be used in the manuscript, please define “FØ” and “CRISPR-FØ screening” appropriately, especially in the abstract.

      We have made changes to replace “FØ” to “F0.” In our other citation (Hoshijima et al., 2019), “F0 embryo” was used throughout the paper. Following our references and Dr Kojima’s suggestion, we adopted “F0 mutant larva” as the most straightforward and less confusing term. We have also made changes in the abstract to define our approach more clearly and made appropriate changes throughout the manuscript.

      2) Figure 1-supplement 1 shows that opn1mw4 has quite high (normalized) FPKM in one of the S-cone samples in contrast to the least (or no) expression in the M-cone samples, in which opn1mw4 is expected to be detected. The authors should address a possible origin of this inconsistent result for opn1mw4 expression as well as a technical limitation of using the Tg(opn1mw2:egfp) line for detection of opn1mw4 expression in the GFP-positive cells.

      In Figure 1 - Supplement 1, we had attempted to provide a summarized figure of all phototransduction genes, but the big differences in expression levels — in particular, the high expression of opsins genes — forced us to use gene-by-gene normalization for display. Without normalization, the expression of opn1mw4 is very low across all samples, and its detection in that sole S-cone sample can likely be attributed to some degree of inherent noise in our methods. We have revised Figure 1 - Supplement 1: we find that we can avoid gene-by-gene normalization and still provide a good summary of the expression of phototransduction genes if the heatmap is broken down by gene families, which have more similar expression levels. In addition, we have added caveats to the use of the Tg(opn1mw2:egfp) line as our sole M-cone marker in the results section describing our RNA-seq approach, including our inability to provide data on Opn1mw4-expressing M cones.

      3) The manuscript lacks a description of the sampling time point. It is well known that many genes are expressed with daily (or circadian) fluctuation (cf. Doherty & Kay, 2010 Annu. Rev. Genet.). For example, the cone-specific gene list in Fig.2C includes a circadian clock gene, per3, whose expression was reported to fluctuate in a circadian manner in many tissues of zebrafish including the retina (Kaneko et al. 2006 PNAS). It appears to be cone-specific at this time point of sample collection as shown in Fig.2, but might be expressed in a different pattern at other time points (eg, rod expression). The authors should add, at least, a clear description of the sampling time points so as to make their data more informative.

      We have included this information in the materials and methods. We collected all our samples during the most active peak of the zebrafish circadian rhythm between 11am and 2pm (3h to 6h after light onset) to avoid the influence of circadian fluctuations in our analysis.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors use a newly developed object-space memory task comprising of a "Stable" version and "Overlapping" version where two objects are presented in two locations per trial in a square open field. Each version consists of 5 training trials of 5-min presentations of an object-space configuration, with both object locations staying constant across training trials in the Stable condition, and only one object location staying fixed in the Overlapping condition. Memory is tested in a test trial 24 hours later where the opposite configuration is presented - overlapping configuration presented for the Stable condition and stable configuration presented for the Overlapping condition - with the thesis that memory in this test trial for the Overlapping condition will depend on the accumulated memory of spatial patterns over the training trials, whereas memory for the test trial in the Stable condition can be due to episodic memory of last trial or accumulated memory. Memory is quantified using a Discrimination Index (DI), comparing the amount of time animals spend exploring the two object locations.

      Here, animals in other groups are also presented with an interference trial equivalent to the test trial, to test if the memory of the Overlapping condition can be disrupted. The behavioral data show that for RGS14 over-expressing animals, memory in the Overlapping condition is diminished compared to controls with no interference or controls where over-expression is inhibited, whereas memory in the Stable condition is enhanced. This is interpreted as interference in semantic-like memory formation, whereas one-shot episodic memory is improved. The authors speculate that increased cortical plasticity should lead to increased and larger delta waves according to the sleep homeostasis hypothesis, and observe that instead increased cortical plasticity leads to less non-REM sleep and smaller delta waves, with more prefrontal neurons with slower firing rates (presumably more plastic neurons). They further report increased hippocampal-cortical theta coherence during task and REM sleep, increased NonREM oscillatory coupling, and changes in hippocampal ripples in RGS14 over-expressing animals.

      While these results are interesting, there are several issues that need to be addressed, and the link between physiology and behavioral results is unclear.

      1) The behavioral results rely on the interpretation that the Overlapping condition corresponds to semantic-like memory and the Stable condition corresponds to episodic-like memory. While the dissociation in memory performance due to interference seen in these two conditions is intriguing, the Stable condition can correspond not just to the memory of the previous trial but also accumulated memory of a stable spatial pattern over the 5 testing trials, similar to accumulated memory of a changing spatial pattern in the Overlapping pattern.

      Yes! We completely agree on this. We do not claim the stable condition corresponds to episodic-like memory, instead we refer to it as simple memory, since it can be solved either way (one trial memory or cumulative memory). We now expanded this in the discussion to make it clearer.

      Here, it is puzzling that in the behavioral control with no interference (Figure 1D), memory in the Stable and Overlapping condition is unchanged in the test trial, with the DI statistically at 0 in the test trial. In the original description of the Object Space task by the authors in the referenced paper, the measure of memory was a Discrimination Index significantly higher than 0 in both the Stable and Overlapping conditions. This discrepancy needs to be reconciled. Is the DI for the interference trial shown in Fig. S1 significantly different than 0? No statistics or description is provided in the figure legend here.

      As mentioned above, we apologize that we oversimplified the description. The 24h interference trial would be what corresponds to the original test trial. We added a clarifying figure for comparison in S1 (bar graph in addition to the violin plot) and stats. Performance was for all groups and conditions above chance, replicating our previous results.

      2) The physiology experiments compare Home cage (HC) conditions to the Object Space task (OS) throughout the manuscript. While some differences are seen in the control and RGS14 over-expressing animals, there is no comparison of the Stable vs. Overlapping condition in the physiology experiments. This precludes making explicit links between physiological observations and behavioral effects.

      As also mentioned above, we have now added analysis exploring the detailed OS conditions. We would like to thank the reviewers for giving us the opportunity of doing so.

      3) The authors speculate that learning will result in larger and more delta waves as per the synaptic homeostasis hypothesis. It should be noted here that an alternative hypothesis is that there should also be a selective increase in synaptic plasticity for learning and consolidation. The authors do observe that control animals show more frequent and higher-amplitude delta waves, but rather than enhancing this process, RGS14 animals with increased plasticity show the opposite effect. How can this be reconciled and linked with the behavioral data in the Stable and Overlapping condition?

      In the context of the Object Space Task, we would expect all behavioural conditions (Stable and Overlapping) to induce synaptic changes since learning does occur also in the Stable condition (see also performance on 24h trial). Thus, especially homeostatic responses such as increase in delta amplitude, we would expect for all experiences independent if subtle statistical rules are presented or not. In contrast, detailed processing, extracting underlying regularities is rather proposed by the Sleep for Active Systems Consolidation Hypothesis to occur during hippocampal-cortical interactions in form of delta/ripple/spindle interactions (with different theories emphasising different types of interactions). As mentioned above, we now add a more specific analysis in this regards, where we can show that the two OS conditions that involve moving objects (where thus potentially statistical regularities can be extracted) show a higher percentage of ripples occurring after large slow oscillations in comparison to home cage or the simple learning condition Stable. In contrast, RGS14 already has higher participation in both control conditions, emphasising that in these animals all experiences are treated by the brain as significant learning condition, explaining the behavioural effect (increased interference due to better memory for the interference). Further, we expanded in the discussion how in RGS we sometimes see an enhancement of learning effects but sometimes see a more complex interaction of what we would expect from physiological learning.

      Similarly, there is an increase in slower-firing neurons in RGS14 over-expressing animals. Slower-firing neurons have been proposed to be more plastic in the hippocampus based on their participation in learned hippocampal sequences, but appropriate references or data are needed to support the assertion that slower-firing neurons in the prefrontal cortex are more plastic.

      As described above, we have expanded the discussion including other citations that also consider the cortex. We can show that our changes would be expected if one turns the cortex as plastic as the hippocampus.

      4) It is noted that changing cortical plasticity influences hippocampal-cortical coupling and hippocampal ripples, suggesting a cortical influence on hippocampal physiological patterns. It has been previously shown that disrupting prefrontal cortical activity does alter hippocampal ripples and hippocampal theta sequences (Schmidt et al., 2019; Schmidt and Redish, 2021). The current results should be discussed in this context.

      We would like to thank the reviewer for these suggestions, they are now incorporated in the manuscript.

      Reviewer #2 (Public Review):

      In this paper, the authors provide evidence to support the longstanding proposition that a dual-learning system/systems-level consolidation (hippocampus attains memories at a fast pace which are eventually transmitted to the slow-learning neocortex) allows rapid acquisition of new memories while protecting pre-existing memories. The authors leverage many techniques (behavior, pharmacology, electrophysiology, modelling) and report a host of behavioral and electrophysiological changes on induction of increased medial prefrontal cortex (mPFC) plasticity which are interesting and will be of significant interest to the broad readership.

      The experimental design and analyses are convincing (barring some instances which are discussed below). The following recommendations will bolster the strength/quality of the manuscript:

      1) Certain concerns regarding the interpretation and analysis of the behavioral data remain. The authors need to clarify if increased mPFC plasticity leads to only an increase in one-shot memory or 'also' interference of previous information. It seems that the behavioral results could also be explained by the more parsimonious explanation that one-shot memory is improved. Do the current controls tease apart these two scenarios?

      We agree we cannot disentangle if one memory is just stronger than the other or if its an overwriting effect. We added this now to the discussion. Of note, we do not think it actually would be possible to distinguish these two effects behaviourally in rodents, or at least we cannot think of a fitting study design that would enable the contrast.

      Additionally, the authors need to clarify why the 'no trial' and 'anisomycin' controls for the stable task perform at chance levels on exposure to a new object-place association on test day (Fig 1D).

      Violin plots are sometimes hard to see. Here simple bar plots where you can see that the animals are not at chance at the 72h test in the control conditions.

      Finally, further description of how the discrimination index (exploration time of novel-exploration time of familiar/sum of both) is recommended i.e., in the stable condition, which 'object' is chosen as 'novel' (as both are in the same locations) for computing the index (Fig 1). Do negative DI values imply a neophobia to novel objects (and thus are a form of memory; this is also crucial because the modelling results (Fig 1E) use both neophilia and neophobia while negative discrimination indexes are considered similar to 0 for interpreting the behavioral results, as stated on page 3, lines 84-86?

      We added this now to the methods (For Overlapping it is moved location – stable location, for Stable it is location-to-be-moved-at-test – stable location and for random which is assigned as moved and stable is random, and then for each divided by total time). We agree that neophilia/neophobia (especially changes in the distribution) can be an issue and have discussed it in detail in Schut et al NLM 2020 where we see difference in absolute beta values (thus controlling for philia/phobia differences). We also discuss there why it is difficult to control for this in the DI in more detail. In short, one could use absolute values but then it is difficult to determine what a group chance-level would look like. However, luckily here there is not issue since we did not observe difference in neophilic or phobic tendencies while running the experiments. Critically the interference trial (that can also function as simple test trial) confirms that as a group animals show positive DI and neophilia.

      2) The authors report lower firing rates in RGS14414 animals during the task in Fig 2F. It is indeed remarkable how large the reported differences are. The authors need to rule out any differences in the behavioral state of the animals in the two groups during the task, i.e., rest vs. active exploration/movement dynamics. Are only epochs during the task while the animals interact with the objects used for computing the firing rates (same epochs as Fig 1)? If not, doing so will provide a useful comparison with Fig 1. Additionally, although the authors make the case for slow firing rate neurons being important for plasticity (based on Grosmark and Buzsaki, 2016), it is crucial to note that the firing rate dynamic (slow vs. fast) in that study for the hippocampus is defined based on the whole recorded session (predominated by sleep), indeed the firing rates of the two groups (slow vs. fast/plastic vs. rigid) during the task/maze-running do not differ in that study. Therefore, the results here seem incongruent with the Grosmark and Buzsaki paper. Since this finding is central to the main claim of the authors, it either warrants further investigation or a re-interpretation of their results.

      As mentioned in the main points, we now added the firing rate analysis (including new groups splits) for wake in the sleep box, NREM and REM separately. Each time the same results are obtained. Currently, we do not yet have the tracking and video synchronization set-up, therefore we cannot split the task for specific behaviours.

      However, we now also cite Buzsaki’s original log-normal brain review, where he first proposed the idea. There he also shows same effects as we do, in that the general firing rate distribution is the same for task and different sleep stages, just overall shifted. The analysis from Grosmark included more strigent subselection of neurons to be able to also argue that incorporation into run/replay-sequences could not have been biased by firing rate per se (instead of plasticity). However, the original proposition from Buzsaki does fit to our results. He further presents hippocampus vs cortex firing rates, which also confirm the idea (hippocampus more plastic and has slower firing rates). We included this figure above in the general comments. Further, we now expanded the discussion in this point.

      3) A concern remains as to how many of the electrophysiological changes they observe (firing rate differences, LFP differences including coupling, sleep state differences, Figs. 2-4) support their main hypothesis or are a by-product of injection of RGS14414 (for instance, one might argue that an increased 'capability' to learn new information/more plasticity might lead to more NREM sleep for consolidation, etc.). The authors need to carefully interpret all their data in light of their main hypothesis, which will substantially improve the quality/strength of the manuscript.

      We now expanded the discussion, included more structure and also include that we cannot disentangle if the cellular changes or sleep oscillation changes or an interaction of both is the cause of the result. Furthermore, we added that we cannot distinguish if the interference memory is stronger or actually overwrites the original training memory.

      Reviewer #3 (Public Review):

      The authors set out to test the idea that memories involve a fast process (for the acquisition of new information) and a slow process (where these memories are progressively transferred/integrated into more-long term storage). The former process involves the hippocampus and the latter the cerebral cortex. This 'dual-learning' system theoretically allows for new learning without causing interference in the consolidation of older memories. They test this idea by artificially increasing plasticity in the pre-limbic cortex and measuring changes in different learning/memory tasks. They also examined electrophysiological changes in sleep, as sleep is linked to memory formation and synaptic plasticity.

      The strengths of the study include a) meticulous analyses of a variety of electrophysiological measurements b) a combination of neurobiological and computational tools c) a largely comprehensive analysis of sleep-based changes. Some weaknesses include questions about the technique for increasing cortical plasticity (is this physiological?) and the absence of some additional experiments that would strengthen the conclusions. However, overall, the findings appear to support the general idea under examination.

      This study is likely to be very impactful as it provides some really new information about these important neural processes, as well as data that challenges popular ideas about sleep and synaptic plasticity.

      We would like to thank the reviewer for these positive comments. Answers to the weaknesses are presented below in the recommendations for the authors.

    1. Author Response

      Reviewer 1 (Public Review):

      To me, the strengths of the paper are predominantly in the experimental work, there's a huge amount of data generated through mutagenesis, screening, and DMS. This is likely to constitute a valuable dataset for future work.

      We are grateful to the reviewer for their generous comment.

      Scientifically, I think what is perhaps missing, and I don't want this to be misconstrued as a request for additional work, is a deeper analysis of the structural and dynamic molecular basis for the observations. In some ways, the ML is used to replace this and I think it doesn't do as good a job. It is clear for example that there are common mechanisms underpinning the allostery between these proteins, but they are left hanging to some degree. It should be possible to work out what these are with further biophysical analysis…. Actually testing that hypothesis experimentally/computationally would be nice (rather than relying on inference from ML).

      We agree with the reviewer that this study should motivate a deeper biophysical analysis of molecular mechanisms. However, in our view, the ML portion of our work was not intended as a replacement for mechanistic analysis, nor could it serve as one. We treated ML as a hypothesis-generating tool. We hypothesized that distant homologs are likely to have similar allosteric mechanisms which may not be evident from visual analysis of DMS maps. We used ML to (a) extract underlying similarities between homologs (b) make cross predictions across homologs. In fact, the chief conclusion of our work is that while common patterns exist across homologs, the molecular details differ. ML provides tantalizing evidence to this effect. The conclusive evidence will require, as the reviewer rightly suggests, detailed experimental or molecular dynamics characterization. Along this line, we note that we have recently reported our atomistic MD analysis of allostery hotspots in TetR (JACS, 2022, 144, 10870). See ref. 41.

      Changes to manuscript:<br /> “Detailed biophysical or molecular dynamics characterization will be required to further validate our conclusions(38).”

      Reviewer 3 (Public Review):

      However - at least in the manuscript's present form - the paper suffers from key conceptual difficulties and a lack of rigor in data analysis that substantially limits one's confidence in the authors' interpretations.

      We hope the responses below address and allay the reviewer’s concerns.

      A key conceptual challenge shaping the interpretation of this work lies in the definition of allostery, and allosteric hotspot. The authors define allosteric mutations as those that abrogate the response of a given aTF to a small molecule effector (inducer). Thus, the results focus on mutations that are "allosterically dead". However, this assay would seem to miss other types of allosteric mutations: for example, mutations that enhance the allosteric response to ligand would not be captured, and neither would mutations that more subtly tune the dynamic range between uninduced ("off) and induced ("on") states (without wholesale breaking the observed allostery). Prior work has even indicated the presence of TetR mutations that reverse the activity of the effector, causing it to act as a co-repressor rather than an inducer (Scholz et al (2004) PMID: 15255892). Because the work focuses only on allosterically dead mutations, it is unclear how the outcome of the experiments would change if a broader (and in our view more complete) definition of allostery were considered.

      We agree with the reviewer that mutations that impact allostery manifest in many different ways. Furthermore, the effect size of these mutations runs the full gamut from subtle changes in dynamic range to drastic reversal of function. To unpack allostery further, allostery of aTF can be described, not just by the dynamic range, but by the actual basal and induced expression levels of the reporter, EC50 and Hill coefficient. Given the systemic nature of allostery, a substantial fraction of aTF mutations may have some subtle impact on one or more of these metrics. To take the reviewer’s argument one step further, one would have to accurately quantify the effect size of every single amino acid mutation on all the above properties to have a comprehensive sequence-function landscape of allostery. Needless to say, this is extremely hard! Resolution of small effect sizes is very difficult, even at high sequencing depth. To the best of our knowledge, a heroic effort approaching such comprehensive analysis has been accomplished so far only once (PMID: 3491352).

      Our focus, therefore, was to screen for the strongest phenotypic impact on allostery i.e., loss of function. Mutations leading to loss of function can be relatively easily identified by cell-sorting. Because our goal was to compare hotspots across homologs, we surmised that loss of function mutations, given their strong phenotypic impact, are likely to provide the clearest evidence of whether allosteric hotspots are conserved across remote homologs.

      The reviewer raised the point of activity-reversing mutations. Yes, there are activity reversing mutations in TetR. However, they represent an insignificant fraction. In the paper cited by the reviewer, there are 15 activity-reversing mutations among 4000 screened. Furthermore, the paper shows that activity-reversing in TetR requires two-tofour mutations, while our library is exclusively single amino acid substitutions. For these reasons, we did not screen for activity-reversing mutations. Nonetheless, we agree with the reviewer that screening for activity-reversing mutations across homologs would be very interesting.

      The separation in fluorescence between the uninduced and induced states (the assay dynamic range, or fold induction) varies substantially amongst the four aTF homologs. Most concerningly, the fluorescence distributions for the uninduced and induced populations of the RolR single mutant library overlap almost completely (Figure 1, supplement 1), making it unclear if the authors can truly detect meaningful variation in regulation for this homolog.

      Yes, the reviewer is correct that the fold induction ratio varies among the four aTF homologs. However, we note that such differences are common among natural aTFs. Depending on the native downstream gene regulated by the aTF, some aTFs show higher ligand-induced activation, and others are lower. While this is not a hard and fast rule, aTFs that regulate efflux pumps tend to have higher fold induction than those that regulate metabolic enzymes. In summary, the variation in fold induction among the four aTFs is not a flaw in experimental design nor indicates experimental inconsistency but is instead just an inherent property of protein-DNA interaction strength and the allosteric response of each aTF.

      Among the four aTFs, wildtype RolR has the weakest fold induction (15-fold) which makes sorting the RolR library particularly challenging. To minimize false positives as much as possible, we require that dead mutant be present in (a) non-fluorescent cells after ligandinduction (b) non-fluorescent cells before ligand-induction (c) at least two out of the three replicates for both sorts. Additionally, for RolR specifically, we adjusted the nonfluorescent gate to be far more stringent than the other three aTFs (Fig. 1 – figure supplement 1). Furthermore, we assign residues as allosteric hotspots, not individual dead mutations. This buffers against false strong signals from stray individual dead mutations. Finally, the top interquartile range winnows them to residues showing strong consistent dead phenotype. As a result of these “safeguards” we have built in, the number of allosteric hotspots of RolR (57) is comparable to the other three aTFs (51, 53 and 48). This suggests that we are not overestimating the number of hotspots despite the weaker fold induction of RolR. We highlight in a new supplementary figure (Figure 1 – figure supplement 4) that changing the read count threshold from 5X to 10X produces near identical patterns of mutations suggesting that our results are also robust to changes in ready depth stringency.

      Changes to manuscript: In response to the reviewer's comment, we have added the following sentence.

      “We note that the lower fold induction (dynamic range) of RolR makes it particularly challenging to separate the dead variants from the rest.”

      The methods state that "variants with at least 5 reads in both the presence and absence of ligand in at least two replicates were identified as dead". However, the use of a single threshold (5 reads) to define allosterically dead mutations across all mutations in all four homologs overlooks several important factors:

      Depending on the starting number of reads for a given mutation in the population (which may differ in orders of magnitude), the observation of 5 reads in the gated nonfluorescent region might be highly significant, or not significant at all. Often this is handled by considering a relative enrichment (say in the induced vs uninduced population) rather than a flat threshold across all variants.

      We regret the lack of clarity in our presentation. We wish to better explain the rationale behind our approach. First, we understand the reviewer’s point on considering relative enrichment to define a threshold. This approach works well in DMS experiments involving genetic selections, which is commonly the case, because activity scales well with selection stringency. One can then pick enrichment/depletion relative to the middle of the read count distribution as a measure of gain or loss of function.

      Second, this strategy does not, in practice, work well for cell-sorting screens. While it may be tempting to think of cell sorting as comparably activity-scaled as genetic selections, in reality, the fidelity of fluorescent-activated cell sorters is much lower. Making quantitative claims of activity based on cell sorting enrichment can be risky. It is wiser to treat cell sorting results as yes/no binary i.e., does the mutation disrupt allostery or not. More importantly, the yes/no binary classification suffices for our need to identify if a certain mutation adversely impacts allosteric activity or not.

      Third, the above argument does not imply that all mutations have the same effect size on allostery. They don’t. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.

      Fourth, a variant to be classified as allosterically dead, it must be present both in uninduced and induced DNA-bound populations in at least two out of three replicates (four conditions total). This is a stringent criterion for selecting dead variants resulting in highly consistent regions of importance in the protein even upon varying read count thresholds. To the extent possible, we have minimized the possibility of false positive bleed-through.

      Finally, two separate normalizations were performed on the total sequence reads to be able to draw a common read count threshold 1) between experimental conditions & replicates and 2) across proteins. First, total sequencing reads were normalized to 200k total across all sample conditions (presorted, -inducer, and +inducer) and replicates for each homolog, allowing comparisons within a single protein. Next, reads were normalized again to account for differences in the theoretical size of each protein’s single-mutant library, allowing for comparisons across proteins by drawing a commont readcount cutoff. For example, total sequencing reads of RolR (4,332 possible mutants) increased by 1.18x relative to MphR (3,667 possible mutants) for a total of 236k reads.

      Changes to manuscript: We have provided substantial additional details in the Fluorescence-activated cell sorting and NGS preparation and analysis sections.

      We also added the following in the main text.

      “In other words, we use cell sorting as a binary classifier i.e., does the mutation disrupt allostery or not. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.”

      Depending on the noise in the data (as captured in the nucleotide-specific q-scores) and the number of nucleotides changed relative to the WT (anywhere between 1-3 for a given amino acid mutation) one might have more or less chance of observing five reads for a given mutation simply due to sequencing noise.

      All the reads considered in our analyses pass the Illumina quality threshold of Q-score ≥ 30 which as per Illumina represent “perfect reads with no errors or ambiguities”. This translates into a probability of 1 in 1000 incorrect base call or 99.9% base call accuracy.

      We use chip-based oligonucleotides to build our DMS library, which allows us to prespecify the exact codon that encodes a point mutation. This means the nucleotide count and protein count are the same. The scenario referred to by the reviewer i.e., “anywhere between 1-3 for a given amino acid mutation” only applies to codon randomized or errorprone PCR library generation. We regret if the chip-based library assembly part was unclear.

      Depending on the shape and separation of the induced (fluorescent) and uninduced (non-fluorescent) population distributions, one might have more or less chance of observing five reads by chance in the gated non-fluorescent region. The current single threshold does not account for variation in the dynamic range of the assay across homologs.

      We have addressed the concern raised by the reviewer on fluorescent population distributions in answers to questions 10 and 11.

      The reviewer makes an important point about the choice of sequencing threshold. We use the sequencing threshold to simply make a binary choice for whether a certain variant exists in the sorted population or not. We do not use the sequencing reads as to scale the activity of the variant. To address the reviewer's comment, we have included a new supplementary figure (Fig 1 – figure supplement 4) where we compare the data by adjust the threshold two levels – 5 and 10 reads. As is evident in the new figure, the fundamental pattern of allosteric hotspots and the overall data interpretation does not change.

      TetR: 5x – 53 hotspots, 10x – 51 hotspots

      TtgR: 5x – 51 hotspots, 10x – 51 hotspots

      MphR: 5x – 48 hotspots, 10x – 48 hotspots

      RolR: 5x – 57 hotspots, 10x – 60 hotspots

      In other words, changing the threshold to be more or less strict may have a modest impact on the overall number of hotspots in the dataset. Still, the regions of functional importance are consistent across different thresholds. We have expanded the discussion in the manuscript to address this point.

      Changes to manuscript: We have now included a new supplementary comparing hotspot data at two thresholds: Figure 1 – figure supplement 4.

      We also added the following in the main text.

      “To assess the robustness of our classification of hotspots, we determined the number of hotspots at two different sequencing thresholds – 5x and 10x. At 5x and 10x, the number of hotspots are – TetR: 53, 51; TtgR: 51, 51; MphR: 48, 48 and RolR: 57,60, respectively. Changing the threshold has a modest impact on the overall number of hotspots and the regions of functional importance are consistent at both thresholds”

      The authors provide a brief written description of the "weighted score" used to define allosteric hotspots (see y-axis for figure 1B), but without an equation, it is not clear what was calculated. Nonetheless, understanding this weighted score seems central to their definition of allosteric hotspots.

      We regret the lack of clarity in our presentation. The weighted score was used to quantify the “deadness” of every residue position in the protein. At each position in the protein, the number of mutations that inhibited activity was summed up and the ‘deadness’ of each mutation was weighted based on how many replicates is appeared to inactivate the protein. Weighted score at each residue position is given by

      Where at position x in the protein, D1 is the number of mutations dead in one replicate only, D2 is the number of mutations dead in 2 replicates, D3 is the number of mutations dead in 3 replicates, and Total is the total number of variants present in the data set (based on sequencing data). Any dead mutation that is seen in only one replicate is discarded and does not contribute to the “deadness” of the residue. Mutations seen in two and three replicates contribute to the score. We have included a new supplementary figure (Fig. 1 – figure supplement 2) to give the reader a detailed heatmap of all mutations and their impact for each protein.

      Changes to manuscript: The weighted scoring scheme is now described in greater detail under Materials and Methods in the “NGS preparation and analysis” section.

      The authors do not provide some of the standard "controls" often used to assess deep mutational scanning data. For example, one might expect that synonymous mutations are not categorized as allosterically dead using their methods (because they should still respond to ligand) and that most nonsense mutations are also not allosterically dead (because they should no longer repress GFP under either condition). In general, it is not clear how the authors validated the assay/confirmed that it is giving the expected results.

      As we state in response to question 12, we use chip-based oligonucleotides to build our DMS library, which allows us to pre-specify the exact codon that encodes a point mutation. We have no synonymous or nonsense mutations in our DMS library. Each protein mutation is encoded by a single unique codon. The only stop codon is at 3’end of the gene.

      The authors performed three replicates of the experiment, but reproducibility across replicates and noise in the assay is not presented/discussed.

      Changes to manuscript: A new supplementary table (Table 1) is now provided with the pairwise correlation coefficients between all replicates for each protein.

      In the analysis of long-range interactions, the authors assert that "hotspot interactions are more likely to be long-range than those of non-hotspots", but this was not accompanied by a statistical test (Figure 2 - figure supplement 1).

      In response to the reviewer's comment, we now include a paired t-test comparing nonhotspots and hotspots with long-range interactions in the main text.

      Changes to manuscript: In all four aTFs, hotspots constituted a higher fraction of LRIs than non-hotspots (Figure 2 – figure supplement 1; P = 0.07).

    1. Author Response

      Reviewer #1 (Public Review):

      In this study, the authors describe an elegant genetic screen for mutants that suppress defects of MCT1 deletions which are deficient in mitochondrial fatty acid synthesis. This screen identified many genes, including that for Sit4. In addition, genes for retrograde signaling factors (Rtg1, Rtg2 and Rtg3), proteins influencing proteasomal degradation (Rpn4, Ubc4) or ribosomal proteins (Rps17A, Rps29A) were found. From this mix of components, the authors selected Sit4 for further analysis. In the first part of the study, they analyzed the effect of Sit4 in context of MCT1 mutant suppression. This more specific part is very detailed and thorough, the experiments are well controlled and convincing. The second, more general part of the study focused on the effect of Sit4 on the level of the mitochondrial membrane potential. This part is of high general interest, but less well developed. Nevertheless, this study is very interesting as it shows for the first time that phosphate export from mitochondrial is of general relevance for the membrane potential even in wild type cells (as long as they live from fermentation), that the Sit4 phosphatase is critical for this process and that the modulation of Sit4 activity influences processes relying on the membrane potential, such as the import of proteins into mitochondria. However, some aspects should be further clarified.

      1) It is not clear whether Sit4 is only relevant under fermentative conditions. Does Sit4 also influence the membrane potential in respiring cells? Fig. S2D shows the membrane potential in glucose and raffinose. Both carbon sources lead to fermentative growths. The authors should also test whether Sit4 levels influence the membrane potential when cells are grown under respirative conditions, such in ethanol, lactate or glycerol. Even if deletions of Sit4 affect respiration, mutants with altered activity can be easily analyzed.

      sit4Δ cells fail to grow on nonfermentable media as shown by us (Figure 2—figure supplement 1C) and others (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). In our opinion, the exact reason is unclear, but there is an interesting observation that addition of aspartate can partially restore growth on ethanol (Jablonka et al., 2006). Despite the lack of thorough investigation on this sit4Δ defect, an early study speculated that this defect could be related to the cAMP-PKA pathway (Sutton et al., 1991). This study pointed out genetic interactions of SIT4 with multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistant, and failure to grow on nonfermentable media (Sutton et al., 1991). We have not found sit4Δ mutants that could grow on nonfermentable media based on literature search.

      2) The authors should give a name to the pathway shown in Fig. 4D. This would make it easier to follow the text in the results and the discussion. This pathway was proposed and characterized in the 90s by George Clark-Walker and others, but never carefully studied on a mechanistic level. Even if the flux through this pathway cannot be measured in this study, the regulatory role of Sit4 for this process is the most important aspect of this manuscript.

      We now refer this mechanism as the mitochondrial ATP hydrolysis pathway.

      3) To further support their hypothesis, the authors should show that deletion of Pic1 or Atp1 wipes out the effect of a Sit4 deletion. In these petite-negative mutants, the phosphate export cycle cannot be carried out and thus, Sit4, should have no effect.

      The mitochondrial phosphate transport activity is electroneutral as it also pumps a proton together with inorganic phosphate. The F1 subunit of the ATP synthase (Atp1 and Atp2) is suggested among many literatures to be responsible for the ATP hydrolysis. We performed tetrad dissection to generate atp1Δ or atp2Δ in pho85Δ background. After streaking the single colony to a fresh plate, we noticed that atp1Δ mct1Δ and atp2Δ mct1Δ cells are lethal, and knocking out PHO85 rescued this synthetic lethality. It is not surprising that atp1Δ mct1Δ or atp2Δ mct1 Δ cells are lethal since the F1 subunit is important to generate a minimum of MMP in mct1 Δ cells when the ETC is absent (i.e., rho0 cells). However, knocking out PHO85 can generate MMP independent of F1 subunit of ATP synthase, which is suggested by the viable atp1Δ mct1Δ pho85Δ and atp2Δ mct1Δ pho85Δ cells. There are many ATPases in the mitochondrial matrix that could hydrolyze ATP for ADP/ATP carrier to generate MMP theoretically. However, we do not currently know exactly which ATPase(s) is activated by phosphate starvation. This data is now included as Figure 5—figure supplement 1F-G.

      4) What is the relevance of Sit4 for the Hap complex which regulates OXPHOS gene expression in yeast? The supplemental table suggests that Hap4 is strongly influenced by Sit4. Is this downstream of the proposed role in phosphate metabolism or a parallel Sit4 activity? This is a crucial point that should be addressed experimentally.

      To investigate the role of the Hap complex in MMP generation in sit4Δ cells, we overexpressed and knocked out HAP4, the catalytic subunit of the Hap complex, separately in wild-type and sit4Δ cells. We confirmed the HAP4 overexpression by the enriched abundance of ETC complexes as shown in the BN-PAGE (Figure 2—figure supplement 1E). However, we did not observe any rescue of ETC or ATP synthase in mct1Δ cells when HAP4 was overexpressed. The enriched level of ETC complexes by HAP4 overexpress is not sufficient to rescue the MMP (Figure 2—figure supplement 1F).

      Next, we knocked out HAP4 in sit4Δ cells. Knocking out SIT4 could still increase MMP in hap4Δ cells with a much-reduced magnitude, which phenocopied ETC subunit and RPO41 deletion in sit4Δ cells (Figure 2—figure supplement 1G).

      In conclusion, the Hap complex is involved in the MMP increase when SIT4 is absent. However, it is not sufficient to increase MMP by overexpressing HAP4. The Hap complex discussion is now included in the manuscript, and the data is presented as Figure 2—figure supplement 1E-G.

      5) The authors use the accumulation of Ilv2 precursors as proxy for mitochondrial protein import efficiency. Ilv2 was reported before as a protein which, if import into mitochondria is slow, is deviated into the nucleus in order to be degraded (Shakya,..., Hughes. 2021, Elife). Is it possible that the accumulation of the precursor is the result of a reduced degradation of pre-Ilv2 in the nucleus rather than an impaired mitochondrial import? Since a number of components of the ubiquitin-proteasome system were identified with Sit4 in the same screen, a role of Sit4 in proteasomal degradation seems possible. This should be tested.

      We thank the reviewer for pointing out this potential caveat with our Ilv2-FLAG reporter. With limited search and tests, we could not find another reporter that behaves like Ilv2FLAG. The reason Ilv2-FLAG is a perfect reporter for this study is because in wild-type cells, Ilv2-FLAG is not 100% imported. Therefore, we could demonstrate that mitochondria with higher MMP import more efficiently. Unfortunately, all of the mitochondrial proteins that we tested could efficiently import in wild-type cells. To identify other suitable mitochondrial proteins that behave like Ilv2-FLAG, we would need to conduct a more comprehensive screen.

      To address the concern of the involvement of protein degradation in obscuring the interpretation of Ilv2-FLAG import, we performed two experiments. First, we measured the proteasomal activity in wild-type and our mutants using a commercial kit (Cayman). We did not observe a statistically significant difference in 20S proteasomal activity between wild-type and sit4Δ cells.

      In the second experiment, we reduced the MMP of sit4 cells using CCCP treatment and measured the Ilv2-FLAG import. We first treated sit4Δ cells with different dosage of CCCP for six hours and measured their MMP. sit4Δ cells treated with 75 µM CCCP had comparable MMP to wild-type cells. When we treated sit4Δ cells with higher concentrations of CCCP, most of the cells did not survive after six hours. Next, we performed the Ilv2-FLAG import assay. We observed similar level of unimported Ilv2FLAG (marked with *) in sit4Δ cells treated with 75 µM CCCP. This result confirms that sit4Δ cells have similar Ilv2-FLAG turnover mechanism and activity as the wild-type cells, because when we lower the MMP in sit4Δ background we observe a similar level of unimported Ilv2-FLAG. We thus feel confident in concluding that the Ilv2-FLAG import results are indeed an accurate proxy for MMP level. These data are now included as Figure 1—figure supplement 1H-J in the manuscript.

      Author response image 1.

      Reviewer #2 (Public Review):

      This study reports interesting findings on the influence of a conserved phosphatase on mitochondrial biogenesis and function. In the absence of it, many nucleus-encoded mitochondrial proteins among which those involved in ATP generation are expressed much better than in normal cells. In addition to a better understanding of th mechanisms that regulate mitochondrial function, this work may help developing therapeutic strategies to diseases caused by mitochondrial dysfunction. However there are a number of issues that need clarification.

      1) The rationale of the screening assay to identify genes required for the gene expression modifications observed in mct1 mutant is not clear. Indeed, after crossing with the gene deletion libray, the cells become heterozygote for the mct1 deletion and should no longer be deficient in mtFAS. Thank you for clarifying this and if needed adjust the figure S1D to indicate that the mated cells are heterozygous for the mct1 and xxx mutations.

      We updated the methods section and the graphic for the genetic screen to clarify these points within the SGA workflow overview. After we created the heterozygote by mating mct1Δ cells with the individual KO cells in the collection, these diploids underwent sporulation and selection for the desired double KO haploid. As a result, the luciferase assay was performed in haploid cells with MCT1 and one additional non-essential gene deleted.

      2) The tests shown in Fig. S1E should be repeated on individual subclones (at least 100) obtained after plating for single colonies a glucose culture of mct1 mutant, to determine the proportion of cells with functional (rho+) mtDNA in the mct1 glucose and raffinose cultures. With for instance a 50% proportion of rho- cells, this could substantially influence the results of the analyses made with these cells (including those aiming to evaluate the MMP).

      We agree that this would provide a more confident estimate for population-level characterization of these colonies. It is important to note that we randomly chose 10 individual subclones, and 100% of these colonies were verified to be rho+. This suggests the population has functional mtDNA, and thus felt confident in the identity of our populations.

      3) The mitochondria area in mct1 cells (Fig.S1G) does not seem to be consistent with the tests in Fig. 1C. that indicate a diminished mitochondrial content in mct1 cells vs wild-type yeast. A better estimate (by WB for instance) of the mitochondrial content in the analyzed strains would enable to better evaluate MMP changes monitored with Mitotracker since the amount of mitochondria in cells correlate with the intensity of the fluorescence signal.

      As this reviewer pointed out, we quantified mitochondrial area based on Tom70-GFP signal. This measurement is quantified by mitochondrial area over cell size. Cell size is an important parameter when measuring organelle size as most of the organelles scale up and down with the cell size. mct1Δ cells generally have smaller cell size than WT cells. Therefore, the mitochondrial area of mct1Δ cells was not significantly different from WT cells when scaled to cell size. We believe this is the best method to compare mitochondrial area. As for quantifying MMP from these microscopy images, we measured the average MitoTracker Red fluorescence intensity of each mitochondria defined by Tom70-GFP. This method inherently normalizes to subtract the influence of mitochondria area when quantifying MMP.

      4) Page 12: "These data demonstrate that loss of SIT4 results in a mitochondrial phenotype suggestive of an enhanced energetic state: higher membrane potential, hyper-tubulated morphology and more effective protein import." Furthermore, the sit4 mutant shows higher levels of OXPHOS complexes compared to WT yeast.

      Despite these beneficial effects on mitochondria, the sit4 deletion strain fails to grow on respiratory substrates. It would be good to know whether the authors have some explanation for this apparent contradiction.

      We agree that this was initially puzzling. We provide a more complete explanation above (see comments to reviewer #1 - major concern #1). Briefly, the growth deficiency in non-fermentable media with sit4Δ cells was reported and studied by multiple groups (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). These seems to indicate that sit4Δ cells contain more ETC complexes and more OCR but cannot respire on nonfermentable carbon source. However, we do not think there is yet a clear explanation for this phenotype. One interesting observation reported is the addition of aspartate partly restoring cells’ growth on ethanol (Jablonka et al., 2006). One early study speculates that this defect could be related to the cAMP-PKA pathway. Sutton et al. pointed out genetic interactions with sit4 and multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistance, and failure to grow on non-fermentable media. However, to keep this manuscript succinct, we opted to stay focused on MMP.

      Reviewer #3 (Public Review):

      In this study, the authors investigate the genetic and environmental causes of elevated Mitochondrial Membrane Potential (MMP) in yeast, and also some physiological effects correlated with increased MMP.

      The study begins with a reanalysis of transcriptional data from a yeast mutant lacking the gene MCT1 whose deletion has been shown to cause defects in mitochondrial fatty acid synthesis. The authors note that in raffinose mct1del cells, unlike WT cells, fail to induce expression of many genes that code for subunits of the Electron Transport Chain (ETC) and ATP synthase. The deletion of MCT1 also causes induction of genes involved in acetyl-CoA production after exposure to raffinose. The authors therefore conduct a screen to identify mutants that suppress the induction of one of these acetylCoA genes, Cit2. They then validate the hits from this screen to see which of their suppressor mutants also reduce expression in four other genes induced in a mct1del strain. This yielded 17 genes that abolished induction of all 5 genes tested in an mct1del background during growth on raffinose.

      The authors chose to focus on one of these hits, the gene coding for the phosphatase SIT4 (related to human PP6) which also caused an increase in expression of two respiratory chain genes. The authors then investigated MMP and mitochondrial morphology in strains containing SIT4 and MCT1 deletions and surprisingly saw that sit4del cells had highly elevated MMP, more reticular mitochondria, and were able to fully import the acetolactate synthase protein Ilv2p and form ETC and ATP synthase complexes, even in cells with an mct1del background, rescuing the low MMP, fragmented mitochondria, low import of Ilv2 and an inability to form ETC and ATP synthase complexes phenotypes of the mct1del strain. Surprisingly, the authors find that even though MMP is high and ETC subunits are present in the sit4del mct1del double deletion strain, that strain has low oxygen consumption and cannot grow under respiratory conditions, indicating that the elevated MMP cannot come from fully functional ETC subunits. The authors also observe that deleting key subunits of ETC complex III (QCR2) and IV (COX5) strongly reduced the MMP of the sit4del mutant, which would suggest that the majority of the increase in MMP of the sit4del mutant was dependant on a partially functional ETC. The authors note that there was still an increase in MMP in the qcr2del sit4del and cox4del sit4del strains relative to qcr2del and cox4del strains indicating that some part of the increase in MMP was not dependent on the ETC.

      The authors dismiss the possibility that the increase in MMP could have been through the reversal of ATP synthase because they observe that inhibition of ATP synthase with oligomycin led to an increase of MMP in sit4del cells. Indicating that ATP synthase is operating in a forward direction in sit4del cells.

      Noting that genes for phosphate starvation are induced in sit4del cells, the authors investigate the effects of phosphate starvation on MMP. They found that phosphate starvation caused an increase in MMP and increased Ilv2p import even in the absence of a mitochondrial genome. They find that inhibition of the ADP/ATP carrier (AAC) with bongkrekic acid (BKA) abolishes the increase of MMP in response to phosphate starvation. They speculate that phosphate starvation causes an increase in MMP through the import and conversion of ATP to ADP and subsequent pumping of ADP and inorganic phosphate out of the mitochondria.

      They further show that MMP is also increased when the cyclin dependent kinase PHO85 which plays a role in phosphate signaling is deleted and argue that this indicates that it is not a decrease in phosphate which causes the increase in MMP under phosphate starvation, but rather the perception of a decrease in phosphate as signalled through PHO85. Unlike in the case of SIT4 deletion, the increase in MMP caused by the deletion of pho85 is abolished when MCT1 is deleted.

      Finally they show an increase in MMP in immortalized human cell lines following phosphate starvation and treatment with the phosphate transporter inhibitor phosphonoformic acid (PFA). They also show an increase in MMP in primary hepatocytes and in midgut cells of flies treated with PFA.

      The link between phosphate starvation and elevated MMP is an important and novel finding and the evidence is clear and compelling. Based on their experiments in various mammalian contexts, this link appears likely to be generalizable, and they propose and begin to test an interesting hypothesis for how MMP might occur in response to phosphate starvation in the absence of the Electron Transport Chain.

      The link between phosphate starvation and deletion of the conserved phosphatase SIT4 is also interesting and important, and while the authors' experiments and analysis suggest some connection between the two observations, that connection is still unclear.

      Major points

      Mitotracker is great fluorescent dye, but it measures membrane potential only indirectly. There is a danger when cells change growth rates, ion concentrations, or when the pH changes, all MMP indicating dyes change in fluorescence: their signal is confounded Change in phosphate levels can possibly do both, alter pH and ion concentrations. Because all conclusions of the manuscript are based on a change in MMP, it would be a great precaution to use a dye-independent measure of membrane potential, and confirm at least some key results.

      Mitochondrial MMP does strongly influence amino acid metabolism, and indeed the SIT4 knockout has a quite striking amino acid profile, with histidine, lysine, arginine, tyrosine being increased in concentration. http://ralser.charite.de/metabogenecards/Chr_04/YDL047W.html Could this amino acid profile support the conclusions of the authors? At least lysine and arginine are down in petites due to a lack of membrane potential and iron sulfur cluster export.- and here they are up. Along these lines, according to the same data resource, the knock-outs CSR2, ASF1, SSN8, YLR0358 and MRPL25 share the same metabolic profile. Due to limited time I did not re-analyse the data provided by the authors- but it would be worth checking if any of these genes did come up in the screens of the authors.

      We tested the mutants within the same cluster as SIT4 shown in this paper from the deletion collection and measured their MMP. yrl358cΔ cells have similar high MMP as observed in sit4Δ cells. However, this gene has a yet undefined function. Beyond YRL358C, we did not observe similar MMP increases in other gene deletions from this panel, which does not support the notion that amino acids such as histidine, lysine, arginine, or tyrosine play a determining effect in driving MMP.

      The media condition and strain used in the suggested paper is very different from what we used in our study. Instead of growing prototrophic cells in minimal media without any amino acids, we used auxotrophic yeast strains and grew them in media containing complete amino acids. So far, none of the other defects or signaling associated with SIT4 deletion could influence MMP as much as the phosphate signaling. We interpret these data to support the hypothesis that the MMP observation in sit4Δ cells is connected with the phosphate signaling as illustrated by the second half of the story in our manuscript.

      Author reponse image 2.

      One important claim in the manuscript attempts to explain a mechanism for the MMP increase in response to phosphate starvation which is independent of the ETC and ATP synthase.

      It seems to me the only direct evidence to support this claim is that inhibition of the AAC with BKA stops the increase of mitotracker fluorescence in response to phosphate starvation in both WT and rho0 cells (Figs 4B and 4C). It would strengthen the paper if the authors could provide some orthogonal evidence.

      This is a similar comment as raised by reviewer #1 - major concern #3. We refer the reviewer to our discussion and the new data above. Briefly, we do not think F1 subunit is responsible for the ATP hydrolysis activity to generate MMP in phosphate depleted situation. We believe there are additional ATPase(s) in the mitochondrial matrix that can be utilized to couple to ADP/ATP carrier for MMP generation during phosphate starvation. However, we have not identified the relevant ATPase(s) at this point, and it is likely that multiple ATPases could contribute to this activity.

      Introduction/Discussion The author might want to make the reader of the article aware that the 'reversal' of the ATP synthase directionality -i.e. ATP hydrolysis by the ATP synthase as a mechanism to create a membrane potential (in petites), has always been a provocative idea - but one that thus far could never be fully substantiated. Indeed some people that are very familiar with the topic, are skeptical this indeed happens. For instance, Vowinckel et al 2021 (PMID: 34799698) measured precise carbon balances for peptide cells, and found no evidence for a futile cycle - peptides grow slower, but accumulate the same biomass from glucose as peptides that re-evolve at a fast growth rate . Perhaps the manuscript could be updated accordingly.

      We thank the reviewer for pointing out this additional relevant study. We have rephased the referenced sentence in the introduction. The MMP generation in phosphate starvation is independent of the F1 portion of ATP synthase. Therefore, our data neither supports or refutes either of these arguments.

      In the introduction and conclusion there is discussion of MMP set points. In particular the authors state:

      "Critically, we find that cells often prioritize this MMP setpoint over other bioenergetic priorities, even in challenging environments, suggesting an important evolutionary benefit."

      This does not seem to be consistent with the central finding of the manuscript that MMP changes under phosphate starvation. MMP doesn't seem so much to have a 'set point' but rather be an important physiological variable that reacts to stimuli such as phosphate starvation.

      The reviewer raises a rational alternative hypothesis to the one that we have proposed. In reality, both of these are complete speculations to explain the data and we can’t think of any way to test the evolutionary basis for the mechanisms that we describe. We recognize that untested/untestable speculative arguments have limitations and there are viable alternative hypotheses. We have softened our language to ensure that it is clear that this is only a speculation.

      The authors suggest that deletion of Pho85 causes an increase in MMP because of cellular signaling. However, they also state in the conclusion:

      "Unlike phosphate starvation, the pho85D mutant has elevated intracellular phosphate concentrations. This suggests that the phosphate effect on MMP is likely to be elicited by cellular signaling downstream of phosphate sensing rather than some direct effect of environmental depletion of phosphate on mitochondrial energetics."

      The authors should cite the study that shows deletion of PHO85 causes increased intracellular phosphate concentrations. It also seems possible that the 'cellular signaling' that causes the increase in MMP could be a result of this increase in intracellular phosphate concentrations, which could constitute a direct effect of an environmental overload of phosphate on mitochondrial energetics.

      We now cited the literature that shows higher intracellular phosphate in pho85Δ cells (Gupta et al., 2019; Liu et al., 2017). Depleting phosphate in the media drastically reduced intracellular phosphate concentration, which is the opposing situation as pho85Δ cells. Nevertheless, we observed higher MMP in either situation. We concluded from these two observations that the increase in MMP is a response to the signaling activated by phosphate depletion rather than the intracellular phosphate abundance.

      Related to this point, in the conclusion, the authors state:

      "We now show that intracellular signaling can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome."

      In sum, the data shows that signaling is important here- but signaling alone is only the message - not the biophysical process that creates a membrane potential. The authors then could revise this slightly.

      We have rephrased this sentence as suggested, which now reads “We now show that intracellular signaling triggers a process that can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome”.

      The authors state in the conclusion that

      "We first made the observation that deletion of the SIT4 gene, which encodes the yeast homologue of the mammalian PP6 protein phosphatase, normalized many of the defects caused by loss of mtFAS, including gene expression programs, ETC complex assembly, mitochondrial morphology, and especially MMP (Fig. 1)"

      The data shown though indicates that a defect in mtFAS in terms of MMP, deletion of SIT4 causes a huge increase (and departure away from normality) whether or not mct1 is present (Fig 1D)

      We changed the word “normalized” to “reversed”. In the discussion section, we also emphasized that many of these increases are independent of mitochondrial dysfunction induced by loss of mtFAS.

      The language "SIT4 is required for both the positive and negative transcriptional regulation elicited by mitochondrial dysfunction" feels strong. SIT4 seems to influence positive transcriptional regulation in response to mitochondrial dysfunction caused by MCT1 deletion (but may not be the only thing as there appears to be an increase in CIT2 expression in a sit4del background following a further deletion of MCT1). In terms of negative regulation, SIT4 deletion clearly affects the baseline, but MCT1 deletion still causes down regulation of both examples shown in Fig 1B, showing that negative transcriptional regulation can still occur in the absence of SIT4. The authors might consider showing fold change of expression as they do in later figures (Figs 4B and C) to help the reader evaluate the quantitative changes they demonstrate.

      We now displayed the fold change as suggested. This sentence now reads “These data suggest that SIT4 positively and negatively influences transcriptional regulation elicited by mitochondrial dysfunction”.

      The authors induce phosphate starvation by adding increasing amounts of potassium phosphate monobasic at a pH of 4.1 to phosphate dropout media supplemented with potassium. The authors did well to avoid confounding effects of removing potassium. The final pH of YNB is typically around 5.2. Is it possible that the authors are confounding a change in pH with phosphate starvation? One would expect the media in the phosphate starvation condition to have a higher pH than the phosphate replacement or control media. Is a change in pH possibly a confounding factor when interpreting phosphate starvation? Perhaps the authors could quantify the pH of the media they use for the experiment to understand how much of a factor that could be. One needs to be careful with Miotracker and any other fluorescent dye when pH changes. Albeit having constraints on its own, MitoLoc as a protein rather than small molecule marker of MMP might be a good complement.

      We followed the protocol used by many other studies that depleted phosphate in the media. The reason we and others adjusted the media without inorganic phosphate to a pH of 4.1 is because that is the pH of phosphate monobasic. From there, we could add phosphate monobasic to create +Pi media without changing the media pH. Therefore, media containing different concentrations of phosphate all have the exact same pH. We now emphasize that all media containing different levels of inorganic phosphate have the same pH to the manuscript to eliminate such concern (see page 18).

      Even though all media have the similar pH, we also provided complementary data using a parallel approach to measure the MMP by assessing mitochondrial protein import as demonstrated previously with Ilv2-FLAG, which shares the same principle as mitoLoc.

      Reference

      Arndt, K. T., Styles, C. A., & Fink, G. R. (1989). A suppressor of a HIS4 transcriptional defect encodes a protein with homology to the catalytic subunit of protein phosphatases. Cell, 56(4), 527–537. https://doi.org/10.1016/00928674(89)90576-X

      Dimmer, K. S., Fritz, S., Fuchs, F., Messerschmitt, M., Weinbach, N., Neupert, W., & Westermann, B. (2002). Genetic basis of mitochondrial function and morphology in Saccharomyces cerevisiae. Molecular Biology of the Cell, 13(3), 847–853. https://doi.org/10.1091/mbc.01-12-0588

      Gupta, R., Walvekar, A. S., Liang, S., Rashida, Z., Shah, P., & Laxman, S. (2019). A tRNA modification balances carbon and nitrogen metabolism by regulating phosphate homeostasis. ELife, 8, e44795. https://doi.org/10.7554/eLife.44795

      Jablonka, W., Guzmán, S., Ramírez, J., & Montero-Lomelí, M. (2006). Deviation of carbohydrate metabolism by the SIT4 phosphatase in Saccharomyces cerevisiae. Biochimica et Biophysica Acta (BBA) - General Subjects, 1760(8), 1281–1291. https://doi.org/10.1016/j.bbagen.2006.02.014

      Liu, N.-N., Flanagan, P. R., Zeng, J., Jani, N. M., Cardenas, M. E., Moran, G. P., & Köhler, J. R. (2017). Phosphate is the third nutrient monitored by TOR in Candida albicans and provides a target for fungal-specific indirect TOR inhibition. Proceedings of the National Academy of Sciences, 114(24), 6346–6351. https://doi.org/10.1073/pnas.1617799114

      Sutton, A., Immanuel, D., & Arndt, K. T. (1991). The SIT4 protein phosphatase functions in late G1 for progression into S phase. Molecular and Cellular Biology, 11(4), 2133–2148.

    1. Author Response

      Reviewer #1 (Public Review):

      This study provides further detailed analysis of recently published Fly Atlas datasets supplemented with newly generated single cell RNA-seq data obtained from 6,000 testis cells. Using these data, the authors define 43 germline cell clusters and 22 somatic cell clusters. This work confirms and extends previous observations regarding changing gene expression programs through the course of germ cell and somatic cell differentiation.

      This study makes several interesting observations that will be of interest to the field. For example, the authors find that spermatocytes exhibit sex chromosome specific changes in gene expression. In addition, comparisons between the single nucleus and single cell data reveal differences in active transcription versus global mRNA levels. For example, previous results showed that (1) several mRNAs remain high in spermatids long after they are actively transcribed in spermatocytes and (2) defined a set of post-meiotic transcripts. The analysis presented here shows that these patterns of mRNA expression are shared by hundreds of genes in the developing germline. Moreover, variable patterns between the sn- and sc-RNAseq datasets reveals considerable complexity in the post-transcriptional regulation of gene expression.

      Overall, this paper represents a significant contribution to the field. These findings will be of broad interest to developmental biologists and will establish an important foundation for future studies. However, several points should be addressed.

      In figure 1, I am struck by the widespread expression of vasa outside of the germ cell lineage. Do the authors have a technical or biological explanation for this observation? This point should be addressed in the paper with new experiments or further explanation in the text.

      Thank you for pointing this out. We found that our single cell dataset shows a similar (low) level of vasa expression outside the germline, suggesting that this is not due to single nucleus versus single cell RNA-seq (cluster 1, red in the lefthand umap).

      Analyzing the single nucleus RNA-seq in more detail revealed that, compared to the germline, both the fraction of cells in a cluster expressing vasa and the level at which they express it are very low. This analysis is included in a new Figure 1 – figure supplement 1. It is likely that much of this is due to a technical artifact, such as ambient RNA. Finally, we note in the resubmission that vasa is in fact expressed in embryonic somatic cells, and thus some of the vasa expression we observe may be real (Renault. Biol Open 2012; https://doi.org/10.1242/bio.20121909).

      Plots in the original submission drew undue attention to the few somatic cells that exhibited vasa signal, due to the fact that expressing cell points were forced to the front of the plot. Given our new analysis reporting the low levels and fraction of cells exhibiting vasa expression (Figure 1 – figure supplement 1), we have modified the panels of Figure 1, changing point size to more faithfully reflect the small proportion of somatic cells with some vasa expression.

      The proposed bifurcation of the cyst cells into head and tail populations is interesting and worth further exploration/validation. While the presented in situ hybridization for Nep4, geko, and shg hint at differences between these populations, double fluorescent in situs or the use of additional markers would help make this point clearer. Higher magnification images would also help in this regard.

      We thank the reviewer for their suggestions on clarifying the differences between HCC and TCC populations. As suggested, we have repeated the FISH experiments of Nep4 and geko with higher resolution, and included the additional marker Coracle that demarcates the junction between HCC and TCC (Figure 6O,Q,S,T). These panels replaced previous Nep4 and geko FISH images (see previous Figure 6Q,U,U’). FISH for Nep4 validated the split, and the enrichment of geko strongly suggests that this arm represents one cell type (HCCs). We have not yet identified a gene reciprocally enriched to the other arm. Therefore, in the revised submission, we call the assignment of TCC identity, and to a lesser extent, HCC identity ‘tentative’, but point out that genes predicted to be enriched to one or the other arm represent fertile candidates for the field to test.

      Reviewer #2 (Public Review):

      In this manuscript the authors explain in greater detail a recent testis snRNAseq dataset that many of these authors published earlier this year as part of the Fly Cell Atlas (FCA) Li et al. Science 2022. As part of the current effort additional collaborators were recruited and about 6,000 whole cell scRNAseq cells were added to the previous 42,000 nuclei dataset. The authors now describe 65 snRNseq clusters, each representing potential cell types or cell states, including 43 germline clusters and 22 somatic clusters. The authors state that this analysis confirms and extends previously knowledge of the testis in several important areas.

      “However, in areas where testis biology is well studied, such as the development of germ cells from GSC to the onset of spermatocyte differentiation, the resolution seems less than current knowledge by considerable margins. No clusters correspond to GSCs, or specific mitotic spermatogonia, and even the major stages of meiotic prophase are not resolved. Instead, the transitions between one state and the next are broad and almost continuous, which could be an intrinsic characteristic of the testis compared to other tissues, of snRNAseq compared to scRNAseq, or of the particular experimental and software analysis choices that were used in this study.”

      Note that the referee raises the same issue later in their review also. To respond succinctly, we placed the relevant sentence from a later portion of this referee’s comment here

      “Support for the view that the problems are mostly technical, rather than a reflection of testis biology, comes from studies of scRNAseq in the mouse, where it has been possible to resolve a stem cell cluster, and germ cell pathways that follow known germ cell differentiation trajectories with much more discrete steps than were reported here (for example, Cao et al. 2021 cited by the authors).”

      Respectfully, we have a different interpretation of other work as cited by this referee. Our data, as well as that from others, supports the notion that transitions are generally broad and continuous and are indeed a feature of testis biology. As we report here, data from both single cell and single nucleus RNAseq exhibit transitions from one cluster to the next. Thus, this feature cannot be due to the choice of method (single cell versus single nucleus).

      In fact, prior scRNA-seq results on systems containing a continuously renewing cell population, such as is the case in the testis, do indeed exhibit a contiguous trajectory rather than discrete, well-separated cell states in gene expression space (that is, in a UMAP presentation). For example, this is the case from single-cell or single-nucleus sequencing from spermatogenesis in mouse (Cao et al 2021), human (Sohni et al 2019), and zebrafish (Qian et al 2022).

      Along differentiation trajectories in these tissues, successive clusters are defined by their aggregate, transcript repertoire. Indeed, differentially-expressed genes can be identified for clusters, with expression enriched in a given cluster. However, expression is rarely restricted to a cluster. For instance, Cao et al. subcluster spermatogonia into four subgroups, termed SPG1-4. They state clearly that these SPG1-4 “follow a continuous differentiation trajectory,” as can be inferred by marker expression across cells in this lineage. Similar to our findings, while the spermatogonia can fall into discrete clusters, gene expression patterns are contiguous. For example, the “undifferentiated” marker used in Cao et al, Crabp1, clearly shows expression in SPG1-3, annotated as spermatogonial stem cells, undifferentiated spermatogonia, and early differentiated spermatogonia, respectively. Likewise, markers for the “SPG3” state spermatogonia have detectable expression in SPG2 and SPG4, and likewise for markers of the “SPG4” state (with expression found also in SPG3). <br /> Analogous study of human spermatogenesis arrives at a similar conclusion. In that work, although clusters are named as “spermatogonial stem cell (SSC)”, the authors are careful to specifically point out that, “…while we refer to the SSC-1 and SSC-2 cell clusters as ‘‘SSCs,’’ scRNA-seq is not a functional assay and thus we do not know the percentage of cells in these clusters with SSC activity. These subsets almost certainly contain other A-SPG cells [A type spermatogonia], including SPG progenitors that have committed to differentiate.” (Sohi et al 2019)

      Thus, the work in several disparate systems, all involving renewing lineages, finds that discrete clusters, such as a “stem cell cluster” are not identified. In the Drosophila testis, germline differentiation flows in a continuous-like manner similar to spermatogenesis in several other organisms studied by scRNA-seq, and our finding is not a function of the methodology, but rather a facet of the biology of the organ.

      Operating in parallel with continuous differentiation, we did find evidence of, and extensively discussed in concert with Figure 4, huge and dramatic shifts in transcriptional state in spermatocytes compared to spermatogonia, in early spermatids compared to spermatocytes, and in late spermatid elongation. Lastly, as we describe further below, new data in this resubmission identify four distinct genes with stage-selective expression as predicted by our analysis (new Figure 2 - figure supplement 1), illustrating the utility of our study for the field to find new markers and new genes to test for function.

      A goal of the study was to identify new rare cell types, and the hub, a small apical somatic cell region, was mentioned as a target region, since it regulates both stem cell populations, GSCs and CySCs, is capable of regeneration, and other fascinating properties. However the analysis of the hub cluster revealed more problems of specificity. 41 or 120 cells in the cluster were discordant with the remaining 79 which did express markers consistent with previous studies. Why these cells co-clustered was not explained and one can only presume that similar problems may be found in other clusters.

      Our writing seems not to have been clear enough on this point and we thank the reviewer. We have revised the section. In addition, we have added new data (Figure 7 - figure supplement 2). We had already stated that only 79 of these 120 nuclei were near to each other in 2D UMAP space, while other members of original cluster 90 were dispersed. Thus the 79 hub nuclei in fact clustered together on the UMAP. Other nuclei that mapped at dispersed positions were initially ‘called’ as part of this cluster in the original Fly Cell Atlas (FCA) paper (Li et al., 2022), making it obvious that a correction to that assignment was necessary, which we carried out. To our eye, no other called cluster was represented by such dispersed groupings. For the hub, we definitively established the 79 nuclei to represent hub cells by marker gene analysis, including the identification of a new maker, tup, that was included in the 79 annotated hub nuclei but excluded from the 41 other nuclei (Figure 7). In this resubmission, to independently verify the relationship of the 79 nuclei to each other, we subjected the 120 nuclei from the original cluster 90 defined by the FCA study to hierarchical clustering using only genes that are highly expressed and variable in these nuclei (Figure 7 - figure supplement 2). This computationally distinct approach strongly supported our identification of the 79 definitive hub nuclei.

      Indeed, many other indications of specificity issues were described, including contamination of fat body with spermatocytes, the expression of germline genes such as Vasa in many somatic cell clusters like muscle, hemocytes, and male gonad epithelium, and the promiscuous expression of many genes, including 25% of somatic-specific transcription factors, in mid to late spermatocytes. The expression of only one such genes, Hml, was documented in tissue, and the authors for reasons not explained did not attempt to decisively address whether this phenomenon is biologically meaningful.

      We discussed the question of vasa expression in somatic clusters in some detail above, in response to referee #1, and included new analysis in the resubmission.

      With respect to the observation of ‘somatic gene’ expression in spermatocytes, we are also intrigued. We do not believe this is due to “contamination,” but rather a spermatocyte expression program that includes expression of somatic genes. First, these somatic markers were not observed in other germline clusters, which would be expected if this was due to general transcript contamination. Second, we observed expression of somatic markers in spermatocytes independently in the single-cell and single-nucleus data, making it unlikely to be an artifact of preparation of isolated nuclei. Finally, in the resubmission, in addition to Hml, we validated ‘somatic’ marker expression in spermatocytes by FISH of a somatic, tail cyst cell marker, Vsx1. Vsx1 is predicted to be expressed at low levels in spermatocytes in our dataset and is clearly visible in germline cells by FISH (Figure 3 – figure supplement 2G,H). We also refer the referee to Figure 6K, where the mRNA for the somatic cyst cell marker eya was observed by FISH at low levels in spermatocytes.

      A truly interesting question mentioned by the authors is why the testis consistently ranks near the top of all tissues in the complexity of its gene expression. In the Li et al. (2022) paper it was suggested that this is due an inherently greater biological complexity of spermiogenesis than other tissues. It seems difficult to independently and rationally determine "biological complexity," but if a conserved characteristic of testis was to promiscuously express a wide range of (random?) genes, something not out of the question, this would be highly relevant and important.

      We agree that the massive transcriptional program found in spermatocytes is, indeed, truly interesting. There are many speculations as to why spermatocytes are so highly transcriptional, including the possibility of “transcriptional scanning” (e.g., Xia et al. 2020) regulating the evolution of new genes. Testing such models is beyond the scope of this paper. However, one must also keep in mind that spermatogenesis involves one of the most dramatic cellular transformations in biology, where cellular components spanning from nuclei to chromatin to Golgi, cell cycle, extensive membrane addition, changes in cell shape, and building of a complex swimming organelle all must occur and be temporally coordinated. Small wonder that many genes must be expressed to accomplish these tasks.

      Unfortunately, the most likely problems are simply technical. Drosophila cells are small and difficult to separate as intact cells. The use of nuclei was meant to overcome this inherent problem, but the effectiveness of this new approach is not yet well-documented. Support for the view that the problems are mostly technical, rather than a reflection of testis biology, comes from studies of scRNAseq in the mouse, where it has been possible to resolve a stem cell cluster, and germ cell pathways that follow known germ cell differentiation trajectories with much more discrete steps than were reported here (for example, Cao et al. 2021 cited by the authors).

      We respectfully disagree with the referee about this collection of statements. First, the use of snRNASeq has been extensively characterized and compared to scRNA-seq in brain tissue by McLaughlin et al., 2021 (cited in the original submission) and was shown to be effective (McLaughlin, et al. eLife 2021;10:e63856. DOI: https://doi.org/10.7554/eLife.63856). snRNA-seq has a distinct advantage when dealing with long, thin cells, such as neurons or cyst cells (as featured in this work), where cytoplasm can easily be sheared off during cell isolation. Second, in a previous portion of our response to this referee, we discussed how our interpretation of Cao et al., 2021 differs from that expressed by this referee. Lastly, as requested in ‘Essential revision’ 2, we adjusted clustering methods and selected four genes, two predicted to be markers for early stage germline cells, and two for mid-spermatocyte stage development. FISH analysis demonstrates that expression for each of these maps to the appropriate stages (new Figure 2 - figure supplement 1). This confirms that the datasets we present in this manuscript can be mined to identify unique, diagnostic markers for various stages.

      The conclusions that were made by the authors seem to either be facts that are already well known, such as the problem that transcriptional changes in spermatocytes will be obscured by the large stored mRNA pool, or promises of future utility. For example, "mining the snRNA-seq data for changes in gene expression as one cluster advances to the next should identify new sub-stage-specific markers." If worthwhile new markers could be identified from these data, surely this could have been accomplished and presented in a supplemental Table. As it currently stands, the manuscript presents the dataset including a fair description of its current limitations, but very little else of novel biological interest is to be found.

      “In sum, this project represents an extremely worthwhile undertaking that will eventually pay off. However, some currently unappreciated technical issues, in cell/nuclear isolation, and certainly in the bioinformatic programs and procedures used that mis-clustered many different cells, has created the current difficulties.

      Most scRNAseq software is written to meet the needs of mammalian researchers working with cultured cells, cellular giants compared to Drosophila and of generally similar size. Such software may not be ideal for much smaller cells, but which also include the much wider variation in cell size, properties and biological mechanisms that exist in the world of tissues.”

      We appreciate the referee’s acknowledgement that this ‘undertaking will eventually pay off’. It was not our intention to address ‘function’ for this study, but rather to make the system accessible to the broadest community possible. We are uncertain if there is any remaining reservation held by this referee. A brief summary of what we covered in the manuscript may help allay any residual concern. Obviously, study of the Drosophila testis and spermatogenesis benefits from the knowledge of a large number of established cell-type and stage-selective markers. Thus, we extensively used the community’s accepted markers to assign identity to clusters in both the sn- and sc-RNA-seq UMAPs. We believe that effort well establishes the validity and reliability of the dataset . Furthermore, we identified upwards of a dozen new markers out of the cluster analysis, and verified their expression by FISH or reporter line in various figures throughout (tup, amph, piwi, geko, Nep4, CG3902, Akr1B, loqs, Vsx1, Drep2, Pxt, CG43317, Vha16-5, l(2)41Ab). To our mind, these contributions, coupled with annotation of the datasets, suggest strongly that they will serve the community well. This is especially true as we provide users with objects that they can feed into commonly used software algorithms such as Seurat and Monocle to explore the datasets to their purposes. Rather than simply relying on default settings within some of the applications, we also adjusted parameters for various clusterings as called for; some of which were in response to astute comments from referees, and included in the resubmission. Of course, it is possible that rare issues may arise in the datasets as these are further studied, but that is the case with all scRNA-seq data, and is not specific to work on this model organism.

      Reviewer #3 (Public Review):

      In this study, the authors use recently published single nucleus RNA sequencing data and a newly generated single cell RNA sequencing dataset to determine the transcriptional profiles of the different cell types in the Drosophila ovary. Their analysis of the data and experimental validation of key findings provide new insight into testis biology and create a resource for the community. The manuscript is clearly written, the data provide strong support for the conclusions, and the analysis is rigorous. Indeed, this manuscript serves as a case study demonstrating best practices in the analysis of this type of genomics data and the many types of predictions that can be made from a deep dive into the data. Researchers who are studying the testis will find many starting points for new projects suggested by this work, and the insightful comparison of methods, such as between slingshot and Monocle3 and single cell vs single nucleus sequencing will be of interest beyond the study of the Drosophila testis.

      We greatly appreciate the reviewer’s comments.

      Reviewer #4 (Public Review):

      This is an extraordinary study that will serve as key resource for all researchers in the field of Drosophila testis development. The lineages that derive from the germline stem cells and somatic stem cells are described in a detail that has not been previously achieved. The RNAseq approaches have permitted the description of cell states that have not been inferred from morphological analyses, although it is the combination of RNAseq and morphological studies that makes this study exceptional. The field will now have a good understanding of interactions between specific cell states in the somatic lineage with specific states in the germ cell lineage. This resource will permit future studies on precise mechanisms of communication between these lineages during the differentiation process, and will serve as a model for studies of co-differentiation in other stem cell systems. The combination of snRNAseq and scRNAseq has conclusively shown differences in transcriptional activation and RNA storage at specific stages of germ cell differentiation and is a unique study that will inform other studies of cell differentiation.

      Could the authors please describe whether genes on the Y chromosome are expressed outside of the male germline. For example, what is represented by the spots of expression within the seminal vesicle observed in Figure 3D?

      Prior work demonstrated that proteins encoded by Y-linked genes are not expressed outside of the germline (Zhang et al. Genetics 2020. https://doi.org/10.1534/genetics.120.303324). In our snRNAseq dataset, we find that genes on the Y chromosome are not highly expressed outside of the male germline (on the order of ~100-fold lower in other tissues). In fact, we observe Y chromosome transcripts at this level in many nuclei across tissues collected for the Fly Cell Atlas project, including the ovary. Since we have not followed up on the Fly Cell Atlas observations directly using FISH to examine Y chromosome transcript expression outside the germline, we cannot rule out the possibility that such low level expression is real. However, the detection across several tissues argues that this is likely technical artifact. With regard to ‘spots of expression within the seminal vesicle’ (Figure 3D), a spot is colored red if the average expression level of genes on the Y chromosome is greater in that cell than in an average cell on our plot. These red spots are likely due to ambient RNA being carried over.

      I would appreciate some discussion of the "somatic factors" that are observed to be upregulated in spermatocytes (e.g. Mhc, Hml, grh, Syt1). Is there any indication of functional significance of any of these factors in spermatocytes?

      This is an excellent question. Although we validated expression for several (Hml, Vsx1 and eya), we did not test for their function here and this issue remains to be studied. This is now directly stated in the main text.

      In the discussion of cyst cell lineage differentiation following cluster 74 the authors state that neither the HCC or TCC lineages were enriched for eya (Figure 6V). It seems in this panel that cluster 57 shows some enrichment for eya - is this regarded as too low expression to be considered enriched?

      We thank the reviewer for their insightful comment and we agree with their conclusions. We have modified the text to reflect the low, but present, expression of eya in the HCC and TCC lineages. The text now reads as follows at line (insert line # here): “Enrichment of eya was dramatically reduced in the clusters along either late cyst cell branch compared to those of earlier lineage nuclei (Figure 6J,U).”

    1. Author Response

      Reviewer #2 (Public Review):

      The work proposes a new computational rule for classifying synaptic plasticity outcome based on the geometry of synaptic enzyme dynamics. Specifically, the authors implement a multi-timescale model of hippocampal synaptic plasticity induction that takes into account the dynamics of the membrane potential, calcium concentration as well as CaMKII and calcineurin signalling pathways. They show that the proposed rule could be applied to reproduce the outcomes from nine published experimental studies involving different spike-timing and frequency-dependent plasticity induction protocols, animal ages, and experimental conditions. The model has been also used to generate predictions regarding the effect of spike-timing irregularity on plasticity outcomes. The proposed approach constitutes an interesting and original idea that contributes to the ongoing effort in discovering the rules of synaptic plasticity.

      The conclusions of this paper are mostly well supported by data, but some model assumptions and interpretation of modelling results need to be clarified and extended.

      1) The proposed model captures well the stochastic nature of the dendritic spine ion channels and receptors except for the calcium-sensitive potassium (SK) channel that has been modelled deterministically. Given that the same justification in terms of small number of channels present in the small dendritic spine compartment applies to the SK channels as well as to the voltage gated calcium channels and the AMPA and NMDA receptors, it is not clear why the authors have chosen a deterministic representation in the case of SK. The implications of this assumption needs to be investigated and discussed.

      There are several stochastic models of AMPA and NMDA receptors based on single-channel recordings. Additionally, we had enough experimental data on single channel recordings to build a custom Markov chain model of VGCCs. For the SK channel, we could not find enough experimental data (age-dependence activity, temperature sensitivity, etc.) to custom-build a stochastic model. We thus decided to implement a deterministic model. Yet, we understand the reviewers’ comment that in theory, a stochastic model of SK channels could impact our results. We thus now provide a simulation with a stochastic model of SK, comparing it to the deterministic model implemented in the study.

      We describe a minimal version of a stochastic model of SK compatible with the deterministic version. The deterministic model of SK channel fit at ~35C is described in the methods section.

      Because of the factor ρ 𝑓𝑆𝐾 in the equation, which multiplies r(Ca) by ~2, the equation cannot be related to a 2-state Markov chain (MC). This could probably be possible with a 3-state MC but we used a different strategy. Noting that ρ 𝑆𝐾 ∼ 2 , we introduce a new equation

      As 0 < r(Ca) < 1, it is straightforward to introduce a 2-state MC for which the above equation describes the probability of the open state. We then simulate two such independent (for a given Ca concentration) channels and approximate 𝑚 𝑆𝐾 as the sum (which belongs to [0,2Nsk]) of the open states for the 2 channels.

      As the reviewer can see in the figure below, we do not find a major difference in the simulations of 3 protocols. Thus, we argue that adding a stochastic version of the SK channels in our current study would not fundamentally alter our main conclusions.

      Figure Legend: a comparison using Tigaret et al. 2016 1Pre2Post10 and 1Pre2Post50 protocols, and 900 at 50 Hz protocol from Dudek and Bear 1992 (100 repetitions) between the model with the deterministic SK channel (original model - blue), and the modified model including the stochastic SK channel (stochastic SK - red). Deterministic vs stochastic SK channel does not significantly modify the model’s behaviour.

      To explain our rationale of using a deterministic version of SK channel, we provide this sentence in the Methods when describing SK channel model: “"Due to a lack of single-channel recordings of SK channels, and a lack of published stochastic models of SK channels, we modelled SK channels deterministically. In tests we found that this assumption had only a negligible impact on the outcomes of plasticity protocols (data not shown)" (page 40).

      2) Many of the model parameters have been set to values previously estimated from synaptic physiology and biochemistry experiments, However, a significant number of important parameter values have been tuned to reproduce the plasticity experiments targeted in this study. As such, it needs to be explained which of the plasticity outcomes have been reproduced because the parameters are chosen to do so. A clarification would have helped to substantiate the authors' conclusions.

      Most parameters were set with values previously defined by experimental work. We referred to these publications where necessary throughout the Methods and Tables in our original manuscript. For the few free parameters that were adjusted, we now provide additional information wherever necessary for the Tables concerned.

      ● In the legend of Table 4 (neuron electrical properties), we explain which parameters are different from values obtained from the literature to fit experimental data (Golding et al. 2001; Buchanan et al. 2007).

      ● Parameters for the sodium and potassium conductance (Table 5) are labelled as generic since they are intentionally set to produce the BaP dynamics we have shown in the paper.

      ● Table 6 has no free parameters.

      ● Table 7 caption now includes a description saying ’Note that the buffer concentration, calcium diffusion coefficient, calcium diffusion time constant and calcium permeability were considered free parameters to adjust the calcium dynamics’.

      ● In Table 8 we had originally pointed out how we adapted the GluN2B rates from a published GluN2A model (Popescu et al. 2004; and Iacobucci and Popesco 2018). We now describe this adaptation in the Table 8 legend. In this Table, we now also better explain how we adjusted the NMDAr model to reflect the ratio between GluN2B and GluN2A, fitted from Sinclair et al. 2016; and the NMDAr conductance depending on calcium fitted from Maki and Popescu 2014.

      ● In Table 9 caption we now explain how the GABAr number and conductance were modified to fit GABAr currents as in Figures 15 b and e. The relevant parameters are indicated in the table.

      ● In Table 10 caption we now state the number of VGCCs per subtype that we used as a free parameter to reproduce the calcium dynamics (Figure 12).

      3) Adding experimental testing of model predictions, for example, that firing variability can alter the rules of plasticity, in the sense that it is possible to add noise to cause LTP for protocols that did not otherwise induce plasticity would be needed to increase confidence in the presented modelling results.

      We agree that it would be interesting in the future to test the many model predictions suggested in this work with biological experiments. This would however require a lot of work and will be the subject of further studies.

      Reviewer #3 (Public Review):

      This manuscript presents and analyzes a novel calcium-dependent model of synaptic plasticity combining both presynaptic and postsynaptic mechanisms, with the goal of reproducing a very broad set of available experimental studies of the induction of long-term potentiation (LTP) vs. long-term depression (LTD) in a single excitatory mammalian synapse in the hippocampus. The stated objective is to develop a model that is more comprehensive than the often-used simplified phenomenological models, but at the same time to avoid biochemical modeling of the complex molecular pathways involved in LTP and LTD, retaining only its most critical elements. The key part of this approach is the proposed "geometric readout" principle, which allows to predict the induction of LTP vs. LTD by examining the concentration time course of the two enzymes known to be critical for this process, namely (1) the Ca2+/calmodulin-bound calcineurin phosphatase (CaN), and (2) the Ca2+/calmodulin-bound protein kinase (CaMKII). This "geometric readout" approach bypasses the modeling of downstream pathways, implicitly assuming that no further biochemical information is required to determine whether LTP or LTD (or no synaptic change) will arise from a given stimulation protocol. Therefore, it is assumed that the modeling of downstream biochemical targets of CaN and CaMKII can be avoided without sacrificing the predictive power of the model. Finally, the authors propose a simplified phenomenological Markov chain model to show that such "geometric readout" can be implemented mechanistically and dynamically, at least in principle.

      Importantly, the presented model has fully stochastic elements, including stochastic gating of all channels, stochastic neurotransmitter release and stochastic implementation of all biochemical reactions, which allows to address the important question of the effect of intrinsic and external noise on the induction of LTP and LTD, which is studied in detail in this manuscript.

      Mathematically, this modeling approach resembles a continuous stochastic version of the "liquid computing" / "reservoir computing" approach: in this case the "hidden layer", or the reservoir, consists of the CaMKII and CaM concentration variables. In this approach, the parameters determining the dynamics of these intermediate ("hidden") variables are kept fixed (here, they are constrained by known biophysical studies), while the "readout" parameters are being trained to predict a target set of experimental observations.

      Strengths:

      1) This modeling effort is very ambitious in trying to match an extremely broad array of experimental studies of LTP/LTD induction, including the effect of several different pre- and post-synaptic spike sequence protocols, the effect of stimulation frequency, the sensitivity to extracellular Ca2+ and Mg2+ concentrations and temperature, the dependence of LTP/LTD induction on developmental state and age, and its noise dependence. The model is shown to match this large set of data quite well, in most cases.

      2) The choice for stochastic implementation of all parts of the model allows to fully explore the effects of intrinsic and extrinsic noise on the induction of LTP/LTD. This is very important and commendable, since regular noise-less spike firing induction protocols are not very realistic, and not every relevant physiologically.

      3) The modeling of the main players in the biochemical pathways involved in LTP/LTD, namely CaMKII and CaN, aims at sufficient biological realism, and as noted above, is fully stochastic, while other elements in the process are modeled phenomenologically to simplify the model and reveal more clearly the main mechanism underlying the LTP/LTD decision switch.

      4) There are several experimentally verifiable predictions that are proposed based on an in-depth analysis of the model behavior.

      We thank the reviewer for pointing out these strengths.

      Weaknesses:

      1) The stated explicit goal of this work is the construction of a model with an intermediate level of detail, as compared to simplified "one-dimensional" calcium-based phenomenological models on the one hand, and comprehensive biochemical pathway models on the other hand. However, the presented model comes across as extremely detailed nonetheless. Moreover, some of these details appear to be avoidable and not critical to this work. For instance, the treatment of presynaptic neurotransmitter release is both overly detailed and not sufficiently realistic: namely, the extracellular Ca2+ concentration directly affects vesicle release probability but has no effect on the presynaptic calcium concentration. I believe that the number of parameters and the complexity in the presynaptic model could be reduced without affecting the key features and findings of this work.

      This point is largely answered in Essential Revisions point 4 where we argue the choices we made for the presynaptic model. We acknowledge, however, that in this current version, we did not incorporate all biophysical components, such as the modulation of presynaptic calcium concentration with external calcium variations and multivesicular release. The calcium-dependence of presynaptic release, as modeled currently, is however fitted in Figure 8e against data from Hardingham et al. 2006 and Tigaret et al. 2016. These current limitations could be addressed in a next version of our presynaptic model where we also plan to incorporate age and temperature influence.

      2) The main hypotheses and assumptions underlying this work need to be stated more explicitly, to clarify the main conclusions and goals of this modeling work. For instance, following much prior work, the presented model assumes that a compartment-based (not spatially-resolved) model of calcium-triggered processes is sufficient to reproduce all known properties of LTP and LTD induction and that neither spatially-resolved elements nor calcium-independent processes are required to predict the observed synaptic change. This could be stated more explicitly. It could also be clarified that the principal assumption underlying the proposed "geometric readout" mechanisms is that all information determining the induction of LTP vs. LTP is contained in the time-dependent spine-averaged Ca2+/calmodulin-bound CaN and CaMKII concentrations, and that no extra elements are required. Further, since both CaN and CaMKII concentrations are uniquely determined by the time course of postsynaptic Ca2+ concentration, the model implicitly assumes that the LTP/LTD induction depends solely on spine-averaged Ca2+ concentration time course, as in many prior simplified models. This should be stated explicitly to clarify the nature of the presented model.

      We thank the reviewer for the suggestions on how to clarify the main hypotheses and assumptions of our work. We slightly modified the sentences provided by the reviewer and added them in the main text (page 2, lines 82 and page 19, lines 593).

      3) In the Discussion, the authors appear to be very careful in framing their work as a conceptual new approach in modeling STD/STP, rather than a final definitive model: for instance, they explicitly discuss the possibility of extending the "geometric readout" approach to more than two time-dependent variables, and comment on the potential non-uniqueness of key model parameters. However, this makes it hard to judge whether the presented concrete predictions on LTP/LTD induction are simply intended as illustrations of the presented approach, or whether the authors strongly expect these predictions to hold. The level of confidence in the concrete model predictions should be clarified in the Discussion. If this confidence level is low, that would call into question the very goal of such a modeling approach.

      These are very good questions. Let us first comment on the parameter uniqueness. We believe, like in E. Marder’s work on ion channels expression in neurons, that the synapse has the possibility to adapt its internal parameters (proteins number, transition rates, etc) to provide a given functioning behaviour. As a by-product, there is non uniqueness of parameters associated with behavior. Additionally, since our model is able to reproduce 9 published experimental outcomes with a single set of parameters, it is a functioning synapse with adjusted parameters which output the expected behaviours. Thus by extrapolation, our confidence in the further predictions is high. We modified sentences in the discussion section to argue this point (page 21, line 707).

      Let us comment now on increasing the complexity. To our best, we strived to design a plasticity readout as simple as possible yet providing a functioning synapse. Given our success to reproduce 9 published experimental outcomes with a single set of parameters, adding more complexity would be akin to overfitting.

      4) The authors presented a simplified mechanistic dynamical Markov chain process to prove that the "geometric readout" step is implementable as a dynamical process, at least in principle. However, a more realistic biochemical implementation of the proposed "region indicator" variables may be complex and not guaranteed to be robust to noise. While the authors acknowledge and touch upon some of these issues in their discussion, it is important that the authors will prove in future work that the "geometric readout" is implementable as a biochemical reaction network. Barring such implementation, one must be extra careful when claiming advantages of this approach as compared to modeling work that attempts to reconstruct the entire biochemical pathways of LTP/LTD induction.

      We acknowledge this issue and agree this would be an interesting subject for future work.

    1. Author Response

      Reviewer #1 (Public Review):

      1) Comment: To determine the effect of diseased monocytes on retinal health, light-injured mouse retinas were injected with monocytes isolated from AMD patients (Figure 1 - figure supplement 1). This resulted in a reduction in photoreceptor number and ERG b-wave amplitude. However, the light-injured control eye was injected with PBS only, so no cells were present. The reasoning for using this control was not provided. The appropriate injection control would include monocytes isolated from non-AMD patients. This control should be performed side-by-side with cells from AMD patients.

      We thank the reviewer for this important comment. The purpose of the current study was to identify the macrophage subtype that may be associated with cell death in aAMD. We have previously reported that macrophages from AMD patient demonstrate a different phenotype compared with healthy patient in the rodent model for laser induced CNV (Hagbi-Levi S et al, 2016). Per the reviewer comment, we have performed additional experiments to assess the effect of monocytes from healthy controls in the photic retinal injury model. Results showed that monocytes from AMD and healthy patients exert different impact on the retina in this rodent model for aAMD. Interestingly, we found that monocytes from healthy patients were more neurotoxic to photoreceptors compared with monocytes from AMD patients. These results are included in the revised ms. as Figure 1- figure supplement 1H. A possible explanation for these findings is discussed in lines 179-190 of the revised manuscript. This finding reinforces the idea that the use of monocytes from AMD patients in the experiments is required to obtain a comprehensive understanding of their involvement in the progression of the disease.

      2) Comment: The authors hypothesize, from the experiments presented in Figure 1 - figure supplement 1, that the injected monocytes generated macrophages in the retina, which were responsible for the observed neurotoxicity (Lines 143-145). However, no direct evidence was presented. This idea should be tested in vivo. This could be done by injecting tracer-labeled human AMD-derived monocytes into light-injured mouse retinas. If the authors' hypothesis is true, collected retinas should contain tracer-labeled cells that express macrophage markers. Tracer-labeled M2a macrophage cells should be present since subsequent experiments identify this subclass as being associated with retinal cell death.

      Thank you for this important comment. To address the reviewers comment, retinal section from mice exposed to photic-retinal injury and injected with Dio-tracer labelled monocytes were stained with two M2a macrophages markers, CD206 (mannose receptor) and VEGF (Kadomoto, S et al, 2022; Jayasingam SD et al, 2019). Interestingly, we found co-localization of Dio-tracer staining (representing the injected human macrophages) with CD206 and VEGF markers in monocytes localized in different retinal layers, but not in monocytes remaining in the vitreous cavity. These data indicate that M2a markers are expressed during the polarization of monocytes into M2a phenotype which is maintained only upon entry into the retina tissue. These results were included in Figure 1- figure supplement 1K-S and discussed in the revised manuscript in lines 179-182.

      3) Comment: Photoreceptor number and b-wave amplitudes were measured in light-injured retinas injected with one of four macrophage cell types generated from human AMD-derived monocytes. The authors conclude that only injection of M2a cells reduced photoreceptor number and b-wave amplitudes (Figure 1C, E). This may be true, but it is difficult for the reader to make a conclusion (especially in Fig. 1E) due to the large error bars and five different traces overlapping each other. To make these results easier to interpret, graph control cells with only one experimental sample (cell type) at a time.

      Thank you for this comment. Per the reviewer comment, the graphs were modified in the revised ms. (Figure 1, panel H-K).

      4) Comment: Most injected macrophages were located in the vitreous. In the case of M2a cells, the authors note that "several of the cells migrated across the retinal layers reaching the subretinal space" (Lines 167,168). One possible explanation for why M0, M1, and M2c macrophages did not induce retinal degeneration is that they did not migrate to the subretinal space and around the optic nerve head. Supplementary figures should be added to demonstrate that this is not the case.

      Thank you for this comment. To address the reviewer comment we compared the migration patterns of the different macrophage phenotypes following intravitreal injection in mice exposed to photic-injury. Our results indicated that M0, M1 and M2c macrophages, similarly to M2a macrophages, migrated to the subretinal space and around the optic nerve. Thus, the neurotoxic effect of M2a is not explained by their capacity to infiltrate the retinal tissues. These results was included in Figure 1- figure supplement 2 E-H of the revised manuscript. These results are supported by our ex-vivo experiments, showing that co-culture of M2a macrophages with a retinal explants was associated with increased photoreceptor cells death compared to M1 macrophages. The results are presented and discussed in the revised manuscript in lines 200-203.

      5) Comment: Figure 1 - figure supplement 2: Panel A, B cells were stained with CD206 to demonstrate the presence of M2a macrophages (panel B). The authors conclude that panel A contains M1 and panel B contains M2a cells. The lack of CD206 expression illustrates that panel A cells are not M2a macrophages but do not demonstrate they are M1 macrophages. A control using an M1 cell marker is necessary to show that panel A cells are M1 and M1 cells are not detected in M2a cultures.

      Thank you for this comment. We have validated the phenotype of each macrophages subtype by qPCR (Figure 1 panel A). To further address the reviewer comment, we have performed additional immunocytochemistry for M1 macrophages using anti-CD80 antibody which is utilized as M1 macrophages marker (Bertani FR et al.2017). Results of the staining confirmed the identity of the M1 macrophages. These new results were included in Figure 1- figure supplement 2A, and are discussed in lines 168-170.

      6) Comment: Ex vivo, apoptotic photoreceptor and RPE cells are observed when cultured with M2a macrophages (Figure 2). Do injected M2a cells also induce apoptosis of RPE cells in vivo? This is important to establish that retinal explants are a good model for in vivo experiments.

      Thank you for this comment. To address the reviewer comment, we assessed RPE apoptosis (using TUNEL, Caspase 3 staining and RPE65 marker) after M2A cells delivery, in the in-vivo photic injury model. We could not detect apoptotic signal in the RPE layers 7 days after photic injury and therefore could not evaluate the effect of M2a macrophages on the RPE cells in-vivo (see Author response image 1). One possible explanation is that RPE cells that have undergone apoptosis are rapidly removed from the damaged tissue and are no longer detectable unlike photoreceptors. Furthermore, a study that investigated the impact of bright light on RPE cells in-vivo, showed that although RPE cells undergone structural and chemical modifications after photic-injury, TUNEL signal was not detected because RPE cell die by necrosis mechanism and not apoptosis (Jaadane I et al, 2017). Other studies validated that blue light induces RPE necrosis (Song W et al, 2022; Mohamed A et al, 2022). Taken together, it seems that ex-vivo retinal explant and in-vivo photic injury both simulate the mechanism of retinal cell death. However, the use of ex-vivo model allows for establishing the direct impact of M2a macrophages on retina in non-inflammatory context.

      Author responnse image 1.

      7) Comment: Reactive oxygen species (ROS) production was measured to determine if M2a cell-mediated neurotoxicity was due to oxidative stress. It is concluded that a ROS increase is partly responsible (Line 218). The data do not support this conclusion. ROS was detected in cultured M2a macrophages. More importantly, however, there was no increase in oxidative damage in vivo. The in vivo and cell culture results contradict each other so no conclusion can be made. The lack of in vivo confirmation weakens the argument that ROS drives M2a neurotoxicity. Text suggesting a role for ROS in neurotoxicity should be appropriately edited (Lines including 218, 244, 401,406,481).

      Thank you for this comment. The manuscript was revised according to the reviewer suggestion (Lines 250-256).

      8) Comment: The authors ask if the photoreceptor cell death is cytokine-mediated. Multiple cytokines were enriched in M2a-conditioned media. Of particular interest were CCR1 ligands MPIF1 and MCP4. The implication is that these two ligands mediate the M2a macrophages to photoreceptor cell death through CCR1. However, there is no attempt to show that either MPIF1 or MCP4 are present in vivo, or are sufficient to induce the retinal response observed. This could be demonstrated by injection of MPIF1 or MCP4. Evidence that either ligand phenocopies M2a macrophage injection would be direct evidence that CCR1 ligands activate the retinal response. Furthermore, co-injection with BX174 should block the effect of these ligands if they work through CCR1.

      Thank you for this comment. The identification of CCR1 ligands expression from M2a polarized macrophages directed our decision to study CCR1 in the context of atrophic AMD. We do not claim that these specific CCR1 ligands are sufficient to activate CCR1 and exert retinal injury. The mechanism is likely more complex. Yet, to address the reviewer comment, we have performed the experiments suggested by the reviewer. Mice were exposed to photic injury and immediately injected in one eye with MPIF1, MCP-4, or a combination of both and in second eye with PBS as vehicle. Intravitreal cytokines delivery was repeated two days later (following the half-life time of these cytokines) and ERG were recorded two days after the last injection. Injection of cytokines at a concentration of 300 ng per eye did not exacerbated photoreceptor death. Then, the same experiment was repeated with two higher concentrations of cytokine, 1.2 ug/eye and 2 ug/eye, but no changes are observed between the cytokines treated-eyes and the vehicle treated-eyes. Based on previous studies reporting the physiological concentration of different cytokines in eyes of un/healthy individuals and on experiments in which different cytokines are injected in rodent eye (Estevao C et al, 2021. Zeng Y et al, 2019; Roybal CN et al, 2018; Mugisho OO et al, 2018), the cytokine concentrations used in our experiment are in the range in which effect on the retina is expected.

      It is likely that a synergistic effect of M2a-secreted proteins in a particular microenvironment is necessary to increase the level of retinal damage (Bartee E et al, 2013). It is also likely that in the photic retinal injury model there is upregulation of cytokines that may mask additional delivery of exogenous cytokines. Comprehensive understanding of the complex interactions of these cytokines during retinal degeneration is beyond the scope of the current manuscript which is not focus on identifying ligand-induced CCR1 activation and its consequences. Additionally, we suggest that due to cytokine redundancy (Nicola NA; 1994), demonstrating that MPIF-4 or MCP-3 can increase photoreceptor death is not required for proving CCR1 receptor involvement.

    1. Author Response

      Reviewer #1 (Public Review):

      In this work George et al. describe RatInABox, a software system for generating surrogate locomotion trajectories and neural data to simulate the effects of a rodent moving about an arena. This work is aimed at researchers that study rodent navigation and its neural machinery.

      Strengths:

      • The software contains several helpful features. It has the ability to import existing movement traces and interpolate data with lower sampling rates. It allows varying the degree to which rodents stay near the walls of the arena. It appears to be able to simulate place cells, grid cells, and some other features.

      • The architecture seems fine and the code is in a language that will be accessible to many labs.

      • There is convincing validation of velocity statistics. There are examples shown of position data, which seem to generally match between data and simulation.

      Weaknesses:

      • There is little analysis of position statistics. I am not sure this is needed, but the software might end up more powerful and the paper higher impact if some position analysis was done. Based on the traces shown, it seems possible that some additional parameters might be needed to simulate position/occupancy traces whose statistics match the data.

      Thank you for this suggestion. We have added a new panel to figure 2 showing a histogram of the time the agent spends at positions of increasing distance from the nearest wall. As you can see, RatInABox is a good fit to the real locomotion data: positions very near the wall are under-explored (in the real data this is probably because whiskers and physical body size block positions very close to the wall) and positions just away from but close to the wall are slightly over explored (an effect known as thigmotaxis, already discussed in the manuscript).

      As you correctly suspected, fitting this warranted a new parameter which controls the strength of the wall repulsion, we call this “wall_repel_strength”. The motion model hasn’t mathematically changed, all we did was take a parameter which was originally a fixed constant 1, unavailable to the user, and made it a variable which can be changed (see methods section 6.1.3 for maths). The curves fit best when wall_repel_strength ~= 2. Methods and parameters table have been updated accordingly. See Fig. 2e.

      • The overall impact of this work is somewhat limited. It is not completely clear how many labs might use this, or have a need for it. The introduction could have provided more specificity about examples of past work that would have been better done with this tool.

      At the point of publication we, like yourself, also didn’t know to what extent there would be a market for this toolkit however we were pleased to find that there was. In its initial 11 months RatInABox has accumulated a growing, global user base, over 120 stars on Github and north of 17,000 downloads through PyPI. We have accumulated a list of testimonials[5] from users of the package vouching for its utility and ease of use, four of which are abridged below. These testimonials come from a diverse group of 9 researchers spanning 6 countries across 4 continents and varying career stages from pre-doctoral researchers with little computational exposure to tenured PIs. Finally, not only does the community use RatInABox they are also building it: at the time of writing RatInABx has received logged 20 GitHub “Issues” and 28 “pull requests” from external users (i.e. those who aren’t authors on this manuscript) ranging from small discussions and bug-fixes to significant new features, demos and wrappers.

      Abridged testimonials:

      ● “As a medical graduate from Pakistan with little computational background…I found RatInABox to be a great learning and teaching tool, particularly for those who are underprivileged and new to computational neuroscience.” - Muhammad Kaleem, King Edward Medical University, Pakistan

      ● “RatInABox has been critical to the progress of my postdoctoral work. I believe it has the strong potential to become a cornerstone tool for realistic behavioural and neuronal modelling” - Dr. Colleen Gillon, Imperial College London, UK

      ● “As a student studying mathematics at the University of Ghana, I would recommend RatInABox to anyone looking to learn or teach concepts in computational neuroscience.” - Kojo Nketia, University of Ghana, Ghana

      ● “RatInABox has established a new foundation and common space for advances in cognitive mapping research.” - Dr. Quinn Lee, McGill, Canada

      The introduction continues to include the following sentence highlighting examples of past work which relied of generating artificial movement and/or neural dat and which, by implication could have been done better (or at least accelerated and standardised) using our toolbox.

      “Indeed, many past[13, 14, 15] and recent[16, 17, 18, 19, 6, 20, 21] models have relied on artificially generated movement trajectories and neural data.”

      • Presentation: Some discussion of case studies in Introduction might address the above point on impact. It would be useful to have more discussion of how general the software is, and why the current feature set was chosen. For example, how well does RatInABox deal with environments of arbitrary shape? T-mazes? It might help illustrate the tool's generality to move some of the examples in supplementary figure to main text - or just summarize them in a main text figure/panel.

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including T-mazes), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      To further illustrate the tools generality beyond the structure of the environment we continue to summarise the reinforcement learning example (Fig. 3e) and neural decoding example in section 3.1. In addition to this we have added three new panels into figure 3 highlighting new features which, we hope you will agree, make RatInABox significantly more powerful and general and satisfy your suggestion of clarifying utility and generality in the manuscript directly.

      On the topic of generality, we wrote the manuscript in such a way as to demonstrate how the rich variety of ways RatInABox can be used without providing an exhaustive list of potential applications. For example, RatInABox can be used to study neural decoding and it can be used to study reinforcement learning but not because it was purpose built with these use-cases in mind. Rather because it contains a set of core tools designed to support spatial navigation and neural representations in general. For this reason we would rather keep the demonstrative examples as supplements and implement your suggestion of further raising attention to the large array of tutorials and demos provided on the GitHub repository by modifying the final paragraph of section 3.1 to read:

      “Additional tutorials, not described here but available online, demonstrate how RatInABox can be used to model splitter cells, conjunctive grid cells, biologically plausible path integration, successor features, deep actor-critic RL, whisker cells and more. Despite including these examples we stress that they are not exhaustive. RatInABox provides the framework and primitive classes/functions from which highly advanced simulations such as these can be built.”

      Reviewer #3 (Public Review):

      George et al. present a convincing new Python toolbox that allows researchers to generate synthetic behavior and neural data specifically focusing on hippocampal functional cell types (place cells, grid cells, boundary vector cells, head direction cells). This is highly useful for theory-driven research where synthetic benchmarks should be used. Beyond just navigation, it can be highly useful for novel tool development that requires jointly modeling behavior and neural data. The code is well organized and written and it was easy for us to test.

      We have a few constructive points that they might want to consider.

      • Right now the code only supports X,Y movements, but Z is also critical and opens new questions in 3D coding of space (such as grid cells in bats, etc). Many animals effectively navigate in 2D, as a whole, but they certainly make a large number of 3D head movements, and modeling this will become increasingly important and the authors should consider how to support this.

      Agents now have a dedicated head direction variable (before head direction was just assumed to be the normalised velocity vector). By default this just smoothes and normalises the velocity but, in theory, could be accessed and used to model more complex head direction dynamics. This is described in the updated methods section.

      In general, we try to tread a careful line. For example we embrace certain aspects of physical and biological realism (e.g. modelling environments as continuous, or fitting motion to real behaviour) and avoid others (such as the biophysics/biochemisty of individual neurons, or the mechanical complexities of joint/muscle modelling). It is hard to decide where to draw but we have a few guiding principles:

      1. RatInABox is most well suited for normative modelling and neuroAI-style probing questions at the level of behaviour and representations. We consciously avoid unnecessary complexities that do not directly contribute to these domains.

      2. Compute: To best accelerate research we think the package should remain fast and lightweight. Certain features are ignored if computational cost outweighs their benefit.

      3. Users: If, and as, users require complexities e.g. 3D head movements, we will consider adding them to the code base.

      For now we believe proper 3D motion is out of scope for RatInABox. Calculating motion near walls is already surprisingly complex and to do this in 3D would be challenging. Furthermore all cell classes would need to be rewritten too. This would be a large undertaking probably requiring rewriting the package from scratch, or making a new package RatInABox3D (BatInABox?) altogether, something which we don’t intend to undertake right now. One option, if users really needed 3D trajectory data they could quite straightforwardly simulate a 2D Environment (X,Y) and a 1D Environment (Z) independently. With this method (X,Y) and (Z) motion would be entirely independent which is of unrealistic but, depending on the use case, may well be sufficient.

      Alternatively, as you said that many agents effectively navigate in 2D but show complex 3D head and other body movements, RatInABox could interface with and feed data downstream to other softwares (for example Mujoco[11]) which specialise in joint/muscle modelling. This would be a very legitimate use-case for RatInABox.

      We’ve flagged all of these assumptions and limitations in a new body of text added to the discussion:

      “Our package is not the first to model neural data[37, 38, 39] or spatial behaviour[40, 41], yet it distinguishes itself by integrating these two aspects within a unified, lightweight framework. The modelling approach employed by RatInABox involves certain assumptions:

      1. It does not engage in the detailed exploration of biophysical[37, 39] or biochemical[38] aspects of neural modelling, nor does it delve into the mechanical intricacies of joint and muscle modelling[40, 41]. While these elements are crucial in specific scenarios, they demand substantial computational resources and become less pertinent in studies focused on higher-level questions about behaviour and neural representations.

      2. A focus of our package is modelling experimental paradigms commonly used to study spatially modulated neural activity and behaviour in rodents. Consequently, environments are currently restricted to being two-dimensional and planar, precluding the exploration of three-dimensional settings. However, in principle, these limitations can be relaxed in the future.

      3. RatInABox avoids the oversimplifications commonly found in discrete modelling, predominant in reinforcement learning[22, 23], which we believe impede its relevance to neuroscience.

      4. Currently, inputs from different sensory modalities, such as vision or olfaction, are not explicitly considered. Instead, sensory input is represented implicitly through efficient allocentric or egocentric representations. If necessary, one could use the RatInABox API in conjunction with a third-party computer graphics engine to circumvent this limitation.

      5. Finally, focus has been given to generating synthetic data from steady-state systems. Hence, by default, agents and neurons do not explicitly include learning, plasticity or adaptation. Nevertheless we have shown that a minimal set of features such as parameterised function-approximator neurons and policy control enable a variety of experience-driven changes in behaviour the cell responses[42, 43] to be modelled within the framework.

      • What about other environments that are not "Boxes" as in the name - can the environment only be a Box, what about a circular environment? Or Bat flight? This also has implications for the velocity of the agent, etc. What are the parameters for the motion model to simulate a bat, which likely has a higher velocity than a rat?

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including circular), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      Whilst we don’t know the exact parameters for bat flight users could fairly straightforwardly figure these out themselves and set them using the motion parameters as shown in the table below. We would guess that bats have a higher average speed (speed_mean) and a longer decoherence time due to increased inertia (speed_coherence_time), so the following code might roughly simulate a bat flying around in a 10 x 10 m environment. Author response image 1 shows all Agent parameters which can be set to vary the random motion model.

      Author response image 1.

      • Semi-related, the name suggests limitations: why Rat? Why not Agent? (But its a personal choice)

      We came up with the name “RatInABox” when we developed this software to study hippocampal representations of an artificial rat moving around a closed 2D world (a box). We also fitted the random motion model to open-field exploration data from rats. You’re right that it is not limited to rodents but for better or for worse it’s probably too late for a rebrand!

      • A future extension (or now) could be the ability to interface with common trajectory estimation tools; for example, taking in the (X, Y, (Z), time) outputs of animal pose estimation tools (like DeepLabCut or such) would also allow experimentalists to generate neural synthetic data from other sources of real-behavior.

      This is actually already possible via our “Agent.import_trajectory()” method. Users can pass an array of time stamps and an array of positions into the Agent class which will be loaded and smoothly interpolated along as shown here in Fig. 3a or demonstrated in these two new papers[9,10] who used RatInABox by loading in behavioural trajectories.

      • What if a place cell is not encoding place but is influenced by reward or encodes a more abstract concept? Should a PlaceCell class inherit from an AbstractPlaceCell class, which could be used for encoding more conceptual spaces? How could their tool support this?

      In fact PlaceCells already inherit from a more abstract class (Neurons) which contains basic infrastructure for initialisation, saving data, and plotting data etc. We prefer the solution that users can write their own cell classes which inherit from Neurons (or PlaceCells if they wish). Then, users need only write a new get_state() method which can be as simple or as complicated as they like. Here are two examples we’ve already made which can be found on the GitHub:

      Author response image 2.

      Phase precession: PhasePrecessingPlaceCells(PlaceCells)[12] inherit from PlaceCells and modulate their firing rate by multiplying it by a phase dependent factor causing them to “phase precess”.

      Splitter cells: Perhaps users wish to model PlaceCells that are modulated by recent history of the Agent, for example which arm of a figure-8 maze it just came down. This is observed in hippocampal “splitter cell”. In this demo[1] SplitterCells(PlaceCells) inherit from PlaceCells and modulate their firing rate according to which arm was last travelled along.

      • This a bit odd in the Discussion: "If there is a small contribution you would like to make, please open a pull request. If there is a larger contribution you are considering, please contact the corresponding author3" This should be left to the repo contribution guide, which ideally shows people how to contribute and your expectations (code formatting guide, how to use git, etc). Also this can be very off-putting to new contributors: what is small? What is big? we suggest use more inclusive language.

      We’ve removed this line and left it to the GitHub repository to describe how contributions can be made.

      • Could you expand on the run time for BoundaryVectorCells, namely, for how long of an exploration period? We found it was on the order of 1 min to simulate 30 min of exploration (which is of course fast, but mentioning relative times would be useful).

      Absolutely. How long it takes to simulate BoundaryVectorCells will depend on the discretisation timestep and how many neurons you simulate. Assuming you used the default values (dt = 0.1, n = 10) then the motion model should dominate compute time. This is evident from our analysis in Figure 3f which shows that the update time for n = 100 BVCs is on par with the update time for the random motion model, therefore for only n = 10 BVCs, the motion model should dominate compute time.

      So how long should this take? Fig. 3f shows the motion model takes ~10-3 s per update. One hour of simulation equals this will be 3600/dt = 36,000 updates, which would therefore take about 72,000*10-3 s = 36 seconds. So your estimate of 1 minute seems to be in the right ballpark and consistent with the data we show in the paper.

      Interestingly this corroborates the results in a new inset panel where we calculated the total time for cell and motion model updates for a PlaceCell population of increasing size (from n = 10 to 1,000,000 cells). It shows that the motion model dominates compute time up to approximately n = 1000 PlaceCells (for BoundaryVectorCells it’s probably closer to n = 100) beyond which cell updates dominate and the time scales linearly.

      These are useful and non-trivial insights as they tell us that the RatInABox neuron models are quite efficient relative to the RatInABox random motion model (something we hope to optimise further down the line). We’ve added the following sentence to the results:

      “Our testing (Fig. 3f, inset) reveals that the combined time for updating the motion model and a population of PlaceCells scales sublinearly O(1) for small populations n > 1000 where updating the random motion model dominates compute time, and linearly for large populations n > 1000. PlaceCells, BoundaryVectorCells and the Agent motion model update times will be additionally affected by the number of walls/barriers in the Environment. 1D simulations are significantly quicker than 2D simulations due to the reduced computational load of the 1D geometry.”

      And this sentence to section 2:

      “RatInABox is fundamentally continuous in space and time. Position and velocity are never discretised but are instead stored as continuous values and used to determine cell activity online, as exploration occurs. This differs from other models which are either discrete (e.g. “gridworld” or Markov decision processes) or approximate continuous rate maps using a cached list of rates precalculated on a discretised grid of locations. Modelling time and space continuously more accurately reflects real-world physics, making simulations smooth and amenable to fast or dynamic neural processes which are not well accommodated by discretised motion simulators. Despite this, RatInABox is still fast; to simulate 100 PlaceCell for 10 minutes of random 2D motion (dt = 0.1 s) it takes about 2 seconds on a consumer grade CPU laptop (or 7 seconds for BoundaryVectorCells).”

      Whilst this would be very interesting it would likely represent quite a significant edit, requiring rewriting of almost all the geometry-handling code. We’re happy to consider changes like these according to (i) how simple they will be to implement, (ii) how disruptive they will be to the existing API, (iii) how many users would benefit from the change. If many users of the package request this we will consider ways to support it.

      • In general, the set of default parameters might want to be included in the main text (vs in the supplement).

      We also considered this but decided to leave them in the methods for now. The exact value of these parameters are subject to change in future versions of the software. Also, we’d prefer for the main text to provide a low-detail high-level description of the software and the methods to provide a place for keen readers to dive into the mathematical and coding specifics.

      • It still says you can only simulate 4 velocity or head directions, which might be limiting.

      Thanks for catching this. This constraint has been relaxed. Users can now simulate an arbitrary number of head direction cells with arbitrary tuning directions and tuning widths. The methods have been adjusted to reflect this (see section 6.3.4).

      • The code license should be mentioned in the Methods.

      We have added the following section to the methods:

      6.6 License RatInABox is currently distributed under an MIT License, meaning users are permitted to use, copy, modify, merge publish, distribute, sublicense and sell copies of the software.

    1. Author Response

      Reviewer #2 (Public Review):

      “The authors wish to relate beat-to-beat coordination of cardiac function (in this case as measured left ventricular pressure) to the activity of sympathetic neuron spiking within the stellate ganglion. A strength includes the challenging measurements from multiple stellate neuron activity over long durations in situ in the anesthetized pig.”

      We thank the reviewer for their feedback.

      “A major and overriding weakness is the founding assumption of the analysis that the underlying sympathetic neurons are all cardiac functioning in nature - an assumption that is overwhelmingly unlikely given the evidence in other species including humans that stellate postganglionic neurons are functionally mixed and have functional noncardiac targets. The use of broad and poorly explained/defined terms such as "event entropy" is difficult to follow and find meaning from. The manuscript is filled with difficult-to-follow text like "The neural specificity metric (Sudarshan et al., 2021). Fig. 5", is used to evaluate the degree to which neural activity is biased toward control target states taken here as LVP" and "The neural specificity is reduced from a multivariate signal to a univariate signal by computing the Shannon entropy at each timestamp of the mapped neural specificity metric". The figures are difficult to understand with axes that often bear no units or are quite compressed obscuring the intuitive meaning of the data trends. Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning". Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      “Fundamentally, cardiac pressure cycles with each heartbeat - roughly once per second - yet fluctuations in the depicted mean spike rate data with changes perhaps ten times in 25 minutes. Such plots are disorienting and difficult to associate with cardiac or neuron "functioning”

      We would like to clarify this point with the understanding that the reviewer is referring to the time axis in Figure 3C in the manuscript.

      The coactivity matrix constructed in Figure 3C computes the cross correlation in sliding mean/std spike activities for different pairs of channels. The mean spiking activities across channels, as the reviewer correctly pointed out, do indeed have a weak autocorrelation with the period of the heart rate. The weak correlation for the heart rate period, possibly due to slow firing rates, was seen across all channels of both control and HF animals. But, the cause of a large proportion of channel-pairs exhibiting high coactivity, termed as cofluctuation (Shown as red tracings in Fig 3D), is not known and cannot be directly associated with cardiac functioning.

      The cofluctuation was also found to be aperiodic in nature approximating a lognormal distribution (Fig R1) with the HF animals containing heavy tails outside their confidence intervals (Fig R1B). The event rate computed from the cofluctuation time series (shown as blue steps in Fig 3E) for an animal is a measure of spatial coherence among SG neural populations and was developed as a novel metric to be used in future studies.

      Figure R1: Cofluctuation histograms (calculated from mean or standard deviation of sliding spike rate, referred as Cofluctuation_MEAN and Cofluctuation_STD, respectively) and log-normal fits for each animal group. μF IT and σF IT are the respective mean and standard deviation (STD) of fitted distribution, used for 68% confidence interval bounds. A-B: Control animals have narrower bounds and represent a better fit to log-normal distribution. C-D: Heart failure (HF) animals display more heavily skewed distributions that indicate heavy tails.

      “Only 17 of the 38 references are not self-citations and thus the cited literature represents a narrow view of sympathetic regulation and sympathetic/stellate ganglion knowledge. Much of the foundations are self-professed in earlier publications by the present group and assumed to be accepted.”

      We thank the reviewer for pointing this out. We have added four additional citations that include methods such as neural population bias and spatiotemporal dynamics linkages to control targets in the neuroscience literature. We have added these citations to page 15 in the “Conclusion” section of the manuscript. In addition, it is our group’s specialty to carry these cardiac nervous system experiments, we are not aware of another group collecting multi-electrode array data from the cardiac nervous system and studying population dynamics of cardiac neurons. Hence we build on based on our previous learnings. The most relevant literature (not necessarily related to cardiac nervous system) can be found in the neuroscience references we cited that contain applications of neural population recordings for different brain areas, mainly in neuropsychiatry domain to understand disease dynamics.

      “For the expert or even the uninformed reader, this report is broadly confused and confusing. The premises (beat to beat or whether LVP conveys cardiac function) are poorly supported. The conclusions are quite vague.”

      Thank you for your feedback. To simplify the understanding, we moved all mathematical details to supplementary material, re-wrote the abstract and the conclusion from scratch, and splitted the methods figures that may be confusion. We believe that our novel metrics event rate and entropy capture non-trivial linkages between heart failure status, cardiac neural activity (spike activity), and peripheral activity (LVP). We have supported our metrics with 17 animals with state-of-the-art surgical techniques and technology, and reported our results with detailed statistical analyses. Our manuscript essentially highlights that event rate and entropy metrics are significantly different between control animals and animals with heart failure. These metrics can be used to design future studies with these animal models to provide a more quantitative approach to heart disease, rather than binary (yes or no) descriptions.

      “Discussion: The abstract does not convey conclusions from the findings and contains broad statements such as "signatures based on linking neuronal population cofluctuation and examine differences in "neural specificity" of SG network" that have little substantive value or conclusion for the reader. Fundamentally what does the title "signatures based on linking neuronal population" cofluctuation mean to the reader? What changed in HF?”

      Thank you for this comment. We completely revised the abstract and conclusion as detailed in our response to Essential Revision #1. Event rate is a metric related to neural activity recordings and entropy is related to the association of neural activity to left ventricular blood pressure. Our findings suggest that both the neural population activity itself (event rate) and its ability to pay attention to cycles of left ventricular pressure (neural specificity) are significantly higher in animals with HF compared to controls.

    1. Author Response

      Reviewer #1 (Public Review):

      (1.1) The work by Porciello and colleagues provides scientific evidence that the acidic content of the stomach covaries with the experienced level of disgust and fear evoked by disgusting videos. The working of the inside of the gut during cognitive or emotional processes have remained elusive due to the invasiveness of the methods to study it. The major strength of the paper is the use of the non-invasive smart pill technology, which senses changes in Ph, pressure and temperature as it travels through the gut, allowing authors to investigate how different emotions induced with validated video clips modulate the state of the gut. The experimental paradigm used to evoke distinct emotions was also successful, as participants reported the expected emotions after each emotion block. While the reported evidence is correlational in nature, I believe these results open up new avenues for studying brain-body interactions during emotions in cognitive neuroscience, and future causal manipulations will shed more insight on this phenomena. Indeed, this is the first study to provide evidence for a link between gastric acidity and emotional experience beyond single patient studies, and it has major implications for the advancement of our understanding of disorders with psycho-somatic influences, such as stress and it's influence of gastritis.

      1.1 First of all, we want to thank Reviewer#1 for his cogent comments and for highlighting that our findings may inspire future research on brain-body interactions. We took into the highest consideration all the remarks and changed the manuscripts accordingly.

      (1.2) As for the limitations, little insight is provided on the mechanisms, time scales, and inter-individual variability of the link between gastric Ph and emotional induction. Since this is a novel phenomena, it would be important to further validate and characterize this finding. On this line, one of the most well known influences of disgust on the gut is tachygastria, the acceleration of the gastric rhythm. It would be important to understand how acid secretion by disgusting film is related to tachygastria, but authors only examine the influence of disgusting film on the normogastric frequency range.

      1.2 We are aware that at the moment our data are mainly descriptive and do not provide a clear picture of the causal mechanisms. However, to deal with this outstanding issue we added a new series of analysis.

      Most of the data on gastric activity come from analysis of the normogastric band. However, information about the EGG tachygastric rhythm in humans is of potential great importance. To deal with the reviewer’s comment and considering the previously published literature, we re-examined the EGG data focusing on the tachygastric rhythm. The methodology remained consistent with the process described for normogastric peak extraction but this time, we extracted the peak in the tachygastric band, specifically 0.067 to 0.167 Hz (i.e., 4–10 cpm). The ANOVA performed over the tachygastric cycle revealed a significant main effect of the type of video clip (F(4, 112) = 2.907, p = 0.025, Eta2 (partial) = 0.09). However, the Bonferroni corrected post hoc tests did not show any significant difference between the different type of emotional video clips and the neutral condition. The sole significant comparison was observed between participants viewing happy and fearful video clips, indicating that participants’ tachygastric cycles were faster when exposed to happy rather than fearful video clips (p = 0.038). For a visual representation of the outcomes, please see Fig S6.

      We revised the main text (Page 17, lines: 472-482) to include this analysis. The revised text now reads as follows:

      “Finally, we explored whether normogastric and/or tachygastric cycle changed in response to specific emotional experience. After checking that normogastric and tachygastric peak frequencies were normally distributed (all ps > 0.05), we ran two separate ANOVAs on the individual peak frequencies in the normogastric and tachygastric range. Each analysis had the type of video clip as within-subjects factor. The ANOVA performed on the normogastric rhythm was not significant (F(4, 44) = 1.037, p = 0.399) suggesting that the gastric rhythm did not change while participants observed the different emotional video clips. In contrast, the ANOVA performed on the tachygastric rhythm did show a significant main effect (F(4, 112) = 2.907, p = 0.025, Eta2 (partial) = 0.09). However, the only comparison that survived the Bonferroni correction was the one between happy and fearful video clips, namely participants’ tachygastric cycle was faster when they observed happy vs fearful video clips (p = 0.038) see Fig. S6 for a graphical representation of the results.”

      To deal with the Reviewer’s comment, we also correlated the average pH value with the corresponding frequency of the tachygastric cycle recorded in the disgusting, happy and the fearful video clips, namely the emotions associated to changes in pH. The only significant correlation was the one found during the disgusting video clips (r= 0.435; p= 0.023, all the other rs ≤ 0.351, all the other ps ≥ 0.073). Differently from what we expected, we found a positive correlation suggesting that when participants were exposed to disgusting video clips the less acidic was the pH the higher was the frequency of the tachygastric cycle. Instead, we know from our pill data that disgusting video clips are associated to more acid values, and from literature (not replicated by us) to a faster gastric rhythm. Since we did not find strong support in the EGG analysis suggesting a relationship between the gastric rhythm and the emotional experience, we believe that additional evidence will help to clarify the relationship between pH and gastric rhythm.

      (1.3) Additionally, only one channel of the electrogastrogram (EGG) was used to measure the gastric rhythm, and no information is provided on the quality of the recordings. With only one channel of EGG, it is often impossible to identify the gastric rhythm as the position of the stomach varies from person to person, yielding inaccurate estimates of the frequency of the gastric rhythm.

      1.3 We agree with Reviewer 1 on this point. We acknowledge the potential limitation associated with one-channel EGG recording in our study. To deal with this remark, in a separate (ongoing) study (N# participants= 25) we recorded the electrogastrogram following the methodology outlined by Wolpert et al., 2020 published on Psychophysiology. Thus, in order to study the EGG in association to the emotional experience, we used a bipolar 4-channels montage while participants observed the same emotional video clips used in our current study (see picture below for the montage set-up).

      Author response image 1 shows the 4-channels EGG bipolar recording montage reproducing the one proposed by Wolpert et al., 2020.

      Author response image 1.

      Then, we extracted the gastric cycle in both the normogastric and the tachygastric bands.

      After checking that data were normally distributed (Kolmogorov-Smirnov ds > 0.10; ps> .20), in the case of the gastric cycle extracted in the normogastric band, we ran a repeated measures ANOVA with the type of video clip as the only within-subjects factor measured on the 5 levels (i.e. the five types of video clip: Disgusting, Fearful, Happy, Neutral, and Sad). The ANOVA shows that the gastric cycle recorded during the different video clips did not differ (F (4,96) = 0.39; p= 0.81), see the plot on Author response image 2.

      Author response image 2.

      Gastric cycle (normogastric band) recorded via multiple-channels electrogastrogram (EGG) during the emotional experience. The plot shows the gastric cycle extracted in the normogastric band while participants were observing the five categories of the video clips (i.e. those inducing disgust, fear, happiness, sadness and, as control, a neutral state).

      We also extracted the gastric cycle in the tachygastric band, the distribution of the data was not normal in one condition (Kolmogorov-Smirnov ds > 0.27; p < 0.05), therefore we ran a Friedman ANOVA to compare the gastric cycle during the different emotional experiences. The Friedman ANOVA was not statistically significant (χ2 (4) = 2.88; p = 0.58), suggesting that, similarly to the gastric cycle extracted in the normogastric band, also the one extracted in the tachygastric band was not clearly associated to the investigated emotional states, see Author response image 3.

      Author response image 3.

      Gastric cycle (tachygastric band) recorded via multiple-channels electrogastrogram (EGG) during the emotional experience. The plot above shows the gastric cycle extracted in the tachygastric band while participants were observing the five categories of the video clips (i.e. those inducing disgust, fear, happiness, sadness and as control a neutral state).

      Results from this control study seem to suggest that the non-significant effect of the gastric cycle was probably not due to the fact that we use a one-channel egg montage, at least for what concerns the gastric cycle extracted from the normogastric band.

      For what concerns the tachygastric frequency associated to the emotional experience these results from a multi-channel EGG recording seem to go in the same direction of the normogastric one, namely no frequency of the gastric cycles recorded during the emotional video clips was different from the control condition.

      The only significant difference that we found in our 1-channel EGG study was the one between the happy and the fearful video clips (see Fig. S6 contained in the supplementary materials and above). Specifically, we found that happy video clips were associated to higher gastric frequency compared to the fearful ones. However, we did not replicate these findings in our multi-channels EGG study.

      Although suggestive, this evidence is not conclusive. Indeed, we are aware that a final word on the results of our multi-channel study can be said only when a larger sample is obtained.

      (1.4) Finally, I believe that the results do not show evidence in favor of the discrete nature of emotions theory as they claim in the discussion. Authors chose to use stimuli inducing discrete emotions, and only asked subjective reports of these same discrete emotions, so these results shed no light on whether emotions are represented discretely vs continuously in the brain.

      We revised the discussion in order to better describe our results and toned down the interpretation that the present findings directly support the discrete nature of emotions, as suggested by this Reviewer.

      Now page 21&22 lines 622-631 reads as follow:

      “Overall, and in line with theoretical and empirical evidence (Damasio, 1999; Harrison et al., 2010; James, 1994, Lettieri et al., 2019; Stephens et al., 2010), our findings may suggest that specific patterns of subjective, behavioural, and physiological measures are linked to unique emotional states...We acknowledge that our results, although novel, are restricted to a sample of male participants, and more importantly they need to be replicated. We also acknowledge that future studies should better investigate the mechanisms underlying the role of the pH in the emergence of specific emotion. For instance, pharmacologically manipulating stomach pH during emotional induction, not only for basic emotions but also for exploring complex emotions such as moral disgust (Rozin et al., 2009), would enable researchers to generalize these findings and examine the directionality of this relationship.”

      Reviewer #2 (Public Review):

      To measure the role of gastric state in emotion, the authors used an ingestible smart pill to measure pH, pressure, and temperature in the gastrointestinal tract (stomach, small bowel, large bowel) while participants watched videos that induced disgust, fear, happiness, sadness, or a control (neutral). The study has a number of strengths, including the novelty of the measurement (very few studies have ever measured these gut properties during emotion processing) and the apparent robustness of their main finding (that during disgusting video clips, participants who experienced more feelings of disgust (and to a lesser degree which might not survive more stringent multiple comparison correction, fear) had more acidic stomach measurements, while participants who experienced more happiness during the disgusting video clips had a less acidic (more basic) stomach pH. Although the study is correlational (which all discussion should carefully reflect) and is restricted to a moderately-sized, homogenous sample, the results support their general conclusion that stomach pH is related to emotion experience during disgust induction. There may be additional analyses to conduct in order for the authors to claim this effect is specific to the stomach. Nevertheless, this work is likely to have a large impact on the field, which currently tends to rely on noninvasive measures of gastric activity such as electrogastrography (which the authors also collect for comparison); the authors' minimally-invasive approach yields new and useful measurements of gastric state. These new measures could have relevance beyond emotion processing in understanding the role of gut pH (and perhaps temperature and pressure) in cognitive processes (e.g. interoception) as well as mental and physical health.

      We are very grateful to Reviewer#2 for skilfully managing the paper and highlighting its strengths, particularly the innovative measurement approach and the potential implications these findings might offer for future research into the impact of gastric signals on emotional experiences and potentially on many other higher-order cognitive functions. Additionally, we would like to thank her for the highly valuable feedback. We have incorporated all the comments into the revised manuscript, aiming to enhance its quality.

      Reviewer #3 (Public Review):

      This study used novel ingestible pills to measure pH and other gastric signals, and related these measures to self-report ratings of emotions induced by video clips. The main finding was that when participants viewed videos of disgust, there was an association between gastric pH and feelings of disgust and fear, and (in the opposite direction) happiness. These findings may be the first to relate objective measures of gastric physiology to emotional experience. The methods open up many new questions that can be addressed by future studies and are thus likely to have an impact on the field.

      We thank very much also Reviewer#3 for the accurate reading of our manuscript; for highlighting the strengths of our study; and for providing valuable feedback. Below, a point-by-point response to all the comments raised by this Reviewer. We have incorporated their comments, and we hope they are satisfied by the new version of the manuscript.

      (3.1) My main concern is with the reliability of the results. The study associates many measures (pH, temperature, pressure, EGG) in stomach, small bowel, and large bowel with multiple emotion ratings. This amounts to many statistical tests. Only one of these measures (pH in the stomach) shows a significant effect. Furthermore, the key findings, as displayed in Figure 4 do not look particularly convincing. Perhaps this is a display issue, but the relations between stomach pH and Vas ratings of disgust, fear, and happiness were not apparent from the scatter plot and may be influenced by outliers (e.g., happiness).

      3.1 We thank Reviewer#3 for raising this issue which was also raised by Reviewer#1 and #2, se replies above. As reported above we worked on the data analysis in order to provide more evidence supporting our claim, i.e. that pH plays a role in the emotional experience of disgust, happiness and fear. We modified Figure 4 (now 5) as also requested by Reviewer 1 and 2, and we now hope that it is clearer. We included a new analysis, in which we used all the datapoints recorded from the ingestible device and we performed a mixed models analysis with pH as dependent variable, type of video clips and number of datapoints (‘Time’) as fixed factors, and the by-subject intercepts as random effects. This analysis not only supported the results of the original one but provided evidence for a causal role of the emotional induction on the pH of the stomach. Results of this analysis are described in point 1.7 in the response to Reviewer#1 and results of the new analysis and the revised version of the main figure can be found in track change in the manuscript (Page 15&16, lines: 408-439) in the main text and copied and pasted below.

      “To explore how the emotional induction could modulate the pH of the stomach and how the length of the exposure to that specific emotional induction could also play a role in modulating pH variations, we ran an additional model, Model 2. This model included all the pH datapoints registered using the Smartpill as dependent variable, the type of video clip and the number of the datapoints (“Time”) as fixed effects, and the by-subject intercepts as random effects (see Supplementary information for a detailed description of the model). Model 2 had a marginal R2 = 0.014 and a conditional R2 = 0.79. Visual inspection of the plots did reveal some small deviations from homoscedasticity, visual inspection of the residuals did not show important deviations from normality. As for collinearity (tested by means of vif function of car package), all independent variables had a GVIF^(1/(2*Df)))^2 < 10.

      Type III analysis of variance of Model 2 showed a statistically significant main effect of the Time (F = 20.237, p < 0.001, Eta2 < 0.01) suggesting that independently from the type of video clip observed, the stomach pH significantly decreased as a function of the time of exposure to the induction. A significant main effect of the type of video clip was also found (F = 22.242, p < 0.001, Eta2 = 0.01) suggesting that pH of the stomach changes when participants experienced different types of emotions. In particular, post hoc analysis revealed that pH was more acidic when participants observed disgusting compared to fearful (t= -11.417; p < 0.001), happy (t= -15.510; p < 0.001) and neutral (t= -3.598; p = 0.003) video clips.

      Also, pH was more acidic when participants observed fearful compared to happy (t= -4.064; p < 0.001), and less acidic compared to neutral (t= 7.835; p < 0.001) and sad scenarios (t= 9.743; p < 0.001). Finally, pH was less acidic when participants observed happy compared to neutral (t= 11.923; p < 0.001). and sad videoclips (t= 13.806; p < 0.001), see Fig.6, left panel. Interestingly, also the double interaction Time X Type of video clip was significant (F = 3.250, p = 0.0113, Eta2 < 0.01) suggesting that the time of the exposure to the induction differentially influenced the pH of the stomach depending on to the type of the observed video clip. Simple slope analysis showed that while pH did not change over time when observing disgusting (t= -1.2691; p = 0.2045) and happy (t= 0.4466; p = 0.6552) clips, it did significantly decrease over time when observing fearful (t= -4.4212; p < 0.001), sad (t= -2.0487; p = 0.0405) and neutral video clips (t= -2.7956; p = 0.0052), see Fig.6, right panel."

      We believe that the new evidence reported provides support of our claims and we hope that the reviewer agrees with us. However, as we also mentioned in the paper, we are aware that replications are needed and we are already working on this.

    1. Author Response

      Reviewer #1 (Public Review):

      This study provided evidence to interpret and understand the aging and developmental processes in children. The main strength of the study is it measures a set of biological age measures and a set of developmental measures, thus providing multi-faceted evidence to explain the associations between aging and development in children. The main weakness of this study is that how to measure and test the aging hypothesis of "a buildup of biological capital model" and "wear and tear" is not well-explained. Why the observed associations between biological age measures and developmental measures could support the aforementioned aging theories?

      Thank you. On reflection we agree that how to test the aging hypotheses of "a buildup of biological capital model" and "wear and tear" is not well-explained in the manuscript. We have addressed this issue in the point-by-point responses below:

      1) Abstract - conclusion: The aging hypothesis of "a buildup of biological capital model" and "wear and tear" were mentioned in the conclusion without an explanation of these theories in the previous section. Readers who are not experts in the field may not understand the logic.

      We have replaced these phrases in the abstract with the following interpretation, which we hope will be more readily understood:

      “Patterns of associations suggested that accelerated immunometabolic age may be beneficial for some aspects of child development while accelerated DNA methylation age and telomere attrition may reflect early detrimental aspects of biological ageing, apparent even in children.”

      2) Result - Biological age marker performance: the correlation between transcriptome age and chronological age is very strong (r =0.94). I am afraid that very little age-independent information could be captured by the transcriptome age. Is it possible to down-regulate the age dependency of the transcriptome age in the training process?

      Thank you for this important comment: We agree the high accuracy of this clock may in fact reduce its relevance as a biological age marker and note that this is a concern generally in the field. We have explored the possibility of using a less accurate transcriptome age model as follows: Instead of elastic net modelling we tested using the lasso penalisation only, which will result in more parsimonious (sparse) models as less important features are dropped as the strength of the lambda parameter is increased. Plotting the correlation in the test set against number of features in models, as the lambda is sequentially increased, we can see (as shown in Author response image 1 by the blue line) that after the inclusion of around 200 features, the gain in accuracy becomes less steep.

      Author response image 1.

      We then tested the sensitivity of a model optimised for sparsity at the expense of some prediction accuracy, selected based on visual inspection (blue line, r in test set =0.87, number of features= 187) of the above plot, against developmental measures, compared to the most accurate model as presently included in the manuscript:

      Author response image 2.

      We find that, across all outcomes tested, the less accurate model, based on only the most important features, does not provide an improvement in sensitivity to developmental outcomes compared to the currently used model.

      We therefore prefer to keep the more accurate model in this study. Especially as it is consistent with the methodology used in the Horvath and Immunometabolic age models and generally in the field, and otherwise it is not obvious how the biological clock should be trained (especially for children without mortality data) without altering the whole approach of the study. We have acknowledged and discussed this issue on page 15.

      3) The study population comes from several cohorts, which might influence the results. How the cohort effects were controlled for in the analyses?

      The possible influence of cohort is a limitation of the study which we have discussed on page 16. We did not include cohort as a predictor in any of the candidate biological clocks since this may reduce detection of some age -related features. Instead, we include a variable for cohort as a fixed effect in all analyses with risk factors and developmental outcomes and examined the performance of candidate biological clocks in predicting chronological age within each cohort. As a further check, we have added an additional sensitivity analysis (Figure 4-figure supplement 6), against developmental outcomes significant in the main analysis, stratified by cohort. We find generally consistent effects across cohorts.

      4) Figure 3 only showed the number of p values. Can the author also provide the number of point estimates and 95% confidence intervals, perhaps in the supplemental table?

      This information was originally provided in supplemental table 5 (now Supplementary file 7), combined with the sensitivity analyses. To make this information easier to find, we have made this a stand-alone table (table 3). We now direct readers to this information within the caption of Figure 4 (previously figure 2).

      Reviewer #2 (Public Review):

      The study had an especially relevant aim for aging research and utilized various data types in an especially interesting human population. Multi-omics perspective adds great value to the work. The researchers aimed to evaluate how different indicators of biological age (BA) behave in children during their developmental stage. In the analysis, relationships between indicators of BA, health risk factors, and developmental factors were assessed in cross-sectional data comprising children aged 5-12 years. The manuscript is well-written and easy to follow. The methodology is good. The authors succeeded to reach the aim in most parts.

      In the study, previously known and unknown biological age indicators were used. Known indicators included telomere length and Horvath's epigenetic age. Unknown (novel) indicators, transcriptomic and immunometabolic clocks, were developed in the present study and they showed a strong correlation with calendar age in this population, also in the validation data set. Although the transcriptomic and immunometabolic clocks have the potential of being true indicators of biological age, they are still lacking scientific evidence of being such indicators in adults. That is, their associations with age-related diseases and mortality are yet to be shown. Thus, the major remark of the study relates to the phrasing: these novel transcriptomic and immunometabolic clocks should be presented as BA indicator candidates waiting for the needed evidence.

      Thank you for this important observation. However, we still find that “biological age indicator” is a useful umbrella term in this manuscript and there is not an obvious alternative. We therefore have added the following sentence on page 8, and highlighted the difference between the markers at key points in the abstract, introduction, results and discussion.

      “We note that since a common definition of markers of biological age is that they should be associated with age-related disease and mortality [69] these new clocks may only currently be considered “candidate” biological age markers. However, we have referred to both the established and candidate markers as biological age markers throughout to simplify presentation.”

    1. Author Response

      Reviewer #1 (Public Review)

      [...] One potential issue is that the high myelination signal is associated with the compartment in V2 (pale stripes) which was not functionally defined itself but by the absence of specific functional activations. No difference was reported between those stripes that were defined functionally. Other explanations for the differential pattern of a qMRI signals, e.g. ROI distribution for presumed pale stripes is not evenly distributed (more foveal), ROIs with low activations due to some other factor show higher myelin-related signals, cannot be excluded based on the analysis presented.

      Indeed, it would have been advantageous to directly functionally delineate pale stripes in V2. Since we were not able to achieve this by fMRI, we needed an indirect method to infer pale stripe contributions in the analysis. We also added a statement in the discussion section to emphasize this more (p. 9, lines 286–288).

      Furthermore, different myelination between thin and thick stripes was not tested, since we did not have a concrete hypothesis on this. Despite the conflicting findings of stronger myelination in dark or pale CO stripes in the literature, no histological study stated myelination differences between dark CO thin and thick stripes. Therefore, our primary interest and hypothesis was lying in comparing the different myelination of thin/thick and pale stripes using MRI.

      Thank you very much for this comment about potential other sources of differential qMRI parameter patterns. Indeed, based on the original analysis we could not exclude that the absence of functional activation around the foveal representation may have biased our analysis. We therefore added a supporting analysis, in which we excluded the region around the foveal representation from the analysis. The excluded cortical region was kept consistent between participants by excluding the same eccentricity range in all maps. We added more details in the results section of the revised manuscript (p. 8, lines 189–202). In Figure 5-Supplement 1 and Figure 5-Supplement 3, results from this supporting analysis are shown which reproduced the primary findings from the main analysis, particularly the relatively higher myelination of pale stripes.

      ROI definitions solely based on fMRI activation amplitude have additional limitations. However, we find it unlikely that a small fMRI effect size and low contrast-to-noise ratio (i.e. stochastic cause of low statistical parameter values/”activation”) has impacted the results, since Figure 3 shows that we could achieve a high degree of reproducibility for each participant.

      We would note that the fact that we found consistent differences across MPM and MP2RAGE sessions makes some potential artifacts driving the differences unlikely. We also find it unlikely that systematic cerebral blood volume differences between stripes would have driven the results. A higher local blood volume would lead to increased BOLD responses but also to a higher R1 value due to the deoxy-hemoglobin induced relaxation, which is opposite to the observation of higher activity in the thick/thin stripes but lower R1 values.

      Further studies using other functional metrics (e.g. VASO, ASL etc.) may help us to even more clearly demonstrate specificity but were out of the scope of this already rather extensive study. Although we have added extensive further analyses in the revised manuscript such as controlling for foveal effects or registration performance, we did not see a possibility to fully exclude a systematic bias that might potentially be caused by unknown factors.

      Another theoretical and practical issue is the question of "ground truth" for the non-invasive qMRI measures, as the authors - as their starting point - roundly dismiss direct histological tissue studies as conflicting, rather than take a critical look at the merit of the conflicting study results and provide a best hypothesis. If so, they need to explain better how they calibrate their non-invasive MR measurements of myelin.

      We agree and have now further elaborated on the limits of specificity of the R1 and R2* signal as cortical myelin marker (p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). However, we still think that it is important for the reader to appreciate the conflicting results in histological studies using staining methods for myelin, which adds to the study’s background.

      We did not intend to give the impression that MRI provides the missing ground-truth to adjudicate histological controversies, but that it provides an alternative and additional view on the open questions. We changed the introduction to better reflect the aspect that the study offers a unique view by providing myelination proxies and functional measures in the same individual, which allows for direct comparison and investigation of structure-function relationships (see p. 2, lines 68–70; p. 3, lines 93–95), which is not accessible to any other approach. Nevertheless, we would like to note that R1 has been well established as a myelin marker under particular conditions (Kirilina et al., 2020; Mancini et al., 2020; Lazari and Lipp, 2021). It has also been widely used for cortical myelin mapping across a variety of populations, systems and field strengths. We added this statement to the introduction (see p. 2, lines 82-85). We note that we excluded volunteers with pathologies or neurological disorders from the study and their mean age was about 28 years. Thus, we had conditions comparable to previous (validation) studies.

      Because of the contradictory findings of histological studies, we could not further finesse the hypothesis beyond our previous a priori hypothesis that we expected differences in the myelin sensitive MRI metrics between the thin/thick versus pale stripes. To improve the contextual understanding, we added a paragraph in the discussion section covering in more depth how the MRI results relate to known histological findings (see pp. 8–9, lines 216–240).

      While this paper makes an important contribution to the question of the association of specific myelination patterns defining the columnar architecture in V2, it is not entirely clear whether the authors can fully resolve it with the data presented.

      Indeed, we agree that non invasive aggregate measures, such as the R1 metrics, offer limited specificity which precludes a fully conclusive inference about cortical myelination. We have further emphasized this on several occasions in the text (see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). Since the correspondence of cortical myelin levels and R1 (and other metrics) is an active area of research, we expect that the understanding, sensitivity and specificity of R1 to cortical myelination will further improve. We note that the use of qMRI is a substantial advance over weighted MRI typically used, which suffers from lack of specificity due to instrumental idiosyncrasies and varying measurement conditions.

      Reviewer #2 (Public Review)

      [...] Unfortunately, this particular study seems to fall into an unhappy middle ground in terms of the conclusions that can be drawn: the relaxometry measures lack the specificity to be considered "ground truth", while the authors claim that the literature lacks consensus regarding the structures that are being studied. The authors propose that their results resolve whether or not stripes differ in their patterns of myelination, but R1 lacks the specificity to do this. While myelin is a primary driver of relaxation times in cortex, relaxometry cannot be considered to be specific to myelin. It is possible that the small observed changes in R1 are driven by myelin, but they could also reflect other tissue constituents, particularly given the small observed effect sizes. If the literature was clear on the pattern of myelination across stripes, this study could confirm that R1 measurements are sensitive to and consistent with this pattern. But the authors present the work as resolving the question of how myelination differs between stripes, which over-reaches what is possible with this method. As it stands, the measured differences in R1 between functionally-defined cortical regions are interesting, but require further validation (e.g., using invasive myelin staining).

      We agree that we have inadvertently overstated the specificity of R1 at several occasions in the text. We therefore toned down the statements concerning the correspondence between R1 and myelin throughout the manuscript (e.g. see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260).

      We also removed the phrase that gave the impression that MRI can conclusively resolve the conflicting results found in histological studies. In the Introduction, we changed the corresponding paragraph by emphasizing the alternative view, which can be obtained from MRI by the possibility to investigate structure-function relationships in the living human brain, which would not be possible by invasive myelin staining (see p. 2, lines 68–70; p. 3, lines 93–95).

      We acknowledge that – perhaps aside from electron microscopy – all common markers have shortcomings, which limit their specificity. For example, classic histology is not quantitative and resulted in conflicting results. It even includes the very fundamental issue, that the composition of myelin varies across the brain and within brain areas significantly (e.g., its lipid composition (González de San Román et al., 2018)). Thus, we regard the different invasive/non-invasive measures as complementary. R1 adds to this arsenal of measures and can be acquired non invasively. It has been shown to be a reliable myelin marker under certain circumstances. It follows the known myeloarchitecture patterns of the human brain, which was also checked for the data of the present study (see Figure 4 and Appendix 2). It is responsive to traumatic changes (Freund et al., 2019), development (Whitaker et al., 2016; Carey et al., 2018; Natu et al., 2019) and plasticity (Lazari et al., 2022). Since we studied healthy volunteers with no known pathologies that were sampled randomly from the population, we believe that the previous results generally apply and suggest sufficient specificity of the R1 marker. Of course, we cannot fully exclude bias due to unknown factors that have not been investigated/discovered by validation studies yet. However, in this case we expect that the systematic differences between stripe types would remain an important result most likely pointing to another interesting biological difference between stripes.

      While more research is needed to clarify the precise role of R1 for cortical myelin, we think that the meaningful determination of quantitative MR parameter within one cortical area is still interesting for the neuroscientific community.

      Moreover, the results make clear that R1 differences are not sufficiently strong to provide an independent measure of this structure (e.g., for segmentation of stripe). As such, one would still require fMRI to localise stripes, making it unclear what role R1 measures would play in future studies.

      Indeed, the observed small effect sizes in the present study still requires a functional localization with fMRI. We expected small effect sizes using R1 and R2* due to the known small inter-areal or intra-cortical differences of MRI myelin markers. Therefore, this study aimed at a proof-of-concept investigating whether intra-areal R1 differences at the spatial scale of columnar structures can be detected using non-invasive MRI. Our study shows that these differences can be seen but currently not at the single voxel level. We anticipate that with further improvements in sequence development and scanner hardware, high-resolution R1 estimates with sufficient SNR can be acquired making fMRI redundant (for this kind of investigations). Please see the reply to the next comment concerning the impact of using R1 in future studies.

      The Introduction concludes with the statement that "Whereas recent studies have explored cortical myelination ... using non-quantitative, weighted MR images... we showed for the first time myelination differences using MRI on a quantitative basis". As written, this sentence implies that others have demonstrated that simpler non-quantitative imaging can achieve the same aims as qMRI. Simply showing that a given method is able to achieve an aim would not be sufficient: the authors should demonstrate that this constitutes an important advance.

      Thank you for this comment. It goes to the heart of the concerns raised about specificity and sensitivity of MRI based myelin metrics. We elaborate here on the main advantage of using qMRI in our current study and why it is more specific than weighted MR imaging. However, we emphasize that a thorough comparison between qMRI and weighted MRI is highly complex and refer to our recent review paper on qMRI for further details (Weiskopf et al., 2021), which are beyond the scope of our paper. The signal in weighted MRI, even when optimally optimized to the tissue of interest, additionally depends on both inhomogeneities in the RF transmit and receive (bias) fields. Other methods like using a ratio image (T1w/T2w) can cancel out the receive field bias entirely (in the case of no subject movements between scans) but not the transmit field bias. This hampers the direct analysis and interpretation of signal differences between distant regions of the brain. For high resolution imaging applications, the usage of high magnetic fields such as 7 T is beneficial or even mandatory due to signal-to-noise (SNR) penalties. With increasing field strength, these inhomogeneities also apply to small regions as V2. For these cases, qMRI is advantageous since it provides metrics which are free from these technical biases, significantly improving the specificity. As high-field MRI has the potential to non invasively study the structure and function of the human brain at the spatial scale of cortical layers and cortical columns, we believe that the results of our current study, which successfully demonstrate the applicability of qMRI to robustly detect small differences at the level of columnar systems, is relevant for future studies in the field of neuroscience.

      We emphasized these considerations in the revised manuscript (see. p. 9, lines 273–285).

      The study includes a very small number of participants (n=4). The advantage of non-invasive in-vivo measurements, despite the fact that they are indirect measures, should be that one can study a reasonable number of subjects. So this low n seems to undermine that point. I rarely suggest additional data collection, but I do feel that a few more subjects would shore up the study's impact.

      The present study was conducted in line with a deep phenotyping study approach. That is, we focused on acquiring highly reliable datasets on individuals. We did not intend to capture the population variance, which is often the goal of other group studies, since low level and basic features such as stripes in V2 are expected to be present in all healthy individuals. Thus we traded off and prioritized test-retest measurements for fMRI sessions and using an alternative MP2RAGE acquisition over a larger number of individuals. This resulted in 6–7 scanning sessions on different days for each individual, summing up to 26 long scanning session in total. We also note that the used sample size is not smaller than in other studies with a similar research question. For example, another fMRI study investigating V2 stripes in humans used the same sample size of n=4 (Dumoulin et al., 2017).

      The paper overstates what can be concluded in a number of places. For example, the paper suggests that R1 and R2 are highly-specific to myelin in a number of places. For example, on p7 the text reads" "We tested whether different stripe types are differentially myelinated by comparing R1 and R2..." Relaxation times lack the specificity to definitively attribute these changes purely to myelin. Similarly, on p11: "Our study showed that pale stripes which exhibit lower oxidative metabolic activity according to staining with CO are stronger myelinated than surrounding gray matter in V2." This implies that the study directly links CO staining to myelination. In addition to using non-specific estimates of myelination, the study does not actually measure CO.

      We agree that we did not clearly point out the limitations of R1 myelin mapping. Therefore, we toned down the statements about the connection between cortical myelin and R1. The mentioned statements in the reviewer’s comment were changed accordingly (see p. 6, line 163; p. 11, lines 353–354). We also included a small paragraph to clarify the used terminology (color-selective thin stripes, disparity-selective thick stripes) in the manuscript (see p. 4, lines 110–114) to avoid the inadvertent conflation of CO staining and actually measured brain activity.

      I'm confused by the analysis in Figure 5. I can appreciate why the authors are keen to present a "tripartite" analysis (thick, thin, and pale stripes). But I find the gray curves confusing. As I understand it, the gray curves as generated include both the stripe of interest (red or blue plots) and the pale stripes. Why not just generate a three-way classification? Generating these plots in effect has already required hard classification of thin and thick stripes, so it is odd to create the gray plots, which mix two types of stripes. Alternatively, could you explicitly model the partial volume for a given cortical location (e.g., under the assumption that partial volume of thick and thin strips is indicated by the z-score) for the corresponding functional contrast? One could then estimate the relaxation times as a simple weighted sum of stripe-wise R1 or R2.

      Figure on weighted average of stripe-wise R1 and R2. (a) shows the weighted sum of R1 (de-meaned and de-curved) over all V2 voxels. z-scores from color-selective thin stripe experiments and disparity-selective thick stripes were used as weights in the left and middle group of bars, respectively. An intermediate threshold of zmax=1.96 was used, i.e., final weights were defined as weights=(z-1.96). Weights with z<0 were set to 0. For pale stripes (right group of bars), we used the maximum z-score value from thin and thick stripe measurements. We then set all weights with z≥1.96 to 0 and used the inverse as final weights. i.e., weights = -1 * (max(z)-1.96). (b) shows the same analysis for R2. Error bars indicate 1 standard error of the mean.

      (1) Yes, indeed. We agree that modeling the partial volume of each compartment (thin, thick and pale stripes) in each V2 voxel would be the most elegant approach. However, we note that z-scores between thin and thick stripe experiments may not reflect the voxel-wise partial volume effect, since they are a purely statistical measure and not a partial volume model. Having said this, we think that this general approach can give some additional insights and we provide results for a similar analysis here. We calculated the weighted sum of R1 and R2 values over all V2 voxels for each stripe compartment (thin, thick and pale stripes) independently (see above figure). For R1, we see the same pattern of R1 between stripe types as in the manuscript (Figure 5). Additionally, we show the differences here for each subject, which further demonstrates the reproducibility across subjects in our study. For R2, no clear pattern across subjects emerged, confirming the results in our manuscript. Since, this analysis did not add relavant new information to the manuscript, we refrained from adding this figure to the manuscript, in order not to overload it.

      (2) In our current study, we were not primarily interested in investigating differences between thin/thick stripes and pale stripes. While histological analysis found differences (though not consistent) between CO dark stripes (more myelinated, (Tootell et al., 1983)) and CO pale stripes (more myelinated, Krubitzer and Kaas, 1989)), no study stated myelin differences between CO dark stripes. This does not fully exclude the possibility of myelination differences but suggests that if myelination differences between CO dark stripes existed, they would presumably be smaller than differences between CO dark and CO pale stripes. Thus, it would be even more difficult to demonstrate than the hypothesis of this manuscript.

      Therefore, we decided to directly test two compartments against each other instead of modeling all three compartments within a single model. In our analysis, we thereby loosely followed the analysis methods described in Li et al. (2019), which compared myelin differences between thin/thick and pale stripes in macaques. We note that this demonstrates further consistency, since it is not trivial that both thick and thin stripes show lower R1 values than the pale stripes. For example, there may be no or opposite differences.

      (3) Just for clarification, the plots in Figure 5 show the comparison of R1 (or R2*) between two compartments in V2. The red (blue) curve includes the thin (thick) stripe of interest. The gray curve includes everything in V2 minus contributions from thick (thin) stripes of interest. If we take the thin stripe comparison as example (Figure 5a), then red contains the thin stripes of interest while gray contains everything minus the thick stripes. Therefore, assuming a tripartite stripe arrangement, the gray curve contains both thin and pale stripe contributions.

      References

      Carey D, Caprini F, Allen M, Lutti A, Weiskopf N, Rees G, Callaghan MF, Dick F. Quantitative MRI provides markers of intra-, inter-regional, and age-related differences in young adult cortical microstructure. Neuroimage 2018; 182:429–440.

      Dumoulin SO, Harvey BM, Fracasso A, Zuiderbaan W, Luijten PR, Wandell BA, Petridou N. In vivo evidence of functional and anatomical stripe-based subdivisions in human V2 and V3. Sci Rep 2017; 7:733.

      Freund P, Seif M, Weiskopf N, Friston K, Fehlings MG, Thompson AJ, Curt A. MRI in traumatic spinal cord injury: from clinical assessment to neuroimaging biomarkers. Lancet Neurol 2019; 18:1123–1135.

      González de San Román E, Bidmon H-J, Malisic M, Susnea I, Küppers A, Hübbers R, Wree A, Nischwitz V, Amunts K, Huesgen PF. Molecular composition of the human primary visual cortex profiled by multimodal mass spectrometry imaging. Brain Struct Func 2018; 223:2767–2783.

      Kirilina E, Helbling S, Morawski M, Pine K, Reimann K, Jankuhn S, Dinse J, Deistung A, Reichenbach JR, Trampel R, Geyer S, Müller L, Jakubowski N, Arendt T, Bazin P-L, Weiskopf N. Superficial white matter imaging: Contrast mechanisms and whole-brain in vivo mapping. Sci Adv 2020; 6:eaaz9281.

      Krubitzer LA, Kaas JH. Cortical integration of parallel pathways in the visual system of primates. Brain Res 1989; 478:161–165.

      Lazari A, Lipp I. Can MRI measure myelin? Systematic review, qualitative assessment, and meta-analysis of studies validating microstructural imaging with myelin histology. Neuroimage 2021; 230:117744.

      Lazari A, Salvan P, Cottaar M, Papp D, Rushworth MFS, Johansen-Berg H. Hebbian activity-dependent plasticity in white matter. Cell Rep 2022; 39:110951.

      Li X, Zhu Q, Janssens T, Arsenault JT, Vanduffel W. In Vivo Identification of Thick, Thin, and Pale Stripes of Macaque Area V2 Using Submillimeter Resolution (f)MRI at 3 T. Cereb 2019; 29:544–560.

      Mancini M, Karakuzu A, Cohen-Adad J, Cercignani M, Nichols TE, Stikov N. An interactive meta-analysis of MRI biomarkers of myelin. Elife 2020; 9:e61523.

      Natu VS, Gomez J, Barnett M, Jeska B, Kirilina E, Jaeger C, Zhen Z, Cox S, Weiner KS, Weiskopf N, Grill-Spector K. Apparent thinning of human visual cortex during childhood is associated with myelination. PNAS 2019; 116:20750–20759.

      Tootell RBH, Silverman MS, De Valois RL, Jacobs GH. Functional Organization of the Second Cortical Visual Area in Primates. Science 1983; 220:737–739.

      Weiskopf N, Edwards LJ, Helms G, Mohammadi S, Kirilina E. Quantitative magnetic resonance imaging of brain anatomy and in vivo histology. Nat Rev Phys 2021; 3:570–588.

      Whitaker KJ, Vértes PE, Romero-Garcia R, Váša F, Moutoussis M, Prabhu G, Weiskopf N, Callaghan MF, Wagstyl K, Rittman T, Tait R, Ooi C, Suckling J, Inkster B, Fonagy P, Dolan RJ, Jones PB, Goodyer IM, NSPN Consortium, Bullmore ET. Adolescence is associated with genomically patterned consolidation of the hubs of the human brain connectome. PNAS 2016; 113:9105–9110.

    1. Author Response

      Reviewer #1 (Public Review):

      Determination of the biomechanical forces and downstream pathways that direct heart valve morphogenesis is an important area of research. In the current study, potential functions of localized Yap signaling in cardiac valve morphogenesis were examined. Extensive immunostainings were performed for Yap expression, but Yap activation status as indicated by nuclear versus cytoplasmic localization, Yap dephosphorylation, or expression of downstream target genes was not examined.

      We thank the reviewer for appreciating the significance of this work, and we also thank the reviewer for the constructive suggestions. Following these suggestions, we have improved analysis of YAP activation status and used nuclear versus cytoplasmic localization to quantify YAP activation. To address the reviewer’s concerns, we have conducted extra qPCR analysis of YAP downstream target genes and YAP upstream genes in Hippo pathway. Please find the detailed revisions in our responses to the Recommendations for authors.

      The goal of the work was to determine Yap activation status relative to different mechanical environments, but no biomechanical data on developing heart valves were provided in the study.

      We appreciate the reviewer for raising this concern. We have previously published the biomechanical data of developing chick embryonic heart valves in the following study:

      Buskohl PR, Gould RA, Butcher JT. Quantification of embryonic atrioventricular valve biomechanics during morphogenesis. Journal of Biomechanics. 2012;45(5):895-902.

      In that study, we used micropipette aspiration to measure the nonlinear biomechanics (strain energy) of chick embryonic heart valves at different developmental stages. Here in this study, we used the same method to measure the strain energy of YAP activated/inhibited cushion explants and compared it to the data from our previous study. Our findings were summarized in the Results: “YAP inhibition elevated valve stiffness”, and the detailed measurements, including images and data, are presented in Figure S4.

      There are several major weaknesses that diminish enthusiasm for the study.

      1) The Hippo/Yap pathway activation leads to dephosphorylation of Yap, nuclear localization, and induced expression of downstream target genes. However, there are no data included in the study on Yap nuclear/cytoplasmic ratios, phosphorylation status, or activation of other Hippo pathway mediators. Analysis of Yap expression alone is insufficient to determine activation status since it is widely expressed in multiple cells throughout the valves. The specificity for activated Yap signaling is not apparent from the immunostainings.

      We thank the reviewer for pointing out this weakness. We have now implemented nuclear versus cytoplasmic localization as recommended to quantify YAP activation. We have also conducted additional experiments to analyze via qPCR YAP downstream target genes and YAP upstream genes in Hippo pathway. Please see the detailed revisions in our responses to the Recommendations for authors.

      2) The specific regionalized biomechanical forces acting on different regions of the valves were not measured directly or clearly compared with Yap activation status. In some cases, it seems that Yap is not present in the nuclei of endothelial cells surrounding the valve leaflets that are subject to different flow forces (Fig 1B) and the main expression is in valve interstitial subpopulations. Thus the data presented do not support differential Yap activation in endothelial cells subject to different fluid forces. There is extensive discussion of different forces acting on the valve leaflets, but the relationship to Yap signaling is not entirely clear.

      We thank the reviewer for these important questions. The region-specific biomechanics have been well mapped and studied, thanks to the help from Computational Fluid Dynamics supported by ultrasound velocity and pressure measurements. For example:

      Yalcin, H.C., Shekhar, A., McQuinn, T.C. and Butcher, J.T. (2011), Hemodynamic patterning of the avian atrioventricular valve. Dev. Dyn., 240: 23-35.

      Bharadwaj KN, Spitz C, Shekhar A, Yalcin HC, Butcher JT. Computational fluid dynamics of developing avian outflow tract heart valves. Ann Biomed Eng. 2012 Oct;40(10):2212-27. doi: 10.1007/s10439-012-0574-8.

      Ayoub S, Ferrari G, Gorman RC, Gorman JH, Schoen FJ, Sacks MS. Heart Valve Biomechanics and Underlying Mechanobiology. Compr Physiol. 2016 Sep 15;6(4):1743-1780.

      Salman HE, Alser M, Shekhar A, Gould RA, Benslimane FM, Butcher JT, et al. Effect of left atrial ligation-driven altered inflow hemodynamics on embryonic heart development: clues for prenatal progression of hypoplastic left heart syndrome. Biomechanics and Modeling in Mechanobiology. 2021;20(2):733-50.

      Ho S, Chan WX, Yap CH. Fluid mechanics of the left atrial ligation chick embryonic model of hypoplastic left heart syndrome. Biomechanics and Modeling in Mechanobiology. 2021;20(4):1337-51.

      Those studies have shown that USS develops on the inflow surface of valves while OSS develops on the outflow surface of valves, CS develops in the tip region of valves while TS develops in the regions of elongation and compaction. Here in this study, we mimic those forces in our in-vitro and ex-vivo models. This allows us to study the direct effect of specific force on the YAP activity in different cell lineages. The results showed that OSS promoted YAP activation in VECs while USS inhibited it, CS promoted YAP activation in VICs while TS inhibited it. This result well explained the spatiotemporal distribution of YAP activation in Figure 1. For example, nuclear YAP was mostly found in VECs on the fibrosa side, where OSS develops, and YAP was not expressed in the nuclei in VECs of the atrialis/ventricularis side, where USS develops. It is also worth noting that formation of OSS on the outflow side is slower, and thus the side specific YAP activation in VECs was not in effect at the early stage, from E11.5 to E14.5.

      3) The requirement for Yap signaling in heart valve remodeling as described in the title was not demonstrated through manipulation of Yap activity.

      With respect, it is unclear what the reviewer is asking for given no experiments are suggested nor an elaboration of alternative interpretations of our results that emphasize against YAP requirement. It has been previously shown that YAP signaling is required for early EMT stages of valvulogenesis using conditional YAP deletion in mice:

      Zhang H, von Gise A, Liu Q, Hu T, Tian X, He L, et al. Yap1 Is Required for Endothelial to Mesenchymal Transition of the Atrioventricular Cushion. Journal of Biological Chemistry. 2014;289(27):18681-92.

      Signaling roles for early regulators at these later fetal stages are different, sometimes opposite early EndMT stages, thus contraindicating reliance on these early data to explain later events:

      Bassen D, Wang M, Pham D, Sun S, Rao R, Singh R, et al. Hydrostatic mechanical stress regulates growth and maturation of the atrioventricular valve. Development. 2021;148(13).

      However, embryos with YAP deletion failed to form endocardial cushions and could not survive long enough for the study of its roles in later cushion growth and remodeling into valve leaflets. In this work,

      We first showed the localization of YAP activity and its direct link with local shear or pressure domains. Then we explicitly applied controlled gain and loss of function of YAP via specific molecules. We also applied critical mechanical gain or loss of function studies to demonstrate YAP mechanoactivation necessity and sufficiency to achieve growth and remodeling.

      Reviewer #2 (Public Review)

      This study by Wang et al. examines changes in YAP expression in embryonic avian cultured explants in response to high and low shear stress, as well as tensile and compressive stress. The authors show that YAP expression is increased in response to low, oscillatory shear stress, as well as high compressive stress conditions. Inhibition of YAP signaling prevents compressive stress-induced increases in circularity, decreased pHH3 expression, and increases VE-cadherin expression. On the other hand, YAP gain of function prevents tensile stress-induced decreases in pHH3 expression and VE-cadherin expansion. It also decreases the strain energy density of embryonic avian cushion explants. Finally, using an avian model of left atrial ligation, the authors demonstrate that unloaded regions within the primitive valve structures are associated with increased YAP expression, compared to regions of restricted flow where YAP expression is low. Overall, this study sheds light on the biomechanical regulation of YAP expression in developing valves.

      We thank the reviewer for the accurate summary and their enthusiasm for this work.

      Strengths of the manuscript include:

      • Novel insights into the dynamic expression pattern of YAP in valve cell populations during post-EMT stages of embryonic valvulogenesis.

      • Identify the positive regulation of YAP expression in response to low, oscillatory shear stress, as well as high compressive stress conditions.

      • Identify a link between YAP signaling in regulating stress-induced cell proliferation and valve morphogenesis.

      • The inclusion of the atrial left atrial ligation model is innovative, and the data showing distinguishable YAP expression levels between restricted, and non-restricted flow regions is insightful.

      We thank the reviewer for appreciating the strengths of this work.

      This is a descriptive study that focuses on changes in YAP expression following exposure to diverse stress conditions in embryonic avian cushion explants. Overall, the study currently lacks mechanistic insights, and conclusions based on data are highly over-interpreted, particularly given that the majority of experimental protocols rely on one method of readout.

      We thank the reviewer for constructive suggestions.

      Reviewer #3 (Public Review)

      In this manuscript, Wang et al. assess the role of wall shear stress and hydrostatic pressure during valve morphogenesis at stages where the valve elongates and takes shape. The authors elegantly demonstrate that shear and pressure have different effects on cell proliferation by modulating YAP signaling. The authors use a combination of in vitro and in vivo approaches to show that YAP signaling is activated by hydrostatic pressure changes and inhibited by wall shear stress.

      We thank the reviewer for their enthusiasm for the impact of our work.

      There are a few elements that would require clarification:

      1) The impact of YAP on valve stiffness was unclear to me. How is YAP signaling affecting stiffness? is it through cell proliferation changes? I was unclear about the model put forward:

      • Is it cell proliferation (cell proliferation fluidity tissue while non-proliferating tissue is stiffer?)

      • Is it through differential gene expression?

      This needs clarification.

      We thank the reviewer for raising this important question. Cell proliferation can affect valve stiffness but is a minor factor compared with ECM deposition and cell contractility Our micropipette aspiration data showed that the higher cell proliferation rate induced by YAP activation did lead to stiffer valves when compared to the controls. This may be because at the early stages, cells are more elastic than the viscous ECM. However, the stiffness of YAP activated valves were only about half of that of YAP inhibited valves, showing that the transcriptional level factor plays a more important role. This also suggests that YAP inhibited valves exhibited a more mature phenotype. An analogous role of YAP has also been found in cardiomyocytes. Many theories propose that in cardiomyocytes when YAP is activated the proliferation programs are turned on, while when YAP is inhibited the proliferation programs are turned off and maturation programs are released. Similarly, here we hypothesize that YAP works like a mechanobiological switch, converting mechanical signaling into the decision between growth and maturation. We have revised the Discussion to include this hypothesis.

      2) The model proposes an early asymmetric growth of the cushion leading to different shear forces (oscillatory vs unidirectional shear stress). What triggers the initial asymmetry of the cushion shape? is YAP involved?

      Although the initial geometry of the cushion model is symmetric, the force acting on it is asymmetric. The detailed numerical simulation of how the initial forces trigger the asymmetric morphogenesis can be found in our previous publication:

      Buskohl PR, Jenkins JT, Butcher JT. Computational simulation of hemodynamic-driven growth and remodeling of embryonic atrioventricular valves. Biomechanics and Modeling in Mechanobiology. 2012;11(8):1205-17.

      The color maps represent the dilatation rates when a) only pressure is applied, b) only shear stress is applied, and c) both pressure and shear stress are applied. It is such load that initiates an asymmetric morphological change, as shown in d). In addition, we believe YAP is involved during the initiation because it is directly nuclear activated by CS and OSS or cytoplasmically activated by TS and LSS.

      3) The differential expression of YAP and its correlation to cell proliferation is a little hard to see in the data presented. Drawings highlighting the main areas would help the reader to visualise the results better.

      We thank the reviewer for this helpful suggestion, we have improved the visualization of Figure 3C and Figure 4C with insets of higher magnification.

      4) The origin of osmotic/hydrostatic pressure in vivo. While shear is clearly dependent upon blood flow, it is less clear that hydrostatic pressure is solely dependent upon blood flow. For example, it has been proposed that ECM accumulation such as hyaluronic acid could modify osmotic pressure (see for example Vignes et al.PMID: 35245444). Could the authors clarify the following questions:

      • How blood flow affects osmotic pressure in vivo?

      • Is ECM a factor that could affect osmotic pressure in this system?

      We thank the reviewer for sharing this interesting study. The osmotic pressure plays a critical role in mechanotransduction and the development of many tissues including cardiovascular tissues and cartilage. As proposed in the reference, osmotic pressure is an interstitial force generated by cardiac contractility. Here in our study, the hydrostatic pressure is different, which is an external force applied by flowing blood. According to Bernoulli's law, when an incompressible fluid flows around a solid, the static pressure it applies on the solid is equal to its total pressure minus its dynamic pressure.

      Despite the difference, the osmotic pressure can mimic the effect of hydrostatic pressure in-vitro. The in-vitro osmotic pressure model has been widely used in cartilage research, for example:

      P. J. Basser, R. Schneiderman, R. A. Bank, E. Wachtel, and A. Maroudas, “Mechanical properties of the collagen network in human articular cartilage as measured by osmotic stress technique.,” Arch. Biochem. Biophys., vol. 351, no. 2, pp. 207–19, 1998.

      D. a. Narmoneva, J. Y. Wang, and L. a. Setton, “Nonuniform swelling-induced residual strains in articular cartilage,” J. Biomech., vol. 32, no. 4, pp. 401–408, 1999.

      C. L. Jablonski, S. Ferguson, A. Pozzi, and A. L. Clark, “Integrin α1β1 participates in chondrocyte transduction of osmotic stress,” Biochem. Biophys. Res. Commun., vol. 445, no. 1, pp. 184–190, 2014.

      Z. I. Johnson, I. M. Shapiro, and M. V. Risbud, “Extracellular osmolarity regulates matrix homeostasis in the intervertebral disc and articular cartilage: Evolving role of TonEBP,” Matrix Biol., vol. 40, pp. 10–16, 2014.

      When maturing cushions shift from GAGs dominated ECM to collagen dominated ECM, the water and ion retention capacity of the tissue would be greatly changed, and thus reducing the osmotic pressure. This could in turn accelerate the maturation of cushions. By contrast, the ECM of growing cushions remain GAGs dominated, which would delay maturation and prolong the growth.

      The revised second section of Results is as follows:

      Shear and hydrostatic stress regulate YAP activity

      In addition to the co-effector of the Hippo pathway, YAP is also a key mediator in mechanotransduction. Indeed, the spatiotemporal activation of YAP correlated with the changes in the mechanical environment. During valve remodeling, unidirectional shear stress (USS) develops on the inflow surface of valves, where YAP is rarely expressed in the nuclei of VECs (Figure 2A). On the other side, OSS develops on the outflow surface, where VECs with nuclear YAP localized. The YAP activation in VICs also correlated with hydrostatic pressure. The pressure generated compressive stress (CS) in the tips of valves, where VICs with nuclear YAP localized (Figure 2B). Whereas tensile stress (TS) was created in the elongated regions, where YAP was absent in VIC nuclei.

      To study the effect of shear stress on the YAP activity in VECs, we applied USS and OSS directly onto a monolayer of freshly isolated VECs. The VEC was obtained from AV cushions of chick embryonic hearts at HH25. The cushions were placed on collagen gels with endocardium adherent to the collagen and incubated to enable the VECs to migrate onto the gel. We then removed the cushions and immediately applied the shear flow to the monolayer for 24 hours. The low stress OSS (2 dyn/cm2) promoted YAP nuclear translocation in VEC (Figure 2C, E), while high stress USS (20 dyn/cm2) restrained YAP in cytoplasm.

      To study the effect of hydrostatic stress on the YAP activation in VICs, we used media with different osmolarities to mimic the CS and TS. CS was induced by hypertonic condition while TS was created by hypotonic condition, and the Unloaded (U) condition refers to the osmotically balanced media. Notably, in-vivo hydrostatic pressure is generated by flowing blood, while in-vivo osmotic pressure is generated by cardiac contractility and plays a critical role in the mechanotransduction during valve development (30). Despite the different in-vivo origination, the osmotic pressure provides a reliable model to mimic the hydrostatic pressure in-vitro (31). We cultured HH34 AV cushion explants under different loading conditions for 24 hours and found that the trapezoidal cushions adopted a spherical shape (Figure 2D). TS loaded cushions significantly compacted, and the YAP activation in VICs of TS loaded cushions was significantly lower than that in CS loaded VICs (Figure 2F).

    1. Author Response

      Reviewer #1 (Public Review):

      Huang et al. sought to study the cellular origin of Tuft cells and the molecular mechanisms that govern their specification in severe lung injury. First the authors show ectopic emergence of Tuft cells in airways and distal parenchyma following different injuries. The authors also used lineage tracing models and uncovered that p63-expressing cells and to some extent Scgb1a1-lineaged labeled cells contribute to tuft cells after injury. Further, the authors modulated multiple pathways and claim that Notch inhibition blocks tuft cells whereas Wnt inhibition enhances Tuft cell development in basal cell cultures. Finally, the authors used Trpm5 and Pou2f3 knock-out models to claim that tuft cells are indispensable for alveolar regeneration.

      In summary, the findings described in this manuscript are somewhat preliminary. The claim that the cellular origin of Tuft cells in influenza infection was not determined is incorrect. Current data from pathway modulation is preliminary and this requires genetic modulation to support their claims.

      We thank the reviewer for the comments and we have performed extensive experiments to address the reviewer’s comments. In the revised manuscript we provide additional data including genetic modulation findings to support our model.

      Major comments:

      1) The abstract sounds incomplete and does not cover all key aspects of this manuscript. Currently, it is mainly focusing on the cellular origin of Tuft cells and the role of Wnt and notch signaling. However, it completely omits the findings from Trpm5 and Pou2f3 knock-out mice. In fact, the title of the manuscript highlights the indispensable nature of tuft cells in alveolar regeneration.

      We have modified the abstract and title accordingly.

      2) In lines 93-94, the authors state that "It is also unknown what cells generate these tuft cells.....". This statement is incorrect. Rane et al., 2019 used the same p63-creER mouse line and demonstrated that all tuft cells that ectopically emerge following H1N1 infection originate from p63+ lineage labeled basal cells. Therefore, this claim is not new.

      We thank the reviewer’s comment. Although Rane et al. reported the p63-expressing lineage-negative epithelial stem/progenitor cells (LNEPs) could contribute to the ectopic tuft cells after PR8 virus infection, it is still not clear whether the p63+ cells immediately give rise to tuft cells or though EBCs. Thus, we performed TMX injection after PR8 infection, different from Rane et al (Rane et al., 2019). who performed Tmx injection before viral infection to indicate the ectopic tuft cells are derived from EBCs, as shown in revised Figure 2.

      3) Lines 152-153 state that "21.0% +/- 2.0 % tuft cells within EBCs are labeled with tdT when examined at 30 dpi...". It is not clear what the authors meant here ("within EBC's")? And also, the same sentence states that "......suggesting that club cell-derived EBCs generate a portion of tuft cells....". In this experiment, the authors used club cell lineage tracing mouse lines. So, how do the authors know that the club cell lineage-derived tuft cells came through intermediate EBC population? Current data do not show evidence for this claim. Is it possible that club cells can directly generate tuft cells?

      We apologize for the confusion and revised the text accordingly. Here, “within EBCs” means within the “pods” area where p63+ basal cells are ectopically present. The sentence is revised as “21.0% +/- 2.0 % tuft cells that are ectopically present in the parenchyma are labeled by tdT. Notably, these lineage labeled tuft cells were co-localized with EBCs.” We don’t know whether the club cell lineage-derived tuft cells transit through intermediate EBCs and that is why we use “suggest”. It is also possible that club cells can directly generate tuft cells. To avoid the confusion, we delete the sentence.

      4) Based on the data from Fig-3A, the authors claim that treatment with C59 significantly enhances tuft cell development in ALI cultures. Porcupine is known to facilitate Wnt secretion. So, which cells are producing Wnt in these cultures? It is important to determine which cells are producing Wnt and also which Wnt? Further, based on DBZ treatments, it appears that active Notch signaling is necessary to induce Tuft cell fate in basal cells. Where are Notch ligands expressed in these tissues? Is Notch active only in a small subset of basal cells (and hence generate rate tuft cells)? This is one of the key findings in this manuscript. Therefore, it is important to determine the expression pattern of Wnt and Notch pathway components.

      We thank the reviewer’s interesting questions and agree the importance of identifying the specific ligands and receptors for relevant Wnt and Notch signaling during tuft cell derivation. That being said, we think the topic is beyond the scope of this study which is focused on the role of tuft cells in alveolar regeneration. The point is well taken and we will investigate the topic in our future study.

      5) How do the authors explain different phenotypes observed in Trpm5 knockout and Pou2f3 mutants? Is it possible that Trpm5 knockout mice have a subset of tuft cells and that they might be something to do with the phenotypic discrepancy between two mutant models?

      Again we thank the reviewer for the interesting question. As discussed in the discussion section, Trpm5 is also reported to be expressed in B lymphocytes (Sakaguchi et al., 2020). It is possible that loss of Trpm5 modulates the inflammatory responses following viral infection, which may contribute to improved alveolar regeneration. However, it is also possible that Trpm5-/- mice keep a subset of tuft cells that facilitate lung regeneration as suggested by the reviewer.

      6) One of the key findings in this manuscript is that Wnt and Notch signaling play a role in Tuft cell specification. All current experiments are based on pharmacological modulation. These need to be substantiated using genetic gain loss of function models.

      We have performed the genetic studies.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the ectopic differentiation of tuft cells that were derived from lineage-tagged p63+ cells post influenza virus infection. These tuft cells do not appear to proliferate or give rise to other lineages. They then claim that Wnt inhibitors increase the number of tuft cells while inhibiting Notch signaling decreases the number of tuft cells within Krt5+ pods after infection in vitro and in vivo. The authors further show that genetic deletion of Trpm5 in p63+ cells post-infection results in an increase in AT2 and AT1 cells in p63 lineage-tagged cells compared to control. Lastly, they demonstrate that depletion of tuft cells caused by genetic deletion of Pou2f3 in p63+ cells has no effect on the expansion or resolution of Krt5+ pods after infection, implying that tuft cells play no functional role in this process.

      Overall, in vivo and in vitro phenotypes of tuft cells and alveolar cells are clear, but the lack of detailed cellular characterization and molecular mechanisms underlying the cellular events limits the value of this study.

      We thank the reviewer for the comments and acknowledging that our findings are clear. In the revised manuscript we provide more detailed characterization and genetic evidence to elucidate the role of tuft cells in lung regeneration.

      1) Origin of tuft cells: Although the authors showed the emergence of ectopic tuft cells derived from labelled p63+ cells after infection, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitors, as previously reported, can also contribute to tuft cell expansion (Rane et al. 2019; by labelling p63+ cells prior to infection, they showed that the majority of ectopic tuft cells are derived from p63+ cells after viral infection). It would be more informative if the authors show the differentiation of tuft cells derived from p63+Krt5+ cells by tracing Krt5+ cells after infection, which will tell us whether ectopic tuft cells are differentiated from ectopic basal cells within Krt5+ pods induced by virus infection.

      We thank the reviewer for the helpful suggestion. We have performed the experiment accordingly.

      2) Mechanisms of tuft cell differentiation: The authors tried to determine which signaling pathways regulate the differentiation of tuft cells from p63+ cells following infection. Although Wnt/Notch inhibitors affected the number of tuft cells derived from p63+ labelled cells, it remains unclear whether these signals directly modulate differentiation fate. The authors claimed that Wnt inhibition promotes tuft cell differentiation from ectopic basal cells. However, in Fig 3B, Wnt inhibition appears to trigger the expansion of p63+Krt5+ pod cells, resulting in increased tuft cell differentiation rather than directly enhancing tuft cell differentiation. Further, in Fig 3D, Notch inhibition appears to reduce p63+Krt5+ pod cells, resulting in decreased tuft cell differentiation. Importantly, a previous study has reported that Notch signalling is critical for Krt5+ pod expansion following influenza infection (Vaughan et al. 2015; Xi et al. 2017). Notch inhibition reduced Krt5+ pod expansion and induced their differentiation into Sftpc+ AT2 cells. In order to address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs, the authors should provide a more detailed characterization of cellular composition (Krt5+ basal cells, club cells, ciliated cells, AT2 and AT1 cells, etc.) and activity (proliferation) within the pods with/without inhibitors/activators.

      Again we thank the reviewer for the insightful suggestions. We agree that it will be interesting to further address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs. In this revised manuscript we added new findings of EBC differentiation into tuft cells in mice with genetic deletion of Rbpjk.

      3) Impact of Trpm5 deletion in p63+ cells: It is interesting that Trpm5 deletion promotes the expansion of AT2 and AT1 cells derived from labelled p63+ cells following infection. It would be informative to check whether Trpm5 regulates Hif1a and/or Notch activity which has been reported to induce AT2 differentiation from ectopic basal cells (Xi et al. 2017). Although the authors stated that there was no discernible reduction in the size of Krt5+ pods in mutant mice, it would be interesting to investigate the relationship between AT2/AT1 cell retaining pods and the severity of injury (e.g. large Krt5+ pods retain more/less AT2/AT1 cells compared to small pods. What about other cell types, such as club and goblet cells, in Trpm5 mutant pods? Again, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitor cells can directly convert into AT2/AT1 cells upon Trpm5 deletion rather than p63+Krt5+ cells induced by infection.

      We thank the reviewer for the comments and suggestions. Our new data using KRT5-CreER mouse line confirmed that pod cells (Krt5+) do not contribute to AT2/AT1 cells, consistent with previous studies (Kanegai et al., 2016; Vaughan et al., 2015). Our data also show that p63-CreER lineage labeled AT2/AT1 cells are separated from pod cell area, suggesting pod cells and these AT2/AT1 cells are generated from different cell of origin. We also checked the Notch activity in pod cells in Trpm5-/- mice, and some pod cell-derived cells are Hes1 positive, whereas some are Hes1 negative (RLFigure 1). As indicated in discussion we think that AT2/AT1 cells are possibly derived from pre-existing AT2 cells that transiently express p63 after PR8 infection. It will be interesting to test whether Trpm5 regulates Hif1a in this population (p63+,Krt5-), and this will be our next plan.

      RLFigure 1. Representative area staining in Trpm5-/- mice at 30 dpi. Area 1: Notch signaling is active (Hes1+, arrows) in pod cells following viral infection. Area 2: pod cells exhibit reduced Notch activities. Note few Hes1+ cells in pods (arrows). Scale bar: 50 µm.

      4) Ectopic tuft cells in COVID-19 lungs: The previous study by the authors' group revealed the presence of ectopic tuft cells in COVID-19 patient samples (Melms et al. 2021). There appears to be no additional information in this manuscript.

      In Melms et al., Nature, 2021 (Melms et al., 2021), we showed tuft cell expansion in COVID-19 lungs but not the potential origin of tuft cells. In this manuscript we show some cells co-expressing POU2F3 and KRT5, suggesting a pod-to-tuft cell differentiation.

      5) Quantification information and method: Overall, the quantification method should be clarified throughout the manuscript. Further, in the method section, the authors stated that the production of various airway epithelial cell types was counted and quantified on at least 5 "random" fields of view. However, virus infection causes spatially heterogeneous injury, resulting in a difficult to measure "blind test". The authors should address how they dealt with this issue.

      We clarified that quantification method as suggested. For the in vitro cell culture assays on the signaling pathways, we took pictures from at least five random fields of view for quantification. For lung sections, we tile-scanned the lung sections including at least three lung lobes and performed quantification.

      Reviewer #3 (Public Review):

      In this manuscript Huang et al. study how the lung regenerates after severe injury due to viral infection. They focus on how tuft cells may affect regeneration of the lung by ectopic basal cells and come to the conclusion that they are not required. The manuscript is intriguing but also very puzzling. The authors claim they are specifically targeting ectopic basal progenitor cells and show that they can regenerate the alveolar epithelium in the lung following severe injury. However, it is not clear that the p63-CreERT2 line the authors are using only labels ectopic basal cells. The question is what is a basal cell? Is an ectopic basal progenitor cell only defined by Trp63 expression?

      The accompanying manuscript by Barr et al. uses a Krt5-CreERT2 line to target ectopic basal cells and using that tool the authors do not see a signification contribution of ectopic basal cells towards alveolar epithelial regeneration. As such the claim that ectopic basal cell progenitors drive alveolar epithelial regeneration is not well-founded.

      We appreciate the reviewer for the positive comments and agreeing that our findings are interesting.

      The title itself is also not very informative and is a bit misleading. That being said I think the manuscript is still very interesting and can likely easily be improved through a better validation of which cells the p63-CreERT2 tool is targeting.

      We have revised the title accordingly and performed extensive experiments to address the reviewer’s concerns.

      I, therefore, suggest the following experiments.

      1) Please analyze which cells p63-CreERT2 labels immediately after PR8 and tamoxifen treatment. Are all the tdTomato labeled cells also Krt5 and p63 positive or are some alveolar epithelial cells or other airway cell types also labeled?

      We thank the reviewer for the question. To answer the reviewer’s question, we performed PR8 infection (250 pfu) on three Trp63-CreERT2;R26tdT mice and TMX treatment at days 5 and 7 post viral infection. We didn't perform TMX injection immediately as the mice were sick at a few days post infection. The lung samples were collected at 14 dpi. We observed that tdT+ cells are present in the airways (rebuttal letter RLFigure 2A, B), and it appears that the lineage labeled cells (tdT+) include club cells (CC10+) that are underlined by tdT+Krt5+ basal cells (RLFigure 2C). We think that these labeled basal cells give rise to club cells. However, we also noticed that rare club cells and ciliated cells (FoxJ1+) are labeled by tdT in the areas absent of surrounding tdT+ basal cells (RLFigure 2D). Moreover, a minor population of tdT+ SPC+ cells are present in the terminal airways that were disrupted by viral infection (RLFigure 2E and D). We did not see any pods formed in this experiment and we did not observe any tdT+ cells in the intact alveoli (uninjured area).

      RLFigure 2. Trp63-CreERT2 lineage labeled cells in the airways but not alveoli when Tamoxifen was induced at day 5 and 7 after PR8 H1N1 viral infection. Trp63-CreERT2;R26-tdT mice were infected with PR8 at 250 pfu and Tmx were delivered at a dose of 0.25 mg/g bodyweight by oral gavage. Lung samples were collected and analyzed at 14 dpi. Stained antibodies are as indicated. Scale bar: 100 µm.

      2) Please also show if p63-CreERT2 labels any cells in the adult lung parenchyma in the absence of injury after tamoxifen treatment.

      Dr. Wellington Cardoso’s group demonstrated that Trp63-CreERT2 only labels very few cells in the airways but not the lung parenchyma in the absence of injury after tamoxifen treatment (Yang et al., 2018). Dr. Ying Yang has revisited the data and she did not observe any labeling in the lung parenchyma (n = 2).

      3) Please analyze if p63-CreERT2 labels any cells with tdTomato in the absence of injury or after PR8 infection but without tamoxifen treatment.

      We performed the experiment and didn't observe any labeled cells in the lung parenchyma without Tamoxifen treatment (n = 4).

      4) Please analyze when after PR8 infection do the first p63-CreERT2 labeled tdTomato positive alveolar epithelial cells appear.

      We administered tamoxifen at day 5 and 7 after PR8 infection and harvested lung tissues at day 14. As shown in Figure 1, we observed a few tdT+ SPC+ cells in the terminal airways that are disrupted by viral infection. Notably, we did not observe any lineage labeled cells in the intact alveoli (uninjured) in this experiment..

      5) A clonal analysis of p63-CreERT2 labeled cells using a confetti reporter might also help interpret the origin of p63-CreERT2 labeled cells.

      We thank the reviewer for the suggestion. Our new data demonstrate that a rare population of SPC+tdT+ cells are present in the disrupted terminal airways of Trp63-CreERT2;R26tdT mice. Our data in the original manuscript and the new data suggest that the initial SPC+;tdT+ cells are rare because we have to administrate multiple doses of Tamoxifen to label them. Given the less labeling efficiency of confetti than R26tdT mice, it is possible we will not be able to label these SPC+ cells. Moreover, our original manuscript clearly shows individual clones of SPC+tdT+ cells in the regenerated lung, and they do not seem to compose of multiple clones. Therefore we think that use of confetti mice may not add new information..

      6) Lastly could the authors compare the single-cell RNAseq transcription profile of p63-CREERT2 labeled cells immediately after PR8 and tamoxifen treatment and also at 60dpi. A pseudotime analysis and trajectory interference analysis could help elucidate the identity of p63-CreERT2 labeled cells that are actually not ectopic basal progenitor cells.

      We appreciated the reviewer’s suggestion and agree that single cell RNA sequencing with pseudotime analysis can provide further information regarding the origin of the lineage labeled alveolar cells of Trp63-CreERT2;R26tdT mice. That said, our new data clearly show that KRT5-CreER lineage labeled cells do not give rise to AT1/2 cells as previously described (Kanegai et al., 2016; Vaughan et al., 2015), suggesting that the ectopic basal progenitor cells do not generate alveolar cells. By contrast, Trp63-CreERT2 lineage labeled cells do give rise to AECs, suggesting that this p63+ cell population capable of generating AECs are different from Krt5+ ectopic basal progenitor cells. Our single cell core has an extremely long waiting list due to the pandemic and we hope that our new findings are enough to address the reviewer’s concern without the need of single cell analysis..

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript applies the framework of information theory to study a subset of cellular receptors (called lectins) that bind to glycan molecules, with a specific focus on the kinds of glycans that are typical of fungal pathogens. The authors use the concentration of various types of ligands as the input to the signaling channel, and measure the "response" of individual cells using a GFP reporter whose expression is driven by a promoter that responds to NFκB. While this work is overall technically solid, I would suggest that readers keep several issues in mind while evaluating these results.

      1) One of the largest potential limitations of the study is the reliance of the authors on exogenous expression of the relevant receptors in U937 cells. Using a cell-line system like this has several advantages, most notably the fact that the authors can engineer different reporters and different combinations of receptors easily into the same cells. This would be much more difficult with, say, primary cells extracted from a mouse or a human. While the ability to introduce different proteins into the cells is a benefit, the problem is that it is not clear how physiologically relevant the results are. To their credit, the authors perform several controls that suggest that differences in transfection efficiency are not the source of the differences in channel capacity between, say, dectin-1 and dectin-2. As the authors themselves clearly demonstrate, however, the differences in the properties of these signaling system are not based on receptor expression levels, but rather on some other property of the receptor. Now, it could be that the dectin-2 receptor is somehow just more "noisy" in terms of its activity compared to, say, dectin-1. This seems a somewhat less likely explanation, however, and so it is likely that downstream details of the signaling systems differ in some way between dectin-2 and the more "information efficient" receptors studied by the authors.

      The channel capacity of a cell signaling network depends critically on the distributions of the downstream signaling molecules in question: see the original paper by Cheong et al. (2011, Science 334 (6054), 354-8) and subsequent papers (notably Selimkhanov et al. (2014) Science 346 (6215), 1370-3 and Suderman et al. (2018) Interface Focus 8 (6), 20180039). The U937 cells considered here clearly don't serve the physiological function of detecting the glycans considered by the authors; despite the fact that this is an artificial cell line, the fact the authors have to exogenously express the relevant receptors indicates that these cells are not necessarily a good model for the types of cells in the body that actually have evolved to sense these glycan molecules.

      Signaling molecules readily exhibit cell-type-specific expression levels that influence cellular responses to external stimuli (Rowland et al.(2017) Nat Commun 8, 16009). So it is unclear that the distributions of downstream signaling molecules in U937 cells mirror those that would be observed in the immune cell types relevant to this response. As such, the physiological relevance of the differences between dectin-2 channel capacities and those exhibited by the other receptors are currently unclear.

      We appreciate Reviewer #1’s in-depth comments related to physiological relevance of the U937 cell. A big benefit of using information theory to investigate a biological communication channel is the realization of quantitative measurement of information that the channel transmits without having detailed measurement of spatiotemporal dynamics of receptors and downstream signaling cascades. In addition, the quantity of measured information itself in turn gives us a decent prediction about detailed signaling mechanisms by comparing the information quantity difference. For example, we investigated how transmission of glycan information from dectin-2 is synergistically modulated in the presence of either dectin-1, DC-SIGN or mincle. Our approach allows to investigate how individual lectins on immune cells contribute to glycan information transmission and be integrated in the presence other type of lectins. Therefore, the findings describe how physiologically relevant lectins are integrating the extracellular signal in a more defined way. Furthermore, we found that our model cell line has one order of magnitude higher expression of dectin-2 compared with primary human monocytes and exhibits a similar zymosan binding pattern (will be described in Recommendations for the authors and Figure R8).

      We fully agree that acquiring more information on the information transmission capability of primary immune cells would increase physiological relevance. In the revised manuscript we addressed this concern by comparing the receptor expression levels of our model cell lines with primary monocytes, for which we find an agreement of cellular heterogeneity. However, we would also like to point out that the very basic nature of our question, of how information stored in glycans is processed by lectins, is not tightly bound to these difference of primary cells and cell lines.

      Line 382: Finally, it is important to take into consideration that our conclusions came from model cell lines, which were used as a surrogate for cell-type-specific lectin expression patterns of primary immune cells. Human monocytes and dectin-2 positive U937 cells have comparable receptor densities and respond similar to stimulation with zymosan particles (SI Fig. 6A and B).

      2) Another issue that readers might want to keep in mind is that the details of the channel capacity calculation are a bit unclear as the manuscript is currently written. The authors indicate that their channel capacity calculations follow the approach of Cheong et al. (2011) Science 334 (6054), 354-8. However, the extent to which they follow that previous approach is not obvious. For instance, the calculations presented in the 2011 work use a combined bootstrapping/linear extrapolation approach to estimate the mutual information at infinite population size in order to deal with known inaccuracies in the calculation that arise from finite-size effects. The Cheong approach also deals with the question of how many bins to use in order to estimate the joint probability distribution across signal and response.

      They do this by comparing the mutual information they calculate for the real data with that calculated for random data to ensure that they are not calculating spuriously high mutual information based on having too many bins. While the Cheong et al. paper does a great job explaining why these steps need to be undertaken, a subsequent paper by Suderman et al. (2017, PNAS 114 (22), 5755-60) explains the approach in even greater detail in the supporting information. Those authors also implemented several improvements to the general approach, including a bootstrap method for more accurately estimating the error in the mutual information and channel capacity estimates.

      The problem here is that, while the authors claim to follow the approach of Cheong et al., it seems that they have re-implemented the calculation, and they do not provide sufficient detail to evaluate the extent to which they are performing the same exact calculation. Since estimates of mutual information are technically challenging, specific details of the steps in their approach would be helpful in order to understand how closely their results can be compared with the results of previous authors. For instance, Cheong et al. estimate the "channel capacity" by trying a set of likely unimodal and bimodal distributions for the input to the channel, and choosing the maximal value as the channel capacity. This is clearly a very approximate approach, since the channel capacity is defined as the supremum over an (uncountably infinite) set of input probability distributions. In any case, the authors of the current manuscript use a different approach to this maximization problem. Although it is a bit unclear how their approach works, it seems that they treat the probability of each input bin as an independent parameter (under the constraint that the probabilities sum to one) and then use an optimization algorithm implemented in Python to maximize the mutual information. In principle, this could be a better approach, since the set of input distributions considered is potentially much larger. The details of the optimization algorithm matter, however, and those are currently unclear as the paper is written.

      We thank Reviewer #1’s recommendation for increasing the legitimacy of the calculation. In the revised manuscript we tried to explain channel capacity calculation procedures in more detail with statistical approaches that adopted from Cheong et al. (2011) and Suderman et al. (2018) (SI section 1 and 2). Furthermore, we decide the number of binning from not only random dataset but also the number of total samples as shown below:

      Figure R1. A) Extrapolated channel capacity values of random dataset at infinitely subsampled distribution under various total number of samples and output binning. The white line in the heatmap represents the channel capacity value at 0.01 bit. B) Extrapolated channel capacity values at infinite subsample size of U937 cells’ input (TNF-a doses) and output (GFP reporter) response.

      Figure R1 describes channel capacity values from random (A) and experimental dataset (B, TNFAR + TNF-a). The channel capacity values from random data indicates the dependence of channel capacity on the number of the output binning and total number sample. According to this heatmap, we decided the allowed bias as 0.01 bits as shown in contour line shown in Figure R1A. Since our minimum dataset that used for channel capacity calculation in the absence of labelled input is near 90,000, the expected bias in channel capacity calculation is therefore less than 0.01 bits in binning range from 10 to 1000 as shown in Figure R1A.

      Furthermore, we demonstrated mutual information maximization procedure using predefined unibimodal input distribution and compared with the systematic method that we used in the work. We found that there is no noticeable difference in channel capacity value between two approaches (SI Figure 3M).

      3) Another issue to be careful about when interpreting these findings is the fact that the authors use logarithmic bins when calculating the channel capacity estimates. This is equivalent to saying that the "output" of the cell signaling channel is not the amount of protein produced under the control of the NFκB promoter, but rather the log of the protein level. Essentially, the authors are considering a case where the relevant output of the system is not the amount of protein itself, but the fold change in the amount of protein. That might be a reasonable assumption, especially if the protein being produced is a transcription factor whose own promoters have evolved to detect fold changes. For many proteins, however, the cell is likely responsive to linear changes in protein concentration, not fold changes. And so choosing the log of the protein level as the output may not make sense in terms of understanding how much information is actually contained in this particular output variable. Regardless, choosing logarithmic bins is not purely a matter of convenience or arbitrary choice, but rather corresponds to a very strong statement about what the relevant output of the channel is.

      We understand Reviewer #1’s concern regarding the choice of log binning. We found that if the number of binning is higher than 200, no matter the binning methods, including linear, logarithmic or equal frequency, the estimated channel capacities in each binning number are converged into the same value. The only difference is how quickly the values approach the converged channel capacity as increasing the binning number (shown in Figure R2). In the revised manuscript, we used linear binning to represent more relevant protein signaling as the Reviewer mentioned. Note that the channel capacity values calculated from linear binning do not show noticeable different from our previously calculated channel capacity values.

      On the other hand, linear binning generates significant bias, if we consider labelled input (i.e., continuous input) into channel capacity calculation, due to the increase of binning in input region.

      Figure R2. Output binning number and binning method dependence of channel capacity value for experimental dataset. The inset plots show the relative difference of channel capacity value to the maximum channel capacity value in the entire binning range (i.e., from 10 to 1000) of the corresponding binning method.

      According to Reviewer #1’s comment we have changed the binning method from logarithmic binning to linear binning in the whole experimental dataset except in the presence of labelled input (i.e., dectin-2 antibody). If we consider channel capacity between labelled input and NF-kB reporter, equal frequency binning is used for every layer of the channel capacity (i.e., labelled input-binding, binding-GFP, labelled input-GFP)

      Reviewer #2 (Public Review):

      My expertise is more on the theoretical than the experimental aspects of this paper, so those will be the focus of these comments.

      Signal transduction is an important area of study for mathematical biologists and biophysicists. This setting is a natural one for information-theoretic methods, and such methods are attracting increasing research interest. Experimental results that attempt to directly quantify the Shannon capacity of signal transduction are particularly interesting. This paper represents an important contribution to this emerging field.

      My main comments are about the rigorousness and correctness of the theoretical results. More details about these results would improve the paper and help the reader understand the results.

      We understand reviewer #2’s comment related with rigorousness and correctness of the theoretical results of this work. In the revised manuscript, we added following contents to help the reader to better understand the channel capacity calculation procedures.

      • General illustrative introduction regarding how we measured input and output dataset and how we handle those data to prepare joint probability distribution shown in SI section 1.1 and 1.2.

      • Exemplified mutual information maximization procedure using experimental and arbitrary dataset shown in SI section 1.3.

      The calculation of channel capacity, given in the methods, is quite a standard calculation and appears to be correct. However, I was confused by the use of the "weighting value" w_i, which is not specified in the manuscript. The input distribution appears to be a product of the weight w_i and the input probability value p_i, and these appear always to occur together as a product w_i p_i. (In joint probabilities w_i p(i,j), the input probability can be extracted using Bayes' rule, leaving w_i p_i p(j|i).) This leads met wonder two things. First, what role does w_i play (is it even necessary)? Second, of particular interest here is the capacity-achieving input distribution p_i, but w_i obscures it; is the physical input distribution p_i equal to the capacity-achieving distribution? If not, what is the meaning of capacity?

      We thank Reviewer #2’s comment regarding the arbitrariness of the weightings. We realize there was a lack of explanation on the weighting values in the original manuscript. 𝑃x(𝑖) is a marginal probability distribution of input from the original dataset and 𝑃x'(𝑖) is the marginal probability distribution of modified input that maximize the mutual information. In usual case 𝑃x(𝑖) is not equal to 𝑃x'(𝑖) and therefore one needs to find 𝑃x'(𝑖) from 𝑃x(𝑖). Because 𝑃x'(𝑖) is a linear combination of 𝑃x(𝑖), it can be expressed as 𝑤(𝑖)𝑃x(𝑖) , where 𝑤(𝑖) is the weightings, under constraint ∑input/i 𝑤(𝑖)𝑃x (𝑖) = 1 . The changed input distribution, in turn, modifies the joint probability distribution as 𝑃'xy (𝑖, 𝑗) = 𝑤(𝑖)𝑃xy)(𝑖, 𝑗). To help readers understand of this work we expanded the Appendix with illustrative descriptions.

      A more minor but important point: the inputs and outputs of the communication channel are never explicitly defined, which makes the meaning of the results unclear. When evaluating the capacity of an information channel, the inputs X and outputs Y should be carefully defined, so that the mutual information I(X;Y) is meaningful; the mutual information is then maximized to obtain capacity. Although it can be inferred that the input X is the ligand concentration, and the output Y is the expression of GFP, it would be helpful if this were stated explicitly.

      We agree with Reviewer’s suggestion for better description of input and output in the manuscript. Therefore, we have modified Figure 1 A and B and the main text to describe the source of input and output much clearly, as follows:

      Line 92: Accounting for the stochastic behavior of cellular signaling, information theory provides robust and quantitative tools to analyze complex communication channels. A fundamental metric of information theory is entropy, which determines the amount of disorder or uncertainty of variables. In this respect, cellular signaling pathways having high variability of the initiating input signals (e.g. stimulants) and the corresponding highly variable output response (i.e. cellular signaling) can be characterized as a high entropy. Importantly, input and output can have mutual dependence and therefore knowing the input distribution can partly provide the information of output distribution. If noise is present in the communication channel, input and output have reduced mutual dependence. This mutual dependence between input and output is called mutual information. Mutual information is, therefore, a function of input distribution and the upper bound of mutual information is called channel capacity (SI section 1) (Cover and Thomas, 2012). In this report, a communication channel describes signal transduction pathway of C-type lectin receptor, which ultimately lead to NF-κB translocation and finally GFP expression in the reporter model (Fig. 1A). To quantify the signaling information of the communication channels, we used channel capacity. Importantly, the channel capacity isn’t merely describing the resulting maximum intensity of the reporter cells. The channel capacity takes cellular variation and activation across a whole range of incoming stimulus of single cell resolved data into account and quantifies all of that data into a single number.

    1. Author Response

      Reviewer #3 (Public Review):

      The authors examine the role of secreted BAFF in senescence phenotypes in THP1 AML cells and primary human fibroblasts. In the former, BAFF is found to potentiate the inflammatory phenotype (SASP) and in the latter to potentiate cell cycle arrest. This is an important study because the SASP is still largely considered in generic and monolithic terms, and it is necessary to deconvolute the SASP and examine its many components individually and in different contexts.

      Although the results show differences for BAFF in the two cell models, there are many places where key results are missing and the results over-interpreted and/or missing controls.

      1) Figure 1. Test whether the upregulation of BAFF is specific to senescence, or also in reversible quiescence arrest.

      We appreciate the Reviewer’s requests. We performed the experiments in fibroblasts and THP-1 cells to assess BAFF levels in quiescence. As shown below in the figure for Reviewers, we induced quiescence in fibroblasts by serum starvation (0.1%) for 96 h and confirmed the quiescent state by measuring two markers of quiescence (reduction of CCND1 mRNA and reduction of phopho-S6, when compared to cycling cells, following markers established previously (PMID 25483060) (panel A). In this case, the level of BAFF mRNA was increased upon quiescence (panel B).

      In THP-1 cells, we tried to induce quiescence by serum starvation and glutamine depletion for 96 h. Unfortunately, however, inducing quiescence in THP-1 cells was rather challenging, likely because they are cancer cells. Thus, we observed a reduction of cell proliferation in both conditions, but we observed a reduction in phospho-S6 only in the samples without glutamine (panel C). We failed to see increased BAFF mRNA levels in quiescent THP-1 cells after either serum starvation or glutamine depletion (panel D).

      In summary, further studies will be necessary to fully understand if the increased expression of BAFF seen in senescent cells is also observed in other conditions of growth suppression (such as quiescence or differentiation), as well as whether this effect is specific to different cell types.

      2) Figure 1, Supplement 1G. Show negative control IgG for immunofluorescence.

      We thank the Reviewer for this suggestion. Along with other changes during the revision, we decided to remove the immunofluorescence data in order to include more informative data.

      3) All results with siRNA should be validated with at least 2 individual siRNAs to eliminate the possibility of off-target effects.

      We agree with the Reviewer on the importance of testing individual siRNAs. For BAFF, we originally tested two independent siRNAs (BAFF#1 and BAFF#2) individually, but we also pooled them for additional analysis (and referred to simply as “BAFFsi” along the manuscript). In the revised version of our manuscript, we included the key experiments performed with these two individual BAFF siRNAs. Upon BAFF silencing in THP-1 cells, we observed a reduction of SASP factors and SA-β-Gal activity levels with each individual siRNA (Figure 4-Figure Supplement 1D-F) and with the pooled siRNAs (Figure 4C). For WI-38 cells, we observed a reduction of p53 levels with individual and pooled siRNAs (Figure 7-Figure Supplement 1A), as well as a reduction in IL6 levels and SA-β-Gal activity (Figure 6-Figure Supplement 1D,E). After IRF1 silencing, we observed a reduction in BAFF pre-mRNA with two different pairs of CTRLsi and IRF1si pools (Figure 2I and supplementary Figure 2E). For the data on BAFF receptors, we used SMARTpools from Dharmacon, which are combinations of 4 siRNAs designed by the company to minimize off-target effects. These additions and clarifications are indicated in the revised manuscript.

      4) To confirm a role for IRF1 in the activation of BAFF, the authors should confirm the binding of IRF1 to the BAFF promoter by ChIP or ChIP-seq.

      We thank the Reviewer for this suggestion. We performed ChIP-qPCR analysis in THP-1 cells that were either proliferating or rendered senescent after exposure to IR (Figure 2H, Materials and methods section), and we confirmed the binding of IRF1 to the proximal promoter region of BAFF. As anticipated, this interaction was stronger after inducing senescence.

      5) Key antibodies should be validated by siRNA knockdown of their targets, for example, TACI, BCMA, and BAFF-R in Figure 5. Note that there is an apparent discrepancy between BCMA data in Figure 5B vs 5C.

      We fully agree with the Reviewer on this point and we thank him/her for helping us to improve this part of our manuscript. To address the discrepancy regarding BCMA western blot analysis and flow cytometry data, we silenced BCMA in THP-1 cells and tested two different antibodies advertised to recognize BCMA. This experiment allowed us to identify the correct band for BCMA by western blot analysis. We then confirmed that BCMA is upregulated in senescence, as observed by both western blot and flow cytometry analyses. We have modified the manuscript to reflect these changes. Please find these data in Figure 5A,B and Figure 5-Figure Supplement 1A of the revised manuscript.

      6) Figure 5E. Negative/specificity controls for this assay should be shown.

      We thank the reviewer for this comment and regret that we were unable to provide a negative control. The kit only provides a competitive wild-type oligomer used to test the specificity of the binding. For each sample (CTRLsi, BAFFsi, CTRLsi IR, BAFFsi IR) and each antibody tested (p65, p50, p52, RelB and c-Rel), we evaluated the reductions in signal upon addition of excess competitive oligomer per well (20 pmol/well) compared to wells with an inactive oligomer. However, the negative control was performed only as single replicate, due to the limited quantity of nuclear extracts and the high number of samples and antibodies analyzed. We therefore considered this control as being ‘qualitative’ rather than fully ‘quantitative’.

      7) Hybridization arrays such as Figure 5H, Figure 6 - Supplement 1I, and Figure 6H should be shown as quantitated, normalized data with statistics from replicates.

      We appreciate this request. We have included the quantification and statistics to the phosphoarrays used for THP-1 and WI-38 cells, which had been performed in triplicate (Figure 7A, Figure 5-Figure Supplement 1D). The original arrays are shown in the respective Source Data Files. In the interest of space, we removed the cytokine array performed on IMR-90 cells and left instead the quantitative ELISA for IL6 (Figure 6-Figure Supplement 1F). The data obtained from the cytokine array analysis in Figure 4F and Figure 4-Supplemental Figure 1C are supported by quantitative multiplex ELISA measurements (Figure 4E and Figure 4C).

      8) Figure 6B - Supplement 1. Controls to confirm fractionation (i.e., non-contamination by cytosolic and nuclear proteins) should be shown.

      We thank the Reviewer for this suggestion. We tested the efficiency of fractionation and we did in fact observe some degree of contamination from cytosolic proteins using the earlier version of the kit (Pierce, cat. 89881). We therefore purchased an improved version of the kit (Pierce, cat. A44390) and repeated the surface fractionation assay, which this time showed improved fractionation (Figure 7-Figure Supplement 1B). Interestingly, with the improved fractionation strategy, we observed that BAFF receptors in fibroblasts were almost exclusively localized inside the cell and not on the surface, as we found in THP-1 cells. Further validation of BAFF receptor antibodies has been provided in Figure 5-Figure Supplement 1A. As described in the text, the intracellular localization of BAFF receptors was previously reported in other cell types and conditions (PMID 31137630, PMID 19258594, PMID 30333819, PMID 10903733), and thus it is possible that BAFF may act through non-canonical mechanisms in WI-38 cells. Nonetheless, we did detect a small amount of BAFFR on the cell surface, and furthermore, BAFFR silencing reduced the level of p53 in fibroblasts. Therefore, we propose that BAFFR may be the primary receptor involved in p53 regulation in fibroblasts (Figure 7-Figure Supplement 1B,C). Our data on BAFF receptors deserve deeper characterization in a future study of the functions of BAFF receptors in senescence.

      9) Figure 6A. Knockdown of BAFF should be shown by western blot.

      Yes, definitely. We appreciate this comment and have included BAFF knockdown data in fibroblasts by western blot analysis (Figure 7B).

      10) Figure 6G. Although BAFF knockdown decreases the expression of p53, p21 increases. How do the authors explain this?

      We thank the Reviewer for the interesting question. We too were surprised to observe that the p53-dependent transcripts regulated by BAFF did not include CDKN1A (p21) mRNA, as confirmed by western blot analysis. The accumulation of p21 in senescence can be also regulated by p53-independent pathways and in p53-/- cells, for example by p90RSK, SP1, and ZNF84 (PMID 24136223, PMID 25051367, PMID 33925586). Eventually, we removed the data relative to p21 and γ-H2AX in favor of other data and to streamline the content of this manuscript for the reader.

    1. Author Response

      Reviewer #1 (Public Review):

      1-1. I do have some concerns that the differences in network clustering reported in Fig 6 may be due to noise and I think the comparisons against the HCP parcellation could be more robust. Specifically, with regard to the network clustering in Fig 6. The authors use a clustering algorithm (which is not explained) to cluster the parcels into different functional networks. They achieve this by estimating the mean time series for each parcel in each individual, which they then correlate between the n regions, to generate an nxn connectivity matrix. This they then binarise, before averaging across individuals within an age group. It strikes me that binarising before averaging will artificially reduce connections for which only a subset of individuals are set to zero. Therefore averaging should really occur before binarising. Then I think the stability of these clusters should be explored by creating random repeat and generation groups (as done for the original parcells) or just by bootstrapping the process. I would be interested to see whether after all this the observation that the posterior frontoparietal expands to include the parahippocampal gryus from 3-6 months and then disappears at 9 months - remains.

      We thank the reviewer for this insightful comment on our clustering process. For the step of “binarizing before averaging”, we followed the method proposed by Yeo et al (1). In this method, all correlation matrices are binarized according to the individual-specific thresholds. Specifically, each individual-specific threshold is determined according to the percentile, and only 10% of connections are kept and set to 1, while all other connections are set to 0. Yeo et al. (1) explained their motivation for doing so as “the binarization of the correlation matrix leads to significantly better clustering results, although the algorithm appears robust to the particular choice of the threshold”. We consider that the possible reason is that the binarization of connectivity in each individual offers a certain level of normalization so that each subject can contribute the same number of connections. If averaging occurs before binarizing, the actual connectivity contributed by different subjects would be different, which leads to bias. Meanwhile, we tested the stability of ‘binarizing first’ and ‘averaging first’, and the result is shown in Fig. R1 below. This figure suggests a similar conclusion as (1), where binarizing first before averaging leads to better clustering stability. We added the motivation of binarizing before averaging in the revised manuscript between line 577 and line 581.

      Fig. R1. The comparison of clustering stability of different methods. The red line refers to the clustering stability when binarizing the correlation matrices first and then averaging the matrices across individuals, while the blue line refers to the clustering stability when averaging the correlation matrices across individuals first and then binarizing the average matrix.

      For the final clustering results, we performed our clustering method using bootstrapping 100 times, and the final result is a majority voting of each parcel. The comparison of these two results is shown in Fig. R2. Overall, we do observe good repeatability between these two results. However, we also observed that some parcels show different patterns between the two results, especially for those parcels that are spatially located around the boundaries of networks or the medial wall. The pattern of the observation that “the posterior frontoparietal expands to include the parahippocampal gyrus from 3-6 months and then disappears at 9 months – remains” was not repeated in the bootstrapped results. These results might suggest that the clustering method is quite robust, the discovered patterns are relatively stable, and the differences between our original results and bootstrapping results might be caused by noises or inter-subject variabilities.

      Fig. R2. Top panel: the network clustering results using all data in the original manuscript. Bottom panel: the network clustering results using majority voting through 100 times of bootstrapping. Black circles and red arrows point to the parahippocampal gyrus, which was included in the posterior frontoparietal network, and is not well repeated in the bootstrapped results. (M: months)

      1-2. Then with regard to the comparison against the HCP parcellation, this is only qualitative. The authors should see whether the comparison is quantitatively better relative to the null clusterings that they produce.

      Thank you for this great suggestion! As suggested, we added this quantitative comparison using the Hausdorff distance. Similar to the comparison in parcel variance and homogeneity, the 1,000 null parcellations were created by randomly rotating our parcellation with small angles on the spherical surface 1,000 times. We compared our parcellation and the null parcellations by accordingly evaluating their Hausdorff distances to some specific areas of the HCP parcellation on the spherical space, including Brodmann's area 2, 3b, 4+3a, 44+45, V1, and MT+MST. The results are listed in Figure 4. From the results, we can observe that our parcellation generally shows statistically much lower Hausdorff distances to the HCP parcellation, suggesting that our parcellation generates parcel borders that are closer to HCP parcellations compared to the null parcellations.

      However, we noticed very few null parcellations that show smaller Hausdorff distances compared to our parcellation. A possible reason comes from our surface registration process with the HCP template purely based on cortical folding, without using functional gradient density maps, which are not available in the HCP template. As a result, this does not ensure high-quality functional alignment between our infant data and the HCP space, thus inevitably increasing the Hausdorff distance between our parcellation and the HCP parcellation.

      1-3. … not all individuals appear (from Fig 8) to be acquired exactly at the desired timepoints, so maybe the authors might comment on why they decided not to apply any kernel weighted or smoothing to their averaging? Pg. 8 'and parcel numbers show slight changes that follow a multi-peak fluctuation, with inflection ages of 9 and 18 months' explain - the parcels per age group vary - with age with peaks at 9 and 18 - could this be due to differences in the subject numbers, or the subjects that were scanned at that point?

      We do agree with the reviewer that subjects are not scanned at similar time points. This is designed in the data acquisition protocol to seamlessly cover the early postnatal stage so that we will have a quasi-continuous observation of the dynamic early brain development.

      We didn’t apply kernel weighted average or smoothing when generating the parcellation, as we would like each scan to contribute equally, and each parcellation map could be representative of the cohort of the covered age, instead of only part of them. Meanwhile, our final ‘age-common parcellation’ could be representative of all subjects from birth to 2 years of age. However, we do agree that the parcellation map that is only designed for the use of a specific age, e.g., 1-year-olds, kernel weighted average, or even a more restricted age range could be a more appropriate solution.

      For the parcel number that likely shows fluctuations with subject numbers, we added an experiment, where we randomly selected 100 scans by considering the minimum scan number in each age group using bootstrapping and repeated this process 100 times. The average parcel number of each age is reported in the following Table R1. We didn’t observe strong changes in parcel numbers when reducing scan numbers, which further demonstrates that our parcel numbers do not show a strong relation to subject numbers. However, the parcel number does not increase greatly from 18M to 24M in the bootstrapping results, so we modified the statement in the manuscript about the parcel number to ‘… all parcel numbers fall between 461 to 493 per hemisphere, where the parcel number attains a maximum at around 9 months and then reduces slightly and remains relatively stable afterward. …’, which can be found between line 121 and line 122.

      1-4. I also have some residual concerns over the number of parcels reported, specifically as to whether all of this represents fine-grained functional organisation, or whether some of it represents noise. The number of parcels reported is very high. While Glasser et al 2016 reports 360 as a lower bound, it seems unlikely that the number of parcels estimated by that method would greatly exceed 400. This would align with the previous work of Van Essen et al (which the authors cite as 53) which suggests a high bound of 400 regions. While accepting Eickhoff's argument that a more modular view of parcellation might be appropriate, these are infants with underdeveloped brain function.

      We thank the reviewer for this insightful comment. We agree that there might be noises for some of the parcels, as noises exist in each step, such as data acquisition, image processing, surface reconstruction, and registration, especially considering functional MRI is noisier than structural MRI. Though our experiments show that our parcellation is fine-grained and is suitable for the study of the infant brain functional development, it is hard to directly quantitatively validate as there is no ground truth available.

      Despite these, we are still motivated to create fine-grained parcellations, as with the increase of bigger and higher resolution imaging data and advanced computational methods, parcellations with more fine-grained regions are desired for downstream analyses, especially considering the hierarchical nature of the brain organization (2). And the main reason that our method generates much finer parcellation maps, is that both our registration and parcellation process is based on the functional gradient density, which characterizes a fine-grained feature map based on fMRI. This leads to both better inter-subject alignment in functional boundaries and finer region partitions. This strategy is different from Glasser et al (3), which jointly considers multimodal information for defining parcel boundaries, thus parcels revealed purely by functional MRI might be ignored in the HCP parcellation. We hope our parcellation framework can be a useful reference for this research direction. We added this discussion in the revised manuscript between line 268 and line 271.

      For the parcel number, even without performing surface registration based on fine-grained functional features, recent adult fMRI-based parcellations greatly increased parcel numbers, such as up to 1,000 parcels in Schaefer et al. (4), 518 parcels in Peng et al. (5), and 1,600 parcels in Zhao et al. (6). For infants, we do agree that the infant functional connectivity might not be as strong as in adults. However, there are opinions (7-9) that the basic units of functional organization are likely to present in infant brains, and brain functional development gradually shapes the brain networks. Therefore, the functional parcel units in infants could be possibly on a comparable scale to adults. Even so, we do agree that more research needs to be performed on larger datasets for better evaluations. We added this discussion in the revised manuscript between line 275 and line 280.

      1-5. Further comparisons across different subjects based on small parcels increases the chances of downstream analyses incorporating image registration noise, since as Glasser et al 2016 noted, there are many examples of topographic variation, which diffeomorphic registration cannot match. Therefore averaging across individuals would likely lose this granularity. I'm not sure how to test this beyond showing that the networks work well for downstream analyses but I think these issues should be discussed.

      We agree with the reviewer that averaging across individuals inevitably brings some registration errors to the parcellation, especially for regions with high topographic variation across subjects, which would lead to loss of granularity in these regions. We believe this is an important issue that exists in most methods on group-level parcellations, and the eventual solution might be individualized parcellation, which will be our future work. We added this discussion in the revised manuscript between line 288 and line 292.

      We also agree with the reviewer that downstream analyses are important evaluations for parcellations. We provided a beta version of our parcellation with 602 parcels (10) to our colleagues, and they tested our parcellation in the task of infant individual recognition across ages using functional connectivity, to explore infant functional connectome fingerprinting (10). We compared the performance of different parcellations with 602 ROIs (our beta version), 360 ROIs (HCP MMP parcellation (3)), and 68 ROIs (FreeSurfer parcellation (11)). The results (Fig. R3) show that our parcellation with a higher parcellation number yields better accuracy compared to other parcellations. We added a description of this downstream application in the discussion between line 284 and line 287.

      Fig. R3. The comparison of different parcellations for infant individual recognition across age based on functional connectivity (figure source: Hu et al. (10)). The parcellation with 602 ROIs is the beta version of our parcellation, 360 ROIs stands for HCP MMP parcellation (3) and 68 ROIs stands for the FreeSurfer parcellation (11). This downstream task shows that a higher parcellation number does lead to better accuracy in the application.

      1-6. Finally, I feel the methods lack clarity in some areas and that many key references are missing. In general I don't think that key methods should be described only through references to other papers. And there are many references, particular to FSL papers, that are missing.

      We thank the reviewer for this great suggestion. We added related references for FLIRT, FSL, MCFLIRT, and TOPUP For the alignment to the HCP 32k_LR space, we first aligned all subjects to the fsaverage space using spherical demons, and then used part of the HCP pipeline (12) to map the surface from the fsaverage space to HCP 164k_LR space, and downsampled to 32k_LR space. We modified this citation by referencing the HCP pipeline by Glasser et al. (12) instead and detailed this registration process in the revised manuscript between line 434 to line 440 in the revised manuscript and as below:

      “… The population-mean surface maps were mapped to the HCP 164k ‘fs_LR’ space using the deformation field that deforms the ‘fsaverage’ space to the ‘fs_LR’ space released by Van Essen et al. (13), which was obtained by landmark-based registration. By concatenating the three deformation fields of steps 1, 3, and 4, we directly warped all cortical surfaces from individual scan spaces to the HCP 164k_LR space and then resampled them to 32k_LR using the HCP pipeline (12), thus establishing vertex-to-vertex correspondences across individuals and ages …”

      Reviewer #2 (Public Review):

      2-1. Diminishing enthusiasm is the lack of focus in the result section, the frequent use of jargon, and figures that are often difficult to interpret. If those issues are addressed, the proposed atlas could have a high impact in the field especially as it is aligned with the template of the Human Connectome Project.

      We’d like to thank Reviewer #2 for the appreciation of our atlas. According to the reviewer’s suggestion, we went through the manuscript again by focusing on correcting the use of jargon, clarity in the result section, as well as figures and figure captions. We hope our corrections can help explain our work to a broader community. Our revisions are accordingly detailed in the following. Meanwhile, our parcellation maps have been aligned with the templates in HCP and FreeSurfer and made available via NITRC at: https://www.nitrc.org/projects/infantsurfatlas/.

      References

      1. B. Thomas Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology 106, 1125-1165 (2011).

      2. S. B. Eickhoff, R. T. Constable, B. T. Yeo, Topographic organization of the cerebral cortex and brain cartography. NeuroImage 170, 332-347 (2018).

      3. M. F. Glasser, T. S. Coalson, E. C. Robinson, C. D. Hacker, J. Harwell, E. Yacoub, K. Ugurbil, J. Andersson, C. F. Beckmann, M. Jenkinson, S. M. Smith, D. C. Van Essen, A multi-modal parcellation of human cerebral cortex. Nature 536, 171-178 (2016).

      4. A. Schaefer, R. Kong, E. M. Gordon, T. O. Laumann, X.-N. Zuo, A. J. Holmes, S. B. Eickhoff, B. T. J. C. C. Yeo, Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. 28, 3095-3114 (2018).

      5. L. Peng, Z. Luo, L.-L. Zeng, C. Hou, H. Shen, Z. Zhou, D. Hu, Parcellating the human brain using resting-state dynamic functional connectivity. Cerebral Cortex, (2022).

      6. J. Zhao, C. Tang, J. Nie, Functional parcellation of individual cerebral cortex based on functional mri. Neuroinformatics 18, 295-306 (2020).

      7. W. Gao, S. Alcauter, J. K. Smith, J. H. Gilmore, W. Lin, Development of human brain cortical network architecture during infancy. Brain Structure and Function 220, 1173-1186 (2015).

      8. W. Gao, H. Zhu, K. S. Giovanello, J. K. Smith, D. Shen, J. H. Gilmore, W. J. P. o. t. N. A. o. S. Lin, Evidence on the emergence of the brain's default network from 2-week-old to 2-year-old healthy pediatric subjects. 106, 6790-6795 (2009).

      9. K. Keunen, S. J. Counsell, M. J. J. N. Benders, The emergence of functional architecture during early brain development. 160, 2-14 (2017).

      10. D. Hu, F. Wang, H. Zhang, Z. Wu, Z. Zhou, G. Li, L. Wang, W. Lin, G. Li, U. U. B. C. P. Consortium, Existence of Functional Connectome Fingerprint during Infancy and Its Stability over Months. Journal of Neuroscience 42, 377-389 (2022).

      11. R. S. Desikan, F. Ségonne, B. Fischl, B. T. Quinn, B. C. Dickerson, D. Blacker, R. L. Buckner, A. M. Dale, R. P. Maguire, B. T. Hyman, An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31, 968-980 (2006).

      12. M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80, 105-124 (2013).

    1. Author Response

      Reviewer #1 (Public Review):

      The authors present a strong set of experiments to uncover what type of role non-mutant stromal cells might be playing in the development of VM and AST, two vascular lesions that share some similarities.

      Questions about experimental design.

      1) For quantification of gene expression in VM and AST specimens in Figure 2, the methods say qPCR data were normalized to housekeeping genes, but it would be helpful to normalize to endothelial content. It might be that increased TGFa is due to increased endothelium.

      We thank the Reviewer for this excellent suggestion. We have now added this new data as suggested with normalization of TGFA mRNA to the endothelial marker PECAM-1/CD31 mRNA. A trend towards an increased expression of TGFA mRNA was detected in VM/AST specimens in comparison to the control group. We also show in the manuscript that besides CD31-positive vascular structures, TGFA is expressed in intervascular areas, i.e. between the vessels, in the patients’ lesions (Fig.2) and in lesion-derived CD31negative intervascular stromal cells. These data altogether demonstrate that i) TGFA is expressed also in other cell types than endothelial cells and ii) indicates that the increased expression of TGFA in lesion samples is not only due to increased vasculature/endothelium in the patient samples.

      The new RT-qPCR data has now been added to the manuscript as a new Fig. 2 - figure supplement 1.

      2) The mutant allelic frequency for the HUVEC-PIK3CA WT versus HUVEC-PIK3CA H1047R should be provided. This is critically needed for the interpretation of the results.

      Thank you for this valuable comment. To confirm that PIK3CA H1047R is still present in transduced HUVECs at the end-point of the mouse xenograft experiment, we performed a new ddPCR analysis detecting fractional abundance of PIK3CA p.H1047R from the matrigel plug-in samples. In this new data, mean fractional abundance of PIK3CA p.H1047R in fibroblast containing PIK3CA H1047R EC plugs was shown to be 27.1 % (variation 26.5-27.8 %; n=2 mice in duplicates). This corresponds to ~54 % of PIK3CA p.H1047R mutation positive cells in the plug, assuming a single copy of the mutation in each cell. As a control group, no positivity was detected in samples with fibroblast and in PIK3CAwt EC, as all the cells express the wildtype form of the PIK3CA gene. Please see Author Response Image 1 representative 2D amplification plots of the mutation analysis. Fractional abundances of PIK3CA mutations in the patient tissue samples and patient-derived CD31+ cells can also be seen in Table 1 and were in a range of 5-12 % (whole tissue) and 44-51 % (EC fraction).

      3) From Figure 5, it appears that the human primary fibroblasts are not required for the mutant ECs to form perfused vessels (panel H).

      We thank the Reviewer for the comment and agree that based on our H&E staining and erythrocyte analysis, perfused vessels are evident in PIK3CA mutant plugs containing ECs with fibroblasts but also in plugs containing ECs alone. This was expected as PIK3CA mutation in ECs alone has shown to be a driver of venous malformation. However, prior to our study the role of fibroblasts in PIK3CA-driven lesions had not been studied. To better understand the role of fibroblasts in lesion formation, we have now added new data to the manuscript containing example images of the PIK3CA H1047R plugs with or without fibroblasts, and added a new quantitation of their erythrocyte amount. Please see Author Response Image 2. Our data demonstrates that there are significantly: i) more CD31-positive vascular structures (Fig. 5E-G), ii) larger lumens (Fig. 5D-F) and iii) more erythrocyte-containing regions, indicative for perfused vessels (new Fig. 5H) in lesions with fibroblasts in comparison to plugs containing ECs alone. This implies that fibroblasts further induce PIK3CA-driven EC lesion formation.

      Author Response Image 2. Vascular structures formed with PIK3CA H1047R ECs alone and PIK3CA H1047R ECs + FBs in mouse xenograft plugs. In the figure panel, H&E staining on each individual plug in these groups is presented. Equal size close-up images were taken from the middle of each plug covering > 50% of plug area (scale bar 250µm). More erythrocytes (red) are seen in the plugs with fibroblasts in comparison to ECs alone. Scanned images of the H&E stained whole tissue sections can be seen in the Fig. 5 – source file.

      A new quantitative analysis of erythrocyte positive area in relation to whole plug area using SproutAngio quantification tool was additionally performed (). Analysis was done on a blinded manner and showed significantly increased erythrocyte amount in the plugs containing PIK3CA H1047R ECs and fibroblasts (in comparison to EC alone). Describtion of the analysis has now been added in the manuscript (p. 42, rows 839-843) Figures 5G and 5H in the manuscript were updated to show statistics and automated intensitybased quantification of the erythrocyte positive area/ plug instead of erythrocyte scoring (scale 0-3).

      Is it possible that TGFa from the ECs is sufficient to drive vascular malformation?

      Mutations in genes such as PIK3CA, TEK and KRAS have been shown to drive formation of vascular anomalies. Thus it is unlikely that a single growth factor, such as increased expression of TGFA, would drive this process alone. That being said, our data shows that TGFA is able to regulate proliferation of PIK3CA mutated ECs via secondary mechanism (Fig. 4F), and we show that inhibition of EGFR pathway is able to reduce PIK3CA-driven lesion growth in mice (Fig. 7). As our bulk RNA-sequencing data from patient-derived cells, showed expression of also other growth factors in lesion ECs (Table 3), it is likely that multiple angiogenic growth factors are involved in lesion formation similarly as in tumors and their expression is primarily driven by mutated cells and secondary by cell-cell crosstalk with other lesion cell types. Thus, targeting of multiple signalling pathways could be a beneficial treatment strategy in the future.

      Reviewer #2 (Public Review):

      In this manuscript, Ilmonen H. et al explored potential crosstalk between endothelial cells and fibroblasts in a context of sporadic vascular malformation (venous malformation and angiomatoses of soft tissue). With a high level of evidence, they found that mutated endothelial cells secrete TGFA that will activate surrounding fibroblasts, leading in turn to VEGFA secretion that will stimulate endothelial cell sprouting and vascular malformation development. Experiments are well-designed and support their hypothesis. Some controls are missing, particularly in Fig. 2. Indeed, it is mandatory to provide data from healthy skin biopsies (that are available in many laboratories): TGFa, CD31, P-EGFR staining.

      We thank the Reviewer for the comments. Although it is common that VM presents in skin, in this work we solely focused on intramuscular and subcutaneous AST and VM patient samples and excluded the samples containing skin from this study. We did TGFA immunostainings from healthy skeletal muscle that can be seen Figure 2 – figure supplement 2B. CD31 staining of vessels in healthy skeletal muscle near the resection margin can be seen in Figure 1B. Please see below also tissue locations of all VM and AST samples in this study:

      • Intramuscular, 42.1 % of lesions (n=16)

      • Intramuscular and subcutaneous, 21.1 % of lesions (n=8)

      • Intramuscular, subcutaneous and synovial membrane, 5.3 % of lesions (n=2)

      • Intramuscular and synovial membrane, 2.6 % of lesions (n=1)

      • Subcutaneous and synovial membrane, 2.6 % of lesions (n=1)

      • Subcutaneous only, 26.3 % of lesions (n=10)

      • Skin, none of the lesions

    1. Author Response

      Reviewer #1 (Public Review):

      Redox signaling is a dynamic and concerted orchestra of inter-connected cellular pathways. There is always a debate whether ROS (reactive oxygen species) could be a friend or foe. Continued research is needed to dissect out how ROS generation and progression could diverge in physiological versus pathophysiological states. Similarly, there are several paradoxical studies (both animal and human) wherein exercise health benefits were reported to be accompanied by increases in ROS generation. It is in this context, that the present manuscript deserves attention.

      Utilizing the in-vitro studies as well as mice model work, this manuscript illustrates the different regulatory mechanisms of exercise and antioxidant intervention on redox balance and blood glucose level in diabetes. The manuscript does have some limitations and might need additional experiments and explanation.

      The authors should consider addressing the following comments with additional experiments.

      1) Although hepatic AMPK activation appears to be a central signaling element for the benefits of moderate exercise and glucose control, additional signals (on hepatic tissue) related to hepatic gluconeogenesis such as Forkhead box O1 (FoxO1), phosphoenolpyruvate carboxykinase (PEPCK), and GLUT2 needs to be profiled to present a holistic approach. Authors should consider this and revise the manuscript.

      We appreciate the constructive suggestion. Besides glycolysis, gluconeogenesis and glucose uptake are critical in maintaining liver and blood glucose homeostasis.

      FoxO1 has been tightly linked with hepatic gluconeogenesis through inhibiting the transcription of gluconeogenesis-related PEPCK and G6Pase expression (1, 2). Herein, we found the expression of FoxO1 increased in the diabetic group but reduced in the CE, IE and EE groups (Fig. X1A, Fig.5E-F in manuscript). Meanwhile, the mRNA level of Pepck and G6PC (one of the three G6Pase catalytic-subunit-encoding genes) also decreased in the CE, IE, and EE groups (Fig. X1B-1C, Fig.5H-I in manuscript). These results indicates that these three modes of exercise all inhibited gluconeogenesis through down-regulating FoxO1.

      For the glucose uptake, we detected the protein expression of GLUT2 in the liver tissue. Glut2 helps in the uptake of glucose by the hepatocytes for glycolysis and glycogenesis. Accordingly, we found GLUT2,a glucose sensor in liver, was up-regulated in diabetic rats, but down-regulated by the CE and IE intervention. However, GLUT2 didn’t decrease in the EE group, which is consistent with the results of the unimproved blood glucose by EE intervention (Figure X1A, Fig.5E and 5G in manuscript).

      Taken together, moderate exercise could benefits glucose control through increasing glycolysis and decreasing gluconeogenesis. We added this part in Page 9 line 251-263 and Figure 5E-5I in this version.

      Figure X1. A. Representative protein level and quantitative analysis of FOXO1 (82 kDa), GLUT2 (60-70 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups. C-D. Expression of hepatic Pepck and G6PC mRNA in the Ctl, T2D, T2D + CE, T2D + IE and T2D + EE groups were evaluated by real-time PCR analysis. Values represent mean ratios of Pepck and G6PC transcripts normalized to GAPDH transcript levels.

      2) Very recently sestrin2 signaling is assumed significant attention in relation to exercise and antioxidant responses. Therefore, authors should profile the sestrin2 levels as it is linked to several targets such as mTOR, AMPK and Sirt1. Additionally, the levels of Nrf2 should be reported as this is the central regulator of the threshold mechanisms of oxidative stress and ROS generation.

      We appreciate reviewer’s expert comments. Nrf2 is an important mediator of antioxidant signaling, playing a fundamental role in maintaining the redox homeostasis of the cell. Under unstressed conditions, Nrf2 activity is suppressed by its innate repressor Kelch-like ECH-associated protein 1 (Keap1) (3). With the increase of ROS level in the development of diabetes, Nrf2 was activated to induce the transcription of several antioxidant enzymes (4, 5).

      Nrf2 expression level has been reported to increase in HFD mice or diabetic patients (6, 7). It has been found from in vitro studies that NRF2 activation is achieved with acute exposure to high glucose, whereas longer incubation times or oscillating glucose concentration failed to activate Nrf2 (8, 9). These suggest that the increase of ROS in diabetes can cause compensatory upregulation of Nrf2. In our study, we found that Nrf2 increased in diabetic rats, which can further initiate the expression of antioxidant enzymes. As shown in Fig.X2A (Fig.2H-2K in manuscript), Grx and Trx involved in thioredoxin metabolism were up-regulated accordingly like Nrf2. After CE intervention, the level of Nrf2 increased further more (Fig.2E-2F), suggesting that CE intervention could activate antioxidant system to achieve a high-level redox balance. We have added these new results into Figure 2.

      On the other hand, the expression level of Sestrin2 and Nrf2 decreased after antioxidant supplement. Our results suggest that the antioxidant treatment improved the diabetes through inhibiting ROS level to achieve a low-level redox balance, but moderate exercise enhanced ROS tolerance to achieve a high-level balance (Fig.X2D-F, Fig.3E-3G in manuscript).

      We added the new data in “Page 5 line 147-153 and Page 7 line 183-186” and Figure 2-3 in current version.

      Figure X2. A-C. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and Actin (45 kDa) in the rats in the Ctl, T2D and T2D + CE groups. D-F. Representative protein level and quantitative analysis of Nrf2 (97 kDa), Sestrin2 (57 kDa) and HSP90 (90 kDa) in the rats in the Ctl, T2D and T2D + APO groups.

      3) Authors should discuss the exercise-associated hormesis curve. They should discuss whether moderate exercise could decrease the sensitivity to oxidative stress by altering the bell-shaped dose-response curve.

      We thank the reviewer’s valuable comments. According to literatures, Zsolt Radak et al proposed a bell-shaped dose-response curve between normal physiological function and level of ROS in healthy individuals, and suggested that moderate exercise can extend or stretch the levels of ROS while increases the physiological function (10). Our results validated this hypothesis and further proposed that moderate exercise could produce ROS meanwhile increase antioxidant enzyme activity to maintain high level redox balance according to the Bell-shaped curve, whereas excessive exercise would generate a higher level of ROS, leading to reduced physiological function. In this study, we found the state of diabetic individuals is more applicable to the description of a S-shaped curve, due to the high level of oxidative stress and decreased reduction level in diabetic individuals (Fig.8B). With the increase of ROS, the physiological function of diabetic individuals gradually decreases and enters a state of redox imbalance. Moderate exercise shifts the S-shaped curve into a bell-shaped dose-response curve, thus reducing the sensitivity to oxidative stress in diabetic individuals and restoring redox homeostasis. However, with excessive exercise, ROS production increases beyond the threshold range of redox balance, resulting in decreased physiological function (Fig.8B, see the decreasing portion of the bell curve to the right of the apex).

      Nevertheless, the antioxidant intervention increased physiological activity by reducing ROS levels in diabetic individuals, restoring a bell-shaped dose-response curve at low level of ROS (Fig.8B). Therefore, redox balance could be achieved either at low level of ROS mediated by antioxidant intervention or at high level of ROS mediated by moderate exercise, both of which were regulated by AMPK activation. Therefore, both high and low levels of redox balance can lead to high physiological function as long as they are in the redox balance threshold range. Then, the activation of AMPK is an important sign of exercise or antioxidant intervention to obtain redox dynamic balance which helps restore physiological function. Accordingly, we speculate that the antioxidant intervention based on moderate exercise might offset the effect of exercise, but antioxidants could be beneficial during excessive exercise. The human study also supports that supplementation with antioxidants may preclude the health-promoting effects of exercise (11). Therefore, personalized intervention with respect to redox balance will be crucial for the effective treatment of diabetes patients.

      We added this part into “Discussion” in this version (Page 13-14 line 389-418).

      4) It would not be ideal to single-out AMPK as a sole biomarker in this manuscript. Instead, authors should consider AMPK activation and associated signaling in relation to redox balance. This should also be presented in Fig 7.

      We thank reviewer’s critical comments. According to the comments, we have discussed the AMPK signaling in the discussion part (Page 13, line 373-384) and added the AMPK signaling in Fig.8A.

      Reference:

      1. R. A. Haeusler, K. H. Kaestner, D. Accili, FoxOs function synergistically to promote glucose production. J Biol Chem 285, 35245-35248 (2010).
      2. J. Nakae, T. Kitamura, D. L. Silver, D. Accili, The forkhead transcription factor Foxo1 (Fkhr) confers insulin sensitivity onto glucose-6-phosphatase expression. J Clin Invest 108, 1359-1367 (2001).
      3. M. McMahon, K. Itoh, M. Yamamoto, J. D. Hayes, Keap1-dependent proteasomal degradation of transcription factor Nrf2 contributes to the negative regulation of antioxidant response element-driven gene expression. J Biol Chem 278, 21592-21600 (2003).
      4. R. S. Arnold et al., Hydrogen peroxide mediates the cell growth and transformation caused by the mitogenic oxidase Nox1. Proc Natl Acad Sci U S A 98, 5550-5555 (2001).
      5. J. M. Lee, M. J. Calkins, K. Chan, Y. W. Kan, J. A. Johnson, Identification of the NF-E2-related factor-2-dependent genes conferring protection against oxidative stress in primary cortical astrocytes using oligonucleotide microarray analysis. J Biol Chem 278, 12029-12038 (2003).
      6. T. Jiang et al., The protective role of Nrf2 in streptozotocin-induced diabetic nephropathy. Diabetes 59, 850-860 (2010).
      7. X. H. Wang et al., High Fat Diet-Induced Hepatic 18-Carbon Fatty Acids Accumulation Up-Regulates CYP2A5/CYP2A6 via NF-E2-Related Factor 2. Front Pharmacol 8, 233 (2017).
      8. T. S. Liu et al., Oscillating high glucose enhances oxidative stress and apoptosis in human coronary artery endothelial cells. J Endocrinol Invest 37, 645-651 (2014).
      9. Z. Ungvari et al., Adaptive induction of NF-E2-related factor-2-driven antioxidant genes in endothelial cells in response to hyperglycemia. Am J Physiol Heart Circ Physiol 300, H1133-1140 (2011).
      10. Z. Radak et al., Exercise, oxidants, and antioxidants change the shape of the bell-shaped hormesis curve. Redox Biol 12, 285-290 (2017).
      11. M. Ristow et al., Antioxidants prevent health-promoting effects of physical exercise in humans. Proc Natl Acad Sci U S A 106, 8665-8670 (2009).
    1. Author Response

      Reviewer #1 (Public Review):

      In one of the most creative eDNA studies I have had the pleasure to review, the authors have taken advantage of an existing program several decades old to address whether insect declines are indeed occurring - an active area of discussion and debate within ecology. Here, they extracted arthropod environmental DNA (eDNA) from pulverized leaf samples collected from different tree species across different habitats. Their aim was to assess the arthropod community composition within the canopies of these trees during the time of collection to assess whether arthropod richness, diversity, and biomass were declining. By utilizing these leaf samples, the greatest shortcoming of assessing arthropod declines - the lack of historical data to compare to - was overcome, and strong timeseries evidence can now be used to inform the discussion. Through their use of eDNA metabarcoding, they were able to determine that richness was not declining, but there was evidence of beta diversity loss due to biotic homogenization occurring across different habitats. Furthermore, their application of qPCR to assess changes in eDNA copy number temporally and associate those changes with changes to arthropod biomass provided support to the argument that arthropod biomass is indeed declining. Taken together, these data add substantial weight to the current discussion regarding how arthropods are being affected in the Anthropocene.

      Thank you very much for the positive assessment of our work.

      I find the conclusions of the paper to be sound and mostly defensible, though there are some issues to take note of that may undermine these findings.

      Firstly, I saw no explanation of the requisite controls for such an experiment. An experiment of this scale should have detailed explanations of the field/equipment controls, extraction controls, and PCR controls to ensure there are no contamination issues that would otherwise undermine the entirety of the study. At one point in the manuscript the presence of controls is mentioned just once, so I surmise they must exist. Trusting such results needs to be taken with caution until such evidence is clearly outlined. Furthermore, the plate layout which includes these controls would help assess the extent of tag-jumping, should the plate plan proposed in Taberlet et al., 2018 be adopted.

      Second, without the presence of adequate controls, filtering schemes would be unable to determine whether there were contaminants and also be unable to remove them. This would also prevent samples from being filtered out should there be excessive levels of contamination present. Without such information, it makes it difficult to fully trust the data as presented.

      Finally, there is insufficient detail regarding the decontamination procedures of equipment used to prepare the samples (e.g., the cryomil). Without clear explanations of the steps the authors took to ensure samples were handled and prepared correctly, there is yet more concern that there may be unseen problems with the dataset.

      We are well aware of the potential issues and consequences of contamination in our work. However, we are also confident that our field and laboratory procedures adequately rule out these issues. We agree with the reviewer that we should expand more on our reasoning. Hence, we have now significantly expanded the Methods section outlining controls and sample purity, particularly under “Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures” (lines 303-304), “Test for DNA carryover in the cryomill” (lines 448-464) and “Statistical analysis” (lines 570-575).

      We ran negative control extractions as well as negative control PCRs with all samples. These controls were sequenced along with all samples and used to explore the effect of experimental contamination. With the exception of a few reads of abundant taxa, these controls were mostly clean. We report this in more detail now in the Methods under “Sequence analysis” (lines 570-575). This suggests that our data are free of experimental contamination or tag jumping issues.

      We have also expanded on the avoidance of contamination in our field sampling protocols. The ESB has been set up for monitoring even the tiniest trace amounts of chemicals. Carryover between samples would render the samples useless. Hence, highly clean and standardized protocols are implemented. All samples are only collected with sterilized equipment under sterile conditions. Each piece of equipment is thoroughly decontaminated before sampling.

      The cryomill is another potential source of cross-contamination. The mill is disassembled after each sample and thoroughly cleaned. Milled samples have already been tested for chemical carryover, and none was found. We have now added an additional analysis to rule out DNA carryover. We received the milling schedule of samples for the past years. Assuming samples get contaminated by carryover between milling runs, two consecutive samples should show signatures of this carryover. We tested this for singletaxon carryover as well as community-wide beta diversity, but did not find any signal of contamination. This gives us confidence that our samples are very pure. The results of this test are now reported in the manuscript (Suppl. Fig 12 & Suppl. Table 3).

      Reviewer #2 (Public Review):

      Krehenwinkel et al. investigated the long-term temporal dynamics of arthropod communities using environmental DNA (eDNA) remained in archived leave samples. The authors first developed a method to recover arthropod eDNA from archived leave samples and carefully tested whether the developed method could reasonably reveal the dynamics of arthropod communities where the leave samples originated. Then, using the eDNA method, the authors analyzed 30-year-long well-archived tree leaf samples in Germany and reconstructed the long-term temporal dynamics of arthropod communities associated with the tree species. The reconstructed time series includes several thousand arthropod species belonging to 23 orders, and the authors found interesting patterns in the time series. Contrary to some previous studies, the authors did not find widespread temporal α-diversity (OTU richness and haplotype diversity) declines. Instead, β-diversity among study sites gradually decreased, suggesting that the arthropod communities are more spatially homogenized in recent years. Overall, the authors suggested that the temporal dynamics of arthropod communities may be complex and involve changes in α- and β-diversity and demonstrated the usefulness of their unique eDNA-based approach.

      Strengths:

      The authors' idea that using eDNA remained in archived leave samples is unique and potentially applicable to other systems. For example, different types of specimens archived in museums may be utilized for reconstructing long-term community dynamics of other organisms, which would be beneficial for understanding and predicting ecosystem dynamics.

      A great strength of this work is that the authors very carefully tested their method. For example, the authors tested the effects of powdered leaves input weights, sampling methods, storing methods, PCR primers, and days from last precipitation to sampling on the eDNA metabarcoding results. The results showed that the tested variables did not significantly impact the eDNA metabarcoding results, which convinced me that the proposed method reasonably recovers arthropod eDNA from the archived leaf samples. Furthermore, the authors developed a method that can separately quantify 18S DNA copy numbers of arthropods and plants, which enables the estimations of relative arthropod eDNA copy numbers. While most eDNA studies provide relative abundance only, the DNA copy numbers measured in this study provide valuable information on arthropod community dynamics.

      Overall, the authors' idea is excellent, and I believe that the developed eDNA methodology reasonably reconstructed the long-term temporal dynamics of the target organisms, which are major strengths of this study.

      Thank you very much for the positive assessment of our work.

      Weaknesses:

      Although this work has major strengths in the eDNA experimental part, there are concerns in DNA sequence processing and statistical analyses.

      Statistical methods to analyze the temporal trend are too simplistic. The methods used in the study did not consider possible autocorrelation and other structures that the eDNA time series might have. It is well known that the applications of simple linear models to time series with autocorrelation structure incorrectly detect a "significant" temporal trend. For example, a linear model can often detect a significant trend even in a random walk time series.

      We have now reanalyzed our data controlling for autocorrelation and for non-linear changes of abundance and recover no change to our results. We have added this information to the manuscript under “Statistical analysis” (lines 629-644).

      Also, there are some issues regarding the DNA sequence analysis and the subsequent use of the results. For example, read abundance was used in the statistical model, but the read abundance cannot be a proxy for species abundance/biomass. Because the total 18S DNA copy numbers of arthropods were quantified in the study, multiplying the sequence-based relative abundance by the total 18S DNA copy numbers may produce a better proxy of the abundance of arthropods, and the use of such a better proxy would be more appropriate here. In addition, a coverage-based rarefaction enables a more rigorous comparison of diversity (OTU diversity or haplotype diversity) than the readbased rarefaction does.

      We did not use read abundance as a proxy for abundance, but used our qPCR approach to measure relative copy number of arthropods. While there are biases to this (see our explanations above), the assay proved very reliable and robust. We thus believe it should indeed provide a rough estimate of biomass. As biomass is very commonly discussed in insect decline (in fact the first study on insect decline entirely relies on biomass; Hallmann et al. 2017), we feel it is important go include a proxy for this as well. However, we also discuss the alternative option that a turnover of diversity is affecting the measured biomass. A pattern of abundance loss for common species has been described in other works on insect decline.

      We liked the reviewer’s suggestion to use copy number information to perform abundance-informed rarefaction. We have done this now and added an additional analysis rarefying by copy number/biomass. A parallel analysis using this newly rarefied table was done for the total diversity as well as single species abundance change. Details can be found in the Methods and Results section of the manuscript. However, the result essentially remains the same. Even abundance-informed rarefaction does not lead to a pattern of loss of species richness over time (see “Statistical analysis”).

      The overall results are supporting a scenario of no overall loss of species richness over time, but a loss of abundance for common species. And we indeed see the pattern of declining abundance for once-common species in our data, for example the loss of the Green Silver-Line moth, once a very common species in beech canopy (Suppl. Fig. 10). We have added details on this to the Discussion (lines 254-260).

      These points may significantly impact the conclusions of this work.

      Reviewer #3 (Public Review):

      The aim of Weber and colleagues' study was to generate arthropod environmental DNA extracted from a unique 30-year time series of deep-frozen leaf material sampled at 24 German sites, that represent four different land use types. Using this dataset, they explore how the arthropod community has changed through time in these sites, using both conventional metabarcoding to reconstruct the OTUs present, and a new qPCR assay developed to estimate the overall arthropod diversity on the collected material. Overall their results show that while no clear changes in alpha diversity are found, the βdiversity dropped significantly over time in many sites, most notable in the beech forests. Overall I believe their data supports these findings, and thus their conclusion that diversity is becoming homogenized through time is valid.

      Thank you for the positive assessment.

      While overall I do not doubt the general findings, I have a number of comments. Firstly while I agree this is a very nice study on a unique dataset - other temporal datasets of insects that were used for eDNA studies do exist, and perhaps it would be relevant to put the findings into context (or even the study design) of other work that has been done on such datasets. One example that jumps to my mind is Thomsen et al. 2015 https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.12452 but I am sure there are others.

      We have expanded the introduction and discussion on this citing this among other studies now (lines 71-72, 276-278).

      From a technical point of view, the conclusions of course rely on several assumptions, including (1) that the biomass assay is effective and (2) that the reconstructed levels of OTU diversity are accurate,

      With regards to biomass although it is stated in the manuscript that "Relative eDNA copy number should be a predictor for relative biomass ", this is in fact only true if one assumes a number of things, e.g. there is a similar copy number of 18s rDNA per species, similar numbers of mtDNA per cell, a similar number of cells per individual species etc. In this regard, on the positive side, it is gratifying to see that the authors perform a validation assay on 7 mock controls, and these seem to indicate the assay works well. Given how critical this is, I recommend discussing the details of this a bit more, and why the authors are convinced the assay is effective in the main text so that the reader is able to fully decide if they are in agreement. However perhaps on the negative side, I am concerned about the strategy taken to perform the qPCR may have not been ideal. Specifically, the assay is based on nested PCR, where the authors first perform a 15cycle amplification, this product is purified, then put into a subsequent qPCR. Given how both PCR is notorious for introducing amplification biases in general (especially when performed on low levels of DNA), and the fact that nested PCRs are notoriously contamination prone - this approach seems to be asking for trouble. This raises the question - why not just do the qPCR directly on the extracts (one can still dilute the plant DNA 100x prior to qPCR if needed). Further, given the qPCRs were run in triplicate I think the full data (Ct values) for this should be released (as opposed to just stating in the paper that the average values were used). In this way, the readers will be able to judge how replicable the assay was - something I think is critical given how noisy the patterns in Fig S10 seem to be.

      We agree with this point, and this is why we do not want to overstate the decline in copy number. This is an additional source of data next to genetic and species diversity. We have added to our discussion of turnover as another potential driver of copy number change (lines 257-260). We have also added text addressing the robustness of the mock community assay (lines 138-141).

      However, we are confident of the reliability and robustness of our qPCR assay for the detection of relative arthropod copy number. We performed several validations and optimizations before using the assay. We have added additional details to the manuscript on this (see “Detection of relative arthropod DNA copy number using quantitative PCR”, lines 548-556). We got the idea for the nested qPCR from a study (Tran et al.) showing its high accuracy and reproducibility. We show that our assay has a very high replicability using triplicates of each qPCR, which we will now include in the supplementary data on Dryad. The SD of Ct values is very low (~ 0.1 on average). NTC were run with all qPCRs to rule out contamination as an issue in the experiments. We also find a very high efficiency of the assay. At dilutions far outside the observed copy number in our actual leaf data, we still find the assay to be accurate. We found very comparable abundance changes across our highly taxonomically diverse mock communities. This also suggests that abundance changes are a more likely explanation than simple turnover for the observed drop in copy number. A biomass loss for common species is well in line with recent reports on insect decline. We can also rely on several other mock community studies (Krehenwinkel et al. 2017 & 2019) where we used read abundance of 18S and found it to be a relatively good predictor of relative biomass.

      The pattern in Fig. S10 is not really noisy. It just reflects typical population fluctuations for arthropods. Most arthropod taxa undergo very pronounced temporal abundance fluctuations between years.

      Next, with regards to the observation that the results reveal an overall decrease in arthropod biomass over time: The authors suggest one alternate to their theory, that the dropping DNA copy number may reflect taxonomic turnover of species with different eDNA shedding rates. Could there be another potential explanation - simply be that leaves are getting denser/larger? Can this be ruled out in some way, e.g. via data on leaf mass through time for these trees? (From this dataset or indeed any other place).

      This is a very good point. However, we can rule out this hypothesis, as the ESB performs intensive biometric data analysis. The average leaf weight and water content have not significantly changed in our sites. We have addressed this in the Methods section (see ”Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures”, lines 308-311).

      With regards to estimates of OTU/zOTU diversity. The authors state in the manuscript that zOTUs represent individual haplotypes, thus genetic variation within species. This is only true if they do not represent PCR and/or sequencing errors. Perhaps therefore they would be able to elaborate (for the non-computational/eDNA specialist reader) on why their sequence processing methods rule out this possibility? One very good bit of evidence would be that identical haplotypes for the individual species are found in the replicate PCRs. Or even between different extractions at single locations/timepoints.

      We have repeated the analysis of genetic variation with much more stringent filtering criteria (see “Statistical analysis”, lines 611-615). Among other filtering steps, this also includes the use of only those zOTUs that occur in both technical replicates, as suggested by the reviewer. Another reason to make us believe we are dealing with true haplotypic variation here is that haplotypes show geographic variation. E.g., some haplotypes are more abundant in some sites than in others. NUMTS would consistently show a simple correlation in their abundance with the most abundant true haplotype.

      With regards to the bigger picture, one thing I found very interesting from a technical point of view is that the authors explored how modifying the mass of plant material used in the extraction affects the overall results, and basically find that using more than 200mg provides no real advantage. In this regard, I draw the authors and readers attention to an excellent paper by Mata et al. (https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14779) - where these authors compare the effect of increasing the amount of bat faeces used in a bat diet metabarcoding study, on the OTUs generated. Essentially Mata and colleagues report that as the amount of faeces increases, the rare taxa (e.g. those found at a low level in a single faeces) get lost - they are simply diluted out by the common taxa (e.g those in all faeces). In contrast, increasing biological replicates (in their case more individual faecal samples) increased diversity. I think these results are relevant in the context of the experiment described in this new manuscript, as they seem to show similar results - there is no benefit of considerably increasing the amount of leaf tissue used. And if so, this seems to point to a general principal of relevance to the design of metabarcoding studies, thus of likely wide interest.

      Thank you for this interesting study, which we were not aware of before. The cryomilling is an extremely efficient approach to equally disperse even traces of chemicals in a sample. This has been established for trace chemicals early during the operation of the ESB, but also seems to hold true for eDNA in the samples. We have recently done more replication experiments from different ESB samples (different terrestrial and marine samples for different taxonomic groups) and find that replication of extraction does not provide much more benefit than replication of PCR. Even after 2 replicates, diversity approaches saturation. This can be seen in the plot below, which shows recovered eDNA diversity for different ESB samples and different taxonomic groups from 1-4 replicates. A single extract of a small volume contains DNA from nearly all taxa in the community. Rare taxa can be enriched with more PCR replicates.

    1. Author response

      Reviewer #1 (Public Review):

      This careful study reports the importance of Rab12 for Parkinson's disease associated LRRK2 kinase activity in cells. The authors carried out a targeted siRNA screen of Rab substrates and found lower pRab10 levels in cells depleted of Rab12. It has previously been reported that LLOMe treatment of cells breaks lysosomes and with time, leads to major activation of LRRK2 kinase. Here they show that LLOMe-induced kinase activation requires Rab12 and does not require Rab12 phosphorylation to show the effect.

      We thank the reviewer for their comments regarding the carefulness and importance of our work and for their specific feedback which has substantially improved our revised manuscript.

      1) Throughout the text, the authors claim that "Rab12 is required for LRRK2 dependent phosphorylation" (Page 4 line 78; Page 9 line 153; Page 22 line 421). This is not correct according to Figure 1 Figure Supp 1B - there is still pRab10. It is correct only in relation to the LLOMe activation. Please correct this error.

      We appreciate the reviewer’s comment around the requirement of Rab12 for LRRK2-dependent phosphorylation of Rab10 and question regarding whether this is relevant under baseline conditions or only in relation to LLOMe activation. Using our MSD-based assay to quantify pT73 Rab10 levels under basal conditions, we observed a similar reduction in Rab10 phosphorylation when we knockdown Rab12 as we also observed with LRRK2 knockdown (Figure 1A). Further, we see comparable reduction in Rab10 phosphorylation in RAB12 KO cells as that observed in LRRK2 KO cells using our MSD-based assay (Figure 2A and B). Based on this data, we believe Rab12 is a key regulator of LRRK2 activation under basal conditions without additional lysosomal damage. However, as the reviewer noted, we do observe some residual Rab10 phosphorylation upon Rab12 knockdown when assessed by western blot analysis (Figure 1D and Figure 1- figure supplement 1). A similar signal is observed upon LRRK2 knockdown, which may suggest that some small amount of Rab10 phosphorylation may be mediated by another kinase in this cell model. Nevertheless, we appreciate this reviewer’s point and have therefore modified the text to remove any reference to Rab12 being required for LRRK2-dependent Rab phosphorylation and now instead refer to Rab12 as a regulator of LRRK2 activity.

      As noted by the reviewer, our data does suggest that Rab12 is required for the increase in Rab10 phosphorylation observed following LLOMe treatment to elicit lysosomal damage, and we now refer to this appropriately throughout the text.

      2) The authors conclude that Rab12 recruitment precedes that of LRRK2 but the rate of recruitment (slopes of curves in 3F and G) is actually faster for LRRK2 than for Rab12 with no proof that Rab12 is faster-please modify the text-it looks more like coordinated recruitment.

      The reviewer raises an excellent point regarding our ability to delineate whether Rab12 recruitment precedes that of LRRK2 on lysosomes following LLOMe treatment. As noted by the reviewer, we do see both the recruitment of Rab12 and LRRK2 to lysosomes increase on a similar timescale, so we cannot truly resolve whether Rab12 recruitment precedes LRRK2 recruitment in our studies. Based on this, we have modified the text to emphasize that this data supports coordinated recruitment, as suggested, and we have further removed any mention of Rab12 preceding LRRK2. The specific change is as follows “Rab12 colocalization with LRRK2 increased over time following LLOMe treatment, supporting potential coordinated recruitment of these proteins to lysosomes upon damage (Figure 3I). Together, these data demonstrate that Rab12 and LRRK2 both associate with lysosomes following membrane rupture.” and can be found on lines 460-463 of the updated manuscript.

      3) The title is misleading because the authors do not show that Rab12 promotes LRRK2 membrane association. This would require Rab12 to be sufficient to localize LRRK2 to a mislocalized Rab12. The authors DO show that Rab12 is needed for the massive LLOME activation at lysosomes. Please re-word the title.

      To address the reviewer’s concern regarding the title of our manuscript, we have modified the title from “Rab12 regulates LRRK2 activity by promoting its localization to lysosomes” to “Rab12 regulates LRRK2 activity by facilitating its localization to lysosomes” to soften the language around the sufficiency of Rab12 in regulating the localization of LRRK2 to lysosomes. We show that Rab12 deletion significantly reduces LRRK2 activity (as assessed by Rab10 phosphorylation on lysosomes) and significantly increases the localization of LRRK2 to lysosomes upon lysosomal damage. The updated title better reflects the regulatory role of Rab12 in modulating LRRK2 activity, and we thank the reviewer for their suggestion to modify this accordingly.

      Reviewer #2 (Public Review):

      This study shows that rab12 has a role in the phosphorylation of rab10 by LRRK2. Many publications have previously focused on the phosphorylation targets of LRRK2 and the significance of many remains unclear, but the study of LRRK2 activation has mostly focused on the role of disease-associated mutations (in LRRK2 and VPS35) and rab29. The work is performed entirely in an alveolar lung cell line, limiting relevance for the nervous system. Nonetheless, the authors take advantage of this simplified system to explore the mechanism by which rab12 activates LRRK2. In general, the work is performed very carefully with appropriate controls, excluding trivial explanations for the results, but there are several serious problems with the experiments and in particular the interpretation.

      We appreciate the reviewer’s comments regarding the rigor of our work and the potential impact of our studies to address a key unanswered question in the field regarding the mechanisms by which LRRK2 activation is mediated. Our studies focused on the A549 cell model given its high endogenous expression of LRRK2 and Rab10, and this cell line provided a simple system to investigate the mechanism and impact of Rab12-dependent regulation of LRRK2 activity. We agree with the reviewer that future studies are warranted to understand whether similar Rab12-dependent regulation of LRRK2 occurs in relevant CNS cell types.

      First, the authors note that rab29 appears to have a smaller or no effect when knocked down in these cells. However, the quantitation (Fig1-S1A) shows a much less significant knockdown of rab29 than rab12, so it would be important to repeat this with better knockdown or preferably a KO (by CRISPR) before making this conclusion. And the relationship to rab29 is important, so if a better KD or KO shows an effect, it would be important to assess by knocking down rab12 in the rab29 KO background.

      The reviewer raises a good point regarding the importance of confirming that loss of Rab29 has no effect on Rab10 phosphorylation. To address potential concerns about insufficient Rab29 knockdown, we measured the levels of pT73 Rab10 in RAB29 KO A549 cells by MSD-based analysis. RAB29 deletion had no effect on Rab10 phosphorylation, confirming findings from our RAB siRNA screen and the observations of Dario Alessi’s group reported previously (Kalogeropulou et al Biochem J 2020; PMID: 33135724). We have included this new data into our updated manuscript in Figure 1- figure supplement 1 and comment on it on page 6 in the updated Results section.

      Secondly, the knockdown of rab12 generally has a strong effect on the phosphorylation of the LRRK2 substrate rab10 but I could not find an experiment that shows whether rab12 has any effect on the residual phosphorylation of rab10 in the LRRK2 KO. There is not much phosphorylation left in the absence of LRRK2 but maybe this depends on rab12 just as much as in cells with LRRK2 and rab12 is operating independently of LRRK2, either through a different kinase or simply by making rab10 more available for phosphorylation. The epistasis experiment is crucial to address this possibility. To establish the connection to LRRK2, it would also help to compare the effect of rab12 KD on the phosphorylation of selected rabs that do or do not depend on LRRK2.

      The reviewer raises an interesting question regarding whether Rab12 can further reduce Rab10 phosphorylation independently of LRRK2. Using our quantitative MSD-based assay, we observe that pRab10 levels are at the lower limits of detection of the assay in LRRK2 KO A549 cells. Unfortunately, this means that we are unable to detect whether there might be any additional minor reduction in Rab10 phosphorylation with Rab12 knockdown in LRRK2 KO cells. We cannot rule out that Rab12 may play a LRRK2-independent role in regulating Rab10 phosphorylation in other cell lines, and future studies are warranted to explore whether Rab12 knockdown can further reduce Rab10 phosphorylation in other systems, including in CNS cells.

      Regarding exploring the effects of RAB12 knockdown on the phosphorylation of other Rabs, we also assessed the impact of RAB12 KO on phosphorylation of another LRRK2-Rab substrate, Rab8a. We observed a strong reduction in pT72 Rab8a levels in RAB12 KO cells compared to wildtype cells, suggesting the impact of RAB12 deletion extends beyond Rab10 (see representative western blot in Author response image 1). Due to potential concerns with the selectivity of the pT72 Rab8a antibody (potentially detecting the phosphorylation of other LRRK2-Rabs), we cannot definitively demonstrate that Rab12 mediates the phosphorylation of other Rabs. This question should be revisited when additional phospho-Rab antibodies become available that enable us to selectively detect LRRK2-dependent phosphorylation of additional Rab substrates under endogenous expression conditions.

      Author response image 1.

      A strength of the work is the demonstration of p-rab10 recruitment to lysosomes by biochemistry and imaging. The demonstration that LRRK2 is required for this by biochemistry (Fig 4A) is very important but it would also be good to determine whether the requirement for LRRK2 extends to imaging. In support of a causal relationship, the authors also state that lysosomal accumulation of rab12 precedes LRRK2 but the data do not show this. Imaging with and without LRRK2 would provide more compelling evidence for a causative role.

      We thank the reviewer for their suggestion to assess Rab12 recruitment to damaged lysosomes with and without LRRK2 using imaging-based analyses to add confidence to our findings from biochemical approaches. To address this comment, we have imaged the recruitment of mCherry-tagged Rab12 to lysosomes (as assessed using an antibody against endogenous LAMP1) and observed a significant increase in Rab12 levels on lysosomes following LLOMe treatment. This occurs to a similar extent in LRRK2 KO A549 cells, suggesting that Rab12 is an upstream regulator of LRRK2 activity. This new data has been incorporated into the revised manuscript (Figure 3E) and is presented on page 20 of the updated manuscript.

      Our conclusions on this are further strengthened by new data assessing Rab12 recruitment to lysosomes using orthogonal analysis of isolated lysosomes biochemically. Using the Lyso-IP method, we observed a strong increase in the levels of Rab12 on lysosomes following LLOMe treatment that was maintained in LRRK2 KO cells. These data have been added to the updated manuscript (new data added to Figure 3- figure supplement 1).

      Together, these data support our hypothesis that Rab12 recruitment to damaged lysosomes is upstream, and independent, of LRRK2.

      The authors also touch base with PD mutations, showing that loss of rab12 reduces the phosphorylation of rab10. However, it is interesting that loss of rab12 has the same effect with R1441G LRRK2 and D620N VPS35 as it does in controls. This suggests that the effect of rab12 does not depend on the extent of LRRK2 activation. It is also surprising that R1441G LRRK2 does not increase p-rab10 phosphorylation (Fig 2G) as suggested in the literature and stated in the text.

      We agree with the reviewer that it is quite interesting that RAB12 knockdown significantly attenuates Rab10 phosphorylation in the context of PD-linked variants in addition to that observed in wildtype cells basally and after LLOMe treatment. As noted by the reviewer, we did not observe increased levels of phospho-Rab10 in LRRK2 R1441G KI A549 cells at the whole cell level (Figure 2G). However, we observed a significant increase in Rab10 phosphorylation on isolated lysosomes from LRRK2 R1441G KI cells compared to WT cells (Figure 4B). This may suggest that the LRRK2 R1441G variant leads to a more modest increase in LRRK2 activity in this cell model. Previous studies in MEFs from LRRK2 R1441G KI mice or neutrophils from human subjects that carry the LRRK2 R1441G variant showed a 3-4 fold increase in Rab10 phosphorylation (Fan et al Acta Neuropathol 2021 PMID: 34125248 and Karaye et al Mol Cell Proteomics 2020 PMID: 32601174), supporting that this variant does lead to increased Rab10 phosphorylation and that the extent of LRRK2 activation may vary across different cell types.

      Most important, the final figure suggests that PD-associated mutations in LRRK2 and VPS35 occlude the effect of lysosomal disruption on lysosomal recruitment of LRRK2 (Fig 4D) but do not impair the phosphorylation of rab10 also triggered by lysosomal disruption (4A-C). Phosphorylation of this target thus appears to be regulated independently of LRRK2 recruitment to the lysosome, suggesting another level of control (perhaps of kinase activity rather than localization) that has not been considered.

      The reviewer suggests an interesting hypothesis around the existence of additional levels of control beyond the lysosomal levels of LRRK2 to lead to increased Rab10 phosphorylation of lysosomes. Given the variability we have observed in measuring endogenous LRRK2 levels on lysosomes, we performed two additional replicates to assess lysosomal LRRK2 levels in LRRK2 R1441G KI and VPS35 D620N KI cells at baseline and after treatment with LLOMe. We observed a significant increase in LRRK2 levels on lysosomes in cells expressing either PD-linked variant and a trend toward a further increase in the levels of LRRK2 on lysosomes after LLOMe treatment in these cells (Figure 4D in the updated manuscript). We have updated the text on page 24 to reflect this change, suggesting that the PD-linked variants do not fully occlude the effect of lysosomal disruption on the lysosomal recruitment of LRRK2.

      LLOMe treatment leads to a stronger increase in Rab10 phosphorylation on lysosomes from LRRK2 R1441G and VPS35 D620N cells compared to the modest increase in LRRK2 levels observed. This could suggest that, as the reviewer noted, additional mechanisms beyond increased lysosomal localization of LRRK2 may be driving the robust increase in Rab10 phosphorylation observed. We have modified the results section on lines 548-551 to highlight this possibility: “Rab10 phosphorylation showed a more significant increase in response to LLOMe treatment than LRRK2 on lysosomes from LRRK2 R1441G and VPS35 D620N KI cells, suggesting that there may be more regulation beyond the enhanced proximity between LRRK2 and Rab that contribute to LRRK2 activation in response to lysosomal damage.”

      Reviewer #3 (Public Review):

      Increased LRRK2 kinase activity is known to confer Parkinson's disease risk. While much is known about disease-causing LRRK2 mutations that increase LRRK2 kinase activity, the normal cellular mechanisms of LRRK2 activation are less well understood. Rab GTPases are known to play a role in LRRK2 activation and to be substrates for the kinase activity of LRRK2. However, much of the data on Rabs in LRRK2 activation comes from over-expression studies and the contributions of endogenously expressed Rabs to LRRK2 activation are less clear. To address this problem, Bondar and colleagues tested the impact of systematically depleting candidate Rab GTPases on LRRK2 activity as measured by its ability to phosphorylate Rab10 in the human A549 type 2 pneumocyte cell line. This resulted in the identification of a major role for Rab12 in controlling LRRK2 activity towards Rab10 in this model system. Follow-up studies show that this role for Rab12 is of particular importance for the phosphorylation of Rab10 by LRRK2 at damaged lysosomes. Increases in LRRK2 activity in cells harboring disease-causing mutants of LRRK2 and VPS35 also depend (at least partially) on Rab12. Confidence in the role of Rab12 in supporting LRRK2 activity is strengthened by parallel experiments showing that either siRNA-mediated depletion of Rab12 or CRISPR-mediated Rab12 KO both have similar effects on LRRK2 activity. Collectively, these results demonstrate a novel role for Rab12 in supporting LRRK2 activation in A549 cells. It is likely that this effect is generalizable to other cell types. However, this remains to be established. It is also likely that lysosomes are the subcellular site where Rab12-dependent activation of LRRK2 occurs. Independent validation of these conclusions with additional experiments would strengthen this conclusion and help to address some concerns that much of the data supporting a lysosome localization for Rab12-dependent activation of LRRK2 comes from a single method (LysoIP). Furthermore, there is a discrepancy between panel 4A versus 4D in the effect of LLoMe-induced lysosome damage on LRRK2 recruitment to lysosomes that will need to be addressed to strengthen confidence in conclusions about lysosomes as sites of LRRK2 activation by Rab12.

      We thank the reviewer for their comments regarding our work that identifies Rab12 as a novel regulator of LRRK2 activation and the appreciation of the parallel approaches we employed to add confidence in this effect.

      As suggested by the reviewer, we have updated our manuscript to now include independent validation of our conclusions using imaging-based analyses to complement our data from biochemical analyses using the Lyso-IP method. Specifically, we have included new imaging data that confirms that Rab12 levels are increased on lysosomes following membrane permeabilization with LLOMe treatment and demonstrates that this occurs independent of LRRK2, providing additional support that Rab12 is an upstream regulator of LRRK2 activity (Figure 3E in the updated manuscript).

      Regarding the reviewer’s comment on a discrepancy between our findings in Figure 4A and Figure 4D, we have performed additional independent replicates in Figure 4D to assess the impact of lysosomal damage on the lysosomal levels of LRRK2 at baseline or upon the expression of genetic variants. We observed a significant increase in LRRK2 levels on lysosomes following LLOMe treatment in our set of experiments included in Figure 4A and a non-significant trend toward an increase in LRRK2 levels on isolates lysosomes in Figure 4D. As described in more detail below (in response to the second point raised by this reviewer), we think this variability arises because of a combination of low levels of LRRK2 on lysosomes with endogenous expression and variability across experiments in the efficiency of lysosomal isolation. Our observations of increased recruitment of LRRK2 to lysosomes upon damage are further supported by parallel imaging-based studies (Figure 3F-I) and are consistent with previous studies using overexpression systems.

      We thank the reviewer for all of the suggestions which have added further confidence to our conclusions and substantially improved the manuscript.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper is a follow-up of the authors previous paper (2018), in which they carefully described the organisation of the junctions between cells of the adult Drosophila midgut epithelium and their control from the basal side by integrin signalling. Here, the authors used state-of-the art imaging and genetics to unravel step-by-step the events leading from an initially unpolarised cell to an epithelial cell that integrates into the existing epithelium. Many of the images are accompanied by cartoons, which help the reader to better understand the images and follow the conclusions. It would have been helpful yet, in particular with respect to the mutant phenotypes described later, if they would have named each of the steps/stages. In addition, mentioning the timescale would give an idea about the temporal frame in which this process elapses.

      We have used terms such as “unpolarised cells, polarised Actin/Cno” to label different stages in Figure 6, since this sequence of steps is inferred from results obtained from fixed samples with still images. We have illustrated the septate junction mutant phenotype in Figure 8I.

      We have also performed a new experiment to estimate the time taken for an activated EB to form a PAC and to become a mature enterocyte using overexpressing Sox21a with esg[ts]>GFP to induce enteroblast differentiation. Counting the number of GFP+ve cells without PAC, with a PAC and with full apical domain at different time points suggests that activated EBs take about a day to form a PAC and another day to form a fully-integrated enterocyte. We have summarised the results in Figure 5-figure supplement 1C.

      We have also included this result in the main-text as “ To estimate the time taken for enteroblasts to progress to pre-enterocytes with a PAC, and for pre-enterocytes become to enterocytes, we induced enterocyte differentiation by over-expressing UAS-Sox21a under the control of esg[ts]-Gal4 and counted the number of GFP+ve cells without a PAC or apical domain, with a PAC and with a full apical domain at different time points after induction (Chen et al., 2016; Meng and Biteau, 2015; Zhai et al., 2017). 17 hours after shifting the flies to 25ºC to inactivate Gal80ts, almost no GFP+ve cells had progressed to pre-EC with a PAC (0.1%) or EC (1%), and these few cells probably started to differentiate before Sox 21a induction. 24 hours later, 10% of the GFP+ve cells had developed into pre-ECs with a PAC and 20% had become ECs (Figure 5-figure supplement 1B-C). After an additional 24 hours, the number of cells with a PAC fell to 1%, whereas 50% were ECs. Assuming that it takes 12-17 hours to induce high levels of Sox21a expression, these results suggest that most activated EBs take about 24 hours to develop into a pre-EC with a PAC and a further 24 hours to differentiate into a mature EC, although some cells differentiate faster. This time frame is in agreement with a previous study using similar approaches to accelerate differentiation (Rojas Villa et al., 2019) and a recent live imaging study tracing the enteroblast to enterocyte transition (Tang et al., 2021). These results also indicate that down-regulation of Sox21a is not essential for enteroblast to pre-enterocyte differentiation, since enteroblasts overexpressing Sox21a still from a PAC (Figure 5-figure supplement 1B).

      The authors convincingly show that septate junctions are instrumental for proper polarisation and integration of the enteroblast. However, while they nicely showed that Canoe in neither required in the enteroblast nor in the enterocytes for this process, it remains unclear whether septate junction proteins are required in enteroblast or in enterocytes or in both and at which particular step the process fails in the mutant.

      Early stage enteroblasts neither express or require septate junction proteins, whereas late stage enteroblasts and pre-enterocytes do (Chen et al., 2020; Hung et al., 2020; Izumi et al., 2019; Xu et al., 2019). Since cells mutant for septate junction proteins do not develop into mature enterocytes with an apical domain facing the gut lumen, we cannot answer the reviewer’s question of whether septate junction proteins are required in enterocytes.

      As we discussed in the paper, we think that “differentiating enteroblasts only require a basal cue to establish their initial apical-basal polarity, whereas the formation of the pre-assembled apical compartment also requires a junctional cue. The septate junctions are not necessary for apical domain formation per se, however, as mesh mutant enteroblasts form a full-developed apical domain with a brush border inside the cell. This suggests that septate junctions define the site of apical domain formation by delimiting the region where apical membrane proteins are secreted to assemble the brush border, but do not control the process of apical domain formation directly.”

      Reviewer #2 (Public Review):

      The authors recently showed the polarization of the cells of the adult Drosophila midgut does not require any of the canonical epithelial polarity factors, and instead depend on basal cues from adhesion to the ECM, as well as septate junction proteins (Chen et al, 2018). Here they extend this research to examine in greater detail precisely how midgut epithelial cells integrate in the pre-exisiting epithelium and become polarized. Surprisingly, they show that enteroblasts form an apical membrane initiation site prior to polarizing. Furthermore, they show that this develops into a pre-apical compartment containing fully-formed brush border. This is a very interesting finding - it explains how integrating enteroblasts can integrate into a pre-existing epithelium without disrupting barrier function. The conclusions of this paper are mostly well supported by data, but some aspects could do with being clarified and extended as outlined below.

      Model presented in Figure 6

      While the separation of membranes indicated in Figure 6 steps 3-5 can be seen in the image shown in Figure 3B, this is one of the only images which supports the idea that there is a separation of membranes between the enteroblast and overlying enterocytes during PAC formation. Is the model in Figure 6 supported by EM data - can you see a region where there is brush border and separation of cells? Supplementing Figure 3 with corresponding EM images would greatly aid the reader in interpreting the data and strengthen the model.

      We think that AJ clearing and membrane separation is a brief process that is quickly followed by the separation of the apical and junctional proteins and apical secretion at the AMIS to form the PAC. We have not captured this stage in our EM images, but have many other examples that show this step (e.g Figure 4C and Figure 8F). Another example is shown below.

      A key step in the model is that the clearance of E-Cadherin from the apical membrane leads to a loss of adhesion between the enteroblast and the overlying enterocytes. This would need to be supported by functional data such as overexpression of E-Cad or E-CadDN in enteroblasts or by generating shg mutant clones. If the model is correct, perturbing E-Cad levels in enteroblasts should lead to defects in PAC formation, such as loss of de-adhesion/early de-adhesion/excessive de-adhesion.

      We think it is the local clearance of ECad from the apical membrane, not the downregulation of total level of ECad that is important for the local membrane separation and future PAC formation. The experiment of overexpressing ECad or ECad-DN proposed by the reviewer might be crucial to demonstrate the importance of total amount of ECad, but might not be very helpful in determining the importance of membrane separation in the PAC formation. Moreover, AJ formation in fly midgut epithelium does not depend on ECad, suggesting that ECad and NCad act redundantly which further complicates this approach (Choi et al., 2011; Liang et al., 2017).

      Role for the septate junction proteins

      Septate junction proteins were previously shown by these authors to be required for enteroblast polarization and integration into the midgut epithelium (Chen et al, 2018). Here they extend this by examining enteroblasts mutant for septate junction proteins, and conclude that septate junction proteins are required for normal PAC formation. However, it is not clear what aspect of the polarization of the enteroblasts is disrupted, because a number of mesh mutant cells (albeit a lower proportion than in wildtype) do form PACs. The main phenotype seems to be that cells fail to polarize (as previously reported) or have internalised PACs. It is hard to know what to conclude from this data about the role of the septate junction components in PAC formation.

      The major phenotype of the septate junction mutants is the loss of polarity, i.e. an inability to form an apical domain and integrate into the epithelial layer as shown in Figure 8. Neither mesh or Tsp2a mutants can form a PAC, even though mesh mutant cells have higher propensity to form an internal PAC-like structure (Figure 8B,C,E,G,H, Figure 8-figure supplement 1L). Thus, we think that septate junctions are required for AMIS and PAC formation. What complicates the interpretation is that some (6-20%) septate junction mutant cells do form an AMIS like structure (Figure 8D-F, Figure 8-figure supplement 1F&K). The simplest explanation for this result is that this is due to perdurance of the wild-type proteins after clone induction, with the weaker phenotype of ssk mutants being due to longer perdurance of this protein. However, we cannot rule out the alternative explanation that AMIS and PAC formation is facilitated by the septate junction proteins, but that they can still form very inefficiently in their absence.

      We realise that this section was quite confusing in the orginal version of the manuscript and have now re-written it to make this interpretation clearer.

      Coracle is used as a readout for the localization of septate junction components, yet the staining for Cora in Figure S3B looks quite different to Mesh in S3D. If Cora is to be used as a readout for the localization of septate junction components, then staining for Cora/Mesh and/or Cora/SSk or Tsp2a should be shown.

      When discussing the requirement for septate junctions for enteroblast integration - Coracle and Mesh are used interchangeably - but as mentioned before, it is not clear if they colocalize, or if their localization is interdependent (as demonstrated for Mesh, Tsp2a and Ssk in Figure 7). What is the phenotype of enteroblasts mutant for cora?

      Following from the previous point - while it is clear that Coracle is apical early during AMIS formation, it is not clear if Mesh, Tsp2a and Ssk also are, yet these are the mutants that are examined for a role in AMIS/PAC formation. It would be good to know whether the loss of cora would lead to defects in AMIS formation.

      The reason we used mainly Coracle as a marker for the septate junctions is that Mesh and Tsp2A localise to the basal labyrinth as well as to the septate junctions which could confuse the reader. We have now added new panels to Figure 3-figure supplement 3E&F showing the colocalization of Cora with Mesh/Tsp2a at the septate junctions and during the crucial stages of PAC formation.

      Additional Results:

      "Coracle is a peripheral septate junction protein whose localisation depends on the structural septate junction components such as Mesh/Ssk/Tsp2a (Chen et al., 2018; Izumi et al., 2016, 2012). Cora antibody staining provides a clearer marker for the septate junctions than Mesh or Tsp2a antibody staining, because the latter also label the basal labyrinth (Figure 3-figure supplement 1E&F). To determine whether Cora is required for PAC formation or epithelial polarity in the adult midgut, we generated a null mutant allele with a premature stop codon in FERM domain using CRISPR. Cells mutant for this allele, corajc, or a second cora null allele, cora5, can form a PAC, septate junctions and a full apical domain, indicating that Cora is also not required for enteroblast integration or enterocyte polarity (Figure 7F&G, Figure 7-figure supplement 1E-H).

      Additional Materials and Methods:

      We used the CRISPR/Cas9 method (Bassett and Liu, 2014) to generate null alleles of canoe and coracle. sgRNA was in vitro transcribed from a DNA template created by PCR from two partially complementary primers:

      forward primer:

      For coracle:

      5′-GAAATTAATACGACTCACTATAGAAGCTGGCCATGTACGGCGGTTTTAGAGCTAGAAATAGC-3′;

      The sgRNA was injected into…Act5c-Cas9 embryos to generate coracle null alleles (Port et al., 2014). Putative…coracle mutants in the progeny of the injected embryos were recovered, balanced, and sequenced. …The coraclejc allele contains a 2bp deletion around the CRISPR site, resulting in a frameshift that leads to stop codon at amino acid 225 in the middle of the FERM domain, which is shared by all isoforms. No Coracle protein was detectable by antibody (DSHB C615.16) staining in both midgut and follicle cell clones. The coraclejc allele was recombined with FRT G13 to make the FRTG13 coraclejc flies.

      It is unclear what is happening in Figure 8A,C,E, S7D. Is that a detachment phenotype or an integration phenotype? Are the majority of cells unpolarised due to loss of integrin attachment rather than failure to form an AMIS/PAC?

      Cells mutant for septate junction proteins do not detach from the basement membrane and still localise Talin basally, as illustrated by the new panel we have added (Figure 8-figure supplement 1N), showing Talin localisation in Tsp2a mutant cell.

      However, because the mutant cells cannot integrate and remain stuck beneath the septate junctions between the enterocytes, they sometimes become displaced from a portion of the basement membrane by younger EBs that derive from the same mutant ISC, leading to a pile up of cells in the basal region of the epithelium (e.g. Figure 8A, E and H).

      We have added the following sentences to the Results, explaining these points:

      "Because the mutant cells remain trapped beneath enterocyte-enterocyte septate junctions, they accumulate in the basal region of the epithelium, with new EBs derived from the same mutant ISC forming beneath them and reducing their contact with the basement membrane (Figure 8A)."

      " The majority of cells mutant for septate junction components fail to polarise or form an AMIS, although they form normal lateral and basal domains, as the basal integrin signalling component, Talin, localises normally (Figure 8-figure supplement 1N)."

      It is unclear whether enteroblasts really pass through an 'unpolarized stage'. In Figure 6, when they are described as 'unpolarised', they clearly have distinct basal and AJ domains. In septate junction mutants, when cells are classified as unpolarized, do they still have distinct regions of integrin/E-Cad expression?

      This is a semantic question. We agree that they have distinct lateral and basal domains, but they do not have an apical domain. In this respect, these "unpolarised" cells are similar to a mesenchymal fibroblast migrating on a substrate, which has a distinct basal side contacting the substrate that is different from the non-contacting regions of the cell surface. They also match the description of the migratory, "mesenchymal" enteroblasts (Antonello et al., 2015). To make this clearer, we have added the following notes to the legend for Figure 6: “Unpolarised” in the second panel of this figure indicates that the enteroblast has not formed a distinct apical domain. At this stage, no marker is clearly apically localised. “unpolarised” or “polarised” in the third and fourth panels describe the localisation of marker proteins, such as Actin and Cno."

    1. Author Response

      eLife assessment

      This important paper exploits new cryo-EM tomography tools to examine the state of chromatin in situ. The experimental work is meticulously performed and convincing, with a vast amount of data collected. The main findings are interpreted by the authors to suggest that the majority of yeast nucleosomes lack a stable octameric conformation. Despite the possibly controversial nature of this report, it is our hope that such work will spark thought-provoking debate, and further the development of exciting new tools that can interrogate native chromatin shape and associated function in vivo.

      We thank the Editors and Reviewers for their thoughtful and helpful comments. We also appreciate the extraordinary amount of effort needed to assess both the lengthy manuscript and the previous reviews. Below, we provide our provisional responses in bold blue font. The majority of the comments are straightforward to address. We have taken a more conservative approach with the subset of comments that would require us to speculate because we either lack key information or we lack technical expertise. Instead of adding the speculative replies to the main text, we think it will be better to leave them in the rebuttal for posterity. Readers will therefore have access to our speculation and know that we did not feel confident enough to include these thoughts in the Version of Record.

      Reviewer #1 (Public Review):

      This manuscript by Tan et al is using cryo-electron tomography to investigate the structure of yeast nucleosomes both ex vivo (nuclear lysates) and in situ (lamellae and cryosections). The sheer number of experiments and results are astounding and comparable with an entire PhD thesis. However, as is always the case, it is hard to prove that something is not there. In this case, canonical nucleosomes. In their path to find the nucleosomes, the authors also stumble over new insights into nucleosome arrangement that indicates that the positions of the histones is more flexible than previously believed.

      We want to point out that canonical nucleosomes are there in wild-type cells in situ, albeit rarer than what’s expected based on our HeLa cell analysis. The negative result (absence of any canonical nucleosome classes in situ) was found in the histone-GFP mutants.

      Major strengths and weaknesses:

      Personally, I am not ready to agree with their conclusion that heterogenous non-canonical nucleosomes predominate in yeast cells, but this reviewer is not an expert in the field of nucleosomes and can't judge how well these results fit into previous results in the field. As a technological expert though, I think the authors have done everything possible to test that hypothesis with today's available methods. One can debate whether it is necessary to have 35 supplementary figures, but after working through them all, I see that the nature of the argument needs all that support, precisely because it is so hard to show what is not there. The massive amount of work that has gone into this manuscript and the state-of-the art nature of the technology should be warmly commended. I also think the authors have done a really great job with including all their results to the benefit of the scientific community. Yet, I am left with some questions and comments:

      Could the nucleosomes change into other shapes that were predetermined in situ? Could the authors expand on if there was a structure or two that was more common than the others of the classes they found? Or would this not have been found because of the template matching and later reference particle used?

      Our best guess (speculation) is that one of the class averages that is smaller than the canonical nucleosome contains one or more non-canonical nucleosome classes. We do not feel confident enough to single out any of these classes precisely because we do not yet know if they arise from one non-canonical nucleosome structure or from multiple – and therefore mis-classified – non-canonical nucleosome structures (potentially with other non-nucleosome complexes mixed in). We feel it is better to leave this discussion out of the manuscript, or risk sending the community on wild goose chases.

      Our template-matching workflow uses a low-enough cross-correlation threshold that any nucleosome-sized particle (plus minus a few nanometers) would be picked, which is why the number of hits is so large. So unless the noncanonical nucleosomes quadrupled in size or lost most of their histones, they should be grouped with one or more of the other 99 class averages (WT cells) or any of the 100 class averages (cells with GFP-tagged histones). As to whether the later reference particle could have prevented us from detecting one of the non-canonical nucleosome structures, we are unable to tell because we’d really have to know what an in situ non-canonical nucleosome looks like first.

      Could it simply be that the yeast nucleoplasm is differently structured than that of HeLa cells and it was harder to find nucleosomes by template matching in these cells? The authors argue against crowding in the discussion, but maybe it is just a nucleoplasm texture that side-tracks the programs?

      Presumably, the nucleoplasmic “side-tracking” texture would come from some molecules in the yeast nucleus. These molecules would be too small to visualize as discrete particles in the tomographic slices, but they would contribute textures that can be “seen” by the programs – in particular RELION, which does the discrimination between structural states. We do not know the inner-workings of RELION well enough to say what kinds of density textures would side-track its classification routines.

      The title of the paper is not well reflected in the main figures. The title of Figure 2 says "Canonical nucleosomes are rare in wild-type cells", but that is not shown/quantified in that figure. Rare is comparison to what? I suggest adding a comparative view from the HeLa cells, like the text does in lines 195-199. A measure of nucleosomes detected per volume nucleoplasm would also facilitate a comparison.

      Figure 2’s title is indeed unclear and does not align with the paper’s title and key conclusion. The rarity here is relative to the expected number of nucleosomes (canonical plus non-canonical). We have changed the title to “Canonical nucleosomes are a minority of the expected total in wild-type cells”. We would prefer to leave the reference to HeLa cells to the main text instead of as a figure panel because the comparison is not straightforward for a graphical presentation. Instead, we will report the total number of nucleosomes estimated for this particular tomogram (~7,600) versus the number of canonical nucleosomes classified (297; 594 if we assume we missed half of them).

      If the cell contains mostly non-canonical nucleosomes, are they really non-canonical? Maybe a change of language is required once this is somewhat sure (say, after line 303).

      This is an interesting semantic and philosophical point. From the yeast cell’s “perspective”, the canonical nucleosome structure would be the form that is in the majority. That being said, we do not know if there is one structure that is the majority. From the chromatin field’s point of view, the canonical nucleosome is the form that is most commonly seen in all the historical – and most contemporary – literature, namely something that resembles the crystal structure of Luger et al, 1997. Given these two lines of thinking, we will add the following clarification after line 303:

      “At present, we do not know what the non-canonical nucleosome structures are, meaning that we cannot even determine if one non-canonical structure is the majority. Until we know what the family of non-canonical nucleosome structures are, we will use the term non-canonical to describe the nucleosomes that do not have the canonical (crystal) structure”.

      The authors could explain more why they sometimes use conventional the 2D followed by 3D classification approach and sometimes "direct 3-D classification". Why, for example, do they do 2D followed by 3D in Figure S5A? This Figure could be considered a regular figure since it shows the main message of the paper.

      Because the classification of subtomograms in situ is still a work in progress, we felt it would be better to show one instance of 2-D classification for lysates and one for lamellae. While it is true that we could have presented direct 3-D classification for the entire paper, we anticipate that readers will be interested to see what the in situ 2-D class averages look like.

      The main message is that there are canonical nucleosomes in situ (at least in wild-type cells), but they are a minority. Therefore, the conventional classification for Figure S5A should not be a main figure because it does not show any canonical nucleosome class averages in situ.

      Figure 1: Why is there a gap in the middle of the nucleosome in panel B? The authors write that this is a higher resolution structure (18Å), but in the even higher resolution crystallography structure (3Å resolution), there is no gap in the middle.

      There is a lower concentration of amino acids at the middle in the disc view; unfortunately, the space-filling model in Figure 1A hides this feature. The gap exists in experimental cryo-EM density maps. See below for an example. The size of the gap depends on the contour level and probably the contrast mechanism, as the gap is less visible in the VPP subtomogram averages. To clarify this confusing phenomenon, we will add the following lines to the figure legend:

      “The gap in the disc view of the nuclear-lysate-based average is due to the lower concentration of amino acids there, which is not visible in panel A due to space-filling rendering. This gap’s size may depend on the contrast mechanism because it is not visible in the VPP averages.”

      Reviewer #2 (Public Review):

      Nucleosome structures inside cells remain unclear. Tan et al. tackled this problem using cryo-ET and 3-D classification analysis of yeast cells. The authors found that the fraction of canonical nucleosomes in the cell could be less than 10% of total nucleosomes. The finding is consistent with the unstable property of yeast nucleosomes and the high proportion of the actively transcribed yeast genome. The authors made an important point in understanding chromatin structure in situ. Overall, the paper is well-written and informative to the chromatin/chromosome field.

      We thank Reviewer 2 for their positive assessment.

      Reviewer #3 (Public Review):

      Several labs in the 1970s published fundamental work revealing that almost all eukaryotes organize their DNA into repeating units called nucleosomes, which form the chromatin fiber. Decades of elegant biochemical and structural work indicated a primarily octameric organization of the nucleosome with 2 copies of each histone H2A, H2B, H3 and H4, wrapping 147bp of DNA in a left handed toroid, to which linker histone would bind.

      This was true for most species studied (except, yeast lack linker histone) and was recapitulated in stunning detail by in vitro reconstitutions by salt dialysis or chaperone-mediated assembly of nucleosomes. Thus, these landmark studies set the stage for an exploding number of papers on the topic of chromatin in the past 45 years.

      An emerging counterpoint to the prevailing idea of static particles is that nucleosomes are much more dynamic and can undergo spontaneous transformation. Such dynamics could arise from intrinsic instability due to DNA structural deformation, specific histone variants or their mutations, post-translational histone modifications which weaken the main contacts, protein partners, and predominantly, from active processes like ATP-dependent chromatin remodeling, transcription, repair and replication.

      This paper is important because it tests this idea whole-scale, applying novel cryo-EM tomography tools to examine the state of chromatin in yeast lysates or cryo-sections. The experimental work is meticulously performed, with vast amount of data collected. The main findings are interpreted by the authors to suggest that majority of yeast nucleosomes lack a stable octameric conformation. The findings are not surprising in that alternative conformations of nucleosomes might exist in vivo, but rather in the sheer scale of such particles reported, relative to the traditional form expected from decades of biochemical, biophysical and structural data. Thus, it is likely that this work will be perceived as controversial. Nonetheless, we believe these kinds of tools represent an important advance for in situ analysis of chromatin. We also think the field should have the opportunity to carefully evaluate the data and assess whether the claims are supported, or consider what additional experiments could be done to further test the conceptual claims made. It is our hope that such work will spark thought-provoking debate in a collegial fashion, and lead to the development of exciting new tools which can interrogate native chromatin shape in vivo. Most importantly, it will be critical to assess biological implications associated with more dynamic - or static forms- of nucleosomes, the associated chromatin fiber, and its three-dimensional organization, for nuclear or mitotic function.

      Thank you for putting our work in the context of the field’s trajectory. We hope our EMPIAR entry, which includes all the raw data used in this paper, will be useful for the community. As more labs (hopefully) upload their raw data and as image-processing continues to advance, the field will be able to revisit the question of non-canonical nucleosomes in budding yeast and other organisms.

    2. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important paper exploits new cryo-EM tomography tools to examine the state of chromatin in situ. The experimental work is meticulously performed and convincing, with a vast amount of data collected. The main findings are interpreted by the authors to suggest that the majority of yeast nucleosomes lack a stable octameric conformation. Despite the possibly controversial nature of this report, it is our hope that such work will spark thought-provoking debate, and further the development of exciting new tools that can interrogate native chromatin shape and associated function in vivo.

      We thank the Editors and Reviewers for their thoughtful and helpful comments. We also appreciate the extraordinary amount of effort needed to assess both the lengthy manuscript and the previous reviews. Below, we provide our point-by-point response in bold blue font. Nearly all comments have been addressed in the revised manuscript. For a subset of comments that would require us to speculate, we have taken a conservative approach because we either lack key information or technical expertise: Instead of adding the speculative replies to the main text, we think it is better to leave them in the rebuttal for posterity. Readers will thereby have access to our speculation and know that we did not feel confident enough to include these thoughts in the Version of Record.

      Reviewer #1 (Public Review):

      This manuscript by Tan et al is using cryo-electron tomography to investigate the structure of yeast nucleosomes both ex vivo (nuclear lysates) and in situ (lamellae and cryosections). The sheer number of experiments and results are astounding and comparable with an entire PhD thesis. However, as is always the case, it is hard to prove that something is not there. In this case, canonical nucleosomes. In their path to find the nucleosomes, the authors also stumble over new insights into nucleosome arrangement that indicates that the positions of the histones is more flexible than previously believed.

      Please note that canonical nucleosomes are there in wild-type cells in situ, albeit rarer than what’s expected based on our HeLa cell analysis and especially the total number of yeast nucleosomes (canonical plus non-canonical). The negative result (absence of any canonical nucleosome classes in situ) was found in the histone-GFP mutants.

      Major strengths and weaknesses:

      Personally, I am not ready to agree with their conclusion that heterogenous non-canonical nucleosomes predominate in yeast cells, but this reviewer is not an expert in the field of nucleosomes and can't judge how well these results fit into previous results in the field. As a technological expert though, I think the authors have done everything possible to test that hypothesis with today's available methods. One can debate whether it is necessary to have 35 supplementary figures, but after working through them all, I see that the nature of the argument needs all that support, precisely because it is so hard to show what is not there. The massive amount of work that has gone into this manuscript and the state-of-the art nature of the technology should be warmly commended. I also think the authors have done a really great job with including all their results to the benefit of the scientific community. Yet, I am left with some questions and comments:

      Could the nucleosomes change into other shapes that were predetermined in situ? Could the authors expand on if there was a structure or two that was more common than the others of the classes they found? Or would this not have been found because of the template matching and later reference particle used?

      Our best guess (speculation) is that one of the class averages that is smaller than the canonical nucleosome contains one or more non-canonical nucleosome classes. However, we do not feel confident enough to single out any of these classes precisely because we do not yet know if they arise from one non-canonical nucleosome structure or from multiple – and therefore mis-classified – non-canonical nucleosome structures (potentially with other non-nucleosome complexes mixed in). We feel it is better to leave this discussion out of the manuscript, or risk sending the community on wild goose chases.

      Our template-matching workflow uses a low-enough cross-correlation threshold that any nucleosome-sized particle (plus minus a few nanometers) would be picked, which is why the number of hits is so large. So unless the noncanonical nucleosomes quadrupled in size or lost most of their histones, they should be grouped with one or more of the other 99 class averages (WT cells) or any of the 100 class averages (cells with GFP-tagged histones). As to whether the later reference particle could have prevented us from detecting one of the non-canonical nucleosome structures, we are unable to tell because we’d really have to know what an in situ non-canonical nucleosome looks like first.

      Could it simply be that the yeast nucleoplasm is differently structured than that of HeLa cells and it was harder to find nucleosomes by template matching in these cells? The authors argue against crowding in the discussion, but maybe it is just a nucleoplasm texture that side-tracks the programs?

      Presumably, the nucleoplasmic “side-tracking” texture would come from some molecules in the yeast nucleus. These molecules would be too small to visualize as discrete particles in the tomographic slices, but they would contribute textures that can be “seen” by the programs – in particular RELION, which does the discrimination between structural states. We are not sure what types of density textures would side-track RELION’s classification routines.

      The title of the paper is not well reflected in the main figures. The title of Figure 2 says "Canonical nucleosomes are rare in wild-type cells", but that is not shown/quantified in that figure. Rare is comparison to what? I suggest adding a comparative view from the HeLa cells, like the text does in lines 195-199. A measure of nucleosomes detected per volume nucleoplasm would also facilitate a comparison.

      Figure 2’s title is indeed unclear and does not align with the paper’s title and key conclusion. The rarity here is relative to the expected number of nucleosomes (canonical plus non-canonical). We have changed the title to:

      “Canonical nucleosomes are a minority of the expected total in wild-type cells”.

      We would prefer to leave the reference to HeLa cells to the main text instead of as a figure panel because the comparison is not straightforward for a graphical presentation. Instead, we now report the total number of nucleosomes estimated for this particular yeast tomogram (~7,600) versus the number of canonical nucleosomes classified (297; 594 if we assume we missed half of them). This information is in the revised figure legend:

      “In this tomogram, we estimate there are ~7,600 nucleosomes (see Methods on how the calculation is done), of which 297 are canonical structures. Accounting for the missing disc views, we estimate there are ~594 canonical nucleosomes in this cryolamella (< 8% the expected number of nucleosomes).”

      If the cell contains mostly non-canonical nucleosomes, are they really non-canonical? Maybe a change of language is required once this is somewhat sure (say, after line 303).

      This is an interesting semantic and philosophical point. From the yeast cell’s “perspective”, the canonical nucleosome structure would be the form that is in the majority. That being said, we do not know if there is one structure that is the majority. From the chromatin field’s point of view, the canonical nucleosome is the form that is most commonly seen in all the historical – and most contemporary – literature, namely something that resembles the crystal structure of Luger et al, 1997. Given these two lines of thinking, we added the following clarification as lines 312 – 316:

      “At present, we do not know what the non-canonical nucleosome structures are, meaning that we cannot even determine if one non-canonical structure is the majority. Until we know the non-canonical nucleosomes’ structures, we will use the term non-canonical to describe all the nucleosomes that do not have the canonical (crystal) structure.”

      The authors could explain more why they sometimes use conventional the 2D followed by 3D classification approach and sometimes "direct 3-D classification". Why, for example, do they do 2D followed by 3D in Figure S5A? This Figure could be considered a regular figure since it shows the main message of the paper.

      Since the classification of subtomograms in situ is still a work in progress, we felt it would be better to show one instance of 2-D classification for lysates and one for lamellae. While it is true that we could have presented direct 3-D classification for the entire paper, we anticipate that readers will be interested to see what the in situ 2-D class averages look like.

      The main message is that there are canonical nucleosomes in situ (at least in wild-type cells), but they are a minority. Therefore, the conventional classification for Figure S5A should not be a main figure because it does not show any canonical nucleosome class averages in situ.

      Figure 1: Why is there a gap in the middle of the nucleosome in panel B? The authors write that this is a higher resolution structure (18Å), but in the even higher resolution crystallography structure (3Å resolution), there is no gap in the middle.

      There is a lower concentration of amino acids at the middle in the disc view; unfortunately, the space-filling model in Figure 1A hides this feature. The gap exists in experimental cryo-EM density maps. See Author response image 1 for an example (pubmed.ncbi.nlm.nih.gov/29626188). The size of the gap depends on the contour level and probably the contrast mechanism, as the gap is less visible in the VPP subtomogram averages. To clarify this confusing phenomenon, we added the following lines to the figure legend:

      “The gap in the disc view of the nuclear-lysate-based average is due to the lower concentration of amino acids there, which is not visible in panel A due to space-filling rendering. This gap’s visibility may also depend on the contrast mechanism because it is not visible in the VPP averages.”

      Author response image 1.

      Reviewer #2 (Public Review):

      Nucleosome structures inside cells remain unclear. Tan et al. tackled this problem using cryo-ET and 3-D classification analysis of yeast cells. The authors found that the fraction of canonical nucleosomes in the cell could be less than 10% of total nucleosomes. The finding is consistent with the unstable property of yeast nucleosomes and the high proportion of the actively transcribed yeast genome. The authors made an important point in understanding chromatin structure in situ. Overall, the paper is well-written and informative to the chromatin/chromosome field.

      We thank Reviewer 2 for their positive assessment.

      Reviewer #3 (Public Review):

      Several labs in the 1970s published fundamental work revealing that almost all eukaryotes organize their DNA into repeating units called nucleosomes, which form the chromatin fiber. Decades of elegant biochemical and structural work indicated a primarily octameric organization of the nucleosome with 2 copies of each histone H2A, H2B, H3 and H4, wrapping 147bp of DNA in a left handed toroid, to which linker histone would bind.

      This was true for most species studied (except, yeast lack linker histone) and was recapitulated in stunning detail by in vitro reconstitutions by salt dialysis or chaperone-mediated assembly of nucleosomes. Thus, these landmark studies set the stage for an exploding number of papers on the topic of chromatin in the past 45 years.

      An emerging counterpoint to the prevailing idea of static particles is that nucleosomes are much more dynamic and can undergo spontaneous transformation. Such dynamics could arise from intrinsic instability due to DNA structural deformation, specific histone variants or their mutations, post-translational histone modifications which weaken the main contacts, protein partners, and predominantly, from active processes like ATP-dependent chromatin remodeling, transcription, repair and replication.

      This paper is important because it tests this idea whole-scale, applying novel cryo-EM tomography tools to examine the state of chromatin in yeast lysates or cryo-sections. The experimental work is meticulously performed, with vast amount of data collected. The main findings are interpreted by the authors to suggest that majority of yeast nucleosomes lack a stable octameric conformation. The findings are not surprising in that alternative conformations of nucleosomes might exist in vivo, but rather in the sheer scale of such particles reported, relative to the traditional form expected from decades of biochemical, biophysical and structural data. Thus, it is likely that this work will be perceived as controversial. Nonetheless, we believe these kinds of tools represent an important advance for in situ analysis of chromatin. We also think the field should have the opportunity to carefully evaluate the data and assess whether the claims are supported, or consider what additional experiments could be done to further test the conceptual claims made. It is our hope that such work will spark thought-provoking debate in a collegial fashion, and lead to the development of exciting new tools which can interrogate native chromatin shape in vivo. Most importantly, it will be critical to assess biological implications associated with more dynamic - or static forms- of nucleosomes, the associated chromatin fiber, and its three-dimensional organization, for nuclear or mitotic function.

      Thank you for putting our work in the context of the field’s trajectory. We hope our EMPIAR entry, which includes all the raw data used in this paper, will be useful for the community. As more labs (hopefully) upload their raw data and as image-processing continues to advance, the field will be able to revisit the question of non-canonical nucleosomes in budding yeast and other organisms. 

      Reviewer #1 (Recommendations For The Authors):

      The manuscript sometimes reads like a part of a series rather than a stand-alone paper. Be sure to spell out what needs to be known from previous work to read this article. The introduction is very EM-technique focused but could do with more nucleosome information.

      We have added a new paragraph that discusses the sources of structural variability to better prepare readers, as lines 50 – 59:

      “In the context of chromatin, nucleosomes are not discrete particles because sequential nucleosomes are connected by short stretches of linker DNA. Variation in linker DNA structure is a source of chromatin conformational heterogeneity (Collepardo-Guevara and Schlick, 2014). Recent cryo-EM studies show that nucleosomes can deviate from the canonical form in vitro, primarily in the structure of DNA near the entry/exit site (Bilokapic et al., 2018; Fukushima et al., 2022; Sato et al., 2021; Zhou et al., 2021). In addition to DNA structural variability, nucleosomes in vitro have small changes in histone conformations (Bilokapic et al., 2018). Larger-scale variations of DNA and histone structure are not compatible with high-resolution analysis and may have been missed in single-particle cryo-EM studies.”

      Line 165-6 "did not reveal a nucleosome class average in..". Add "canonical", since it otherwise suggests there were no nucleosomes.

      Thank you for catching this error. Corrected.

      Lines 177-182: Why are the disc views missed by the classification analysis? They should be there in the sample, as you say.

      We suspect that RELION 3 is misclassifying the disc-view canonical nucleosomes into the other classes. The RELION developers suspect that view-dependent misclassification arises from RELION 3’s 3-D CTF model. RELION 4 is reported to be less biased by the particles’ views. We have started testing RELION 4 but do not have anything concrete to report yet.

      Line 222: a GFP tag.

      Fixed.

      Line 382: "Note that the percentage .." I can't follow this sentence. Why would you need to know how many chromosome's worth of nucleosomes you are looking at to say the percentage of non-canonical nucleosomes?

      Thank you for noticing this confusing wording. The sentence has been both simplified and clarified as follows in lines 396 – 398:

      “Note that the percentage of canonical nucleosomes in lysates cannot be accurately estimated because we cannot determine how many nucleosomes in total are in each field of view.”

      Line 397: "We're not implying that..." Please add a sentence clearly stating what you DO mean with mobility for H2A/H2B.

      We have added the following clarifying sentence in lines 412 – 413:

      “We mean that H2A-H2B is attached to the rest of the nucleosome and can have small differences in orientation.”

      Line 428: repeated message from line 424. "in this figure, the blurring implies.."

      Redundant phrase removed.

      Line 439: "on a HeLa cell" - a single cell in the whole study?

      Yes, that study was done on a single cell.

      A general comment is that the authors could help the reader more by developing the figures and making them more pedagogical, a list of suggestions can be found below.

      Thank you for the suggestions. We have applied all of them to the specific figure callouts and to the other figures that could use similar clarification.

      Figure 2: Help the reader by avoiding abbreviations in the figure legend. VPP tomographic slice - spell out "Volta Phase Plate". Same with the term "remapped" (panel B) what does that mean?

      We spelled out Volta phase plate in full and explained “remapped” the additional figure legend text:

      “the class averages were oriented and positioned in the locations of their contributing subtomograms”.

      Supplementary figures:

      Figure S3: It is unclear what you mean with "two types of BY4741 nucleosomes". You then say that the canonical nucleosomes are shaded blue. So what color is then the non-canonical? All the greys? Some of them look just like random stuff, not nucleosomes.

      “Two types” is a typo and has been removed and “nucleosomes” has been replaced with “candidate nucleosome template-matching hits” to accurately reflect the particles used in classification.

      Figure S6: Top left says "3 tomograms (defocus)". I wonder if you meant to add the defocus range here. I have understood it like this is the same data as shown in Figure S5, which makes me wonder if this top cartoon should not be on top of that figure too (or exclusively there).

      To make Figures S6 (and S5) clearer, we have copied the top cartoon from Figure S6 to S5.

      Note that we corrected a typo for these figures (and the Table S7): the number of template-matched candidate nucleosomes should be 93,204, not 62,428.

      The description in the parentheses (defocus) is shorthand for defocus phase contrast and was not intended to also display a defocus range. All of the revised figure legends now report the meaning of both this shorthand and of the Volta phase plate (VPP).

      To help readers see the relationship between these two figures, we added the following clarifying text to the Figure S5 and S6 legends, respectively:

      “This workflow uses the same template-matched candidate nucleosomes as in Figure S6; see below.”

      “This workflow uses the same template-matched candidate nucleosomes as in Figure S5.”

      Figure S7: In the first panel, it is unclear why the featureless cylinder is shown as it is not used as a reference here. Rather, it could be put throughout where it was used and then put the simulated EM-map alone here. If left in, it should be stated in the legend that it was not used here.

      It would indeed be much clearer to show the featureless cylinder in all the other figures and leave the simulated nucleosome in this control figure. All figures are now updated. The figure legend was also updated as follows:

      “(A) A simulated EM map from a crystal structure of the nucleosome was used as the template-matching and 3-D classification reference.”

      Figure S18: Why are there classes where the GFP density is missing? Mention something about this in the figure legend.

      We have appended the following speculations to explain the “missing” GFP densities:

      “Some of the class averages are “missing” one or both expected GFP densities. The possible explanations include mobility of a subpopulation of GFPs or H2A-GFPs, incorrectly folded GFPs, or substitution of H2A for the variant histone H2A.Z.”

      Reviewer #2 (Recommendations For The Authors):

      My specific (rather minor) comments are the following:

      1) Abstract:

      yeast -> budding yeast.

      All three instances in the abstract have been replaced with “budding yeast”.

      It would be better to clarify what ex vivo means here.

      We have appended “(in nuclear lysates)” to explain the meaning of ex vivo.

      2) Some subtitles are unclear.

      e.g., "in wild-type lysates" -> "wild-type yeast lysates"

      Thank you for this suggestion. All unclear instances of subtitles and sample descriptions throughout the text have been corrected.

      3) Page 6, Line 113. "...which detects more canonical nucleosomes." A similar thing was already mentioned in the same paragraph and seems redundant.

      Thank you for noticing this redundant statement, which is now deleted.

      4) Page 25, Line 525. "However, crowding is an unlikely explanation..." Please note that many macromolecules (proteins, RNAs, polysaccharides, etc.) were lost during the nuclei isolation process.

      This is a good point. We have rewritten this paragraph to separate the discussion on technical versus biological effects of crowding, in lines 538 – 546:

      “Another hypothesis for the low numbers of detected canonical nucleosomes is that the nucleoplasm is too crowded, making the image processing infeasible. However, crowding is an unlikely technical limitation because we were able to detect canonical nucleosome class averages in our most-crowded nuclear lysates, which are so crowded that most nucleosomes are butted against others (Figures S15 and S16). Crowding may instead have biological contributions to the different subtomogram-analysis outcomes in cell nuclei and nuclear lysates. For example, the crowding from other nuclear constituents (proteins, RNAs, polysaccharides, etc.) may contribute to in situ nucleosome structure, but is lost during nucleus isolation.”

      5) Page 7, Line 126. "The subtomogram average..." Is there any explanation for this?

      Presumably, the longer linker DNA length corresponds to the ordered portion of the ~22 bp linker between consecutive nucleosomes, given the ~168 bp nucleosome repeat length. We have appended the following explanation as the concluding sentence, lines 137 – 140:

      “Because the nucleosome-repeat length of budding yeast chromatin is ~168 bp (Brogaard et al., 2012), this extra length of DNA may come from an ordered portion of the ~22 bp linker between adjacent nucleosomes.”

      6) "Histone GFP-tagging strategy" subsection:

      Since this subsection is a bit off the mainstream of the paper, it can be shortened and merged into the next one.

      We have merged the “Histone GFP-tagging strategy” and “GFP is detectable on nucleosome subtomogram averages ex vivo” subsections and shortened the text as much as possible. The new subsection is entitled “Histone GFP-tagging and visualization ex vivo”

      7) Page 16, Line 329. "Because all attempts to make H3- or H4-GFP "sole source" strains failed..." Is there a possible explanation here? Cytotoxic effect because of steric hindrance of nucleosomes?

      Yes, it is possible that the GFP tag is interfering with the nucleosomes interactions with its numerous partners. It is also possible that the histone-GFP fusions do not import and/or assemble efficiently enough to support a bare-minimum number of functional nucleosomes. Given that the phenotypic consequences of fusion tags is an underexplored topic and that we don’t have any data on the (dead) transformants, we would prefer to leave out the speculation about the cause of death in the attempted creation of “sole source” strains.

    1. Author Response

      Reviewer #1 (Public Review):

      This study explores the mechanisms responsible for reduced steroidogenesis of adrenocortical cells in a mouse model of systemic inflammation induced by LPS administration. Working from RNA and protein profiling data sets in adrenocortical tissue from LPS-treated mice they report that LPS perturbs the TCA cycle at the level of succinate dehydrogenase B (SDHB) impairing oxidative phosphorylation. Additional studies indicate these events are coupled to increased IL-1β levels which inhibit SDHB expression through DNA methyltransferase-dependent DNA methylation of the SDHB promoter.

      In general, these are interesting studies with some novel implications. I do, however, have concerns with some of the author's rather broad conclusions given the limitations of their experimental approach. The paper could be improved by addressing the following points:

      1) The limitations of using LPS as the model for systemic inflammation need to be explicitly described.

      We thank the Reviewer for this suggestion. Indeed, the LPS model has several limitations as a preclinical model of sepsis, which are outlined in the revised Discussion. Despite its limitations, we chose this model over other models of sepsis, such as the cecal slurry model, due to its high reproducibility, which enabled the here presented mechanistic studies.

      2) The initial in vivo findings, which support the proposed metabolic perturbation, are based on descriptive profiling data obtained at one time point following a single dose of LPS. The author's conclusion that the ultimate transcriptional pathway identified hinges critically on knowledge of the time course of this effect following LPS, which is not adequately addressed in the paper. How was this time and dose of LPS established and are there data from different dose and time points?

      We thank the Reviewer for raising this question, which we indeed addressed at the beginning of our studies in order to determine a suitable time point and dose of LPS treatment. We chose 6 h as a suitable starting time point to perform transcriptional analyses, based on the fact that LPS triggers transcriptional changes in the adrenal gland and other tissues within the range of few hours (1-3). Confirming our expectations we found 2,609 differentially expressed genes (Figure 1a) in the adrenal cortex of LPS-treated mice among which many were involved in cellular metabolism (Figure 1d,e, 2a-e, Table 1, Table 2). Acute transcriptional changes, which are more likely to reflect direct effects of inflammatory signals compared to changes occurring at later time points (for instance in the range of days), would allow us to mechanistically investigate the effects of inflammation in the adrenal gland, which was the purpose of our studies. Hence, we were guided by the transcriptional changes observed at 6 h of LPS treatment and established the hypothesis that disruption of the TCA cycle in adrenocortical cells is key in the impact of inflammation on adrenal function. Along this line, we analyzed the metabolomic profile of the adrenal gland at 6 and 24 h of LPS treatment. At 6 h succinate levels as well as the succinate / fumarate ratio remained unchanged (Author response image 1A), while at 24 h post-injection these were increased by LPS (Author response image 1B, Figure 2l,o,q). The time delay of the increase in succinate levels (observed at 24 h) following downregulation of Sdhb mRNA expression (at 6 h) can be explained by the time required for reduction of SDHB protein levels, which is dependent on the protein turnover suggested to be approximately 12 h in HeLa cells (4). Based on these findings, all further metabolomic analyses were performed at 24 h of LPS treatment.

      Author response image 1. LPS increases the succinate/fumarate ratio at 24 but not 6h. Mice were i.p. injected with 1 mg/kg LPS and 6 h (A) and 24 h (B) post-injection succinate and fumarate levels were determined by LC-MS/MS in the adrenal gland. n=8-10; data are presented as mean ± s.e.m. Statistical analysis was done with two-tailed Mann-Whitney test. *p < 0.05.

      Having established the most suitable time points of LPS treatments to observe induced transcriptional and metabolic changes, we set out to define the LPS dose to be used in subsequent experiments. The data shown in Author response image 1, were acquired after treatment with 1 mg/kg LPS. This is a dose that was previously reported to cause transcriptional re-profiling of the adrenal gland (1, 2). However, 5 mg/kg LPS, similarly to 1 mg/kg LPS, also reduced Sdhb, Idh1 and Idh2 expression at 4 h (Author response image 2A) and increased succinate and isocitrate levels at 24 h (Author response image 2B) in the adrenal gland. Given that the effects of 1 and 5 mg/kg LPS were similar, for animal welfare reasons we continued our studies with the lower dose.

      Author response image 2. Five mg/kg LPS downregulate Sdhb, Idh1 and Idh2 expression and increase succinate and isocitrate levels in the adrenal gland of mice. Sdhb, Idh1 and Idh2 expression (A) and succinate and isocitrate levels (B) were assessed in the adrenal gland of mice treated with 5 mg/kg LPS for 4 h (A) and 24 h (B). n=5; data are presented as mean ± s.d. Statistical analysis was done with two-tailed Mann-Whitney test. p < 0.05, *p < 0.01.

      3) Related to the point above, the authors data supporting a break in the TCA cycle would be strengthened direct biochemical assessment (metabolic flux analysis) of step kin the TCA cycle process impacted.

      We entirely agree with the Reviewer and considered performing TCA cycle metabolic flux analyses in adrenocortical cells. Unfortunately, the low yield of adrenocortical cells per mouse (approx. 3,000- 6,000) does not allow the performance of metabolic flux experiments, which require higher cell numbers per sample, several time points per condition and an adequate number of replicates per experiment. Moreover, NCI-H295R cells being adrenocortical carcinoma cells are expected to have substantially altered metabolic fluxes compared to normal cells. Since we wouldn’t have the capacity to confirm findings from metabolic flux experiments in NCI-H295R cells in primary adrenocortical cells, as we did for the rest of the experiments, we decided to not perform metabolic flux experiments in NCI-H295R cells. However, performing metabolic flux analyses in adrenocortical cells under inflammatory or other stress conditions remains an important future task that we will pursue upon establishment of a more suitable cell culture system.

      4) The proposed connection of DNMT and IL1 signaling to systemic inflammation and reduced steroidogenesis could be more firmly established by additional studies in adrenal cortical cells lacking these genes.

      We thank the Reviewer for this excellent suggestion. In the revised manuscript we strengthened the evidence for an IL-1β –DNMT1 link and show that DNMT1 deficiency blocks the effects of IL-1β on SDHB promoter methylation (Figure 6k), the succinate / fumarate ratio (Figure 6m), the oxygen consumption rate (Figure 6n) and steroidogenesis (Figure 6o-q) in adrenocortical cells. In order to validate the role of IL-1β in vivo, mice were simultaneously treated with LPS and Raleukin, an IL-1R antagonist. Treatment with Raleukin increased the SDH activity (Figure 6r), reduced succinate levels and the succinate / fumarate ratio (Figure 6s,t) and increased corticosterone production in LPS-treated mice (Figure 6u).

      Reviewer #2 (Public Review):

      The present manuscript provides a mechanistic explanation for an event in adrenal endocrinology: the resistance which develops during excessive inflammation relative to acute inflammation. The authors identify disturbances in adrenal mitochondria function that differentiate excessive inflammation. During severe inflammation the TCA in the adrenal is disrupted at the level of succinate production producing an accumulation of succinate in the adrenal cortex. The authors also provide a mechanistic explanation for the accumulation of succinate, they demonstrate that IL1b decreases expression of SDH the enzyme that degrades succinate through a methylation event in the SDH promoter. This work presents a solid explanation for an important phenomenon. Below are a few questions that should be resolved experimentally.

      1) The authors should confirm through direct biochemical assays of enzymatic activity that steroidogenesis enzyme activity is not impaired. Many of these enzymes are located in the mitochondria and their activity may be diminished due to the disturbed, high succinate environment of the cortical cell as opposed to the low ATP production.

      We thank the Reviewer for this question. The activity of the first and rate-limiting steroidogenic enzyme, cytochrome P450-side-chain-cleavage (SCC, CYP11A1) which generates pregnenolone from cholesterol, was recently shown to require intact SDH function (5). In agreement with this report we show that production of progesterone, the direct derivative of pregnenolone, is impaired upon SDH inhibition (Figure 5b,e,h). In addition, we assessed the activity of CYP11B1 (steroid 11β-hydroxylase), the enzyme catalyzing the conversion of 11-deoxycorticosterone to corticosterone, i.e. the last step of glucocorticoid synthesis, by determining the corticosterone and 11-deoxycorticosterone levels by LC-MS/MS and calculating the ratio of corticosterone to 11-deoxycorticosterone in ACTH-stimulated adrenocortical cells and explants. The corticosterone / 11-deoxycorticosterone ratio was not affected by Sdhb silencing in adrenocortical cells (Figure 5- Supplement 2g) nor did it change upon LPS treatment in adrenal explants (Figure 5- Supplement 2h), suggesting that CYP11B1 activity may not be altered upon SDH blockage. Hence, we propose that upon inflammation impairment of SDH function may disrupt at least the first steps of steroidogenesis (producing pregnenolone/progesterone), thereby diminishing production of all downstream adrenocortical steroids. This is now discussed in the revised manuscript.

      2) What is the effect of high ROS production? Is steroidogenesis resolved if ROS is pharmacologically decreased even if the reduction of ATP is not resolved?

      We thank the Reviewer for this suggestion, which helped us to broaden our findings. Indeed, ROS scavenging by the vitamin E analog Trolox (Figure 5n) partially reversed the inhibitory effect of DMM on steroidogenesis (Figure 5o,p), suggesting that impairment of SDH function impacts steroidogenesis also via enhanced ROS production (Figure 4g).

      3) Does increased intracellular succinate (through cell permeable succinate treatment) inhibit steroidogenesis even if there is not a blockage of OXPHOS?

      We suggest that SDH inhibition and succinate accumulation lead to reduced steroidogenesis due to impaired oxidative phosphorylation (Figure 4c,e, 5i), reduced ATP synthesis (Figure 4d, 5j-m) and increased ROS production (Figure 4g, 5o,p). Since SDH is part (complex II) of the electron chain transfer it cannot be decoupled from oxidative phosphorylation, thereby limiting the experimental means for addressing this question.

      4) It should be demonstrated the genetic loss of IL1 signaling in adrenal cortical cells results in a loss of the effect of LPS on reduced steroidogenesis and increased succinate accumulation.

      We thank the Reviewer for this suggestion. Development of a mouse line with genetic loss of Il-1r in adrenocortical cells was rather impossible during the short time of revisions. Instead, mice under LPS treatment were treated with the IL-1R antagonist, Raleukin, to study the in vivo effects of IL-1β in the adrenal gland. IL-1R antagonism increased SDH activity in the adrenal cortex (Figure 6r), decreased succinate levels and the succinate/fumarate ratio in the adrenal gland (Figure 6s,t) and enhanced corticosterone production (Figure 6u) in LPS-treated mice, supporting our hypothesis that IL-1β mediates the effects of systemic inflammation in the adrenal cortex.

      5) It should be demonstrated the genetic loss of IL1 signaling in adrenal cortical cells results in a loss of the effect of LPS on SDH activity and ATP production and SDH promoter methylation

      As outlined above, Raleukin treatment increased SDH activity in the adrenal cortex (Figure 6r) and decreased succinate levels and the succinate/fumarate ratio in the adrenal gland (Figure 6s,t) of mice treated with LPS. Furthermore, IL-1β reduced the ATP/ADP ratio (Figure 6e) and enhanced SDHB promoter methylation in NCI-H295R cells (Figure 6k).

      6) It should be shown that the silencing of DNMT eliminates or diminishes the effect of LPS on reduced steroidogenesis and increased succinate accumulation.

      We thank the Reviewer for this suggestion, which prompted us to strengthen the evidence for the implication of DNMT1 in the effects of LPS on adrenocortical cell metabolism and function. As mentioned above, development of a new mouse line, in this case bearing genetic loss of DNMT1 in adrenocortical cells, was considered impossible during the short time of revisions. Therefore, we assessed the role of DNMT1 by silencing it via siRNA transfections in primary adrenocortical cells and NCI-H295R cells. We show that DNMT1 silencing inhibits the effect of IL-1β on SDHB promoter methylation (Figure 6k), restores Sdhb expression (Figure 6l) and reduces the succinate/fumarate ratio in IL-1β treated adrenocortical cells (Figure 6m). Accordingly, DNMT1 silencing restores ACTH-induced production of corticosterone, 11-deoxycorticosterone and progesterone in IL-1β treated adrenocortical cells (Figure 6o-q). We chose to stimulate adrenocortical cells with IL-1β instead of LPS, as in vitro the effects of IL-1β were more robust than these of LPS (possibly due to a reduction of TLR4 expression or function in cultured adrenocortical cells) and in order to show the link between IL-1β and DNMT1.

      7) Does silencing of DNMT reduce OXPHOS in adrenal cortical cells?

      We measured the oxygen consumption rate in NCI-H295R cells, which were transfected with siRNA against DNMT1 and treated or not with IL-1β. IL-1β reduced the OCR in cells transfected with control siRNA, while DNMT1 silencing blunted the effect of IL-1β (Figure 6n).

      8) The effects of LPS on reduced adrenal steroidogenesis are not elaborated at the physiological level. The manuscript should demonstrate the ramifications of the adrenal function decreasing after LPS. Does CORT release become less pronounced after subsequent challenges? Does baseline CORT decrease at some point? No physiological consequences are shown. Similarly, these physiological consequences of decreased adrenal function should be dependent on decreased SDH activity and OXPHOS in adrenal cells and this should be demonstrated experimentally.

      We thank the Reviewer for raising this excellent question. Inflammation is a potent inducer of the Hypothalamus-Pituitary-Adrenal gland (HPA) axis, causing increased glucocorticoid production, a stress response leading to vital immune and metabolic adaptations. Accordingly, LPS treatment rapidly increases glucocorticoid production in mice (1, 6, 7). Reduced adrenal gland responsiveness to ACTH associates with decreased survival of septic mice (8). These preclinical findings stand in accordance with observations in septic patients, in which impairment of adrenal function correlates with high risk for death (9). Along this line, ACTH test was suggested to have prognostic value for identification of septic patients with high mortality risk (9, 10).

      In order to confirm impairment of the adrenal gland function in septic mice, animals were subjected to sepsis via administration of a high LPS dose (10 mg / kg) and treated with ACTH 24 h later. Indeed, the ACTH-induced increase in corticosterone levels was diminished in LPS-treated mice (Author response image 3). This finding was further confirmed in adrenal explants, in which LPS pre-treatment also blunted ACTH-stimulated corticosterone production (Figure 5s).

      Author response image 3. High LPS dose blunts the ACTH response in mice. C57BL/6J mice were i.p. injected with 10 mg/kg LPS or PBS and 24 h later they were i.p. injected with 1 mg/kg ACTH. One hour after ACTH administration blood was retroorbitally collected and corticosterone plasma levels were determined by LC-MS/MS. n=4-5; data are presented as mean ± s.d. Statistical analysis was done with two-tailed Mann-Whitney test. *p < 0.05.

      Given that purpose of our studies was to dissect the mechanisms underlying adrenal gland dysfunction in inflammation rather than analyzing the physiological consequences thereof, we chose not to follow these lines of investigations and concentrate on the role of cell metabolism in adrenocortical cells in the context of inflammation.

      References

      1. W. Kanczkowski, A. Chatzigeorgiou, M. Samus, N. Tran, K. Zacharowski, T. Chavakis, S. R. Bornstein, Characterization of the LPS-induced inflammation of the adrenal gland in mice. Mol Cell Endocrinol 371, 228-235 (2013).
      2. L. S. Chen, S. P. Singh, M. Schuster, T. Grinenko, S. R. Bornstein, W. Kanczkowski, RNA-seq analysis of LPS-induced transcriptional changes and its possible implications for the adrenal gland dysregulation during sepsis. J Steroid Biochem Mol Biol 191, 105360 (2019).
      3. V. I. Alexaki, G. Fodelianaki, A. Neuwirth, C. Mund, A. Kourgiantaki, E. Ieronimaki, K. Lyroni, M. Troullinaki, C. Fujii, W. Kanczkowski, A. Ziogas, M. Peitzsch, S. Grossklaus, B. Sonnichsen, A. Gravanis, S. R. Bornstein, I. Charalampopoulos, C. Tsatsanis, T. Chavakis, DHEA inhibits acute microglia-mediated inflammation through activation of the TrkA-Akt1/2-CREB-Jmjd3 pathway. Mol Psychiatry 23, 1410-1420 (2018).
      4. C. Yang, J. C. Matro, K. M. Huntoon, D. Y. Ye, T. T. Huynh, S. M. Fliedner, J. Breza, Z. Zhuang, K. Pacak, Missense mutations in the human SDHB gene increase protein degradation without altering intrinsic enzymatic function. FASEB J 26, 4506-4516 (2012).
      5. H. S. Bose, B. Marshall, D. K. Debnath, E. W. Perry, R. M. Whittal, Electron Transport Chain Complex II Regulates Steroid Metabolism. iScience 23, 101295 (2020).
      6. W. Kanczkowski, V. I. Alexaki, N. Tran, S. Grossklaus, K. Zacharowski, A. Martinez, P. Popovics, N. L. Block, T. Chavakis, A. V. Schally, S. R. Bornstein, Hypothalamo-pituitary and immune-dependent adrenal regulation during systemic inflammation. Proc Natl Acad Sci U S A 110, 14801-14806 (2013).
      7. W. Kanczkowski, A. Chatzigeorgiou, S. Grossklaus, D. Sprott, S. R. Bornstein, T. Chavakis, Role of the endothelial-derived endogenous anti-inflammatory factor Del-1 in inflammation-mediated adrenal gland dysfunction. Endocrinology 154, 1181-1189 (2013).
      8. C. Jennewein, N. Tran, W. Kanczkowski, L. Heerdegen, A. Kantharajah, S. Drose, S. Bornstein, B. Scheller, K. Zacharowski, Mortality of Septic Mice Strongly Correlates With Adrenal Gland Inflammation. Crit Care Med 44, e190-199 (2016).
      9. D. Annane, V. Sebille, G. Troche, J. C. Raphael, P. Gajdos, E. Bellissant, A 3-level prognostic classification in septic shock based on cortisol levels and cortisol response to corticotropin. JAMA 283, 1038-1045 (2000).
      10. E. Boonen, S. R. Bornstein, G. Van den Berghe, New insights into the controversy of adrenal function during critical illness. Lancet Diabetes Endocrinol 3, 805-815 (2015).
      11. C. C. Huang, Y. Kang, The transient cortical zone in the adrenal gland: the mystery of the adrenal X-zone. J Endocrinol 241, R51-R63 (2019).
    1. Author Response

      Reviewer #1 (Public Review):

      […] Overall, the results from these analyses are convincing and valuable, but still do not seem to be a big leap from their Unger 2021 paper […]. The methodology that they established should be described more clearly so that it can be shared with the research community. For example, they say cells how many donors were recruited for this experiment? are there differences in efficiency in B cell differentiation by individual?

      Also, it would be important to assay for antibodies in the culture media. How would you suggest to improve the culture system to be used to model diseases?

      We appreciate the reviewer's queries and the points raised. In response to the first set of comments, the reviewer has correctly observed that the methodology of the assay itself as employed in this paper is not new or superior to our previously published data in (Unger et al., Cells 2021), where we described a minimalistic in vitro system for efficient differentiation of human naive B cells into antibody-secreting cells (ASCs). However, the current study aims to elucidate a comprehensive evaluation of the phenotype of the cells in the in vitro system and their relationships in potential differentiation pathways. In addition, we aimed to elucidate how the detailed gene expression profiles of the differentiating cells in vitro compare to in vivo observed counterparts. In this way, we were able to uncover an antibody secreting cell phenotype in vivo that was not observed before and could only be uncovered due to our full transcriptome knowledge of these cells. In addition, we present novel findings that demonstrate that this culture system not only enables efficient ASCs generation but also recapitulates the entire in vivo B cell differentiation pathway, as evidenced by the presence of germinal-center (GC)-like and pre-memory B cells in the culture. These results have not been previously reported in the literature for human B cells in culture and represent a significant contribution to the field of human B cell biology.

      In regards to the reviewer's inquiry about the cell culture protocol, its reproducibility, donors variability, and additional experimental applications, we refer to three additional recent publications from our group that have adopted the same in vitro B cell differentiation system and have provided extensive analysis of the immunoglobulin production, intracellular signaling pathways, as well as comparison with other culture systems in the field (Marsman et al., Cells 2020; Marsman et al., Eur. J. Immunol. 2022; Marsman et al., Front. Immunol. 2022). On top pf this, we now realize that the section that describes the culture system (MATERIAL AND METHODS - “In vitro naive B cell differentiation cultures”) was a bit too concise and we thank the reviewer for mentioning it. We have extended now on it and corrected an inconsistency at lines 125-127: “After six days, activated B cells were collected and co-cultured with 1 × 104 9:1 wild type (WT) to CD40L-expressing 3T3 cells that were irradiated and seeded one day in advance (as described above), together with IL-4 (100 ng/ml) and IL-21 (50 ng/ml; Invitrogen) for five days.”

      As for the application of our in vitro system in disease modeling, as requested by the reviewer, this would require modifying the culture conditions to mimic the disease-specific biology background (if known). For instance, by inhibiting or enhancing specific transcriptional pathways that are known to be associated with the disease in question. However, it would also require the presence of antigen-specific B cells in the pool of naive B cells included in the culture, which can be difficult to achieve due to their low frequency. Alternatively, the system could be used to study antigen-specific recall responses using antigen-specific memory B cells as starting material. Our group has evaluated this approach in a recent publication (Marsman et al., Front. Immunol 2022).

      [..] B cell differentiation may also influence to cell cycle regulation. Rather than normalize its effect, can authors analyze effect of cell cycle in B cell differentiation? [...]

      We very much agree with the reviewer and know that the cell cycle plays a significant role in B cell differentiation output trajectories (Zhou et al, Front Immunol. 2018; Duffy et al., Science 2012). Preparing the manuscript, we have in fact performed a parallel analysis in which we compared both cell cycle regressed- and not cell cycle regressed-based clustering and marker gene selection. Concerning the clustering, other clusters were obtained using the not cell-cycle-regressed dataset compared to the cell-cycle-regressed dataset (figure below). However, when overlaying the clusters obtained with the cell cycle-regressed dataset, the extra clusters were the same cell population but now split based on cycling and not cycling cells: cluster 2 is now divided into the cycling cluster “c”, and the not-cycling cluster “d” while cluster 4 and 5 are now divided into the cycling clusters “e” and the not-cycling cluster “f”. A comprehensive examination of the expression of the top 50 genes associated with antibody-secreting cells in the (non)cycling clusters 4 and 5 reveals that these genes are expressed at a higher level in (non)cycling cluster 5 as compared to cluster 4. This suggests that the cells within cluster 5 are more advanced in their differentiation, regardless of their cell cycle state. This finding has led us to the decision to present the data that has undergone cell cycle regression in the manuscript. Should the reviewer so desire, we are very willing to include additional supplementary figures to the manuscript that include the un-regressed representation.

      Figure legend: A-C) UMAP projection of single-cell transcriptomes of in vitro differentiated human naive B cells without cell cycle regression. Each point represents one cell, and colors indicate graph-based cluster assignments identified without cell-cycle regression (A), with cell cycle regression (B) or with cell cycle regression and additional subdivision in cycling and not cycling cells (C). D) Dotplot showing the top 50 differentially expressed genes in cycling and not-cycling cells from cluster 4 and 5. Point size indicate percentage of cell in the cluster expressing the gene, color indicates average expression

    1. Author Response

      Reviewer #3 (Public Review):

      The manuscript by the Qiu and Lu labs investigates the mechanism of desensitization of the acid-activated Cl- channel, PAC. These trimeric channels reside in the plasma membrane of cells as well as in organelles and play important roles in human physiology. PAC channels, like many other ion channels, undergo a process known as desensitization, where the channel adopts a non-conductive conformation in the presence of a prolonged physiological stimulus. For PAC the mo-lecular mechanisms regulating this process are not well understood. Here the authors use a com-bination of electrophysiological recordings and MD simulations to identify several acidic residues and a conserved histidine side chain as important players in PAC desensitization. The results are overall interesting and clearly indicate a role for these residues in this process. However, there are several weaknesses in the experimental design, inconsistencies between the mutagenesis data and the MD results, as well as in the interpretation of the data. For these reasons I do not think the authors have made a convincing mechanistic case.

      We thank the reviewer for the constructive comments and address the concerns point-by-point below.

      Major weaknesses:

      The underlying assumption in the interpretation of all the data is that the mutations stabilize or destabilize the desensitized conformation of the channel. However, none of the functional meas-urements provide direct evidence supporting this key assumption. Without direct evidence sup-porting the notion that the mutations specifically impact the rate of recovery from desensitiza-tion, I do not think the authors have made a convincing mechanistic case.

      We agree with the reviewer that our functional data measure the degree and rate of the PAC channel entering desensitization from the activated state upon prolonged acid treatment. This is a common experimental procedure for research on desensitization/inactivation of ion channels. Fol-lowing the reviewer’s suggestion, we also sought to capture the kinetics from the desensitized state to the activated state by switching from more acidic pH to less acidic pH (for example 4.0 to 5.0) or neutral pH. However, we found that such experiments are not feasible partly because the kinetics of PAC desensitization is much slower compared to other channels, such as ASIC channels (see a recent study we cited: https://elifesciences.org/articles/51111). For the mutants with strong desensitization (E94R and D91R), it’s unclear whether the currents we recorded at pH 5.0 right after pH 4.0 representing the activated state or the desensitized state at pH 5.0. In other words, we don’t know if the PAC channel transitions from the desensitized state from a lower pH back to the activated state or rather directly to the desensitized state at a higher pH. For the mutants with reduced desensitization, the current amplitude at pH 4.0 were often similar to that at pH 5.0, which makes the recovery/transition variable. We also tried to switch the acidic pH to neutral pH. We found that the PAC channels (both WT and mutants) go back to the closed state from the desensitized state in seconds as limited by our perfusion speed. These data suggest that the desensitized state of PAC is no longer maintained after switching buffer from low pH to neutral pH. In summary, it’s technically infeasible, in our opinion, to measure the rate of recovery from desensitization to activation for the PAC channel. However, our data do support the con-clusion that the rates of entering desensitization from the activated state, a standard measurement of desensitization, change for various channel mutants we studied.

      Overall, the agreement between the MD simulations, functional data, and interpretation are often weak and some issues should be acknowledged and addressed.

      For example:

      1) The experimental data suggests that H98, E107, and D109 play analogous roles in PAC desen-sitization. However, the MD simulations suggest that the H98-D109 interaction energy is ~4 times larger than that of H98-E107. This should lead to a much greater effect of the D109 muta-tion. How is this rationalized?

      The purpose of quantifying the interaction between H/R98 with E107 and D109 is to better dis-sect the mechanism by which H/R98 interacts with the acidic pocket residues. The result suggests that R98 has a reduced association with E107/D109 when compared to H98. It also suggests that D109 makes a more direct interaction with H/R98 when compared to E107. We acknowledge that this is not clear in our initial manuscript and we have updated the text to better describe this result. However, this doesn’t imply that the desensitization phenotype of E107R should be less pronounced than D109R. Both E107R and D109R are expected to disrupt the integrity of the acidic pocket, thus resulting in diminished channel desensitization. It is worth pointing out that E107 played a more complex role as it was identified in our previous papers as one of the major proton sensors. The E107R mutant could allow the PAC channel to become more sensitive to ac-id-induced activation (Figure 4d-e in Ruan et al, Nature, 2020), further complicating its effect in desensitization. Taken together, we don’t think the E107/D109 and H/R98 interaction strength could have quantitative correlation with the desensitization phenotype of E107R and D109R.

      2) The experimental data shows that E94 plays a key role in desensitization and the authors argue that this is due to the interactions of this residue with the β10-11 linker. However, the MD simu-lations show that these interactions happen for a small fraction, ~10%, of the time and with inter-action energies comparable to those of the H98-E107-D109 cluster. It is not clear how these sparse and transient interactions can play such a critical role in desensitization. Also, if the inter-action energies are of the same sign, how come one set of mutants favors desensitization and one does not?

      The 10% value is the amount of time when at least a hydrogen bond forms between E94/R94 and the β10–β11 loop. It is NOT the amount of time that they form interactions, as there could be other types of non-bonded interactions such as Van der Waals interaction and Coulombic interaction. In fact, our non-bonded energy calculation clearly suggests that R94 interacts with the β10–β11 loop much more favorably than E94 (Figure 4C). The impact of E94R on β10–β11 loop is also reflected in the root-mean-square-fluctuation analysis, where the β10–β11 loop shows a reduced flexibility when R94 is present (Figure 4B).

      Our central hypothesis is that PAC becomes more prone to desensitization when the desensitized conformation is stabilized. Two critical interactions are characteristic of the desensitized structure of PAC, including the association of the E94 with the β10–β11 loop, and H98 with E107/D109. Therefore, we expect mutations that alter these interactions to affect PAC channel desensitization. Based on the MD simulations, we observed the root-mean-square-fluctuation of β10–β11 loop are reduced for E94R when compared to WT (Figure 4B), suggesting that β10–β11 loop is stabilized when E94 is replaced by an arginine. The non-bonded interaction energy between E94 and the β10–β11 loop is also more negative for E94R when compared to WT (Figure 4C), another indicator of conformation stabilization. As a result, the E94R mutant favors desensitization. This is in sharp contract with the H98R data, in which H98R interact less favorably with E107/D109 (Figure 2F, G, H, I) when compared to WT. Although the interaction energies are of the same sign, it is the difference between WT and the mutants that will ultimately determine whether a certain mutation will favor desensitization or not.

      The authors' MD analysis critically depends on assumptions on the protonation states of multiple residues, that are often located in close proximity to each other. In the methods, the authors state they use PropKa to estimate the pKa of residues and assigned the protonation states based on this. I have several questions about this procedure:

      • What pH was considered in the simulations? I imagine pH 4.0 to match that of the electrophys-iological experiments.

      The exact pH environment cannot be explicitly modeled in standard MD as the protonation state of an ionizable group is not allowed to change during the simulation. Therefore, in our simulation, we prepared the MD system by first predicting the pKa of titratable residues of PAC in the de-sensitized state, and then assign the protonation status of these residues based on the pKa values. We acknowledge that the description in this part is not very clear in our original manuscript. We have revised the method to better describe how the protonation status is assigned.

      • Was the propKa analysis run considering how choices in the protonation state of neighboring residues affect the pKa of the other residues? This is critical because the interaction energies will greatly depend on the protonation state chosen.

      The pKa analysis was done based on the WT structure and the residue protonation status was assigned based on the predicted value. It is possible that mutations on certain residues could change the pKa of neighboring residues. To evaluate this impact, we carried out pKa prediction for all the mutant structures that we used as input for simulation. This is summarized in the table below:

      As shown in the table, although mutations will affect the pKa of neighboring residues, the impact is generally within 0.3 units. As our simulation is carried out based on a pH of 4.0, this variability will not affect how we assign the residue protonation status.

      • Was the pKa for the mutant constructs re-evaluated? For example, does having a Gln or Arg in place of a His affect the pKa of nearby acidic residues?

      We didn’t re-evaluate the pKa for each mutant in our initial manuscript. We have conducted such an analysis as indicated in the above table. The result suggests that arginine substitutions of H98/E94/D91 could have an impact on the pKa value of nearby residues. However, the differ-ence is relatively small and does not alter the predominant protonation status of these residues at pH 4.0.

      • H98R and Q have the same functional effect. The MD partially rationalizes the effect of H98R, however, it is not clear how Q would have the same effect as R on the interaction energies.

      Our analysis on H98R and H98Q serves two different purposes. H98 is expected to be protonat-ed at pH 4.0. The fact that H98Q mutant reduced PAC desensitization suggests that positive charge at the location is critical for PAC desensitization, which we attribute to the loss of favora-ble interaction between H98 and E107/D109. This is different from H98R mutant as arginine bears the same amount of charge as a protonated histidine. Our data suggest that the exact bio-chemical property, including its charge and side-chain flexibility, of H98 is crucial for PAC de-sensitization.

      • Are 600 ns sufficient to evaluate sampling of the different conformations?

      Our MD analysis doesn’t intend to sample large conformational transitions between different functional state. Instead, our analysis focused on local dynamics which allowed us to correlate the observation with electrophysiology data. During the revision, we have extended our simula-tion to 1 μs for each mutant. It is worth pointing out that because PAC protein is a trimer, and we performed all the calculations across three subunits. Therefore, the effective sampling time would become 3 μs in total. The new result remains the same as our initial analysis, suggesting that the sampling time is sufficient to evaluate the metrics reported in the study. We also acknowledged this limitation of our study in the discussion.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors push a fresh perspective with a sufficiently sophisticated and novel methodology. I have some remaining reservations that concern the actual make-up of the data basis and consistency of results between the two (N=16) samples, the statistical analysis, as well as the “travelling” part.

      I previously commented on the fact that findings from both datasets were difficult to discern and more effort should be made to highlight these. Also, a major conclusion “the directionality effect [effect of attention on forward waves] only occurs for visual stimulation” only rested on a qualitative comparison between studies. The authors have improved on this here, e.g., by toning down this conclusion. One thing that is still missing is a graphical representation of the data from Foster et al. (the second dataset analysed here) that would support the statistical results and allow the reader a visual comparison between the sets of findings.

      We are glad that the reviewer recognizes the improvement in the presentation of the conclusions. According to the suggestions, we have modified figure 2, not only by including a third dataset (see point below), but also in a way that allows a direct comparison between the three datasets. Specifically, the results from the three datasets are now shown in three columns next to each other. The first row shows the FW and BW waves in contra and ipsilateral lines of electrodes for each dataset: our dataset and the one from Feldmann-Wustefeld and colleagues (the first and the second column in the figure, both with visual stimulation) shows a clear interaction between direction and laterality, as confirmed by the statistical analysis. The dataset from Foster and colleagues (the third column, no visual stimulation) shows a laterality effect only in the backward waves but not in the forward ones, in line with the hypothesis that FW waves are modulated only in the presence of visual stimulation. The second row shows a schematic representation of the task, and the third row illustrate the electrodes’ lines used in each dataset. We hope the reviewer will be satisfied with the current data presentation.

      Also, for any naive reader, the concept of travelling waves may be hard to grasp in the way data are currently presented - only based on the results of the 2D-FFT. Can forward and backward-travelling waves be illustrated in a representative example to make this more intuitive?

      We thank the reviewer for the suggestion. We included in figure 1 an additional panel E that represents a schematic example of forward and backward waves in the temporal domain (i.e., in the EEG data). We hope this example will provide a better understanding of the data and the traveling wave concept.

      Finally, the way Bayes Factors from the Bayesian ANOVA are presented, especially with those close to the ‘meaningful boundaries’ ⅓ and 3, as defined in the ‘Statistical analysis’ section, requires some unification/revision. For example, here: “We found a positive correlation between contra- and ipsi- lateral backward waves, and occipital (all Pearson’s r~=0.4, all BFs 10 ~=3) and -to a smaller extent- frontal areas (all Pearson’s r~=0.3, all BFs 10 ~=2).”, where the second part should strictly be labelled as inconclusive evidence. In the same vein, there is occasional mention of “negative effects”, where it should say that evidence favours the absence of an effect.

      We agree with the reviewer and apologize for the inaccuracies in reporting the statistical analysis. We corrected as suggested (see below), replacing ‘negative effects’ with ‘evidence favors the absence of an effect’.

      From the updated manuscript :

      "We found moderate evidence of a positive correlation between contra- and ipsi- lateral backward waves, and occipital (all Pearson’s r~=0.4, all BFs10~=3) but inconclusive evidence in the frontal areas (all Pearson’s r~=0.3, all BFs10~=2)."

      From the revised ‘Results’ section, now it reads:

      […] whereas all other factors and their interactions revealed evidence in favor of the absence of an effect (BFs10<0.3).

      […] but not in the forward waves (BF10=0.231, error<0.01%, supporting evidence in favor of the absence of an effect).

      Reviewer #2 (Public Review):

      The present manuscript takes a new perspective and investigates the functional relevance of traveling alpha waves’ direction for visual spatial attention. While the modulation of alpha oscillatory power - and especially the lateralization of alpha power - has been associated with spatial attention in the literature, the present investigation offers a new perspective that helps understand and differentiate the functional roles of alpha oscillations in the ipsi- versus contralateral hemisphere for spatial attention.

      The present study uses a straightforward approach and provides an analysis of two EEG datasets, which are convergingly in line with the authors’ claim that two patterns of travelling alpha waves need to be differentiated in visual spatial attention. First, backward waves in the ipsilateral hemisphere, and second, forward waves in the contralateral hemisphere, which are only observed during visual stimulation. Importantly, the authors test the relation of these patterns of traveling waves to the overall power of alpha oscillations and to the hemispheric lateralization of alpha power. Furthermore, to test the functional significance, the authors demonstrate that the pattern of forward and backward waves around stimulus onset differentiates between hits and misses in task performance.

      Although the results are in line with the conclusions drawn, some questions remain. The authors investigate the relationship between traveling alpha waves and the hemispheric lateralization of alpha power, which is a well-established neural signature of spatial attention. Surprisingly, the lateralization of alpha power shown in Figure 3B appears relatively weak in the present dataset (by visual inspection), which raises the question of whether the investigation of a relation between lateralized alpha power and alpha traveling waves is warranted in the first place.

      We agree with the reviewer that the effect seems reduced compared to other studies, despite the topography of alpha-band lateralization in our data is in line with the literature. In order to quantify the effect, we performed an analysis similar to (Thut et al., 2006), defining a laterality index as:

      We computed such index for occipital electrodes and their average (in red in figure R1). The results reveal that for most electrodes, including their average, the laterality index is significantly larger than 0, confirming the presence of alpha-band lateralization. However, we also note that the amplitude of the effect (~0.04) is reduced compared to the study by Thut and colleagues, which was between 0.05 and 0.10.

      Figure R1 – Laterality index for occipital electrodes, quantifying alpha-band lateralization during attention allocation. All electrodes go in the expected direction, revealing an increase of alpha-band power in the ipsilateral occipital hemisphere.

      Furthermore, the authors employ between-subject correlations (with N = 16) to test the relationship between alpha traveling waves and (lateralized) alpha power. However, as inter- individual differences in patterns of travelling waves are not the main focus here, within- subject analyses of the same relations would be able to test the authors’ hypotheses much more directly.

      As suggested, we included the recommended within-subject analysis in the revised manuscript by computing a trial-by-trial correlation between alpha power and traveling waves for each participant. First, we obtained a correlation coefficient and a p-value for each subject. Then, we tested whether the correlation coefficients had an overall positive or negative distribution (i.e., according to our previous results, we expected a positive correlation between backward waves and alpha power). Additionally, we combined the p-values to test for overall significance (using the Fisher method, see Methods section below). Our results corroborate the between-subject correlation, supporting the conclusion that alpha-band power correlates mostly with backward waves (especially contro-lateral to the attended location). The other correlations (i.e., forward waves and alpha power) were statistically inconclusive. We included in the revised manuscript these new results, as shown in the following.

      From the Results section:

      “To further investigate the relation between alpha-band travelling waves and alpha power, we performed the same analysis focusing on the correlation within each participant. In particular, we correlated trial-by-trial forward and backward waves with alpha-band power for each subject, obtaining correlation coefficients ‘r’ and their respective p-values. As in the previous analysis, we correlated forward and backward waves with frontal and occipital electrodes in both contro- and ipsilateral hemispheres. We applied the Fisher method (Fisher, 1992, see Methods for details) to combine all subjects' p-values in every conditions. Overall, we found a significant effect of all combined p-values (p<0.0001), except in the lateralization condition (contra- minus ipsilateral hemisphere), similar to our previous analysis. Additionally, we tested for a consistent positive or negative distribution of the correlation coefficients. As shown in figure 3C, the results support a significant correlation between backward waves and alpha- power in the hemisphere contralateral to the attended location (BF10=10.7 and BF10=7.4 for occipital and frontal regions, respectively; all other BF10 were between 1 and 2, providing inconclusive evidence). Interestingly, this analysis also revealed a small but consistent effect in the correlation between lateralization effects, as we reported a consistently positive correlation in the contra- minus ipsilateral difference between forward waves and alpha power (BF10~5 for both frontal and occipital electrodes). However, it’s important to notice that the combined p-values obtained using the Fisher method did not reach the significance threshold in the lateralization condition, reducing the relevance of this specific result.“

      From the Methods section:

      “Additionally, we computed trial-by-trial correlations between waves and alpha power for all participants. First, we tested the correlation coefficient against zero in all conditions. Then, we obtained a combined p-value per condition using the log/lin regress Fisher method (Fisher, 1992), as shown in (Zoefel et al., 2019). Specifically, we computed the T value of a chi- square distribution with 2*N degrees of freedom from the pi values of the N participants as:

      It needs to be appreciated that the authors analyze two datasets in the present study. However, the question remains whether the absence of the forward waves effect in paradigms without visual stimulation is a general one and would replicate in other datasets. Moreover, the manuscript would benefit from a discussion of the potential implications of traveling waves for functional connectivity between posterior and anterior regions.

      We have now included a third dataset in the paper. In this dataset, from (Feldmann-Wüstefeld & Vogel, 2019), participants performed a visual working memory task by attending either the left or the right side of the screen where a stimulus was displayed. We analyzed the amount of waves during stimulus presentation, and we found the same results as in our own dataset: very strong evidence in favor of an interaction between LATERALITY (contra- and ipsilateral) and DIRECTION (FW and BW). We now included the results in figure 2 (see point above) and in the results section of the manuscript. Unfortunately, we couldn't find any other publicly available EEG dataset in which participants attend to either side of the screen without ongoing visual stimulation.

      In addition, we re-analyzed our main findings (i.e. the interaction between LATERALITY and DIRECTION) in all three datasets using a classic ANOVA to report the effect size as 𝜂2 (see point above). Unlike the Bayesian ANOVA (which -in JASP- is based on linear mixed models), the classic one does not model the slope of the random effects. Yet, we observed that the LATERALITY x DIRECTION interaction in the Foster dataset proved very significant, with a large effect size (F(1,16)=9.81, p=0.003, 𝜂2=0.13). Supposedly, modeling the slope of the random effects in the Bayesian ANOVA lowered its statistical sensitivity. For the sake of completeness, we reported both results in the manuscript.

      Concerning the potential implications of traveling waves on functional connectivity, we consider the interpretation based on the Predictive Coding scheme in the one before the last paragraph of the discussion (reported below for the reviewer’s convenience). In this framework, top-down connections have inhibitory functions, suppressing the predicted activity in lower regions. These interpretations align with our findings, relating the inhibitory role of backward travelling waves to visual attention. Similarly, in the same paragraph, we refer to the work of Spratling, which extensively investigates the relationship between selective attention and Predictive Coding.

      From the Results section:

      "To confirm our previous results, we replicated the same traveling waves analysis on two publicly available EEG datasets in which participants performed similar attentional tasks (experiment 1 of Foster et al., 2017 and experiment 1 of Feldmann-Wüstefeld and Vogel, 2019). In the first experiment from the Feldmann-Wüstefeld and Vogel dataset, participants were instructed to perform a visual working memory task in which, while keeping a central fixation, they had to memorize a set of items while ignoring a group of distracting stimuli. We focused our analysis on those trials in which the visual items to remember were placed either to the right or the left side of the screen, while the distractors were either in the upper or lower part of the screen (we pulled together the trials with either 2 or 4 distractors, as this factor was irrelevant for the purposes of our analysis). The stimuli were shown for 200ms, and we computed the amount of forward and backward waves in the 500ms following stimulus onset. As shown in figure 2 (central column), the analysis confirmed our previous results, demonstrating a strong interaction between the factors DIRECTION and LATERALITY (BF10=667, error~2%; independently, the factors DIRECTION and LATERALITY had BF10=0.2 and BF10=0.4, respectively). These results confirmed that, in the presence of visual stimulation, spatial attention modulates both forward and backward waves. Next, we analyzed another publicly available dataset from Foster et al., 2017. [...]"

      "Remarkably, as shown in figure 2 (right panel), our analysis demonstrated an effect of the lateralization (LATERALITY: BF10=3.571, error~1%), revealing more waves contralateral to the attended location, but inconclusive results regarding the interaction between DIRECTION and LATERALITY (BF10=2.056, error~1%). However, using a classical ANOVA (i.e., without modeling the slope of the random terms), the interaction between DIRECTION and LATERALITY proved significant (F(1,16)=9.81, p=0.003, 𝜂2=0.13)."

      From the Methods section:

      "We included two additional datasets in this study. In both studies, participants performed a visual attention task while keeping their fixation in the center of the screen. Regarding the Feldmann-Wüstefeld and Vogel, 2019 study, participants were asked to memorize the colors of two stimuli while ignoring a set of distractors stimuli. We analyzed uniquely those trials in which the visual stimuli were presented to the left or right side of the screen, while the distractors were placed above or below the fixation cross. After 500ms of the fixation cross, two colored 'target' stimuli were presented for 200ms. Participants were asked to memorize these stimuli, and a new 'probe’ stimulus was shown after an additional second. Participants reported whether the probe matched the target stimuli or not. We analyzed the traveling waves in the 500ms following the target stimulus onset. Participants performed a spatial attention task in the second dataset from Foster et al. 2017. First, the fixation cross cued participants to covertly attend one of eight possible spatial positions uniformly distributed around the center of the screen. After one second, a digit was displayed either in the cued location or in any other one. The remaining locations were filled with letters. Participants were instructed to report the only displayed digit. We analyzed the waves the second before the stimuli onset when participants attended to the locations cued to the left or right side of the screen (we discarded trials in which participants attended locations above or below the fixation cross). For additional details about both experimental procedures, we refer the reader to Foster et al., 2017 and Feldmann-Wüstefeld and Vogel, 2019.”

      From the discussion:

      "Our previous work proposed an alternative cause for the generation of cortical waves (Alamia and VanRullen, 2019). We demonstrated that a simple multi-level hierarchical model based on Predictive Coding (PC) principles and implementing biologically plausible constraints (temporal delays between brain areas and neural time constants) gives rise to oscillatory traveling waves propagating both forward and backward. This model is also consistent with the 2-dipoles hypothesis (Zhigalov and Jensen, 2022), considering the interaction between the parietal and occipital areas (i.e., a model of 2 hierarchical levels). However, dipoles in parietal regions are unlikely to explain the observed pattern of top-down waves, suggesting that more frontal areas may be involved in generating the feedback. This hypothesis is in line with the PC framework, in which top-down connections have an inhibitory function, suppressing the activity predicted by higher-level regions (Huang and Rao, 2011). Interestingly, Spratling proposed a simple reformulation of the terms in the PC equations that could describe it as a model of biased competition in visual attention, thus corroborating the interpretation of our finding within the PC framework (Spratling, 2008, 2012)."

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1) There is affluent evidence that the cortical activity in the waking brain, even in head restrained mice, is not uniform but represents a spectrum of states ranging from complete desynchronization to strong synchronization, reminiscent of the up and down states observed during sleep (Luczak et al., 2013; McGinley et al., 2015; Petersen et al., 2003). Moreover, awake synchronization can be local, affecting selective cortical areas but not others (Vyazovskiy et al., 2011). State fluctuations can be estimated using multiple criteria (e.g., pupil diameter). The authors consider reduced glutamatergic drive or long-range inhibition as potential sources of the voltage decrease but do not attempt to address this cortical state continuum, which is also likely to play a role. For example: does the voltage inactivation following ripples reflect a local downstate? The authors could start by detecting peaks and troughs in the voltage signal and investigate how ripple power is modulated around those events.

      Our study is correlational, and hence, we cannot speak as to any casual role that the awake hippocampal ripples may play in the post-ripple hyperpolarization observed in aRSC. It is indeed possible that the post-awake-ripple neocortical hyperpolarization is independent of ripples and reflects other mechanisms that our experiments have possibly been blind to. One such mechanism is neocortical synchronization in the awake state. As reviewer 1 pointed out, it is possible that a proportion of hippocampal ripples occur before neocortical awake down-states. To test this hypothesis, we triggered the ripple power signal by the troughs (as proxies of awake down-states) and peaks (as proxies of awake up-states) of the voltage signals, captured from different neocortical regions, during periods of high ripple activity when the probability of neocortical synchronization is highest (McGinley et al., 2015; Nitzan et al., 2020). According to this analysis (see the figure below), the ripple power was, on average, higher before troughs of aRSC voltage signal than before those of other regions. On the other hand, the ripple power, on average, was not higher after the peaks of aRSC voltage signal than after those of other regions. This observation supports the hypothesis that a local awake down-state could occur in aRSC after the occurrence of a portion of hippocampal ripples. However, a recent work whose preprint version was cited in our submission (Chambers et al., 2022, 2021) reported that, out of 33 aRSC neurons whose membrane potentials were recorded, only 1 showed up-/down-states transitions (bimodal membrane potential distribution). Still, a portion (10 out of 30) of the remaining neurons showed an abrupt post-ripple hyperpolarization. In addition, they reported a modest post-ripple modulation of aRSC neurons’ membrane potential (~ %20 of the up/down-states transition range). Hence, these results suggest that the post-ripple aRSC hyperpolarization is not necessarily the result of down-states in aRSC. A paragraph discussing this point was added to the discussion lines 262-279.

      Mean ripple power triggered by troughs and peaks of voltage signal captured from aRSC, V1, and FLS1. Zero time represents the timestamp of neocortical troughs/peaks. The shading represents SEM (n = 6 animals).

      Point 2) Ripples are known to be heterogeneous in multiple parameters (e.g., power, duration, isolated events/ ripple bursts, etc.), and this heterogeneity was shown to have functional significance on multiple occasions (e.g. Fernandez-Ruiz et al., 2019 for long-duration ripples; Nitzan et al., 2022 for ripple magnitude; Ramirez-Villegas et al., 2015 for different ripple sharp-wave alignments). It is possible that the small effect size shown here (e.g. 0.3 SD in Fig. 2a) is because ripples with different properties and downstream effects are averaged together? The authors should attempt to investigate whether ripples of different properties differ in their effects on the cortical signals.

      The seeming small effect size (e.g. 0.3 SD in Fig. 2a) is because the individual peri-ripple voltage/glutamate traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In the latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than the standard deviation of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size). For instance, if the sample size in Fig. 2A (number of ripples) was 100, the mean z-scored value would be 0.3*10 = 3. In any case, it is of interest to investigate the relationship between the ripple and neocortical activity features.

      To investigate the relationship between the hippocampal ripple power and the peri-ripple neocortical voltage activity, we focused on the agranular retrosplenial cortex (aRSC) as it showed the highest level of modulation around ripples. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple power, the ripples were divided into 8 subgroups using 8-quantiles of their power distribution, and the corresponding aRSC voltage traces were averaged for each subgroup (similar to the work of Nitzan et al. (Nitzan et al., 2022)). The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple power subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lowest and highest power subgroups. The shading represents the standard error of the mean.

      These results suggested that there might be a positive correlation between the ripple power and the pre-ripple and post-ripple aRSC voltage amplitude. To test this possibility, Pearson’s correlation between the ripple power and pre-/post-ripple aRSC amplitude was calculated for each animal separately. The ripple power for each detected ripple was defined as the average of the ripple-band-filtered, squared, and smoothed hippocampal LFP trace from -50 ms to +50ms relative to the ripple's largest trough timestamp (ripple center). The pre- and post-ripple aRSC amplitude for each ripple was calculated as the average of the aRSC voltage trace over the intervals [-200ms, 0] and [0, 200ms], respectively. The results come as follows.

      Top: the scatter plots of the ripple power and pre-ripple aRSC voltage amplitude for individual animals. The black lines in each graph represent the linear regression line. The blue circles in each graph are associated with one ripple. The Pearson’s correlation values (ρ) and the p-value of their corresponding statistical significance are represented on top of each graph. Bottom: the same as top graphs but for post-ripple aRSC amplitude.

      According to this analysis, 4 out of 6 animals showed a weak positive correlation (ρ = 0.0806 ± 0.0115; mean ± std), 1 animal showed a negative correlation (ρ = -0.20183), and 1 animal did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (ρ = -0.1 and -0.14), and 4 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      To check that the correlation results were not influenced by the extreme values of the ripple power and aRSC voltage, we repeated the same correlation analysis after removing the ripples associated with top and bottom %5 of the ripple power and aRSC voltage values. According to this analysis, 1 out of 6 animals showed a negative correlation (ρ = -0.13), and 5 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and pre-ripple aRSC voltage amplitude. Moreover, 2 out of 6 animals showed a negative correlation (same animals that showed negative correlation before removing the extreme values; ρ = -0.12 and -0.14), 1 animal showed a positive correlation (ρ = 0.1), and 3 animals did not show a statistically significant correlation (p-value > 0.05) between ripple power and post-ripple aRSC voltage amplitude.

      Based on these results, we cannot conclude that there is a meaningful correlation between the ripple power and amplitude of aRSC voltage activity before and after the ripples. It is noteworthy to mention that Nitzan et al. (see Fig S6 in (Nitzan et al., 2022)) did not report a statistically significant correlation between ripple power octile number (by discretizing a continuous-valued random variable into 8 subgroups) and pre-ripple firing rate of the mouse visual cortex. However, they reported a statistically significant negative correlation (ρ = -0.13) between the ripple power octile number and post-ripple firing rate of the mouse visual cortex. It appears that their reported negative correlation was influenced by the disproportionately larger values of the firing rate associated with the first ripple power octile compared to the other octiles. Therefore, repeating their analysis after removing the first octile would probably lead to a weak correlation value close to 0.

      Next, we investigated the relationship between ripple duration and aRSC voltage activity. To get an idea of what features of the aRSC voltage activity might be correlated with the ripple duration, the ripples were divided into 8 subgroups using 8-quantiles of their duration distribution, and the corresponding aRSC voltage traces were averaged for each subgroup. The results of this analysis are summarized in the figure below.

      Left: peri-ripple aRSC voltage trace was triggered on ripples in the odd-numbered ripple duration subgroups for each animal and then averaged across 6 animals. The standard errors of the mean were not shown for the sake of simplicity. Right: the same as the left panel but for only lower and highest duration subgroups. The shading represents standard error of the mean.

      These results do not reveal a qualitative difference between the patterns of aRSC peri-ripple voltage modulation and ripple duration. However, the same correlation analysis performed for the ripple power was also conducted for the ripple duration. Only 1 animal out of 6 showed a statistically significant correlation (ρ = 0.08) between pre-ripple aRSC voltage amplitude and ripple duration.

      Moreover, only 1 animal out of 6 showed a statistically significant correlation (ρ = -0.08) between post-ripple aRSC voltage amplitude and ripple duration. In conclusion, there does not seem to be a meaningful linear relationship between peri-ripple aRSC voltage amplitude and ripple duration.

      Next, we investigated whether the peri-ripple aRSC voltage modulation differs depending on whether a single or a bundled ripple occurs in the dorsal hippocampus. The bundled ripples were detected following the method described in our previous work (Karimi Abadchi et al., 2020). We found that 9.4 ± 3.5 (mean ± std across 6 animals) percent of the ripples occurred in bundles. Then, the aRSC voltage trace was triggered by the centers of the single as well as centers of the first/second ripples in the bundled ripples, averaged for each animal, and averaged across 6 animals. The results of this analysis are represented in the following figure.

      Left: animal-wise average of mean peri-ripple aRSC voltage trace triggered by centers of the single and centers of the first ripple in the bundled ripples. Right: Same to the left but triggered by the centers of the second ripple in the bundled ripples.

      These results suggest that the amplitude of aRSC voltage activity is larger before bundled than single ripples, and the timing of aRSC voltage activity is shifted to the later times for bundled versus single ripples. The pre-ripple larger depolarization might signal the occurrence of a bundled ripple (similar to larger pre-bundled- than pre-single-ripple deactivation observed during sleep (Karimi Abadchi et al., 2020)).

      Point 3) The differences between the voltage and glutamate signals are puzzling, especially in light of the fact that in the sleep state they went hand in hand (Karimi Abadchi et al., 2020, Fig. 2). It is also somewhat puzzling that the aRSC is the first area to show voltage inactivation but the last area to display an increase in glutamate signal, despite its anatomical proximity to hippocampal output (two synapses away). The SVD analysis hints that the glutamate signal is potentially multiplexed (although this analysis also requires more attention, see below), but does not provide a physiologically meaningful explanation. The authors speculate that feed-forward inhibition via the gRSC could be involved, but I note that the aRSC is among the two major targets of the gRSC pyramidal cells (the other being homotypical projections) (Van Groen and Wyss, 2003), i.e., glutamatergic signals are also at play. To meaningfully interpret the results in this paper, it would be instrumental to solve this discrepancy, e.g., by adding experiments monitoring the activity of inhibitory cells.

      Observing that glutamate and voltage signals do not go hand-in-hand in awake versus sleep states was surprising for us as well, and it was the main reason that SVD analysis was performed. Especially that a portion of aRSC excitatory neurons showed elevated calcium activity despite the reduction of voltage and delayed elevation of glutamate signals in aRSC at the population level. At the time of initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons, as the reviewer 1 pointed out, or from other neocortical or subcortical regions projecting to aRSC. It is also possible that feedback inhibition would be involved where the principal aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons.

      Point 4) I am puzzled by the ensemble-wise correlation analysis of the voltage imaging data: the authors point to a period of enhanced positive correlation between cortex and hippocampus 0-100 ms after the ripple center but here the correlation is across ripple events, not in time. This analysis hints that there is a positive relationship between CA1 MUA (an indicator for ripple power) and the respective cortical voltage (again an incentive to separate ripples by power), i.e. the stronger the ripple the less negative the cortical voltage is, but this conclusion is contradictory to the statements made by the authors about inhibition.

      A closer look at Figure 2B iv reveals that elevation of the cross-correlation function between peri-ripple aRSC voltage and hippocampal MUA starts with a short delay (~20 ms) and peaks around 75 ms after the ripple centers. It means the maximum correlation between the two signals occurs at point (75ms, 75ms) on the MUA time-voltage time plane whose origin (i.e. the point (0, 0)) is the ripple centers in the hippocampal MUA and corresponding imaging frame in the voltage signal. Reviewer 1’s interpretation would be correct if the maximum correlation occurred at the point (0, 0) not at the point (75ms, 75 ms). It is because the MUA value at the time of ripple centers (t = 0) is the indicator of the ripple power not at the time t = 75ms. Figure 2B iii shows that the amplitude of hippocampal MUA is more than 2 dB less at t = 75ms than at t = 0 which is a reflection of the fact that ripples are often short-duration events. Instead, if the maximum correlation occurred at the point (0, 100ms) where the ripples had maximum power and aRSC voltage was at its trough (Figure 2B iii), it could have been concluded that “the stronger the ripple the less negative the cortical voltage”.

      Point 5) Following my previous point, it is difficult to interpret the ensemble-wise correlation analysis in the absence of rigorous significance testing. The increased correlation between the HPC and RSC following ripples is equal in magnitude to the correlation between pre-ripple HPC MUA and post-ripple cortical activity. How should those results be interpreted? The authors could, for example, use cluster-based analysis (Pernet et al., 2015) with temporal shuffling to obtain significant regions in those plots. In addition, the authors should mark the diagonal of those plots, or even better compute the asymmetry in correlation (see Steinmetz et al., 2019 Extended Fig. 8 as an example), to make it easier for the reader to discern lead/lag relationships.

      The purpose of calculating the ensemble-wise correlation coefficient was to provide further information about the relationship between the two random processes peri-ripple HPC MUA and peri-ripple neocortical activity. In general, the correlation between the two random processes cannot be inferred from the temporal relationship between their mean functions. In other words, there are infinitely many options for the shape of the correlation function between two random processes with given mean functions. Moreover, the point was to compare the correlation of peri-ripple neocortical activity and HPC MUA across neocortical regions. The fact that mean peri-ripple activity in, for example, RSC and FLS1 are different does not necessarily mean their correlation functions with peri-ripple HPC MUA are also different.

      As requested, we performed cluster-based significant testing via temporal shuffling for each individual VSFP (n = 6), iGluSnFR Ras (n = 4), and iGluSnFR EMX (n = 4) animals. The following figures summarize the number of animals showing significant regions in their correlation functions between peri-ripple HPC MUA and different neocortical regions. The diagonal of the correlation functions is marked; however, the temporal lead/lag should not be inferred from these results mainly because the temporal resolution of the two signals, one electrophysiological and one optical, are not the same.

      Point 6) For the single cell 2-photon responses presented in Fig. 3, how should the reader interpret a modulation that is at most 1/20 of a standard deviation? Was there any attempt to test for the significance of modulation (e.g., by comparing to shuffle)? If yes, what is the proportion of non-modulated units? In addition, it is not clear from the averages whether those cells represent bona fide distinct groups or whether, for instance, some cells can be upmodulated by some ripples but downmodulated by others. Again, separation of ripples based on objective criteria would be useful to answer this question.

      As explained in response to point 2, the seeming small modulation size (e.g. 0.05 SD in Fig. 3b) is because the individual peri-ripple calcium traces were z-scored against a peri-non-ripple distribution and then averaged. Alternatively, the peri-ripple traces could have been averaged first, and the averaged trace could have been z-scored against a sampling distribution constructed from the abovementioned peri-non-ripple distribution where the sample size would have been the number of ripples detected for a specific animal. In this latter case, the standard deviation of the sampling distribution would have been used as the divisor in the z-scoring process as opposed to the former case where the standard deviation of the original peri-non-ripple distribution would have been used. Since the standard deviation of the sampling distribution is smaller than that of the original distribution by a factor of √(sample size), the final z-scored values in the latter would be higher than those in the former case by a factor of √(sample size).

      As suggested by the reviewer and to make our results more comparable with those of electrophysiological studies, we deconvolved the calcium traces and tested for the significance of the modulation of each neuron by comparing its mean peri-ripple deconvolved trace with a neuron-specific shuffled distribution (see the methods section for details). We found %8.46 ± 3 (mean ± std across 11 mice) of neurons were significantly modulated over the interval [0, 200ms] and %81.08 ± 8.91 (mean ± std across 11 mice) of which were up-modulated. If the criterion of being distinct is being significantly up- or down-modulated, these two groups could be considered distinct groups. The following figures show mean peri-ripple calcium and deconvolved traces, averaged across up- or down-modulated neurons for each mouse and then averaged across 11 mice.

      Point 7) Fig. 3: The decomposition-based analysis of glutamate imaging using SVD needs to be improved. First, it is not clear how much of the variance is captured by each component, and it seems like no attempt has been made to determine the number of significant components or to use a cross-validated approach. Second, the authors imply that reconstructing the glutamate imaging data using the 2nd-100th components 'matches' the voltage signal but this statement holds true only in the case of the aRSC and not for other regions, without providing an explanation, raising questions as to whether this similarity is genuine or merely incidental.

      The first 100 components explained about %99.9 of the variance in the concatenated stack of peri-ripple neocortical glutamate activity for each animal which is practically equivalent to the entire variance in the data. Our goal was not to obtain a low-rank approximation of the data for which the number of significant components had to be determined. Instead, we decomposed the data into the activity along the first principal component for which there was no noticeable topography among neocortical regions and the activity along the rest of the components for which there was a noticeable topography among neocortical regions. The first component explained %83.11 ± 6.75 (mean ± std across 4 iGluSnFR Ras mice) and %83.3 ± 5.07 (mean ± std across 4 iGluSnFR EMX mice) of variance in the concatenated stack of peri-ripple neocortical glutamate activity.

      As we discussed in the discussion section of the manuscript, SVD is agnostic about brain mechanisms and only cares about capturing maximum variance. Specifically, it is not designed to capture the maximum similarity between glutamate and voltage activity in the brain. Therefore, the only thing we can say with certainty comes as follows: when the activity along the axis with maximum co-variability (1st principal component) across the neocortical regions’ glutamate activity is removed, only aRSC, and no other regions, show a post-ripple down-modulation, whose timing matches that of aRSC post-ripple voltage down-modulation. Moreover, the timing of activity of 1st principal component matches better with that of calcium activity among the up-modulated portion of aRSC neurons. Even though the genuineness of these results is not guaranteed, the similarity between the timing of SVD output in aRSC glutamatergic activity with that in two independently collected signals in aRSC, i.e. voltage and calcium, could support the idea that peri-ripple aRSC glutamatergic activity is likely a mixture of up- and down-modulated components.

      Point 8) The estimation of deep pyramidal cells' glutamate activity by subtracting the Ras group (Fig. 4d) is not very convincing. First, the efficiency of transgene expression can vary substantially across different mouse lines. Second, it is not clear to what extent the wide field signal reflects deep cells' somatic vs. dendritic activity due to non-linear scattering (Ma et al., 2016), and it is questionable whether a simple linear subtraction is appropriate. The quality of the manuscript would improve substantially if the authors probe this question directly, either by using deep layer specific line/ 2-P imaging of deep cells or employing available public datasets.

      Simulation studies have suggested that the signal, captured by wide-field imaging of voltage-sensitive dye, can be modeled as a weighted sum of voltage activity across neocortical layers (Chemla and Chavane, 2010; Newton et al., 2021). Hence, modeling the glutamate signal as a weighted sum of the glutamate activity across neocortical layers is a good starting point. Future studies would be needed to improve this starting point by imaging glutamate activity in a cohort of mice with iGluSnFR expression in only deep layers’ neurons. Moreover, Ma et al. (Ma et al. 2016) stated that “This means that signal detected at the cortical surface (in the form of a two-dimensional image) represents a superficially weighted sum of signals from shallow and deeper layers of the cortex”.

      Reviewer #2 (Public Review):

      Point 1) The authors throughout the manuscript compare the correlation between hippocampal MUA and the imaged cortical ensemble activity (Example: Lines 120-122). There is a potential time lag in signal detection with regard to the two detection methods. While the time lag using electrophysiological recording is at the scale of milliseconds, the glutamate-sensitive imaging might take several 100s of ms to be detected. It is not clear in the manuscript how the authors considered this problem during the analysis.

      The ensemble-wise correlation analysis characterizes the relationship between two random processes, peri-ripple HPC MUA and peri-ripple neocortical activity (please see the response to reviewer 1’s major point 5). Although it is a valid point that the temporal resolution of the two signals is not the same which could introduce an error in the exact timing of the relationship between the two processes, we did not draw any conclusion based on the exact timing of the elevated correlation between the two processes. Moreover, we smoothed (equivalent to low-pass filtering) and down-sampled the MUA signal (please see the methods section) to bring the temporal scale of the two processes closer to each other. We also want to clarify that the temporal resolution of voltage and glutamate imaging is in the range of 10s of ms (Xie et al., 2016).

      Point 2) In the results section "The peri-ripple glutamatergic activity is layer dependent", are the Ras and EMX expressed in two different experimental animal groups? If yes, and there was a time lag between the two groups, is it valid to estimate the deeper layer activity using a scaled version of the Ras from the EMX signal?

      This comment is addressed in response to reviewer 1’s major point 8.

      Point 3) The authors did not discuss the results adequately in the discussion section. Since there is no behavioral paradigm and no behavioral read-out to induce or correlate it with possible planning and future decision-making process, the significance of the paper will be enhanced by discussing the possible underlying circuitry mechanism that might cause the reported observations. With no planning periods in the task (instead just sitting on a platform), it is actually quite unclear what the purpose of wake ripples should be. For example, the authors discuss the superficial and deep layer responses and their relation to the memory index theory. However, the RSC possesses different groups of excitable neurons in different layers. Specifically, three excitable neurons are found within the different layers of the RSC; the intrinsically bursting neurons (IB), regular spiking (RS), and low-rheobase (LR) neurons. These neurons are distributed heterogeneously within the RSC cortical layer. Although the RS are abundant in the deeper layers of the RSC, they occupy 40% of the total amount of excitable neurons found in layers II/III. On the other hand, the LR is the dominant excitable neuron in the superficial layers. It will add to the significance of the work if the authors discussed the results in the context of the cellular structure of the RSC and how would that impact the observed inhibition in the peri-ripple time window. It would be helpful for the readers and the reviewers to add a schematic diagram to the discussion section.

      The goal of our study was to characterize the patterns of neocortical activity around hippocampal ripples in the awake state and not shed light on the function (purpose) of awake ripples. However, we speculated about what our results could mean in the discussion section. To address the reviewer’s comment on the differences across RSC layers, the following paragraph was added to the discussion section lines 342-353.

      “Our results suggest that dendrites of deep pyramidal neurons, arborized in the superficial layers of the neocortex, receive glutamatergic modulation earlier than those of the superficial ones. However, the results do not provide a mechanistic explanation of the phenomenon. It is possible that the observed layer-dependency of the glutamatergic modulation would partially result from the heterogeneity of the excitatory as well as inhibitory neurons across aRSC layers. But, the question is how this heterogeneity may lead to the above-mentioned layer-dependency to which our data does not provide an answer. It could be speculated that the difference in the dendritic morphology and firing type of different types of RSC excitatory neurons (Yousuf et al., 2020) or the difference in connectivity of different RSC layers with other brain regions would play a role (Sugar et al., 2011; van Groen and Wyss, 1992; Whitesell et al., 2021). This is a complicated problem and could only be resolved by conducting experiments specifically designed to address this problem.”

      Point 4. A general issue (in addition to the missing behaviour), is the mix of the methods. On one side this makes the article very interesting since it highlights that with different methods you actually observe different things. But on the other side, it makes it very difficult to follow the results. It would be a major improvement of the article if the authors could include (as mentioned above) a schematic of the results and their theory, especially highlighting how the different methods would capture different parts of the mechanism. Finally, the authors should not use calcium signals as a direct measure of neuronal firing. Calcium influx is only seen in bursts of firing, not with individual spikes. It is a plasticity signal and therefore should be treated and discussed as such. Just recently it was shown by Adamantidis lab that the calcium signal changes between wake and sleep and this change does not parallel changes in neuronal firing/spikes.

      We agree with the reviewer that the calcium signal is biased toward burst of spikes (Huang et al., 2021). To address this concern, the term “spiking activity” was replaced with “calcium activity” throughout the manuscript. Moreover, the calcium signal was deconvoled to get a better estimate of the spiking activity (please refer to our response to the reviewer 1’s point 6).

      Point 5. In the discussion section, the authors focus their discussion on the connectivity between the CA1 area and the RSC. Although it is an important point, since the authors are examining the peri-ripple cortical dynamics, it is critical to discuss other possible connectivity effects. Furthermore, the hippocampal input preferentially targets the granular RSC, how would that impact the results and the interpretation of the authors? Additionally, a previous study reported the suppression of the thalamic activity during hippocampal ripples (Yang et al., 2019). Importantly, the thalamic inputs to the RSC target the superficial layers. It will add to the value of the paper if the authors expanded the discussion section and elaborated further on the possible interpretation of the results.

      At the time of our initial submission, pre-ripple reduction and post-ripple elevation of calcium activity in a portion of three subclasses of the superficial aRSC inhibitory neurons were reported (Chambers et al., 2022, 2021), and it was the basis of our speculation on the potential involvement of feed-forward inhibition in the post-ripple voltage reduction. We speculated that the source of this potential feed-forward inhibition could stem from gRSC excitatory neurons or other neocortical or subcortical regions projecting to aRSC (please see the discussion section). However, the source being from the thalamus is less likely because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022). On the other hand, it is also possible that feedback inhibition would be involved where the excitatory aRSC neurons that are excited by gRSC (as reviewer 1 pointed out) or any other region, including aRSC itself, excite aRSC inhibitory neurons which in turn inhibit pyramidal cells. To address this comment, the following paragraph was added to the discussion section in lines 323-328.

      “Thalamus is another source of axonal projections to aRSC (Van Groen and Wyss, 1992). However, it is less likely that thalamic projections contribute to the peri-awake-ripple aRSC activity modulation because multiple studies have observed the suppression of the majority of thalamic neurons during awake ripples (Logothetis et al., 2012; Nitzan et al., 2022; Yang et al., 2019). Moreover, peri-awake-ripple suppression of thalamic axons projecting to the first layer of aRSC is reported (Chambers et al., 2022).”

    1. Author Response

      Reviewer #1 (Public Review):

      HCN channels are atypically opened by the downward movement of gating charges during hyperpolarisation and have such weak coupling between the VSD and pore domain, and in the absence of an open state structure, extracting mechanistic information has been difficult. This manuscript is a continuation of a previous study on HCN channel gating that revealed how hyperpolarisation causes a downward movement of the VSD's S4, with breakage into two helices. The authors explore gating motions and the coupling between VSD and the pore domain using atomistic simulations. This includes microsecond MD with and without very strong -1V applied potentials to try to drive VSD-TMD changes to open the channel. In the end, however, the authors used a biased simulation approach (adiabatic bias) to enforce conformational change from resting to an open homology model of HCN based on hERG/rEAG. This microsecond simulation followed three interaction distances that were suggested to change between resting and open states based on free MD. This simulation caused pore opening and allowed a description of changes that may occur during gating, including a competition of S5-S6 and S6-S6 contacts and lipid binding locations, which may suggest lipid-dependent function and explain an unexpected closed structure at 0mV in micelles. While I feel the manuscript is written for the HCN expert audience, the mechanistic information in terms of hyperpolarisation-induced voltage gating makes it of much interest. The manuscript is presented at a high level, though there are a couple of points to address, including reproducibility of simulations and potential for more relation to experimental findings.

      We appreciate the comments, thank you, please find a detailed answer below.

      The authors carried out 1μs-MD simulations of the resting, activated, and a Y289D mutant at 0 mV, and then tried to drive the conformational change with a very large -1V voltage (double that studied previously). In 1 us MD, is the membrane stable with such a big voltage, as it would likely not be experimentally? Even with a volt applied, there was incomplete activation of the voltage sensors, despite timescales approaching that of activation.

      This reviewer is correct in cautioning against membrane rupturing effects in simulations with a voltage of this magnitude. We have indeed checked that the membrane and the protein remains intact under these conditions and can confirm that no poration occurs. As membrane poration is stochastic, it could indeed occur over microsecond timescales under 1V, but the probability remains low, and we were lucky to not face this situation herein. Note that whereas potentials of this magnitude could not be applied in experiments, they are relatively routinely used in MD simulations to speed up processes that are driven by changes in transmembrane potentials.

      Interestingly, other work from our lab (Rems et al. Biophysical Journal 119 (1) 190-205 (2020)) has shown that HCN1 voltage sensor domains are less prone to poration than those from other voltage sensor domains, for reasons that remain to be determined.

      Author Response Figure 1. Final snapshots from the simulations of the resting (blue), intermediate (yellow) and activated (red) states. The representation of the solvent (water+ions) in cyan showed no membrane poration at the end of the 1us simulations.

      For the pulling/ driving simulations (adiabatic bias MD) to change suspected interaction distances (V390-I302, N300-W281, and D290-K412), it seems to be just 1 simulation, without reproducibility. One has to wonder, if the simulation was redone from a very different initial conformation, would the results be the same (in addition to the distances themselves that were enforced by the ABMD). Moreover, the authors had to model the open state, such that the results depend on a homology model based on other CNBD channels, hERG / rEAG. Although the model stayed open for a microsecond, what other measures of accuracy of the homology model are there, such as preserved distances according to mutants/double mutants?

      The ABMD simulations were repeated, please refer to the response to essential revisions point 1 for details.

      For reasons mentioned by the reviewer as well as a reconsideration of our strategy to model channel opening, we have decided to omit homology models from the revised version of the paper.

      The authors find that activation involves hydrophobic forces that strengthen the intra-subunit S4/S5/S6 interface, as well as lipid headgroups that make contact with hydrophilic residues at this interface, with lipid tails also contributing to hydrophobic contacts. The authors see bending and rotation of the lower S4 and a displacement of S1 away from S4 that exposes the VSD-pore interface to lipids, with increased lipid contacts at S4 and S5 during activation. This indicates lipid tails may play a role in coupling in HCN1 and may explain the closed state micelle structure at 0mV. Two sites of lipid contact are identified, one engaging VSD residues and the other polar or charged residues on S5 and S6. No experiments are presented or proposed to test the predicted lipid sites. e.g. Mutation of key residues, such as the arginine and histidine seen binding lipid headgroups could be tested as proof of their involvement, or perhaps experiments with varied phosphate moieties? In the absence of new experiments, is there existing data that could help validate the findings?

      We thank this reviewer for this comment. As noted in the response to essential revisions point 3, such experiments are challenging, and have not been reported so far in HCN channels. We do agree that aspects of the mechanism we propose remain hypothetical awaiting further work, but are happy to report that importance of lipid interactions with the crucial salt bridge pair mentioned in the response to essential revisions point 3 has been completely independently validated, thus strengthening our mechanistic hypothesis substantially.

      During free MD simulation, the authors see tilting of S5 caused by activation of the Y289D mutation that brings D290 and K412 positions into proximity. How do we know that the adjacent mutant of Y289 to aspartate has not caused this, or was this interaction also seen in wild-type simulation? Fig.3c might suggest the wt activated simulation may see such an interaction, but it is unclear given the large C_alpha distances, as opposed to H-bonding distances.

      Indeed, Figure 3 appears to indicate that this interaction between D290 and K412 is present in the activated state when the mutation is reverted to the WT sequence. We have recalculated the interaction propensity using all atoms of the residues and present an updated Figure 3c in response.

      The authors predict that a D290-K412 salt bridge may be important for gating and sought to experimentally validate the interaction in the activated-open state using cysteine cross-bridging. As this is the only experimental backing in the paper, it is important to be able to judge its ability to report on the D290-K412 salt bridge. A comparison experiment demonstrating other crosslinks that do not favour the open state would have been helpful in this regard e.g. if crossbridging at similar locations (but not predicted to change interaction during gating) had little effect on I/Imax, then the result may be bolstered. Are there existing mutagenesis experiments that may suggest the importance of these residues (as well as for other key interaction distances identified)?

      Negative results in cross bridging and cysteine accessibility studies in general are difficult to interpret as the lack of a cadmium-specific effect may be due to inaccessibility of the site to cadmium, pairwise distance too far to bridge by cadmium, or bridging or the specified site without a functional effect. However, as reviewer 2 pointed out below, the Yellen group has performed extensive cross bridging experiments in the S4-S5 to Clinker region in spHCN and in most of these positions, the pairs favoring the open state are closer together in our models than pairs favoring the closed state or those without functional effect. We have added Videos 1-6 to highlight this comparison on our open state models and describe in our updated discussion section.

      Rotation of the V390 side chain from a position facing the pore lumen to a position facing I302 on S5 is coupled to an increase of the pore radius at V390, an increased hydration of the pore intracellular gate, and K+ ion movement. Perhaps 5 or 6 ions cross in that single simulation. As K channel ion permeation can depend critically on starting ion configs (as well as the model/force field), reproducibility of this finding is important but does not appear to have been tested. How can we be sure that periods of permeation or no permeation in individual simulations are reliable?

      As mentioned in our response to essential revisions point 1, we have modified the collective variable set used in ABMD, and repeated the simulations in 4 replicates. Whereas the number of permeation events is low in each simulation (Figure 4 S1), the consistency across repeats indicates that these open pore models indeed represent conductive states. Given how short the simulations are, however, it appears unreasonable to infer conductance values from these observations.

      Reviewer #3 (Public Review):

      In this work, Elbahnsi and colleagues use enhanced sampling MD simulation, to recapitulate step by step, the electromechanical coupling between VSD and the pore in HCN1 channels. Building on the available cryoEM structures of HCN1 with the VSD in resting and active state, the authors characterize by MD a subset of interactions that seemingly stabilize the open channel. This subset is, in turn, used in enhanced-sampling simulations to guide channel opening. The main findings are that S4 movement induces a rearrangement of the hydrophobic interaction at the level of S1- S4- and S5 interfaces. Occupancy of lipids seems therefore statedependent and highlights their regulatory role in HCN gating.

      The approach is rather innovative, and it apparently allows the reconstruction of the whole mechanism of gating, pushing the predictive power of MD simulation well beyond its actual temporal limitations. At the same time, the initial choice of interactions is crucial for this approach, because the result cannot differ from the inputs. And reading the paper it does not emerge clearly how the correctness of the reconstructed gating pathway can be verified, if not by functional validation.

      We thank the reviewer for this thoughtful review. It has pushed us to reconsider our approach to enhance the sampling of channel activation and gating. Please refer to the detailed response below as well as the response in particular to essential revisions point 1.

      Here are my comments on the main interactions that were used to feed the final MD simulation:

      1) W281-N300: this interaction, previously identified and studied in SpH channels (Ramentol et al, 2020; Wu et al, 2021), has been elegantly confirmed in this paper. Its inclusion in the initial subset seems appropriate. In the other two cases, the choice of interactions requires further explanations and experimental validation.

      2) D290 and K412: the validation of this interaction shown in Figure 3 and suppl Figure 1 is missing a control, i.e., the effect of the addition of Cd++ on the wt channel. Please add.

      We have performed the control suggested. Please also refer to the answer to essential revisions point 2.

      3) Modelling the open state of HCN1 pore (page 18), is done on the structure of the distantly related hERG rather than on the available open pore structure of HCN4. This choice is justified as follows by the authors:

      a) "Available structures in the CNBD channel family for which representative structures have been solved in closed and open states".

      b) "The structural mechanism of pore gating (i.e. the ⍺ to 𝜋 helix occurring at the glycine657 hinge in hERG) observed in rEAG/hERG may be a conserved gating transition in the CNBD family of channels"

      I encourage the authors to consider the following:

      a) The structure of hERG channel is not available in the closed/open configuration, indeed the comparison must be done with the closed configuration of the related channel rEAG. On the contrary, HCN4 is available in the closed/open configurations. Moreover, one of the open pore structures shows S4-S5-S6 in a very similar conformation to the lock open mutant (F186C/S264C) of HCN1 (Saponaro et al, 2021). With an available HCN4 open structure, forcing HCN1 to the open pore structure of hERG channel (which opens in depolarization and is not regulated by cAMP) seems not necessary.

      In response to this point, we reconsidered our approach and chose to instead use a biasing distance that is consistently increased in CNBD channels of resolved structures, that between neighboring and cross-subunits V390. We have detailed our rationale in the response to essential revisions point 1.

      To my knowledge, hERG is the only channel of the CNBD family for which the transition ⍺ to 𝜋 helix reported by the Authors, occurs in S6. It is not reported for other CNBD family members, in particular for the CNG channels mentioned by the Authors (Zheng et al., 2020; Xue et al., 2021, 2022). Task 4 (Zheng et al) does not show it. Its pore opens by a right-handed twist of S6 at glycine 399, a conserved glycine in all CNG. Human CNGA1 too, opens the pore by a rotational movement of S6 hinged at the equivalent glycine (glycine 385) (Xue et al, 2021). And the same occurs in the non-symmetrical channel CNGA1/B1 (Xue te al, 2022). So, it seems that CNG channels do not show the ⍺ to 𝜋 helix transition in the open pore. Moreover, hERG excluded, all other members of the CNBD family, CNG, EAG, and HCN4 included, do not bend at the hinge glycine 657 of hERG, but at another glycine (gly 648 in hERG numbering) located upstream. Further, their opening is due to a rotation of S6 associated with an outward movement, rather than to the lifting of the lower part of S6, as in hERG.

      After considering this reviewer’s comment, we were surprised to see that HCN1 is apparently prone to secondary structure deformation in S6, even when biasing the aforementioned distances, and thus enforcing no rotation at all in S6. We are intrigued by this observation and eagerly await experimental validation or disproval.<br /> In the meantime, we have made clear in the text that this hypothesis remains based exclusively on modeling work.

      4) V390-I302: this interaction is predicted to stabilize the open pore configuration and was included in the subset. The contact between V390 on S6 and I302 on S5 is observed in the homology model discussed above when the S6 is twisted at the glycine hinge, rotating the preceding residue (V390) out of its pore-lining position and is. Again, I can only disagree with this hypothesis because it has been experimentally demonstrated (Cheng et al, J Pharmacol Exp Ther. 2007 Sep;322(3):931-9) that the side chain of Valine390 is inside the cavity of the open pore of HCN1 channels as it controls the affinity for the pore blocker ZD7288.

      In accordance with other comments above, we have eliminated the bias applied to the V390I302 distance. However, the new ABMD simulations with bias applied to encourage dilation at position 390 still involve rotation of V390 away from the central pore axis, albeit with bending of S6 at the upper glycine mentioned by this reviewer. The degree of rotation is lower than in our previous simulations so that V390 still lines the inner vestibule in the open state, consistent with the observation that this position influences the apparent affinity of open pore blockers.

      In conclusion, modelling the open state pore of HCN1 on hERG rather than on that of HCN4 seems not justified based on accumulated evidence in the published literature. Therefore, the choice of the authors to use it as the open pore model of HCN1 channels needs to be experimentally validated. One possibility is to mutate the glycine hinge, gly391 in HCN1, into an Alanine in order to remove the flexible hinge. If this mutation alters pore gating, it will support the choice of the Authors.

      Once more, we thank the reviewer for the comments, which have led us to reconsider a larg part of our modeling work.

    1. Author Response

      Reviewer #2 (Public Review):

      There is emerging evidence that connexin43 hemichannels localized to mitochondria can influence their function. Here the authors demonstrated using an osteocyte cell model that connexin43 is localized to mitochondria and that this is enhanced in response to oxidative stress. Several lines of evidence were presented showing that mitochondrial connexin43 forms functional hemichannels and that connexin43 is required for optimal mitochondrial respiration and ATP generation. These aspects were major strengths of the study.

      The authors also show that connexin43 is recruited to mitochondria in response to oxidant stress, as a cell protective mechanism. This was primarily done using hydrogen peroxide to generate oxidant stress; primary osteocytes from Csf-1+/- mice, which are prone to Nox4 induced oxidant stress, also show enhanced mitochondrial connexin43 when compared with wild type osteocytes.

      Several approaches were used to demonstrate that connexin43 interacts with the ATP synthase subunit, ATP5J2, suggesting a direct role for connexin43 in the control of ATP synthesis by mediating mitochondrial ion homeostasis. Several experiments were done using a series of pHluorin fusion protein constructs as a proton sensor, these experiments hint at a potential role for connexin43 in regulating H+ permeability to support ATP production. However, the effects of inhibiting connexin43 on pH were modest, suggesting that additional roles for mitochondrial connexin43 in ATP generation should be considered.

      Thank you for your positive and thoughtful comments. We agree that additional roles for mitochondrial Cx43 may be possible. As an example, we consider that there may be a change in the stability of ATP synthase that occurs after mtCx43 deficiency. This and other possible roles of mtCx43 ought to be investigated in the future.

      Reviewer #3 (Public Review):

      This manuscript should be of broad interest to readers not only in the field of gap junction (GJ) mediated cell-to-cell communication but also to scientists and clinicians working on the function of mitochondria and metabolism. Their data elucidates a new function of Cx43 in regulating the energy (ATP) generation of mitochondria, e.g., under oxidative stress.

      The canonical function of gap junctions is in direct cell-to-cell communication by forming plasma membrane traversing channels that electrically and chemically connect the cytoplasms of adjacent cells. These channels are assembled from connexin proteins, connexin 43 (Cx43). However, more recently new, non-canonical cellular locations and functions of Cx43 have been discovered, e.g. mitochondrial Cx43 (mtCx43). However, very little is known about where Cx43 transported into mitochondria is derived from, how Cx43 is transported into mitochondria, where it is located in mitochondria, in which form Cx43 is present in mitochondria, (polypeptides, hemi-channels (HCs), complete GJ channels), and what the function of mtCx43 is. The authors addressed the latter question. The authors provide convincing evidence that mtCx43 modulates mitochondrial homeostasis and function in bone osteocytes under oxidative stress. Together, their study suggests that mtCx43 hemi-channels regulate mitochondrial ATP generation by mediating K+, H+, and ATP transfer across the mitochondrial inner membrane by directly interacting with mitochondrial ATP synthase (ATP5J2), leading to an enhanced protection of osteocytes against oxidative insult. These findings provide important information of a role of Cx43 functioning directly in mitochondria and not at the canonical location in the plasma membrane. While most of the functional assays presented in Figures 2-8 appear solid, the mitochondrial localization of Cx43, its translocation into mitochondria under oxidative stress, and its configuration as hemi-channels (Figure 1) is less convincing. I have five general comments that should be addressed:

      1) This study was performed in MLO-Y4 osteocyte cells. Is the H2O2 induced increase of mitochondrial Cx43 MLO-Y4 cell type or osteocyte specific, or is Cx43 playing a more general role in mitochondrial function, e.g. under oxidative stress? Osteoblasts such as MC3T3-E1 and MG63, and many other cell types endogenously express Cx43, and oxidative stress is a general physiological stressor, not only for osteocytes and bone cells. Attending to this question would address the generality of the findings for mitochondrial function.

      We thank the reviewer for bringing up these valid points; seeing the phenotype displayed in secondary cell types, such as osteoblasts, would be of great relevance and interest. To address this, we conducted new experiments on MC3T3-E1 cells (Figure 1-figure supplement 2). After 2 hrs of H2O2 treatment, Cx43 accumulated on the mitochondria, marked by Mitotracker. Statistical analysis also showed a significant increase of the localization between Cx43 and Mitotracker (Figure 1-figure supplement 2B). The colocalization coefficient is higher in the Ctrl group in MC3T3-E1 cells when compared with the MLO-Y4 Ctrl group, indicating a different response level in other cell lines. Osteoblasts seemed to be more sensitive to redox interference. Overall, proving the point that under oxidative stress, mtCx43 may display a similar phenotype, across multiple cell lines, although the degree of sensitivity may differ.

      2) The images of MLO-Y4 cells (Figure 1A) and the primary osteocytes isolated from Csf-1+/- and control mice (Figure 8) do not show visible gap junctions. I guess this is due to the fact that slides were stained with the Cx43(E2) antibody. I feel, staining of these cells in addition with the Cx43(CT) antibody would be helpful to get a better understanding on the distribution of Cx43 in gap junctions and undocked/un-oligomerized Cx43 in these cells.

      Thank you for the suggestion. To get a better understanding of the distribution of Cx43, either in GJ or HC form, we performed additional experiments in MLO-Y4 cells using the Cx43(CT) antibody and data are shown below. With Cx43(CT) staining, we observed more signals in the cells and on the plasma membrane. After H2O2 treatment, we observed increased and stronger signals localized on the mitochondria compared with the untreated control group. Stronger signals observed in the plasma membrane indicate the gap junction stained by Cx43(CT) antibody.

      3) The images of cells presented in Figure 1A are quite fussy. No mitochondria are visible, and the Cx43 staining is hazy and does not localize to any subcellular structures. Also, it is not clear if the higher resolution image presented in Figure 1C actually represents a mitochondrion. A good DIC image, or co-staining with another mitochondrial marker such as MitoTracker (as shown in Figure 4-S1) would make the localization and translocation of Cx43 into mitochondria upon oxidative stress more convincing. This is especially important as the translocation, although statistically significant, increases only by about 10% or less (Figure 1B). Such a small difference (also represented in the Western analyses presented in Figure 1D) could easily be artefactual, depending on how the correlation coefficient was generated. Of note in this respect is that control cells in Figure 1A appear larger (compare the size of the nuclei) and are spread out more than the H2O2 treated cells. Better, more clear images would make the mitochondrial localization/translocation more convincing.

      The reviewer made great points. To improve the image clarity, we redid the staining/imaging and determined the colocalization of SDHA and MitoTracker Deepred. The result (shown below) suggested that under normal conditions without H2O2 treatment, SDHA and MitoTracker merged perfectly, while after H2O2 treatment for 2 hrs, mitochondria became fragmented and the SDHA signal exhibited a more dotted pattern compared to the MitoTracker. Overall, we feel that MitoTracker represents the distribution of mitochondria better. SDHA is a subunit of mitochondrial complex II, and the images we presented in Figure 1C were captured from isolated mitochondria under a confocal microscope with SDHA and Cx43(CT) co-staining. Considering the specificity of SDHA (see images below), we believe the Cx43 signal we captured demonstrates the mitochondrial localization/translocation. After using MitoTracker as a mitochondrial marker and higher magnificent images, the correlation coefficient increased from 0.35 to 0.47, a 32% increment with statistical significance. As to the nuclei size, some cells indeed have smaller sizes, which may be affected by varied local cell density. The new images represented in Figure 1A are much more consistent in the nuclei size.

      4) How pure are the mitochondria that were probed for Cx43 by Western shown in Figure 1D? The preparation method described is relatively simple, collecting the 10,000xg supernatant (here 9,000xg supernatant) as mitochondrial fraction. Is it possible that the Cx43 signal, at least in part, is derived from other, contaminating membranes, such as PM, Golgi, or ER? Testing the mitochondrial preparation by Western with marker proteins specific for these compartments would strengthen the author's results.

      The reviewer made a great suggestion. To address this, we did a western blot to test the mitochondrial purity. Indeed, this method using centrifugation is simple, and as expected there were some contamination of ER (marked by PDI) and Golgi (marked by STX6). However, to further confirm the purity of the mitochondrial fraction, fluorescent dyes for mitochondria (MitoTracker Deepred), ER (ER-Tracker Blue-White), and nuclei (Hochest) were used. The organelle-specific dyes indicated most parts of the fraction were mitochondria. There were some contaminations with ER fragments and minimal nuclear contamination. Combining our western blot and immunofluorescence data, it can be concluded that our Cx43 signal is primarily derived from mitochondria.

      5) The authors rely on previous studies to postulate that Cx43 in mitochondria forms hemichannels in their system, is localized in the inner membrane, and is oriented with the Cx43 C-termini facing the inter-membrane space (as schemed in Figure 8C). The authors use lucifer yellow (LY) dye transfer and carbenoxolone, but both are not hemi-channel specific probes. They are transferred by, and block GJ channels as well. Experiments, using hemi-channel specific probes would be more convincing. This is important, as the information cited is based on only two references (Boengler et al., 2009; Miro-Casas et al., 2009), and it still is highly unclear how a membrane protein that is co-translationally inserted into the ER membrane, then traffics through the Golgi to be inserted into the plasma membrane is actually imported into mitochondria and in which state (monomeric, hexameric). Why the Cx43(CT) specific antibody traverses the outer mitochondrial membrane and reaches the Cx43CT while the Cx43(E2) specific antibody is not described and clear either. Where are these mitochondria permeabilized with Triton X-100 as described in M&M?

      We edited the Methods section. We did not use Triton X-100 to permeate mitochondria. PMP appeared to preserve mitochondrial inner membrane integrity allowing us to assess the localization of Cx43(CT) antibody on mitochondria. We showed these new immunofluorescence images in Figure 5- figure supplement 2. PMP used as a plasma membrane permeabilizer has a 6x affinity with MOM compared with MIM. Meanwhile, no Cx43(E2) Ab signal was detected in mitochondria, suggesting the extracellular loop of Cx43 faces the matrix and cannot be accessed by Cx43(E2) antibody.

      The translocation of Cx43 to mitochondria was reported to involve the chaperone Hsp90-dependent TOM complex pathway (Rodriguez-Sinovas et al., 2006). After the translocation, if mtCx43 forms gap junctions in mitochondria is unclear. Lucifer yellow is widely used in hemichannel-mediated dye uptake or gap junction-mediated dye transfer. In our case, considering the channel orientation, mtCx43 should form hemichannels, and Cx43(CT) Ab could be used as a specific Cx43 HCs blocker like the study reported in cardiomyocytes (Lillo et al., 2019).

    1. Author response:

      Reviewer #1 (Public Review):

      This paper proposes a novel framework for explaining patterns of generalization of force field learning to novel limb configurations. The paper considers three potential coordinate systems: cartesian, joint-based, and object-based. The authors propose a model in which the forces predicted under these different coordinate frames are combined according to the expected variability of produced forces. The authors show, across a range of changes in arm configurations, that the generalization of a specific force field is quite well accounted for by the model.

      The paper is well-written and the experimental data are very clear. The patterns of generalization exhibited by participants - the key aspect of the behavior that the model seeks to explain - are clear and consistent across participants. The paper clearly illustrates the importance of considering multiple coordinate frames for generalization, building on previous work by Berniker and colleagues (JNeurophys, 2014). The specific model proposed in this paper is parsimonious, but there remain a number of questions about its conceptual premises and the extent to which its predictions improve upon alternative models.

      A major concern is with the model's premise. It is loosely inspired by cue integration theory but is really proposed in a fairly ad hoc manner, and not really concretely founded on firm underlying principles. It's by no means clear that the logic from cue integration can be extrapolated to the case of combining different possible patterns of generalization. I think there may in fact be a fundamental problem in treating this control problem as a cue-integration problem. In classic cue integration theory, the various cues are assumed to be independent observations of a single underlying variable. In this generalization setting, however, the different generalization patterns are NOT independent; if one is true, then the others must inevitably not be. For this reason, I don't believe that the proposed model can really be thought of as a normative or rational model (hence why I describe it as 'ad hoc'). That's not to say it may not ultimately be correct, but I think the conceptual justification for the model needs to be laid out much more clearly, rather than simply by alluding to cue-integration theory and using terms like 'reliability' throughout.

      We thank the reviewer for bringing up this point. We see and treat this problem of finding the combination weights not as a cue integration problem but as an inverse optimal control problem. In this case, there can be several solutions to the same problem, i.e., what forces are expected in untrained areas, which can co-exist and give the motor system the option to switch or combine them. This is similar to other inverse optimal control problems, e.g. combining feedforward optimal control models to explain simple reaching. However, compared to these problems, which fit the weights between different models, we proposed an explanation for the underlying principle that sets these weights for the dynamics representation problem. We found that basing the combination on each motor plan's reliability can best explain the results. In this case, we refer to ‘reliability’ as execution reliability and not sensory reliability, which is common in cue integration theory. We have added further details explaining this in the manuscript.

      “We hypothesize that this inconsistency in results can be explained using a framework inspired by an inverse optimal control framework. In this framework the motor system can switch or combine between different solutions. That is, the motor system assigns different weights to each solution and calculates a weighted sum of these solutions. Usually, to support such a framework, previous studies found the weights by fitting the weighed sum solution to behavioral data (Berret, Chiovetto et al. 2011). While we treat the problem in the same manner, we propose the Reliable Dynamics Representation (Re-Dyn) mechanism that determines the weights instead of fitting them. According to our framework, the weights are calculated by considering the reliability of each representation during dynamic generalization. That is, the motor system prefers certain representations if the execution of forces based on this representation is more robust to distortion arising from neural noise. In this process, the motor system estimates the difference between the desired generalized forces and generated generalized forces while taking into consideration noise added to the state variables that equivalently define the forces.”

      A more rational model might be based on Bayesian decision theory. Under such a model, the motor system would select motor commands that minimize some expected loss, averaging over the various possible underlying 'true' coordinate systems in which to generalize. It's not entirely clear without developing the theory a bit exactly how the proposed noise-based theory might deviate from such a Bayesian model. But the paper should more clearly explain the principles/assumptions of the proposed noise-based model and should emphasize how the model parallels (or deviates from) Bayesian-decision-theory-type models.

      As we understand the reviewer's suggestion, the idea is to estimate the weight of each coordinate system based on minimizing a loss function that considers the cost of each weight multiplied by a posterior probability that represents the uncertainty in this weight value. While this is an interesting idea, we believe that in the current problem, there are no ‘true’ weight values. That is, the motor system can use any combination of weights which will be true due to the ambiguous nature of the environment. Since the force field was presented in one area of the entire workspace, there is no observation that will allow us to update prior beliefs regarding the force nature of the environment. In such a case, the prior beliefs might play a role in the loss function, but in our opinion, there is no clear rationale for choosing unequal priors except guessing or fitting prior probabilities, which will resemble any other previous models that used fitting rather than predictions.

      Another significant weakness is that it's not clear how closely the weighting of the different coordinate frames needs to match the model predictions in order to recover the observed generalization patterns. Given that the weighting for a given movement direction is over- parametrized (i.e. there are 3 variable weights (allowing for decay) predicting a single observed force level, it seems that a broad range of models could generate a reasonable prediction. It would be helpful to compare the predictions using the weighting suggested by the model with the predictions using alternative weightings, e.g. a uniform weighting, or the weighting for a different posture. In fact, Fig. 7 shows that uniform weighting accounts for the data just as well as the noise-based model in which the weighting varies substantially across directions. A more comprehensive analysis comparing the proposed noise-based weightings to alternative weightings would be helpful to more convincingly argue for the specificity of the noise-based predictions being necessary. The analysis in the appendix was not that clearly described, but seemed to compare various potential fitted mixtures of coordinate frames, but did not compare these to the noise-based model predictions.

      We agree with the reviewer that fitted global weights, that is, an optimal weighted average of the three coordinate systems should outperform most of the models that are based on prediction instead of fitting the data. As we showed in Figure 7 of the submitted version of the manuscript, we used the optimal fitted model to show that our noise-based model is indeed not optimal but can predict the behavioral results and not fall too short of a fitted model. When trying to fit a model across all the reported experiments, we indeed found a set of values that gives equal weights for the joints and object coordinate systems (0.27 for both), and a lower value for the Cartesian coordinate system (0.12). Considering these values, we indeed see how the reviewer can suggest a model that is based on equal weights across all coordinate systems. While this model will not perform as well as the fitted model, it can still generate satisfactory results.

      To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. In this experiment, a model that is based on global fitted weights can only predict one out of two possible generalization patterns while models that are based on individual direction-predicted weights can predict a variety of generalization patterns. We show that global weights, although fitted to the data, cannot explain participants' behavior. We report these new results in Appendix 2.

      “To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. We used the idea of experiment 3 in which participants generalize learned dynamics using a tool. That is, the arm posture does not change between the training and test areas. In such a case, the Cartesian and joint coordinate systems do not predict a shift in generalized force pattern while the object coordinate system predicts a shift that depends on the orientation of the tool. In this additional experiment, we set a test workspace in which the orientation of the tool is 90° (Appendix 2- figure 1A). In this case, for the test workspace, the force compensation pattern of the object based coordinate system is in anti-phase with the Cartesian/joint generalization pattern. Any globally fitted weights (including equal weights) can produce either a non-shifted or 90° shifted force compensation pattern (Appendix 2- figure 1B). Participants in this experiment (n=7) showed similar MPE reduction as in all previous experiments when adapting to the trigonometric scaled force field (Appendix 2- figure 1C). When examining the generalized force compensation patterns, we observed a shift of the pattern in the test workspace of 14.6° (Appendix 2- figure 1D). This cannot be explained by the individual coordinate system force compensation patterns or any combination of them (which will always predict either a 0° or 90° shift, Appendix 2- figure 1E). However, calculating the prediction of the Re-Dyn model we found a predicted force compensation pattern with a shift of 6.4° (Appendix 2- figure 1F). The intermediate shift in the force compensation pattern suggests that any global based weights cannot explain the results.”

      With regard to the suggestion that weighting is changed according to arm posture, two of our results lower the possibility that posture governs the weights:

      (1) In experiment 3, we tested generalization while keeping the same arm posture between the training and test workspaces, and we observed different force compensation profiles across the movement directions. If arm posture in the test workspaces affected the weights, we would expect identical weights for both test workspaces. However, any set of weights that can explain the results observed for workspace 1 will fail to explain the results observed in workspace 2. To better understand this point we calculated the global weights for each test workspace for this experiment and we observed an increase in the weight for the object coordinates system (0.41 vs. 0.5) and a reduction in the weights for the Cartesian and joint coordinates systems (0.29 vs. 0.24). This suggests that the arm posture cannot explain the generalization pattern in this case.

      (2) In experiments 2 and 3, we used the same arm posture in the training workspace and either changed the arm posture (experiment 2) or did not change the arm posture (experiment 3) in the test workspaces. While the arm posture for the training workspace was the same, the force generalization patterns were different between the two experiments, suggesting that the arm posture during the training phase (adaptation) does not set the generalization weights.

      Overall, this shows that it is not specifically the arm posture in either the test or the training workspaces that set the weights. Of course, all coordinate models, including our noise model, will consider posture in the determination of the weights.

      Reviewer #2 (Public Review):

      Leib & Franklin assessed how the adaptation of intersegmental dynamics of the arm generalizes to changes in different factors: areas of extrinsic space, limb configurations, and 'object-based' coordinates. Participants reached in many different directions around 360{degree sign}, adapting to velocity-dependent curl fields that varied depending on the reach angle. This learning was measured via the pattern of forces expressed in upon the channel wall of "error clamps" that were randomly sampled from each of these different directions. The authors employed a clever method to predict how this pattern of forces should change if the set of targets was moved around the workspace. Some sets of locations resulted in a large change in joint angles or object-based coordinates, but Cartesian coordinates were always the same. Across three separate experiments, the observed shifts in the generalized force pattern never corresponded to a change that was made relative to any one reference frame. Instead, the authors found that the observed pattern of forces could be explained by a weighted combination of the change in Cartesian, joint, and object-based coordinates across test and training contexts.

      In general, I believe the authors make a good argument for this specific mixed weighting of different contexts. I have a few questions that I hope are easily addressed.

      Movements show different biases relative to the reach direction. Although very similar across people, this function of biases shifts when the arm is moved around the workspace (Ghilardi, Gordon, and Ghez, 1995). The origin of these biases is thought to arise from several factors that would change across the different test and training workspaces employed here (Vindras & Viviani, 2005). My concern is that the baseline biases in these different contexts are different and that rather the observed change in the force pattern across contexts isn't a function of generalization, but a change in underlying biases. Baseline force channel measurements were taken in the different workspace locations and conditions, so these could be used to show whether such biases are meaningfully affecting the results.

      We agree with the reviewer and we followed their suggested analysis. In the following figure (Author response image 1) we plotted the baseline force compensation profiles in each workspace for each of the four experiments. As can be seen in this figure, the baseline force compensation is very close to zero and differs significantly from the force compensation profiles after adaptation to the scaled force field.

      Author response image 1.

      Baseline force compensation levels for experiments 1-4. For each experiment, we plotted the force compensation for the training, test 1, and test 2 workspaces.

      Experiment 3, Test 1 has data that seems the worst fit with the overall story. I thought this might be an issue, but this is also the test set for a potentially awkwardly long arm. My understanding of the object-based coordinate system is that it's primarily a function of the wrist angle, or perceived angle, so I am a little confused why the length of this stick is also different across the conditions instead of just a different angle. Could the length be why this data looks a little odd?

      Usually, force generalization is tested by physically moving the hand in unexplored areas. In experiment 3 we tested generalization using a tool which, as far as we know, was not tested in the past in a similar way to the present experiment. Indeed, the results look odd compared to the results of the other experiments, which were based on the ‘classic’ generalization idea. While we have some ideas regarding possible reasons for the observed behavior, it is out of the scope of the current work and still needs further examination.

      Based on the reviewer’s comment, we improved the explanation in the introduction regarding the idea behind the object based coordinate system

      “we could represent the forces as belonging to the hand or a hand-held object using the orientation vector connecting the shoulder and the object or hand in space (Berniker, Franklin et al. 2014).” The reviewer is right in their observation that the predictions of the object-based reference frame will look the same if we change the length of the tool. The object-based generalized forces, specifically the shift in the force pattern, depend only on the object's orientation but not its length (equation 4).

      The manuscript is written and organized in a way that focuses heavily on the noise element of the model. Other than it being reasonable to add noise to a model, it's not clear to me that the noise is adding anything specific. It seems like the model makes predictions based on how many specific components have been rotated in the different test conditions. I fear I'm just being dense, but it would be helpful to clarify whether the noise itself (and inverse variance estimation) are critical to why the model weights each reference frame how it does or whether this is just a method for scaling the weight by how much the joints or whatever have changed. It seems clear that this noise model is better than weighting by energy and smoothness.

      We have now included further details of the noise model and added to Figure 1 to highlight how noise can affect the predicted weights. In short, we agree with the reviewer there are multiple ways to add noise to the generalized force patterns. We choose a simple option in which we simulate possible distortions to the state variables that set the direction of movement. Once we calculated the variance of the force profile due to this distortion, one possible way is to combine them using an inverse variance estimator. Note that it has been shown that an inverse variance estimator is an ideal way to combine signals (e.g., Shahar, D.J. (2017) https://doi.org/10.4236/ojs.2017.72017). However, as we suggest, we do not claim or try to provide evidence for this specific way of calculating the weights. Instead, we suggest that giving greater weight to the less variable force representation can predict both the current experimental results as well as past results.

      Are there any force profiles for individual directions that are predicted to change shape substantially across some of these assorted changes in training and test locations (rather than merely being scaled)? If so, this might provide another test of the hypotheses.

      In experiments 1-3, in which there is a large shift of the force compensation curve, we found directions in which the generalized force was flipped in direction. That is, clockwise force profiles in the training workspace could change into counter-clockwise profiles in the test workspace. For example, in experiment 2, for movement at 157.5° we can see that the force profile was clockwise for the training workspace (with a force compensation value of 0.43) and movement at the same direction was counterclockwise for test workspace 1 (force compensation equal to -0.48). Importantly, we found that the noise based model could predict this change.

      Author response image 2.

      Results of experiment 2. Force compensation profiles for the training workspace (grey solid line) and test workspace 1 (dark blue solid line). Examining the force nature for the 157.5° direction, we found a change in the applied force by the participants (change from clockwise to counterclockwise forces). This was supported by a change in force compensation value (0.43 vs. -0.48). The noise based model can predict this change as shown by the predicted force compensation profile (green dashed line).

      I don't believe the decay factor that was used to scale the test functions was specified in the text, although I may have just missed this. It would be a good idea to state what this factor is where relevant in the text.

      We added an equation describing the decay factor (new equation 7 in the Methods section) according to this suggestion and Reviewer 1 comment on the same issue.

      Reviewer #3 (Public Review):

      The author proposed the minimum variance principle in the memory representation in addition to two alternative theories of the minimum energy and the maximum smoothness. The strength of this paper is the matching between the prediction data computed from the explicit equation and the behavioral data taken in different conditions. The idea of the weighting of multiple coordinate systems is novel and is also able to reconcile a debate in previous literature.

      The weakness is that although each model is based on an optimization principle, but the derivation process is not written in the method section. The authors did not write about how they can derive these weighting factors from these computational principles. Thus, it is not clear whether these weighting factors are relevant to these theories or just hacking methods. Suppose the author argues that this is the result of the minimum variance principle. In that case, the authors should show a process of how to derive these weighting factors as a result of the optimization process to minimize these cost functions.

      The reviewer brings up a very important point regarding the model. As shown below, it is not trivial to derive these weights using an analytical optimization process. We demonstrate one issue with this optimization process.

      The force representation can be written as (similar to equation 6):

      We formulated the problem as minimizing the variance of the force according to the weights w:

      In this case, the variance of the force is the variance-covariance matrix which can be minimized by minimizing the matrix trace:

      We will start by calculating the variance of the force representation in joints coordinate system:

      Here, the force variance is a result of a complex function which include the joints angle as a random variable. Expending the last expression, although very complex, is still possible. In the resulted expression, some of the resulted terms include calculating the variance of nested trigonometric functions of the random joint angle variance, for example:

      In the vast majority of these cases, analytical solutions do not exist. Similar issues can also raise for calculating the variance of complex multiplication of trigonometric functions such as in the case of multiplication of Jacobians (and inverse Jacobians)

      To overcome this problem, we turned to numerical solutions which simulate the variance due to the different state variables.

      In addition, I am concerned that the proposed model can cancel the property of the coordinate system by the predicted variance, and it can work for any coordinate system, even one that is not used in the human brain. When the applied force is given in Cartesian coordinates, the directionality in the generalization ability of the memory of the force field is characterized by the kinematic relationship (Jacobian) between the Cartesian coordinate and the coordinate of interest (Cartesian, joint, and object) as shown in Equation 3. At the same time, when a displacement (epsilon) is considered in a space and a corresponding displacement is linked with kinematic equations (e.g., joint displacement and hand displacement in 2 joint arms in this paper), the generated variances in different coordinate systems are linked with the kinematic equation each other (Jacobian). Thus, how a small noise in a certain coordinate system generates the hand force noise (sigma_x, sigma_j, sigma_o) is also characterized by the kinematics (Jacobian). Thus, when the predicted forcefield (F_c, F_j, F_o) was divided by the variance (F_c/sigma_c^2, F_j/sigma_j^2, F_o/sigma_o^2, ), the directionality of the generalization force which is characterized by the Jacobian is canceled by the directionality of the sigmas which is characterized by the Jacobian. Thus, as it has been read out from Fig*D and E top, the weight in E-top of each coordinate system is always the inverse of the shift of force from the test force by which the directionality of the generalization is always canceled.

      Once this directionality is canceled, no matter how to compute the weighted sum, it can replicate the memorized force. Thus, this model always works to replicate the test force no matter which coordinate system is assumed. Thus, I am suspicious of the falsifiability of this computational model. This model is always true no matter which coordinate system is assumed. Even though they use, for instance, the robot coordinate system, which is directly linked to the participant's hand with the kinematic equation (Jacobian), they can replicate this result. But in this case, the model would be nonsense. The falsifiability of this model was not explicitly written.

      As explained above, calculating the variability of the generalized forces given the random nature of the state variable is a complex function that is not summarized using a Jacobian. Importantly the model is unable to reproduce or replicate the test force arbitrarily. In fact, we have already shown this (see Appendix 1- figure 1), where when we only attempt to explain the data with either a single coordinate system (or a combination of two coordinate systems) we are completely unable to replicate the test data despite using this model. For example, in experiment 4, when we don’t use the joint based coordinate system, the model predicts zero shift of the force compensation pattern while the behavioral data show a shift due to the contribution of the joint coordinate system. Any arbitrary model (similar to the random model we tested, please see the response to Reviewer 1) would be completely unable to recreate the test data. Our model instead makes very specific predictions about the weighting between the three coordinate systems and therefore completely specified force predictions for every possible test posture. We added this point to the Discussion

      “The results we present here support the idea that the motor system can use multiple representations during adaptation to novel dynamics. Specifically, we suggested that we combine three types of coordinate systems, where each is independent of the other (see Appendix 1- figure 1 for comparison with other combinations). Other combinations that include a single or two coordinate system can explain some of the results but not all of them, suggesting that force representation relies on all three with specific weights that change between generalization scenarios.”

    1. Author Response

      Reviewer #1:

      This is a very timely paper that addresses an important and difficult-to-address question in the decision-making field - the degree to which information leakage can be strategically adapted to optimise decisions in a task-dependent fashion. The authors apply a sophisticated suite of analyses that are appropriate and yield a range of very interesting observations. The paper centres on analyses of one possible model that hinges on certain assumptions about the nature of the decision process for this task which raises questions about whether leak adjustments are the only possible explanation for the current data. I think the conclusions would be greatly strengthened if they were supported by the application and/or simulation of alternative model structures.

      We thank the reviewer for this positive appraisal of our study. We now entirely agree with their central comment about whether leak adjustments are the only (or even the best) explanation for the current data. We hope that the additional modelling sections that we have discussed in response to main comment 1 above have strengthened the paper. We have responded point-by-point to their public review, as this contained their main recommendations for revision.

      The behavioural trends when comparing blocks with frequent versus rare response periods seem difficult to tally with a change in the leak. […] Are there other models that could reproduce such effects? For example, could a model in which the drift rate varies between Rare and Frequent trials do a similar or better job of explaining the data?

      We can see why the reviewer has advocated for a possible change of drift rate (or ‘gain’ applied to sensory evidence) between conditions to explain our behavioural findings. We found, however, that changes in drift rate could elicit qualitatively similar changes in integration kernels to changes in decision threshold:

      Author response image 1.

      Changes in gain applied to incoming sensory evidence (A parameter in model) have similar effects on recovered integration kernels from Ornstein-Uhlenbeck simulation as changes in decision threshold.

      The likely reason for this is that the overall probability of emitting a response at any point in the continuous decision process is determined by the ratio of accumulated evidence to decision threshold. A similar logic applies to effects on reactions times and detection probability (main figure 2): increasing sensory gain/decreasing decision threshold will lead to faster reaction times and increased detection probability during response periods.

      Both parameters may even have a similar effect on ‘false alarms’, because (as the reviewer notes below) false alarms in our paradigm are primarily being driven by the occurrence of stimulus changes as well as internal noise. In fact, the false alarm findings mean it is difficult to fully reconcile all of our behavioural findings in terms of changes in a single set of model parameters in the O-U process. It is possible that other changes not considered within our model (such as expectations of hazard rates of inter-response intervals leading to dynamic thresholds etc.) may have had a strong impact upon the resulting false alarm rates. A full exploration of different variations in O-U model (with varying urgency signals, hazard rates, etc.) is beyond the scope of this paper.

      For this reason, we have decided in our new modelling section to focus primarily on a single, well-established model (the O-U process) and explore how changes in leak and threshold affect task performance and the resulting integration kernels. We note that this is in line with the suggestion of reviewer #2, who focussed on similar behavioural findings to reviewer #1 but suggested that we look at decision threshold rather than drift rate as our primary focus.

      This ties in to a related query about the nature of the task employed by the authors. Due to the very significant volatility of the stimulus, it seems likely that the participants are not solely making judgments about the presence/absence of coherent motion but also making judgments about its duration (because strong coherent motion frequently occurs in the inter-target intervals). If that is so, then could the Rare condition equate to less evidence because there is an increased probability that an extended period of coherent motion could be an outlier generated from the noise distribution? Note that a drift rate reduction would also be expected to result in fewer hits and slower reaction times, as observed.

      As mentioned above, the rare and frequent targets are indeed matched in terms of the ease with which they can be distinguished from the intervening noise intervals. To confirm this, we directly calculated the variance (across frames) of the motion coherence presented during baseline periods and response periods (until response) in all four conditions:

      Author response image 2.

      The average empirical standard deviation of the stimulus stream presented during each baseline period (‘baseline’) and response period (‘trial’), separated by each of the four conditions (F = frequent response periods, R = rare, L = long response periods, S = short). Data were averaged across all response/baseline periods within the stimuli presented to each participant (each dot = 1 participant). Note that the standard deviation shown here is the standard deviation of motion coherence across frames of sensory evidence. This is smaller than the standard deviation of the generative distribution of ‘step’-changes in the motion coherence (std = 0.5 for baseline and 0.3 for response periods), because motion coherence remains constant for a period after each ‘step’ occurs.

      Some adjustment of the language used when discussing FAs seems merited. If I have understood correctly, the sensory samples encountered by the participants during the inter-response intervals can at times favour a particular alternative just as strongly (or more strongly) than that encountered during the response interval itself. In that sense, the responses are not necessarily real false alarms because the physical evidence itself does not distinguish the target from the non-target. I don't think this invalidates the authors' approach but I think it should be acknowledged and considered in light of the comment above regarding the nature of the decision process employed on this task.

      This is a good point. We hope that the reviewer will allow us to keep the term ‘false alarms’ in the paper, as it does conveniently distinguish responses during baseline periods from those during response periods, but we have sought to clarify the point that the reviewer makes when we first introduce the term.

      “Indeed, participants would occasionally make ‘false alarms’ during baseline periods in which the structure of the preceding noise stream mistakenly convinced them they were in a response period (see Figure 4, below). Indeed, this means that a ‘false alarm’ in our paradigm has a slightly different meaning than in most psychophysics experiments; rather than it referring to participants responding when a stimulus was not present, we use the term to refer to participants responding when there was no shift in the mean signal from baseline.”

      And:

      “The fact that evidence integration kernels naturally arise from false alarms, in the same manner as from correct responses, demonstrates that false alarms were not due to motor noise or other spurious causes. Instead, false alarms were driven by participants treating noise fluctuations during baseline periods as sensory evidence to be integrated across time, and the physical evidence preceding ‘false alarms’ need not even distinguish targets from non-targets.”

      The authors report that preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods. It is not clear what identifies this signal as reflecting motor preparation. Did the authors consider using other effectorselective EEG signatures of motor preparation such as beta-band activity which has been used elsewhere to make inferences about decision bounds? Assuming that this central ERP signal does reflect the decision bounds, the observation that it has a larger amplitude at the response on Rare trials appears to directly contradict the kernel analyses which suggest no difference in the cumulative evidence required to trigger commitment.

      Thanks for this comment. First, we should simply comment that this finding emerged from an agnostic time-domain analysis of the data time-locked to button presses, in which we simply observed that the negative-going potential was greater (more negative) in RARE vs. FREQUENT trials. So it is simply the fact that it precedes each button press that we relate it to motor preparation; nonetheless, we note that (Kelly and O’Connell, 2013) found similar negative-going potentials at central sensors without applying CSD transform (as in this study). Like them, we would relate this potential to either the well-established Bereitschaftpotential or the contingent negative potential (CNV).

      We agree that many other studies have focussed on beta-band activity as another measure of motor preparation, and to make inferences about decision bounds. To investigate this, we used a Morlet wavelet transform to examine the time-varying power estimate at a central frequency of 20Hz (wavelet factor 7). We repeated the convolutional GLM analysis on this time-varying power estimate.

      We first examined average beta desynchonisation at a central cluster of electrodes (CPz, CP1, CP2, C1, Cz, C2) in the run-up to correct button presses during response periods. We found a reliable beta desynchonisation occurred, and, just as in the time-domain signal, this reached a greater threshold in the RARE trials than in the FREQUENT trials:

      Author response image 3.

      Beta desynchronisation prior to a correct response is greater over central electrodes in the RARE condition than in the FREQUENT condition.

      We agree with the reviewer that this is likely indicative of a change in decision threshold between rare and frequent trials. We also note that our new computational modelling of the O-U process suggests that this in fact reconciles well with the behavioural findings (changes in integration kernels). We now mention this at the relevant point in the results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      We did also investigate the lateralised response (left minus right beta-desynchronisation, contrasted on left minus right responses). We found, however, that we were simply unable to detect a reliable lateralised signal in either condition using these lateralised responses. We suspect that this is because we have far fewer response periods than conventional trialbased EEG experiments of decision making, and so we did not have sufficient SNR to reliably detect this signal. This is consistent with standard findings in the literature, which report that the magnitude of the lateralised signal is far smaller than the magnitude of the overall beta desynchronisation (e.g. (Doyle et al., 2005))

      P11, the "absolute sensory evidence" regressor elicited a triphasic potential over centroparietal electrodes. The first two phases of this component look to have an occipital focus. The third phase has a more centroparietal focus but appears markedly more posterior than the change in evidence component. This raises the question of whether it is safe to assume that they reflect the same process.

      We agree. We have now referred to this as a ‘triphasic component over occipito-parietal cortex’ rather than centroparietal electrodes.

      Reviewer #2:

      Overall, the authors use a clever experimental design and approach to tackle an important set of questions in the field of decision-making. The manuscript is easy to follow with clear writing. The analyses are well thought-out and generally appropriate for the questions at hand. From these analyses, the authors have a number of intriguing results. So, there is considerable potential and merit in this work. That said, I have a number of important questions and concerns that largely revolve around putting all the pieces together. I describe these below.

      Thanks to the reviewer for their positive appraisal of the manuscript; we are obviously pleased that they found our work to have considerable potential and merit. We seek to address the main comments from their public review and recommendations below.

      1) It is unclear to what extent the decision threshold is changing between subjects and conditions, how that might affect the empirical integration kernel, and how well these two factors can together explain the overall changes in behavior.

      I would expect that less decay in RARE would have led to more false alarms, higher detection rates, and faster RTs unless the decision threshold also increased (or there was some other additional change to the decision process). The CPP for motor preparatory activity reported in Fig. 5 is also potentially consistent with a change in the decision threshold between RARE and FREQUENT. If the decision threshold is changing, how would that affect the empirical integration kernel? These are important questions on their own and also for interpreting the EEG changes.

      This important comment, alongside the comments of reviewer 1 above, made us carefully consider the effects of changes in decision threshold on the evidence integration kernel via simulation. As discussed above (in response to ‘essential revisions for the authors’), we now include an entirely new section on how changes in decision threshold and leak may affect the evidence integration kernel, and be used to optimise performance across the different sensory environments. In particular, we agree with the reviewer that the motor preparatory activity that differs between RARE and FREQUENT is consistent with a change in decision threshold, and our simulations have suggested that our behavioural findings on evidence integration are also consistent with this change as well. These are detailed on pp.1-4 of the rebuttal, above.

      2) The authors find an interesting difference in the CPP for the FREQUENT vs RARE conditions where they also show differences in the decay time constant from the empirical integration kernel. As mentioned above, I'm wondering what else may be different between these conditions. Do the authors have any leverage in addressing whether the decision threshold differs? What about other factors that could be important for explaining the CPP difference between conditions? Big picture, the change in CPP becomes increasingly interesting the more tightly it can be tied to a particular change in the decision process.

      We fully agree with the spirit of this comment, and we’ve tried much more carefully to consider what the influences of decision threshold and leak would be on our behavioural analyses. As discussed in the response to reviewer 1, we think that the negative-going potential at the time of responses (which is greater in RARE vs. FREQUENT, main figure 7b, and mirrored by equivalent changes in beta desynchronisation, see Reviewer Response Figure 5 above) are both reflective of a change in decision threshold between RARE and FREQUENT conditions. We have tried to make this link explicit in the revised results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      I'll note that I'm also somewhat skeptical of the statements by the authors that large shifts in evidence are less frequent in the RARE compared to FREQUENT conditions (despite the names) - a central part of their interpretation of the associated CPP change. The FREQUENT condition obviously has more frequent deviations from the baseline, but this is countered to some extent by the experimental design that has reduced the standard deviation of the coherence for these response periods. I think a calculation of overall across-time standard deviation of motion coherence between the RARE and FREQUENT conditions is needed to support these statements, and I couldn't find that calculation reported. The authors could easily do this, so I encourage them to check and report it.

      See Author response image 2.

      3) The wide range of decay time constants between subjects and the correlation of this with another component of the CPP is also interesting. However, in trying to interpret this change in CPP, I'm wondering what else might be changing in the inter-subject behavior. For instance, it looks like there could be up to 4 fold changes in false alarm rates. Are there other changes as well? Do these correlate with the CPP? Similar to my point above, the changes in CPP across subjects become increasingly interesting the more tightly it can be tied to a particular difference in subject behavior. So, I would encourage the authors to examine this in more depth.

      Thanks for the interesting suggestion. We explored whether there might be any interindividual correlation in this measure with the false alarm rate across participants, but found that there was no such correlation. (See Author response image 4; plotting conventions are as in main figure 9).

      Author response image 4.

      No evidence of between-subject correlations in CPP responses and false alarm rates, in any of the four conditions.

      We hope instead that the extended discussion of how the integration kernel should be interpreted (in light of computational modelling) provides at least some increased interpretability of the between-subject effects that we report in figure 9.

      Reviewer #3 (Public Review):

      The main strength is in the task design which is novel and provides an interesting approach to studying continuous evidence accumulation. Because of the continuous nature of the task, the authors design new ways to look at behavioral and neural traces of evidence. The reverse-correlation method looking at the average of past coherence signals enables us to characterize the changes in signal leading to a decision bound and its neural correlate. By varying the frequency and length of the so-called response period, that the participants have to identify, the method potentially offers rich opportunities to the wider community to look at various aspects of decision-making under sensory uncertainty.

      We are pleased that the reviewer agrees with our general approach as a novel way of characterising various aspects of decision-making under uncertainty.

      The main weaknesses that I see lie within the description and rigor of the method. The authors refer multiple times to the time constant of the exponential fit to the signal before the decision but do not provide a rigorous method for its calculation and neither a description of the goodness of the fit. The variable names seem to change throughout the text which makes the argumentation confusing to the reader. The figure captions are incomplete and lack clarity.

      We apologise that some of our original submission was difficult to follow in places, and we are very grateful to the reviewer for their thorough suggestions for how this could be improved. We address these in turn below, and we hope that this answers their questions, and has also led to a significant improvement in the description and rigour of the methodology.

    1. Author Response

      Reviewer #2 (Public Review):

      I am not a specialist in cryo-EM, so cannot comment on the technicalities of the structure reconstruction or methods used. I thus focus on the conclusions and observations that the authors provide in the manuscript and their relevance to functional photosynthesis.

      The authors attempt to resolve the structure of PSII from Dunaliella and noticed that three types of PSII could be identified: two conformational states, and a stacked configuration. There is no doubt that these structures add to our current knowledge of PSII and that they exist in abundance upon solubilisation of the sample. My main issue however is the relevance to in vivo conditions, and the efforts to exclude the possibility that pigment loss and conformational states and stacking are a reflection of ex-vivo manipulations.

      Our compact model contains 202 Chls molecules while the stretched conformation contains 206 Chls. All of the differences in Chl binding are attributed to CP29. We have compiled a table enumerating the different CP29 structures currently available from plants and green alga at similar resolution to our work (Supplementary table 2). In the larger plant complexes (C2S2M2) CP29 contains 14 chls, while CP29 in smaller C2S2 complexes contains 10-13 chls, so it appears the some chl loss from CP29 is associated with the release of LHCIIM. In the green alga structures, CP29 contains less chls in general and shows a similar trend. The currently published structure most relevant to our work contains 8 chls (6KAC), a somewhat lower amount then both the compact and stretched models (9 and 11 chls, respectively). The stretched orientation, which is the closest match to the known PSII core arrangement, therefore contains more chls than comparable models. While the in-vivo configuration is not known in the sense that it could contain more chls, the current structure is apparently the closest representation of it.

      The presence of CP29 with lower chls content in the chlamy C2S2 (6KAC, which is in a stretched orientation) supports a conclusion that pigment loss from CP29 alone is not sufficient to trigger the stretch to compact transition although it is associated with it. In general, the precise orientation of CP29 is variable and seem to depend on the binding of additional LHCII, it is possible that some chl loss is accompanied with these changes in vivo.

      I see a number of questions pertaining to this work. Starting from the two conformations of PSII, compact and stretched, the authors say that both are highly active based on oxygen measurements at a saturating light intensity. In the meantime, they report large variations in the chl content and positions of the chlorophyll molecules in these structures (also compared to other known PSIIs). This gives the impression that one can lose two chlorophylls, and freely modify the distance between others without losing efficiency, certainly a risky conclusion. Are the samples highly active also in light-limiting conditions? It is thought that even tiny movements and alterations in chl-chl distances alter their coupling and spectral properties, how come the variations in this report are so huge? In other words, the assay tests the charge separation activity of the PSII RC in the preps, but not the light-harvesting efficiency.

      The chl content differences reported in this work amounts to 2%. In our opinion this represents quite a low variation in pigment content, which exist in virtually any experiment involving large complexes. We agree that measurements of activity in limiting light conditions are interesting, however this goes beyond the scope of the current work. Light harvesting efficiency in PSII is known to vary substantially as a result of additional mechanisms (NPQ in some of its forms), not associated with chl loss or gain. While the formation of quenching centers is attributed to small structural changes within specific pigment protein complexes, what we are showing in this work are structural changes between pigment protein complexes. These can affect transfer rates between the different complexes but are distinct from the structural changes thought to accompany the formation of quenching centers within specific pigment protein complexes.

      How does one ascertain that the lost chlorophyll molecules in CP29 are not a preparation error? Does slightly increasing the detergent concentration impact the proportion of stretched:compact forms?

      The effect of detergent concentration on the proportion of the different forms was not tested directly. However, we do not detect many differences in lipids or bound detergent molecules content between the two conformations, suggesting that for these “ligands” the differences are not substantial. We can only distinguish these two forms at the very last stages of data processing, at the present state of cryoEM cost and time availability, mapping the effect of detergent concentration on the different orientations is outside our reach.

      On a similar note, how do the authors exclude that a certain interaction with this type of grid impacts the distribution of these complexes? Is it identical to a biologically separate preparation of algae? In case of discoveries of this type, it is of high importance to exclude as many possibilities of non-native conditions or influences on the structure.

      It’s hard to completely exclude grid and sample preparation issues. However, we employed relatively standard grids and vitrification conditions. The observed complexes are embedded in vitrified ice and do not interact with the grid directly. The differences we observed are mainly in the orientations of the PSII cores, all the interactions between PSII subunits within each core are preserved and agree with previously published structures. Since the interactions within the core and between cores involve the same physical principles, we think its fairly conservative to think that the observed core orientations are not an artefact of sample preparation.

      I would further like to encourage the authors to elaborate on the CP29 phosphorylation. What is the proportion of PSIIcomp that are phosphorylated? I assume it is not 100%, as in this case, the authors would propose that this is the effect that modulates between compact and stretched architectures.

      Its difficult to estimate the proportion of observed phosphorylation/sulfinylation. To be detected in maps, most of the residues (above 50%) are probably modified. We attempted to estimate this by refining the atom occupancies of the Pi molecule on Ser84 and the oxygens attached to Cys218, both values suggested that about 70% of the complexes are modified. With regards to the possibility that these modifications can promote the formation of the compact state, we think that this is certainly a possibility, since these modifications were detected in this state and are in close proximity to each other. However, this can also result from the resolution differences of the maps and the structural implications of both modifications are hard to predict. At this point we prefer to note their existence without further interpretations.

      In line 290, the authors highlight the structural heterogeneity within the two groups' PSII conformations. I would like to see how does the distribution look like for all the structures together: are the two (stretched and compact) specifically forming two heterogenous distributions? Or is it possible that the distribution between the two is quasi-continuous? In other words, if the structures are not perfectly defined, how do the authors decide that two- and not more or less subtypes exist?

      We went back and refined the initial particle group (containing both compact and stretched orientations) using multibody with masks defining the two PSII monomers. This analysis showed the expected two peaks only in the first Principal components which accounted for ~38% of the variance in the dataset.

      Multibody refinement carried out on the combined particle dataset shows one very large PC accounting for about 38% of the variance and the presence of two distinct peaks in the particle distribution of the first PC.

      From this analysis it’s clear that there are two distinct classes in this particle set (as expected), as none of the other PC’s shows any signs of multiple peaks, this analysis suggests that two distinct models are the best representation of this eukaryotic PSII. Whether these are quasi continuous or distinct is more complex. There is continuity in this representation (particle distributions along PC), a different picture may appear if characters such as CP29 state are considered, but the size of CP29 and the remaining heterogeneity does not provide enough signal to carry out this classification at the moment.

      Considering the stacked PSII, I also have a few concerns. Contrary to previous studies the authors do not assign a functional role to the stacking beyond the structural aspect. This could be better backed by a discussion about the closest chlorophyll a molecules across the stacked PSII, which given the rather large distance shown in fig. 4L seems to be too large for any EET across the stromal gap.

      The closest chl-chl distance that we can measure in the stacked PSII dimer is ~54 Å, with most distances at the ~70 Å range, making EET between staked complexes very slow. We have added a statement clarifying this to our manuscript. In our opinion a structural role for the staked PSII dimer is more likely.

      There is a report that suggests the presence of some density between the stacked PSII - could the authors comment on the differences between it and their work? Are the angles and positions conserved between these types of stacks? https://doi.org/10.1038/s41598-017-10700-8

      We referred to Albanese et al, in our manuscript. We isolated the C2S2 complex from green alga, the analysis in Albanese et al was done on C2S2M1 complexes from pea and this can account for some of the differences. At any rate, our conclusion that we don’t find any evidence for protein linkers in the stacked complex is stated clearly. The angles described in Albanese et al are consistent with our analysis.

      Line 387, the authors state that due to the transient nature of the interactions across the stromal gap, the stacks could be "under-detected" in cryo-ET data. This statement is in my opinion misformulated. For once, the transient interaction argument would apply the same (if not more due to changing conditions induced by the purification process) to the single particle analysis performed in this paper. Second, tomographic volumes detect hundreds of PSII in a suspended state. Any transient interaction that adds up to 25% of particle population in a steady state cell should be clearly visible, while the in situ data suggests not more than random cross-stromal-gap orientations. Of course, this can be a specificity of Chlamydomonas or a particular growth condition. The statement used by the authors could be indeed converted into: the PSII stacks are over-detected in vitro, and it is certainly a simpler explanation for their presence. It is also important to mention that PSII stacking alone is not the only reason for grana architecture - stacking with the antenna of larger complexes, absent in the authors' preparation could also contribute to grana maintenance; and auxiliary proteins such as CURT help with this issue as well. Here a recent demonstration of the importance of minor antenna should probably be also cited: https://doi.org/10.1101/2021.12.31.474624

      We used the term “flexible” rather than “transient” to describe the interactions within the stacked PSII dimer. Our data (and tomographic data) do not contain any temporal component. When we used the term under-detected we refer to the fact that PSII is mainly detected by the luminal extrinsic subunits. The flexibility detected in our analysis may affect the concurrent visibly of these features in the PSII complexes making up an individual PSII stack. Specifically, Wietrzynski et al mainly analyze C2S2M2L2 complexes while our analysis only contained C2S2 complexes. It is likely that the different amount of bound LHCII affect PSII stacking as well. For example, Wietrzynski et al, show some overlap between LHCII complexes and little overlap between cores in the larger complexes they analyzed. We observe mainly core to core overlap with little LHCII overlap in the smaller C2S2, although we did not observe any states where LHC’s were not included in what appear to be the binding interface. We agree with the reviewer on the relevance Lhcb’s and CURT contributions to stacking but prefer to focus on what was directly demonstrated in our data. We clearly note that we are discussing in-vitro results.

      Taking these last thoughts, I would like to finish by mentioning one more thing - almost philosophical. The authors are certainly at the forefront of the booming cryoEM revolution in biology which is profoundly changing the way we understand the living. There is absolutely zero doubt that this powerful technique is of the highest interest. But a growing number of structures of photosynthetic complexes remain puzzling, in particular with regard to their abundance in vivo (such as the PSII stacks) and functional relevance. How do we ascertain that these interactions are not due to in vitro preparation (isolation from cells, solubilisation)? Which ways can we use to try to exclude this (simple) hypothesis? I suggest that at least a small extent of biological replicas - experiments performed on separate batches, in different technical conditions, with slightly altered solubilization conditions, and so on - could shed light on the nature of these structures and their occurrence in vivo. Technical reps of the freezing+analysis pipeline could also be tried to see the variability. This would strongly reinforce this manuscript and its conclusions, and while not completely unequivocal (the stacked PSII, for example, could form upon each purification), a quantification of the effects would be of high interest.

      We certainly share the reviewer hope of being able to conduct cause and effect cryoEM experiments covering a complete set of experimental parameters. This is still beyond reach in terms of time and cost. Within each cryoEM experiment, however, all the analysis is consistent and, more importantly, transparent with regards to image analysis, which is the most important factor in our opinion. Preparation artefacts are always a possibility but, in our opinion, cryoEM is not affected by them differentially compared to other techniques. As we mentioned above, the particles are being observed suspended in vitreous ice, this is not different, and one can say even better, then numerous low temperature spectroscopic observations on samples suspended in glass state or crystals obtained in the presence of high concentrations of various agents. One thing that validates structural studies are the chemical details (bond lengths and angles etc…) underlying every model which are consistence with known values to close tolerances.

      Reviewer #3 (Public Review):

      In this manuscript, Caspy et al. present a detailed structural analysis of eukaryotic photosystem II (PSII) isolated from the green alga Dunaliella salina. By combining single-particle cryo-EM with multibody refinement, the authors not only reveal a high-resolution (2.4Å) structure of the eukaryotic PSII, but also demonstrate alternate conformations and intrinsic flexibility of the overall complex. Stretched and compact conformations of the PSII dimer were readily identified within the single-particle dataset. From this structural analysis, the authors propose that excitation energy transfer properties may be modulated by changes in transfer distance between key chlorophyll molecules observed in different conformational states of the PSII dimer. Due to the high resolution of the maps obtained, the authors identify post-translational modifications and a sodium binding site based on the observed cryo-EM maps. Additionally, the authors analyze PSII complexes in stacked and unstacked configurations, and find that compact and stretched states also exist within the stacked PSII complexes. From their cryo-EM maps, the authors demonstrate that there is no direct protein-protein interaction between stacked PSII complexes, and rather propose a model wherein long-range electrostatic interactions mediated by divalent cations such as magnesium, can facilitate PSII stacking.

      The conclusions and models presented in the manuscript are mostly well justified by the data. The cryo-EM maps are high quality and the models appear generally well refined. However, some aspects of data processing and analysis, as well as the resultant conclusions need to be clarified.

      1) In general, it is not clear from the cryo-EM processing workflow (suppl. Fig 1) or the methods section when exactly symmetry was applied during 3D classification and refinement. In the case of C2S2 unstacked particles, when was symmetry first applied in the overall processing workflow? To identify the compact and stretched configurations of C2S2, did the 3D classification without alignment (and/or the refinement preceding this classification) have C2 symmetry applied? If so, have you considered the possibility that some particles may actually be asymmetric in some regions?

      We modified figure S1 to clearly indicate the use of symmetry and particle expansion. In general, we refined most of the particle sets without symmetry (C1). At the final processing stage of the unstacked PSII sets, after we separated both conformations, we used C2 symmetry to expand the data, this was followed by multibody refinement. No symmetry or symmetry expansion was used for the stacked PSII particle sets.

      2) Following multibody refinement in Relion individual maps and half-maps for each body will be generated. There is no mention in the methods of how these individual maps for each C2S2 "monomer" were combined to produce an overall map of the dimer following multibody refinement. There are several methods currently used to combine such maps, including taking the maximum or average of the two maps or using a model-based approach in phenix. The authors should be explicit about the method they used, any potential artifacts that may develop from this map combination process, and/or the interface between masks used in multibody refinement.

      We used phenix.combined_focused_maps to combine the maps. This is now indicated in the method section.

      3) In addition to the point raised above, following multibody refinement there will be an individual FSC curve and resolution for each body. However, in supplemental figure 2 and supplemental table 1, only a single FSC curve and resolution are reported. Are these FSC curves/resolutions only reported for the better of the two bodies? If not, how was a single resolution calculated for the overall map of combined bodies?

      Both FSC curves were calculated and were highly similar, as expected following C2 expansion. This can also be evaluated from the local resolution maps which are highly similar between the two bodies. The reported resolutions are all taken from the displayed FSC curves generated through relion PostProcess.

      4) One of the major conclusions from the 3D classification and multibody refinement is that conformational changes and inherent flexibility of the PSII dimers have the potential to change distances between cofactors in the complex, ultimately leading to altered excitation energy transfer. However, it is unclear whether or not the authors believe one conformation over another may more readily support the evolution of oxygen. It would be nice if the authors could elaborate slightly upon this topic in the discussion.

      As discussed above the structural changes associated with the formation of quenching centers are not expected to be detected in the current work. The changes we observe can however affect the transfer to such centers and by doing so can play an important part in PSII biology. We do not detect any changes around the OEC and we don’t find any reason to think the two conformations are different with respect to their ETC.

      5) Along the lines of point 4 above, on line 95 the authors claim that "the high specific activity of 816 umol O2/ (mg Chl * hr) suggest that" both the C2S2 compact and stretched conformation are highly active. However, it is not clear to me why this measure of specific activity would suggest that both PSII conformations should have "high" activity. Maybe a reference here would help guide readers to previous measures of specific activity?

      Looking at specific activity from previously published structural studies on eukaryotic PSII we find that Sheng et al, 2019 reported on a specific activity of 272 mol O2/ (mg Chl * hr), this difference can stem partially from the presence of larger complexes in their preparation and is comparable to the activity that we measured in our As fraction (276 mol O2/ (mg Chl * hr), Figure 1-figure supplement 9). Reported specific activity values from plants (Pisum sativum) are also similar, Su et al, reported on a maximal value of 288 mol O2/ (mg Chl * hr), again, for larger complexes which can explain some of the difference. However, the specific activity measured for the C2S2 PSII isolated in the current study is 2.8 X higher than this value, more than the differences in chl content which ranges between 1.5 X to 2 X in favor of the larger complexes. If either one of the conformations is not as active, it would only mean that the other conformation will display even higher specific activity which seems less likely. In addition, we find no difference around the oxygen evolution center or in the peripheral luminal subunits in both the shape or map strength so both orientations show highly similar structures around these regions which determine the oxygen evolution activity.

      6) It is claimed that "more than 2100 water molecules were detected in the C2S2 compressed model", and the water distribution is shown in Figure 3. Obtaining resolutions capable of visualizing waters with cryo-EM is still a significant challenge. Upon visual inspection of the map supplied, it appears that several of the waters that were built into the atomic model simply do not have supporting peaks in the coulomb potential map above the level of noise. While some of the modeled waters are certainly supported by the map, in my opinion, there are many waters that simply are not, or at best are questionable. What method or tool was originally used to build waters into the model, and how were these waters subsequently validated during structure refinement?

      We followed standard methods for water placement and refinement in the preparation of the model, in addition to manually curating the water structure. However, in light of the reviewer comment we undertook additional rounds of refinement and inspection of the water molecules in the model. We removed a few hundred water molecules so that the total number of water molecules is now around 1700. All the water molecules in the present model should be well supported at maps values higher then 2.5 sigma and in our opinion the current water model should be regarded as conservative and underestimates the number of bound water molecules. This also led to some improvements in additional validation statistics of the model which are listed in the Table 1. The new model has been deposited in the PDB and the new PDB validation report is included in our resubmission.

      7) The authors claim to identify several unique map densities during model building. One of these is a sodium ion close to the OEC, which is coordinated by D1-His337, several backbone carbonyls, and a water molecule. When looking closely at the cryo-EM map supplied, it appears that the coulomb potential map is quite weak for this sodium, and is only visible at quite low contour levels. In fact, the features for the coordinating water, and chloride ions located ~7-9A away are much stronger than the sodium. Do the authors have any explanation for why the cryo-EM map is significantly weaker for the sodium compared to the coordinating water or chloride ions in the same general vicinity? Similar to what they did for the other post-translational modifications, the authors should consider showing the actual cryo-EM map for the bound sodium in supplemental Figure 10 a,b.

      Our main support for the placement of a Na+ ion in this location stems from the analysis of Wang et al. Our maps show the presence of a density which is discernible at 4 σ with an elongated shape suggesting the presence of multiple atoms/waters. Although in principle positive ions should have very strong densities in cryoEM maps due to their interactions with electrons, other factors such as occupancy, coordination and b-factor also play a role making the distinction between water and sodium complicated and case specific. The sodium peak is not observed in unsharpened maps (as do most of the water molecules which occupy conserved positions).

        We collected a few examples from comparable cases (cryo-EM maps of similar resolution ranges) where the presence of sodium ions is highly probable based on additional evidence. These maps densities highlight the factors we discussed above. In cases ‘a’ (dual oxidase 1 prepared in high sodium conditions) and ‘b’ (human voltage-gated sodium channel), Na+ is observed in a highly coordinated states and especially in ‘a’ shows the expected increase density values compared to water molecules. However, cases ‘d’ (human Na+/K+ P type Atpase) and ‘e’ (voltage-gated sodium channel) appear very similar to the proposed Na+ assignment in PSII. We conclude that map density alone is not enough to distinguish between Na+ and water molecules and rely on the additional experiments described by Wang et al. which show increase PSII activity in elevated Na+ levels in basic conditions.

      8) The cryo-EM maps showing CP29-Ser84 phosphorylation and CP47-Cys218 sulfinylation are quite convincing. However, it is interesting that these modifications are only observed in the compact conformation, and not in the stretched conformation. Can the authors elaborate on whether or not they believe the compact and stretched conformations could be a result of these posttranslational modifications, or vice versa?

      This is an interesting suggestion. In our opinion it is less likely that the modification themselves trigger the transition between compact and stretched states. It is not clear how these modifications will stabilize the compact vs the stretched states. It is equally likely that these modifications are somehow triggered by the structural change. We cannot be certain that these modifications are not present in the stretched orientation as well but remain unobserved due to resolution differences. The correlation between the states and post translation modifications should be verified before a discussion on their possible roles in the transitions.

      9) Do the authors believe that PSII dimers in the solution can readily interconvert between compact and stretched conformations? Or is the relative ratio of these conformations fixed at the time of membrane solubilization with decyl-maltoside?

      We think that its more probable that the transition between these states occur in the membrane phase. The main reason for this will be that pigment loss and structural transitions in CP29 are more likely to occur in the membrane rather than in aqueous/micelle environments.

      10) The model proposed for divalent cation-mediated stacking of PSII dimers is compelling, and seems to be in agreement with previous investigations that observed a lack of stacked dimers in cryo-EM preparations lacking calcium/magnesium. However, my understanding from reading the methods section is that the observed lack of density between the stacked PSII dimers was inferred from maps obtained after multibody refinement. Based on the way the masks to define bodies were created for multibody refinement (Fig. 4A), the region between stacked dimers would be highly prone to map artifacts following multibody refinement. Have the authors looked closely at the interfacial region between stacked dimers following conventional 3D classification/refinement to ensure that there are indeed no features observed in the interfacial region even at low contour levels?

      We’ve made several attempts to resolve differences in the space between the stacked PSII dimer. These include focused classification with masks containing selected volumes from this regions and masks that include only one of the stacked PSII dimers to avoid signal subtraction in this region. All of these did not reveal any discernible features in this region. In addition, any stable binding of a bridging protein across the stacked dimer will probably be at least partially visible as additional density over the unstacked PSII. We searched for such features and found none.

    1. Author Response:

      Reviewer #4 (Public Review):

      In this work, Tee et al. study the implications of Heparan Sulfate (HS) binding mutations observed on the Enterovirus A71 (EV-A71) capsid. HS-binding mutations are observed for several virus infections and are often presumed to be a cell culture adaptation. However, in the case of EV-A71, the presence of HS-binding mutations in clinical samples and the contradictory findings in animal studies have made the clinical relevance of HS-binding a subject of debate. Therefore, to better understand the role of HS-binding in EV-A71, the authors use a mouse-adapted EV-A71 variant (MP4) and compare it to a cell-adapted strong HS-binder (MP4-97R/167G). Using these two variants, the authors show that the strong HS-binder does not require acidification for uncoating and genome release. Furthermore, it is demonstrated that the capsid stability of the HS-binding variant is compromised, resulting in pH-independent uncoating. Overall, this study provides new insights demonstrating that seemingly beneficial mutations increasing viral replication may be counterbalanced by other unintended consequences.

      Strengths:

      The thoroughness of the experiments performed to demonstrate that the HS-binding phenotype results in pH-independent entry and capsid destabilisation is worth highlighting. In this regard, the authors have explored viral entry using a range of approaches involving lysosomotropic drugs, viral binding assays, and neutral red-labelled viruses coupled with diverse techniques such as FISH, RNAscope, and transient expression of constitutively active molecules to inhibit parts of the viral cycle. In my opinion, this is necessary to rule out the other downstream effects of the lysomotropic drugs and to confirm the role of the HS-binding mutation in the entry phase. The use of in silico analysis coupled with negative staining electron microscopy and environmental challenge assays is notable. Finally, the demonstration of some of the work using a human-relevant strain is commendable.

      We appreciate the reviewer recognition of the significance of our study and the precious advises.

      Weaknesses:

      A major weakness in this study is the focus on using a mouse-adapted EV-A71 strain (MP4). In the introduction, it is argued that HS-binding mutations are controversial due to their occurrence in cell culture. However, due to host limitations, mice are not the natural hosts for EV-A71 and thus, the same argument can be made for a mouse-adapted strain. It is not clear how different this strain is from circulating EV-A71 strains and the relevance of these findings to the human situation is questionable. This is particularly made evident in the discussion where it is highlighted that HS-binding variants (VP1-145G/Q mutants) have been associated with severe neurological cases while the same variants show attenuated phenotypes in mice and monkeys. This contrast between clinical data and animal studies should be highlighted in the introduction, rather than later in the discussion, as currently the in vivo animal studies are presented as the optimal situation and may lead to misconstrued conclusions from the results.

      As requested by the reviewer, we included new experiments performed with a clinical strain isolated in an immunosuppressed patient (Cordey et al., 2012). We compared the sensitivity of this human strain harboring or not the VP1 L97R and E167G mutations to HCQ and confirmed that the similar differential sensitivity to HCQ was observed as with the MP4 variant. This result is presented as a new supplementary figure (Figure 6-figure supplement 1) and is described in the result section of the revised manuscript (Page 7, lines 251).

      Page 7, lines 251: To determine if our observations are applicable to human strains, we examined the sensitivity of a closely related clinical strain. This strain was isolated from the respiratory tract of an immunosuppressed patient with a disseminated EV-A71 infection27. Additionally, we tested a strong HS-binding derivative that harbors the same VP1-L97R and E167G mutations as our MP4 double mutant. Notably, this human clinical strain shares 98.3% amino acid similarity with the MP4 variant used in this study and exhibits similar HS-binding phenotypes28. As shown in Figure 6-figure supplement 1, the original human strain was inhibited by HCQ, whereas the double mutant exhibited insensitivity to the drug.

      We also added the comment about discrepancy between clinical data and animal studies in the introduction as requested (page 2, lines 69-76): However, epidemiological surveillance of human EV-A71 infections19-21 and experimental evidence from 2D human fetal intestinal models22, human airway organoids23 and air-liquid interface cultures24 suggest that HS binding may enhance viral replication and virulence in humans. In addition, recent research has shown that EV-A71 can be released and transmitted via cellular extrusions25 or exosomes26, potentially preventing viral trapping of HS-binding strains in the circulation. Further studies are required to evaluate the true impact of HS-binding mutations on the spread and virulence of EV-A71 in both animal models and humans.

      An important consideration is that the results are based primarily on image analysis. The inclusion of RT-qPCR and/or plaque assays as supplementary data will help strengthen the findings.

      We have performed RT-qPCR to confirm the immunostaining data and included them in the supplementary data (Figure 1-figure supplement 1E). Reference to these data is made in the result section [Page 4, lines 114-116: These results were confirmed by viral load quantification with real-time RT-PCR (Figure 1-figure supplement 1E).]

      Moreover, there are suggestions of an intermediate binder having a different phenotype. As this intermediate binder is the clinical phenotype, data on the entry of this intermediate binder will be valuable.

      While we agree with reviewer that the single mutant is an intermediate binder and exhibits a clinical phenotype, we made the decision to work with variants that display clear phenotypes, selecting MP4 and the double mutant, as the latter is fully attenuated in both immunocompetent and immunosuppressed mice (Weng et al., 2023). Additionally, we performed an experiment using HCQ, where we observed an intermediate effect with the single mutant. This further confirmed our decision to proceed with MP4 and the double mutant for all experiments. The data supporting this are shown in Author response image 1, which we are sharing exclusively with the reviewer.

      Author response image 1.

      Differential sensitivity of MP4, MP4-97R and MP4-97R167G to Lysosomotropic drugs

      Another weakness in the study is the lack of contextualization of the results to current EV-A71 literature. For instance, SCARB2 is referred to as the internalization receptor but a recent study has shown that SCARB2 is not required for internalization (https://doi.org/10.1128%2Fjvi.02042-21). The findings from this study are consistent with the localization of SCARB2 in the lysosomal membranes. Furthermore, the same study has highlighted host sulfation as a key factor in EV-A71 entry. Post-translational sulfation introduces negatively charged residues on host proteins including HS and SCARB2. This increases the binding of HS-binding strains to these proteins. In this regard, the reduced infectivity upon soluble SCARB2 treatment may simply be due to enhanced binding rather than capsid opening as suggested in the results. Therefore, additional experiments (e.g. nSEM following soluble SCARB2 treatment) must be performed to support the conclusion of capsid opening, due to inherent instability, upon SCARB2 binding.

      We apologize for not citing this relevant literature excluding the role of SCARB2 in viral attachment. We have now included these references in the revised version of the manuscript. (Page 2, lines 54-56: “Since SCARB2 is mostly localized on endosomal and lysosomal membrane and sparsely on plasma membrane3,5, it seems to play only a minor role in EV-A71 cell attachment6,7.

      We thank the reviewer for mentioning the possibility that the sulfation of SCARB2 may enhance its binding to the mutated virus compared to the wild-type virus, potentially explaining the selective competitive inhibition of this variant by soluble SCARB2 produced in mammalian cells. To investigate this hypothesis, we performed nsEM imaging of the double mutant incubated with soluble SCARB2 and we observed an increase in the proportion of empty capsids in the presence of soluble SCARB2 (4% versus 0.7%), supporting our original findings that the inactivation is indeed associated with capsid opening. The results are included in the revised manuscript in Figure 5-figure supplement 4 and described on Page 7, lines 243-245: “However, the double mutant exhibited a ~5-fold increase in empty capsid percentage after treatment with sSCARB2 (Figure 5-figure supplement 4), consistent with the functional data above.”

      In addition to the above, other existing literature on EV-A71 pathogenesis using organoids contradicts some of the explanations of differential phenotype in clinical observations versus mice models. In the introduction, it is suggested that reduced neurovirulence of HS-binding strains is due to binding to the vascular endothelia. However, the correlation of clinical severity to viremia (https://doi.org/10.1186/1471-2334-14-417) and the association of HS-binding mutants to clinical disease counteract this suggestion. Similarly, viral infection in human organoids with EV-A71 results in as low as 0.4% of the cells being infected (https://doi.org/10.1038/s41564-023-01339-5). In this case, if viral binding to (ubiquitously expressed) HS results in viral trapping then the HS-binding mutants should show lowered infectivity in organoid models rather than the observed higher infectivity (https://doi.org/10.3389/fmicb.2023.1045587, https://doi.org/10.1038/s41426-018-0077-2). Finally, EV-A71 release has also been shown to occur in exosomes (https://doi.org/10.1093%2Finfdis%2Fjiaa174) which effectively provides a protective lipid membrane. These recent findings must be incorporated into the article and will help better contextualize their findings.

      We appreciate the reviewer thoughtful comments. We do not believe that the correlation between clinical severity and viremia contradicts the viral trapping hypothesis. For strains that do not bind to HS, the absence of viral trapping could indeed lead to higher viral concentrations in the bloodstream, potentially increasing neurovirulence. However, we agree with the reviewer that other observations in humans, along with experimental data from more relevant models such as organoids, challenge the trapping hypothesis. We are grateful for the suggested citations and have incorporated these references in the introduction, where we discuss this point in more detail

      Page 2, lines 69-76: “However, epidemiological surveillance of human EV-A71 infections19-21 and experimental evidence from 2D human fetal intestinal models22, human airway organoids23 and air-liquid interface cultures24 suggest that HS binding may enhance viral replication and virulence in humans. In addition, recent research has shown that EV-A71 can be released and transmitted via cellular extrusions25 or exosomes26, potentially preventing viral trapping of HS-binding strains in the circulation. Further studies are required to evaluate the true impact of HS-binding mutations on the spread and virulence of EV-A71 in both animal models and humans.”

      Overall, the authors present new findings with convincing methodology. The manuscript can be improved in the contextualization of the findings and highlighting the weakness in translating these findings to resolve the debate surrounding the relevance of HS-binding phenotype. The inclusion of additional experiments and data recommended to the authors will also help strengthen the manuscript.<br />

    1. Author Response:

      Reviewer #1 (Public Review):

      This manuscript investigates the gene regulatory mechanisms that are involved in the development and evolution of motor neurons, utilizing cross-species comparison of RNA-sequencing and ATAC-sequencing data from little skate, chick and mouse. The authors suggest that both conserved and divergent mechanisms contribute to motor neuron specification in each species. They also claim that more complex regulatory mechanisms have evolved in tetrapods to accommodate sophisticated motor behaviors. While this is strongly suggested by the authors' ATAC-seq data, some additional validation would be required to thoroughly support this claim.

      Strengths of the manuscript:

      1) The manuscript provides a valuable resource to the field by generating an assembly of the little skate genome, containing precise gene annotations that can now be utilized to perform gene expression and epigenetic analyses. The authors take advantage of this novel resource to identify novel gene expression programs and regulatory modules in little skate motor neurons.

      2) Cross-species RNA-seq and ATAC-seq data comparisons are combined in a powerful approach to identify novel mechanisms that control motor neuron development and evolution.

      Weaknesses:

      1) It is surprising that the analysis of RNA-seq datasets between mouse, chick, and little skate only identified 5 genes that are common between the 3 species, especially given the authors' previous work identifying highly conserved molecular programs between little skate and mouse motor neurons, including core transcription factors (Isl1, Hb9, Lhx3), Hox genes and cholinergic transmission genes. This raises some questions about the robustness of the sequencing data and whether the genes identified represent the full transcriptome of these motor neurons.

      To address reviewer #1’s questions, we have generated RNA sequencing data with mouse forelimb MNs and re-analyzed the RNA-seq data using only the homologous MN populations (Figure 3) among different species. As a result, many genes (1038 genes) are commonly expressed in MNs in different species, including many known MN marker genes. In the result section, we have added the following:

      “The evolution of genetic programs in MNs was investigated unbiasedly by comparing highly expressed genes in pec-MNs (percentile expression > 70) of little skate with the ones from MNs of mouse and chick, two well-studied tetrapod species. In order to compare gene expression with homologous cell types from each species, we performed RNA sequencing on forelimb MNs of mouse embryos at embryonic day 13.5 (e13.5) and wing level MNs of chick embryos at Hamburger-Hamilton (HH) stage 26–27…”

      We have also compared our re-analysis with previous results in Figure 2–figure supplement 1, shown above. Most of the fin MN genes (21/24) are highly expressed in pecMNs (percentile > 70), consistent with the previous in situ experiments. In the Results we have added the following:

      “Although the total number of DEGs are different from the previous data (592 vs. 135 genes in pec-MN DEGs), which might be caused by different statistical analysis with different reference genome, previous RNA-seq data based on de novo assembly and annotation using zebrafish was mostly recapitulated in our DEG analysis based on our new skate genome (21 out of 24 previous fin MN marker genes have the expression level ranked above 70th percentile in Pec-MNs; Figure 2‒figure supplement 1).”

      2) The authors suggest based on analysis of binding motifs in their ATAC-seq data that the greater number of putative binding sites in the mouse MNs allows for a higher complexity of regulation and specialization of putative motor pools. This could certainly be true in theory but needs to be further validated. The authors show FoxP1 as an example, which seems to be more heavily regulated in the mouse, but there is no evidence that FoxP1 expression profile is different between mouse and skate. It is suggested in Fig.5 that FoxP1 might be differentially regulated by SnaiI in mouse and skate but the expression of SnaiI in MNs in either species is not shown.

      We have added further discussion and data about differential expression of Foxp1 in mouse and little skate in Figure 5–figure supplement 16 and have discussed as follows:

      “Foxp1, the major limb/fin MN determinant appears to be differentially regulated in tetrapod and little skate. Although Foxp1 is expressed in and required for the specification of all limb MNs in tetrapods, Foxp1 is downregulated in Pea3 positive MN pools during maturation in mice (Catela et al., 2016; Dasen et al., 2008). In addition, preganglionic motor column neurons (PGC MNs) in the thoracic spinal cord of mouse and chick express half the level of Foxp1 expression than limb MNs. Although PGC neurons have not yet been identified in little skate, we tested the expression level of Foxp1 using a previously characterized tetrapod PGC marker, pSmad. We observed that Foxp1 is not expressed in MNs that express pSmad (Figure 5‒figure supplement 3). Since there is currently no known marker for PGC MNs in little skate, our conclusion should be taken with caution.”

      As for Snai1, in the revision we performed a motif enrichment analysis with an unbiased gene list where Snai1 didn’t show up. However, when we performed an RNA in situ hybridization experiment for Snai1 (Figure 5–figure supplement 3), we found that Snai1 is expressed in MNs of both mouse and little skate, but not in chick, which has been shown previously (Cheung et al., 2005). In order to examine the function of Snai1 in the regulation of Foxp1 expression, we ectopically expressed Snai1 in chick spinal cord by performing in ovo electroporation. However, we did not detect any changes in Foxp1. Instead we observed an increase in the number of neurons and abnormal MN exits from the spinal cord, which is the reminiscent of a previous observation (Zander et al., 2014). Although we did not detect any changes in Foxp1 expression, we cannot rule out the possibility that Snai1 regulates Foxp1 in mouse and little skate, which may require a gene knock out experiment. Because binding sites of Snai1 were not enriched in the new gene sets that we analyzed in the revision, we have not further discussed the Snai1 in the text.

      3) In their discussion section the authors state that they found both conserved and divergent molecular markers across multiple species but they do not validate the expression of novel markers in either category beyond RNA-seq, for example by in situ or antibody staining.

      We have added RNA in situ hybridization results in Figure 3C and Figure 3–figure supplement 1 and 2. Most of the genes were expressed in tissues in accordance with the sequencing results (6 out of 9 common MN genes; 4 out of 6 mouse specific genes; 5 out of 7 skate specific genes). Specifcally, Uchl1, Slc5a7, Alcam, and Serinc1 are expressed in MNs of all three species; Coch, Ppp1rc, Ctxn1, and Clmp are expressed in MNs of mouse but not in MNs of other species; Eya1, Etv5, Dnmbp, and Spint1 are expressed in MNs of skate but not in MNs of other species. In the result section, we have summarized the results as follow:

      “These results were validated by performing RNA in situ hybridization in tissue sections on a subset of species-specific genes …”

    1. Author Response

      Reviewer #2 (Public Review):

      Regulation of NAD and its intermediary metabolites is of critical importance in axon degeneration and neurodegenerative disease. Mounting evidence supports a scenario in which low NAD, and high NMN triggers axon degeneration by competitive allosteric inhibition/activation of SARM1. Strategies to increase NAD levels and/or lower NMN levels provide neuroprotection in a variety of contexts. NAD metabolism is a partially conserved process, however, there are key differences in pathway routes and dynamics between model organisms used for NAD research (yeast, worm, fly, zebrafish, mouse/mammalian systems). Drosophila is a key model organism for axon degenerative research based on its ease of use and range of available genetic tools, in addition, the effector of axon degeneration - SARM1 - was first identified in the fly. As Drosophila has some key differences in the NAD synthesis pathways to mammalian systems it is important to test and develop tools to enable exploration of these pathways on the fly. Llobet Rosell and colleagues have developed clear and demonstrable tools in Drosophila for exploring NAD-related axon degenerative pathways by modulating the use of NMN via the addition of NMN consuming and NMN generating enzymes. They utilize Drosophila genetics to adequately support the claims made in the manuscript. Importantly, the authors well-demonstrate that consuming NMN through an alternate route to NaMN provides neuroprotection and that the neuroprotective components of low NMN are upstream of SARM1. These should be useful tools for neuroscientists in the future to use Drosophila for neurodegenerative research.

      Strengths:

      • Clear demonstration that low NMN provides neuroprotection using novel, stable, enzymatic depletion of NMN (to NaMN).

      • Development of a novel Drosophila tool (NMN-D transgenics) to explore NMN metabolism in vivo, including a stabilized version to permit chronic NMN depletion.

      • Metabolomic profiles across the pathway to show all pathway changes (not just isolated NMN or NAD assays). • Neurodegenerative assays that include both histological outcomes (axon degeneration) but also circuitry/functional outcomes. Data from both series of experiments all support each other.

      • Assessment of other known potent axon degenerative genes via genetics in combination with the tools developed. • Staging of the molecular processes by strategic ablation of the inhibitory ARM domain on SARM1 (dSarm deltaARM). These experiments suggest that low NAD AND high NMN (i.e. ratio between the two) is the critical factor that drives axon degeneration. Once NAD is low, axon degeneration cannot be recovered by further lowering of NMN. The dSarm delta-ARM and dnmnat sgRNAs experiments support a hypothesis in that (high) NMN triggers, but doesn't, execute axon degeneration.

      We appreciate his recognition of the quality of our research.

      Weaknesses:

      • The authors use murine NAMPT (mNAMPT) to increase NMN. The degeneration assays support the hypotheses made, yet mNAMPT doesn't actually increase NMN. Thus it is unclear in this setting whether mNAMPT promotes axon degeneration by an NMN-related mechanism or through another route. It is also unclear as to why the murine form was chosen versus a human or other orthologues, or changing the metabolism of the intrinsic pathway (NR and NRK).

      Why mNAMPT:

      We decided to use mouse NAMPT (mNAMPT) because it was readily available by Giuseppe Orsomando (Amici et al., 2017), and because we did not have access to human NAMPT (hNAMPT).<br /> We agree with the observation that under physiological conditions, the expression of mNAMPT does not change NMN. However, we argue that after injury, once dNmnat is degraded, the additional NMN synthesis provided by mNAMPT expression (in addition to dNrk), leads to a faster NMN accumulation. It is supported by the observation that NMNAT2 is more labile than NAMPT in mammals (Gilley and Coleman, 2010; Stefano et al., 2015).

      • The authors use metabolic profiling to look at the individual metabolites during axon degenerative evens and treatments however it is unclear if any of these proteins or genes change as a consequence. This is likely not important for understanding the findings however, might be helpful in explaining the mNAMPT data.

      We agree with the idea to test whether there is a change induced at the mRNA or protein level when the metabolic flux is altered. To do this, first, we measured the relative expression levels of axon death and NAD+ synthesis genes (Figure 2 – figure supplement 1B). Then, we measured potential changes upon mNAMPT expression (Figure 4 – figure supplement 1). Importantly, while the Gal4-driven expression resulted in an increase of relative mNAMPT transcript abundance from 30 to 12’000, the change observed in the other genes was not notable. Importantly, compared to Actin–Gal4, dnrk is 2-fold lower in UAS-mNAMPT and Actin > mNAMPT backgrounds (control vs. experiment, respectively). Thus, overall, there appears to be no change in mRNAs of either axon death or NAD+ synthesis genes.

      In the results, we changed the text accordingly:

      "We then tested the effect of mNAMPT on the NAD+ metabolic flux in vivo. Surprisingly, NAM, NMN, and NAD+ levels remained unchanged under physiological conditions (Figure 4C). However, we noticed 3-fold higher NR and a moderate but significant elevation of ADPR and cADPR levels upon mNAMPT overexpression (Figure 4C). We also asked whether mNAMPT impacts on NAD+ homeostasis thereby altering the expression of axon death or NAD+ synthesis genes. Besides the expected significant increase in the Gal4-mediated expression of mNAMPT, we did not observe any notable changes at the mRNA level (Figure 4 – figure supplement 1)."

      • The authors repeatedly introduce a novel PncC antibody. However, no details on this, its generation, or its testing are found within the manuscript as presented. The antibody detects with several bands. The authors speculate that this could be a degradation product but nothing substantial is shown.

      In Materials and methods, we added a new section:

      "PncC antibody generation Rabbit anti-PncC antibodies were generated by Lubioscience under a proprietary protocol. The immunogen used was purified from Escherichia coli, strain K12, corresponding to the full protein sequence of NMN-D. The amino acid sequence is the following: MTDSELMQLSEQVGQALKARGATVTTAESCTGGWVAKVITDIAGSSAWFERGFVTYSNEAKAQMIGVREETLAQHGAVSEPVVVEMAIGALKAARADYAVSISGIAGPDGGSEEKPVGVWFAFATARGEGITRRECFSGDRDAVRRQAT AYALQTLWQQFLQNT"

      We also updated the results referencing it.

      "We found that both wild-type and enzymatically dead NMN-D enzymes are equally expressed in S2 cells, as detected by newly generated PncC antibodies (Materials & Methods, Figure 1–figure supplement 2). Notably, we observed two immunoreactivities per lane, with the lower band being a potential degradation product."

      In addition, we now provide evidence why we believe that the upper band is NMN-D, while the lower one is a degradation product. In the figure attached below, the samples of the first five lanes were denatured at 70 °C, while the samples of the last two lanes were denatured at 95 °C (each for 10 min, respectively). The resulting Western blot shows that at 70 °C, there is more unspecific background, but no lower degradation product, while at 95 °C, the background is drastically reduced; however, there is a lower degradation product appearing. NMN-D is indicated by an asterisk. We feel that it is important to show this data here in the rebuttal. But we feel that it would add confusion to the readers in the manuscript.

      • Olfactory receptor neuron degeneration assays are shown in Fig1 but no data is presented with it to support the images.

      We agree that a quantification would support our observation. However, it is difficult to precisely quantify individual axons in the ORN injury assay, for two main reasons:

      1. Severed axons are often bundled, thus the exact number cannot be scored.

      2. Due to the removal of the cell body, the axonal GFP intensity decreases over time, due to the absence of mCD8::GFP synthesis. It adds another level of difficulty. Nevertheless, we added numbers to each example in Figure 1E and D, where we quantified the % of brains where severed preserved axons were observed, similar to Figure 2 in (MacDonald et al., 2006).

      In the results section, we changed the text as indicated below:

      "We extended the ORN injury assay and found preservation at 10, 30, and 50 dpa (Figure 1E). While quantifying the precise number of axons is technically not feasible, severed preserved axons were observed in all 10, 30, and 50 dpa brains, albeit fewer at later time points (MacDonald et al., 2006). Thus, high levels of NMN-D confer robust protection of severed axons for multiple neuron types for the entire lifespan of Drosophila."

      In the Figure 1 legend, we changed the text accordingly:

      "D Low NMN results in severed axons of olfactory receptor neurons that remain morphologically preserved at 7 dpa. Examples of control and 7 dpa (arrows, site of unilateral ablation). Lower right, % of brains with severed preserved axon fibers. E Low NMN results in severed axons that remain morphologically preserved for 50 days. Representative pictures of 10, 30, and 50 dpa, from a total of 10 brains imaged for each condition (arrows, site of unilateral ablation). Lower right, % of brains with severed preserved axon fibers."

    1. Author Response

      eLife assesssment:

      This paper conducts human and rodent experiments of non-invasive diffusion MRI estimates of axon diameter with the aim to establish whether these estimates provide biologically specific markers of axonal degeneration in MS. It will be of interest to researchers developing quantitative MRI methods and scientists studying neurodegeneration. The experiments provide evidence for the sensitivity of these markers, but do not directly validate axon diameter and do not reflect common pathological mechanisms across rodents and humans.

      We thank the Editor for the appreciation of our work. Thanks to the addition of an extensive electron microscopy paradigm, we now include a direct validation of axonal damage and expand on the common pathological mechanisms across the two species. The new results are detailed in the manuscript and summarized in Fig. 3 in the manuscript

      Reviewer #1 (Public Review):

      1.1 My primary concern relates to how meaningful the human-rodent comparisons are, and whether these comparisons really advance our understanding of AxCaliber estimates in MS. I applaud the aim to conduct "matched" experiments in both rodent models and human disease. It is a strength that the experiments are aligned with respect to the MRI measurements (although there are some caveats to this mentioned below). But beyond that, the overlap is not what one might hope for: the pathology would seem to be very distinct in humans and rodents, and the histological validation is not specific to what the MRI measurements claim to estimate. To summarize the main findings: (i) in a rat model of general axonal degeneration, axon calibre estimates correlate with neurofilaments; (ii) in MS in humans, axon calibre estimates correlate with demyelinating lesions. This gives a picture of AxCalibre estimates correlating with neuropathology, but is this something that has not already been established in the literature? If the aim is to validate AxCaliber, then there is a logic in using a rodent model that isolates alterations to axonal radius, but what then does this add to the existing literature in that space? If the aim is to study MS (for which AxCaliber results have been previously reported in Huang et al), then why not use a rodent model of MS?

      We thank the reviewer for their very insightful comments. Indeed, multiple sclerosis (MS) is a chronic neuroinflammatory and neurodegenerative disease of unknown etiology. An enormous effort has been made to obtain animal models that simulate the pathogenesis of this disease. However, while several models exist recapitulating distinct aspects of the disease (mostly related to demyelination), MS fundamentally remains a disease that only affects humans. This does not mean that EAE or lysolecithin models do not provide information on specific aspects and are therefore valuable. In fact, we believe that trying to replicate the pathological mechanisms of this disease in an animal model goes beyond the scope of the present work. In this work, our intention is to validate a biomarker of axonal damage preclinically, and for this, we use a model of axonal degeneration. We do not claim that this model should be valid to capture the complex clinical and pathological manifestation of MS, but we do think that it is a necessary step to ensure MRI sensitivity to axonal pathology. Why necessary? Because all the available (very limited) MRI literature which provides some form of validation: i) only focuses on healthy tissue, and ii) has an n of 1. Our preclinical paradigm gives conclusive evidence that the MRI axonal diameter proxy detects axonal damage as an increase in the mean diameter. This is now detailed in the discussion.

      After this necessary preclinical validation, we then apply the same framework to a human disease like MS that, among other manifestations, is believed to also cause axonal pathology. The improvements with respect to the one published work about axonal diameter in MS are: i) the whole brain analysis, which allowed us to characterize the extent of these early alterations outside the demyelinated lesions; and ii) the larger sample size, which allowed us to uncover an association with disease duration, strengthening our hypothesis about increased axonal diameter being a marker of early disease (new Fig. 5).

      Regarding the nonspecificity of histological validation, we thank the reviewer for this insightful comment, which triggered an additional analysis that we believe has added further value to the paper. Using electron microscopy, we found that in our model of neurodegeneration, axonal damage is indeed reflected as an increase in axon diameter (new Fig. 3). These recent findings strongly support the validation of our noninvasive diffusion MRI estimates of axon diameter alterations as an early-stage hallmark of normal-appearing tissue in MS.

      Coming back to the comparison between pathology in humans and in rodents, the EM data also support our choice of preclinical model, showing axonal swelling, the same phenomenon reported and characterized in recent postmortem histological data in the normal-appearing white matter of MS patients (Luchicchi et al., Ann Neurol 2021) and in lesions (Fisher et al. Ann Neurol 2007).

      All in all, we are confident that the new data supports the validity of this translational approach, and shed new light into the degenerating aspect of MS.

      Changes in the manuscript

      • Discussion, pag.12: It is important to stress that the aim of this work is not to propose a new animal model of MS, a disease that only affects humans, but rather to validate axonal damage detection (independently from the pathology that has induced it) through noninvasive MRI and apply the framework to characterize axonal pathology in MS.

      1.2 I appreciate that both rodent and patient studies are time intensive, major endeavors. Neverthless, the number of subjects is very low in both rodent (n=9) and human (MS=10, control=6) studies. At the very least, this should be more openly acknowledged. But I'm concerned that this is a major weakness of the paper. Related to this, I find it hard to tell how carefully multiple comparison correction was performed throughout. It seems reasonably clear for the TBSS analyses, but then other analyses were performed in ROIs. Are these multiple comparisons corrected as well? Similarly, in Methods, I am confused by the statement that: "post hoc t tests corrected for multiple comparisons whenever a significant effect was detected". What does this mean?

      We thank the reviewer for this comment. We agree that a small sample size was a weakness of the previous version of the paper, and therefore, in the new version, we have substantially increased the n for both animal and human experiments (from n=9 to 19 in animals, from 16 to 21 in humans). We removed the ROI analysis in the new version, and thus the confusing statement, and clarified the strategy for multiple comparisons.

      Changes in the manuscript

      • Data analysis, pag. 18: Lesion masks were excluded from the statistical analysis, and multiple comparisons across clusters were controlled for by using threshold-free cluster enhancement.

      1.3 While I do not think the text is in any sense deliberately misleading, I think the authors would do well to either tone down their claims or consider more carefully the implications of the text in many places. Some that stuck out for me are:

      Throughout, language in the paper (e.g., "Paired t tests were used to assess differences in the axonal diameter") presumes that the AxCaliber estimates specifically reflect axon diameter. I think the jury is out over whether this is true, particularly for measurements conducted with limited hardware specs. At the very least, I would encourage the author to refer to these measurements throughout as "estimates" of axon diameter.

      Thank you for this clarification. We have indeed changed the notation, and now consistently refer to the estimates of axon diameter through MRI as the “MRI axonal diameter proxy”.

      1.4 The authors suggest that their results provide "new tools for patient stratification" based on differences in lesion type, but it isn't clear what new information these markers would confer given that the lesions are differentiated based on T1w hypo/hyperintensities. In other words, these lesions are by definition already differentiable from a much simpler MRI marker.

      Thank you for this insightful comment. The reviewer is right, and following the general reviewers’ assessment we have decided to not include the lesion analysis in the new version of the manuscript.

      1.5 The authors note in the Discussion that: "sensitive to early stages of axonal degeneration, even before alterations in the myelin sheet are detected". Whether intentional or not, the implication in the context of this study is that this would hold for MS (that these markers would detect axonal degeneration preceding demyelination). While there is some discussion of alterations to axonal diameter in MS, the authors do not discuss whether these are the same mechanisms thought to occur in the IBO intervention used here.

      Thank you for this comment. Indeed, the scope of the paper is not to assess whether axonal swelling precedes or not myelin alterations, so we agree with the reviewer that this sentence might be misleading and have removed it in the text. While we do not claim that ibotenic acid injections are able to replicate the complex clinical and pathological manifestation of MS (and now we made it clear in the revised manuscript, see comment 1), the electron microscopy paradigm indicates the presence of axonal swelling in the damaged fimbria, which is indeed the same pathological manifestation found in MS post-mortem data (see e.g. Fisher et al. Ann Neurol 2007).

      1.6 In the Discussion, the authors note the lack of evidence for a relationship with disability or disease duration, but nevertheless, go on to interpret the "trends" they do observe. I would advise strongly against this: the authors acknowledge that their numbers are low, so I would avoid the temptation to speculate here.

      The reviewer is 100% correct. We should have refrained from speculating. In the new version of the paper, however, thanks to the larger human cohort, we were able to find significant associations with disease duration in voxelwise analysis of the white matter skeleton in standard space and in the whole white matter in single subject space (new Figure 5).

      1.7 In the Discussion state that "the use of neurofilaments has also been well validated in MS". Well validated for what? MS is a complex disease with a broad range of pathology, so this statement could be read to mean "neurofilaments are known to be altered in MS". However, in the context of this paragraph, the implication would seem to be that neurofilaments are a wellestablished proxy for axonal diameter. Is that the implication, and if so what general evidence is there for this?

      We thank the reviewer for this insightful comment. Indeed, altered neurofilaments are not conclusive evidence of increased axonal diameter. In this context, the addition of electron microscopy data in the new manuscript version supports the claim.

      Reviewer #2 (Public Review):

      Diffusion MRI is sensitive to the brain microstructure, and it has been used to assess the integrity of white matter for nearly 3 decades. Its main limitation is the limited specificity, which makes it difficult to link changes in diffusion parameters to a given pathological substrate. Recently methods based on diffusion MRI that enable the estimation of axonal diameter, non invasively, have become available. This paper aims at validating one of such methods using an experimental model of neurodegeneration. The authors found a significant correlation between axonal diameter estimated by MRI and an histological marker of neurodegeneration. Although this is of great interest, as it demonstrates that this method is sensitive to neurodegeneration, a direct validation would require a measurement of axonal diameter using electron or confocal microscopy, rather than a correlation with a measure of axonal degeneration not directly related to axonal diameter. So, although these data are compelling, they do not prove that the increase in axonal diameter suggested by diffusion MRI corresponds to actual axonal swelling. The Authors also apply the same method to compare the white matter of patients with multiple sclerosis (MS) and healthy controls, showing widespread increases in axonal diameter in the patients. These data are compelling, but again, not conclusive. Other factors such as gloss could bias the MRI measurement and lead to an apparent increase in axonal diameter.

      We would like to thank the reviewer for the positive assessment of our work and for the valuable suggestion. We are confident that the new version of the manuscript, by including an extensive validation based on electron microscopy, has addressed the reviewer´s criticisms.

      Reviewer #3 (Public Review):

      3.1 In this paper, Toschi et al. performed dMRI to in vivo estimate axon diameter in the brain and demonstrated that multi-compartmental modeling (AxCaliber) is sensitive to microstructural axonal damage in rats and axon caliber increase in demyelinating lesions in MS patients, suggesting that axon diameter mapping provides a potential biomarker to bridge the gap between medical imaging contrasts and biological microstructure. In particular, authors injected ibotenic acid (IBO) and saline in the left and right rat hippocampus, respectively, and compared in vivo estimated axon diameter and ex vivo neurofilament staining in left and right fimbria. The axon size estimation was larger in the fimbria of IBO injection side, where the neurofilament intensity is higher. Correlation of axon size estimation and neurofilament intensity was observed in both injection sides. Further, higher axon diameter estimation was observed in normal appearing white matter (NAWM) of MS patients, compared with the healthy subjects. The axon size estimation increased in hypointense lesions of T1 weighted contrast, but not in isointense lesions. Through the comparison of dMRI-estimated axon size and histology-based fluorescence intensity, authors indirectly validated the sensitivity of axon diameter mapping to the tissue microstructure in the rat brain, and further explored the axon size change in the brain of MS patients. However, the dMRI protocol and biophysical modeling in this study were not fully optimized to maximize the sensitivity to axon size estimation, and the dMRI-estimated axon size (4.4-5.4 micron) was much larger than values reported in previous histological studies (0.5-3 micron) [Barazany et al., Brain 2009]. Finally, although the modified AxCaliber model incorporated two fiber bundles in different directions, the fiber dispersion in each bundle was not considered (c.f. fiber dispersion ~20-30 degree in corpus callosum), potentially leading to overestimated axon diameter.

      We thank the reviewer for their appreciation of our work, which we believe is substantially improved in this revised version through the inclusion of an electron microscopy paradigm. Below, the point-by-point response to the specific points raised.

      3.2 The conclusions in this study are supported by experimental results. However, the dMRI protocol and biophysical model could be further optimized and validated: 1. To in vivo estimate the axon diameter ~1 micron using dMRI, strong diffusion weighting (b-value) should be applied to maximize the signal decay due to intra-axonal restricted diffusion and minimize the signal contribution of extra-cellular hindered diffusion. However, authors only applied maximal b-value = 4000 s/mm2, much smaller than values ~15,00020,000 s/mm2 in previous studies [Assaf et al., MRM 2008; Huang et al., BSAF 2020, 225:1277]. The use of low diffusion weighting in this study leads to a lower bound ~4-6 micron for accurate diameter estimation, the so-called resolution limit in [Nilsson et al., NMR Biomed 2017, 30:e3711]. In other words, the estimated axon diameter is potentially overestimated and related with the imaging protocol and image quality, confounding the biological interpretation.

      We thank the reviewer for this insightful comment. Indeed, while the resolution limit is a concern, the chosen b-value has been a compromise between sensitivity to small structure and SNR, as indicated by recent animal (Crater et al., 2022) and human (Jensen et al., 2016; McKinnon et al., 2017; Moss et al., 2019) work, pointing at 3000-4000 s/mm2 as the b-value for which the intra-axonal water signal is dominant. In addition, a paper from the laboratory that first developed the Axcaliber method recently came out (Gast et al., 2023, DOI: 10.1007/s12021-023-09630-w) demonstrating that an MRI protocol with a maximum b-value between 3000 and 4000 s/mm2 (and even lower) is sufficient to capture, in vivo and in humans, various well-known aspects of axonal morphometry (e.g., the corpus callosum axon diameter variation) as well as other aspects that are less explored (e.g., axon diameter-based separation of the superior longitudinal fasciculus into segments). The same paper contains resources and further bibliography supporting the fact that experimental evidence suggests that the contribution of intra-axonal water to restricted diffusion signals dominates other factors (see Online Resource 1, section A of the same paper). To challenge this recent evidence from a neurobiology perspective, we include in the supplementary material a subset of experiments in animals with lower maximum b-value (2500 s/mm2, Fig. S1), where we are able to detect the same effect of increased MRI axonal diameter proxy in the injected hemisphere compared to control.

      We would like to add that while extremely valuable and informative, simulation studies such as the excellent study by Veraart et al., 2020, are inevitably valid under certain assumptions. Among them, some critical ones are i) the need to neglect nonaxonal cells such as glia, ii) assuming that the bulk diffusivity of water in cerebral tissue would be the same as that of free water, and iii) impermeable barriers. All these assumptions are expected to play a role in the estimated resolution limit, a role difficult to quantify but likely substantial.

      For this reason, we believe that our approach, which is 100% focused on neurobiology and measurements performed in real tissue, can offer a different perspective and fuel the ongoing debate on axonal diameter measurement feasibility. We acknowledge the value of the reviewer comment and discuss the issue of b-value in the discussion (see also comment 1.8).

      Changes in the manuscript

      • Discussion, pag. 12:<br /> Despite some inevitable minor differences due to different brain sizes and magnet features, the human protocol was built to match the main characteristics of the preclinical diffusion sequence, such as the b-value and diffusion time range. The chosen b-value has been a compromise between sensitivity to small structures and the signalto-noise ratio (SNR), as indicated by recent animal (Crater et al., 2022) and human (Gast et al., 2023; Jensen et al., 2016; McKinnon et al., 2017; Moss et al., 2019) work, pointing at 4000 s/mm2 as the b-value for which the intra-axonal water signal is dominant. However, following recent work supporting sensitivity of diffusion-weighted MRI to axonal diameter even at lower b-values (Gast et al., 2023), we tested a protocol with a lower b-value in a subset of animals, with the aim of facilitating future clinical AxCaliber studies. We found no qualitative differences in the outcome (MRI axonal diameter proxy was increased following fimbria damage). Further work and perhaps more realistic simulations, considering real cell composition and morphology, are needed to clarify this issue.

      3.3 In this study, the positive correlation of dMRI-estimated axon size and neurofilament fluorescence intensity is indeed an encouraging result, and yet this validation is indirect since it relies on the positive correlation between neurofilament intensity and axon diameter in histology.

      The reviewer correctly points out a severe limitation of the previous manuscript version, which is now addressed by including an extensive electron microscopy evaluation, recapitulated in new Fig. 3.

      3.4 Authors did not consider the fiber dispersion in the proposed dMRI model. This can lead to overestimated axon diameter, even in the highly aligned WM, such as corpus callosum with ~20-30 degree dispersion in histology [Ronen et al., BSAF 2014, 219:1773; Leergaard et all, PLoS One 2010, 5(1), e8595] and MRI [Dhital et al., NeuroImage 2019, 189, 543; Novikov et al., NeuroImage 2018, 174:518].

      The reviewer is correctly pointing out an important characteristic of while matter microstructure as is fibre dispersion. However, we would like to point out that the use of a second fiber population is expected to mitigate this effect by absorbing some axonal directional dispersion in areas of a single fiber. To support this, we quantified dispersion as the angle between the two main fiber orientations captured by the AxCaliber fit, as showed in Author response image 1 for two representative subjects (one control, upper line, and one MS, lower line; the “dispersion” maps are masked by a white matter probability mask, and superimposed to a T2w). Indeed, the angle between the two main fibres in the corpus callosum is around 20 degrees or lower, compatible with the bibliography cited by the reviewer, and higher in other white matter areas known to be characterized by fiber crossing and dispersion.

      Author response image 1.

      Angle in radians between the two main fiber orientations captured by the AxCaliber fit, as showed below for two representative subjects (one control, upper line, and one MS, lower line). The dispersion maps are masked by a white matter probability mask (P>=0.95), and superimposed to a T2-weighted image.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper shows that a principled, interpretable model of auditory stimulus classification can not only capture behavioural data on which the model was trained but somewhat accurately predict behaviour for manipulated stimuli. This is a real achievement and gives an opportunity to use the model to probe potential underlying mechanisms. There are two main weaknesses. Firstly, the task is very simple: distinguishing between just two classes of stimuli. Both model and animals may be using shortcuts to solve the task, for example (this is suggested somewhat by Figure 8 which shows the guinea pig and model can both handle time-reversed stimuli).

      The task structure is indeed simple. In the context of categorization tasks that are typically used in animal experiments, however, we would argue that we are the higher end of stimulus complexity. Auditory categories used in most animal experiments typically employ a category boundary along a single stimulus parameter (for example, tone frequency or modulation frequency of AM noise). Only a few recent studies (for example, Yin et al., 2020; Town et al., 2018) have explored animal behavior with “non-compact” stimulus categories. Thus, we consider our task a significant step towards more naturalistic tasks.

      We were also faced with the practical factor of the trainability of guinea pigs (GPs). Prior to this study, guinea pigs have been trained using classical conditioning and aversive reinforcement on detecting tone frequency (e.g., Heffner et al., 1971; Edeline et al., 1993). More recently, competitive training paradigms have been developed for appetitive conditioning, using a single “footstep” sound as a target stimulus and manipulated sounds as non-target stimuli (Ojima and Horikawa, 2016). But as GPs had never been trained on more complex tasks before our study, we started with a conservative one vs. one categorization task. We mention this in the Discussion section of the revised manuscript (page 27, line 665).

      To determine whether these results hold for more complex tasks as well, after receiving the reviews of the original manuscript, we trained two GPs (that were originally trained and tested on the wheeks vs. whines task) further on a wheeks vs. many (whines, purrs, chuts) task. As earlier, we tested these GPs with new exemplars and verified that they generalized. In the figure below, the average performance of the two GPs on the regular (training) stimuli and novel (generalization) stimuli are shown in gray bars, and individual animal performances are shown as colored discs. The GPs achieved high performance for the novel stimuli, demonstrating generalization. We also implemented a 4-way WTA stage for a wheek vs. many model and verified that the model generalized to new stimuli as well.

      For frequency-shifted calls, these two GPs performed better for wheeks vs. many compared to the average for wheeks vs. whines shown in the main manuscript. The 4-way WTA model closely tracked GP behavioral trends.

      The psychometric curves for wheeks vs. many categorization in noise (different SNRs) did not differ substantially from the wheeks vs. whines task.

      We focused our one vs. many training on the two conditions that showed the greatest modulation in the one vs. one tasks. However, these preliminary results suggest that the one vs. one results presented in the manuscript are likely to extend to more complex classification tasks as well. We chose not to include these new data in the revised manuscript because we performed these experiments on only 2 animals, which were previously trained on a wheeks vs. whines task. In future studies, we plan to directly train animals on one vs. many tasks.

      Secondly, the predictions of the model do not appear to be quite as strong as the abstract and text suggest.

      We now replace subjective descriptors with actual effect size numbers to avoid overstatingresults. We also include additional modeling (classification based on the long-term spectrum) and discuss alternative possibilities to provide readers with points of comparison. Thus, readers can form their own opinions of the strengths of the observed effects.

      The model uses "maximally informative features" found by randomly initialising 1500 possible features and selecting the 20 most informative (in an information-theoretic sense). This is a really interesting approach to take compared to directly optimising some function to maximise performance at a task, or training a deep neural network. It is suggestive of a plausible biological approach and may serve to avoid overfitting the data. In a machine learning sense, it may be acting as a sort of regulariser to avoid overfitting and improve generalisation. The 'features' used are basically spectro-temporal patterns that are matched by sliding a crosscorrelator over the signal and thresholding, which is straightforward and interpretable.

      This intuition is indeed accurate – the greedy search algorithm (described in the original visionpaper by Ullman et al., 2002) sequentially adds features that add the most hits and the least false alarms compared to existing members of the MIF set to the final MIF set. The latter criterion (least false alarms) essentially guards against over-fitting for hits alone. A second factor is the intermediate size and complexity of MIFs. When MIFs are too large, there is certainly overfitting to the training exemplars, and the model does not generalize well (Liu et al., 2019).

      It is surprising and impressive that the model is able to classify the manipulated stimuli at all. However, I would slightly take issue with the statement that they match behaviour "to a remarkable degree". R^2 values between model and behaviour are 0.444, 0.674, 0.028, 0.011, 0.723, 0.468. For example, in figure 5 the lower R^2 value comes out because the model is not able to use as short segments as the guinea pigs (which the authors comment on in the results and discussion). In figure 6A (speeding up and slowing down the stimuli), the model does worse than the guinea pigs for faster stimuli and better for slower stimuli, which doesn't qualitatively match (not commented on by the authors). The authors state that the poor match is "likely because of random fluctuations in behavior (e..g motivation) across conditions that are unrelated to stimulus parameters" but it's not clear why that would be the case for this experiment and not for others, and there is no evidence shown for it.

      Thank you for this feedback. There are two levels at which we addressed these comments inthe revised manuscript.

      First, regarding the language – we have now replaced subjective descriptors with the statement that the model captures ~50% of the overall variance in behavioral data. The ~50% number is the average overall R2 between the model and data (0.6 and 0.37 for the chuts vs. purrs and wheeks vs. whine tasks respectively). We leave it to readers to interpret this number.

      Second, our original manuscript lacked clarity on exactly what aspects of the categorization behavior we were attempting to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022). Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      In figure 11, the authors compare the results of training their model with all classes, versus training only with the classes used in the task, and show that with the latter performance is worse and matches the experiment less well. This is a very interesting point, but it could just be the case that there is insufficient training data.

      This could indeed be the case, and we acknowledge this as a potential explanation in therevised manuscript (page 22, line 537; page 27, line 653). Our original thinking was that if GPs were also learning discriminative features only using our training exemplars, they would face a similar training data constraint as well. But despite this constraint, the model’s performance is above d’=1 for natural calls – both training and novel calls; it is only the similarity with behavior on the manipulated stimuli that is lower than the one vs. many model. This phenomenon warrants further investigation.

      Reviewer #2 (Public Review):

      Kar et al aim to further elucidate the main features representing call type categorization in guinea pigs. This paper presents a behavioral paradigm in which 8 guinea pigs (GPs) were trained in a call categorization task between pairs of call types (chuts vs purrs; wheek vs whines). The GPs successfully learned the task and are able to generalize to new exemplars. GPs were tested across pitch-shifted stimuli and stimuli with various temporal manipulations. Complementing this data is multivariate classifier data from a model trained to perform the same task. The classifier model is trained on auditory nerve outputs (not behavioral data) and reaches an accuracy metric comparable to that of the GPs. The authors argue that the model performance is similar to that of the GPs in the manipulated stimuli, therefore, suggesting that the 'mid-level features' that the model uses may be similar to those exploited by the GPs. The behavioral data is impressive: to my knowledge, there is scant previous behavioral data from GPs performing an auditory task beyond audiograms measured using aversive conditioning by Heffner et al., in. 1970. [One exception that is notably omitted from the manuscript is Ojima and Horikawa 2016 (Frontiers)]. Given the popularity of GPs as a model of auditory neurophysiology these data open new avenues for investigation. This paper would be useful for neuroscientists using classifier models to simulate behavioral choice data in similar Go/No-Go experiments, especially in guinea pigs. The significance of the findings rests on the similarity (or not) of the model and GP performance as a validation of the 'intermediary features' approach for categorization. At the moment the study is underpowered for the statistical analysis the authors attempt to employ which frequently relies on non-significant p values for its conclusions; using a more sophisticated approach (a mixed effects model utilizing single trial responses) would provide a more rigorous test of the manipulations on behavior and allow a more complete assessment of the authors' conclusions.

      We thank the reviewer for their feedback and the suggestion for a more robust statistical approach. We have now replaced the repeated measures ANOVA based statistics for the behavior and model where more than 2 test conditions were presented (SNR, segment length, tempo shift, and frequency shift) with generalized linear models with a logit link function (logistic activation function). In these models, we predict the trial-by-trial behavioral or model outcome from predictors including stimulus type (Go or Nogo), parameter value (e.g., SNR value), parameter sign (e.g., positive or negative freq. shift), and animal ID as a random effect. To evaluate whether parameter value and sign had a significant contribution to the model, we compare this ‘full’ model against a null model that only has stimulus type as a predictor and animal ID as a random effect. These analyses are described in detail in the Materials and Methods section of the revised manuscript (page 36, line 930).

      These analyses reveal significant effects of segment length changes, and weak effects of tempo changes on behavior (as expected by the reviewer). Both the behavior and model showed similar statistical significance (except tempo shift for wheeks vs. whines) for whether performance was significantly affected by a given parameter.

      The behavioral data presented here are descriptive. The central conceptual conclusions of the manuscript are derived from the comparison between the model and behavioral data. For these comparisons, the p-value of statistical tests is not used. We realized that a description of how we compared model and behavioral data was not clear in the original manuscript. To compare behavioral data with the model, we fit a line to the d’ values obtained from the model plotted against the d’ values obtained from behavior, and computed the R2 value. We used the mean absolute error (MAE) to quantify the absolute deviation between model and behavior d’ values. Thus, high R2 values would signify a close correspondence between the model and behavior regardless of statistical significance of individual data points. We now clarify this in page 12, line 289. We derive R2 values for individual stimulus manipulations, as well as an overall R2 by pooling across all manipulations (presented in Fig. 11). This is now clarified in page 21, line 494.

      Reviewer #3 (Public Review):

      The authors designed a behavioral experiment based on a Go/ No-Go paradigm, to train guinea pigs on call categorization. They used two different pairs of call categories: chuts vs. purrs and wheeks vs. whines. During the training of the animals, it turned out that they change their behavioral strategies. Initially, they do not associate the auditory stimuli with rewards, and hence they overweight the No-Go behavior (low hit and false alarm rate). Subsequently, they learned the association between auditory stimuli and reward, leading to overweighting the Go behavior (high hit and false alarm rates). Finally, they learn to discriminate between the two call categories and show the corresponding behaviors, i.e. suppress the Go behavior for No-go stimuli (improved discrimination performance due to stable hit rates but lower false alarm rates).

      In order to derive a mechanistic explanation of the observed behaviors, the authors implemented a computational feature-based model, with which they mirrored all animal experiments, and subsequently compared the resulting performances.

      Strengths:

      In order to construct their model, the authors identified several different sets of so-called MIFs (most informative features) for each call category, that were best suited to accomplish the categorization task. Overall, model performance was in general agreement with behavioral performance for both the chuts vs. purrs and wheeks vs. whines tasks, in a wide range of different scenarios.

      Different instances of their model, i.e. models using different of those sets of MIFs, performed equally well. In addition, the authors could show that guinea pigs and models can generalize to categorize new call exemplars very rapidly.

      The authors also tested the categorization performance of guinea pigs and models in a more realistic scenario, i.e. communication in noisy environments. They find that both, guinea pigs and the model exhibit similar categorization-in-noise thresholds.

      Additionally, the authors also investigated the effect of temporal stretching/compression of calls on categorization performance. Remarkably, this had virtually no negative effect on both, models and animals. And both performed equally well, even for time reversal. Finally, the authors tested the effect of pitch change on categorization performance, and found very similar effects in guinea pigs and models: discrimination performance crucially depends on pitch change, i.e. systematically decreases with the percentage of change.

      Weaknesses:

      While their computational model can explain certain aspects of call categorization after training, it cannot explain the time course of different behavioral strategies shown by the guinea pigs during learning/training.

      Thank you for bringing this up – in hindsight the original manuscript lacked clarity on exactlywhat aspects of the behavior we were trying to model. As recent studies have suggested, categorization behavior can be decomposed into two steps – the acquisition of the knowledge of auditory categories, and the expression of this knowledge in an operant task (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022) . Our model solely addresses how knowledge regarding categories is acquired (through the detection of maximally informative features). Other than setting a 10% error in our winner-take-all stage, we did not attempt to systematically model any other cognitive-behavioral effects such as the effect of motivation and arousal, or behavioral strategies. Thus, in the revised manuscript, we have included a paragraph at the top of the Results section that defines our intent more clearly (page 5, line 117). We conclude the initial description of the behavior by stating that these factors are not intended to be captured by the model (page 6, line 171). We also edited a paragraph in the Discussion section for clarity on this point (page 26, line 629).

      Furthermore, the model cannot account for the fact that short-duration segments of calls (50ms) already carry sufficient information for call categorization in the guinea pig experiment. Model performance, however, only plateaued after a 200 ms duration, which might be due to the fact that the MIFs were on average about 110 ms long.

      The segment-length data indeed demonstrates a deviation between the data and the model.As we had acknowledged in the original manuscript, this observation suggests further constraints (perhaps on feature length and/or bandwidth) that need to be imposed on the model to better match GP behavior. We originally did not perform this analysis because we wanted to demonstrate that a model with minimal assumptions and parameter tuning could capture aspects of GP behavior.

      We have now repeated the modeling by constraining the features to a duration of 75 ms (thelowest duration for which GPs show above-threshold performance). We found that the constrained MIF model better matched GP behavior on the segment-length task (R2 of 0.62 and 0.58 for the chuts vs. purrs and wheeks vs. whines tasks; with the model crossing d’=1 for 75 ms segments for most tested cases). The constrained MIF model maintained similarity to behavior for the other manipulations as well, and yielded higher overall R2 values (0.66 for chuts vs. purrs, 0.51 for wheeks vs. whines), thereby explaining an additional 10% of variance in GP behavior.

      In the revised manuscript, we included these results (page 28, line 699), and present results from the new analyses as Figure 11 – Figure Supplement 2.

      In the temporal stretching/compressing experiment, it remains unclear, if the corresponding MIF kernels used by the models were just stretched/compressed in a temporal direction to compensate for the changed auditory input. If so, the modelling results are trivial. Furthermore, in this case, the model provides no mechanistic explanation of the underlying neural processes. Similarly, in the pitch change experiment, if MIF kernels have been stretched/compressed in the pitch direction, the same drawback applies.

      We did not alter the MIFs in any way for the tests – the MIFs were purely derived by trainingthe animal on natural calls. In learning to generalize over the variability in natural calls, the model also achieved the ability to generalize over some manipulated stimuli. The fact that the model tracks GP behavior is a key observation supporting our argument that GPs also learn MIF-like features to accomplish call categorization.

      We had mentioned at a few places that the model was only trained on natural calls. To addclarity, we have now included sentences in the time-compression and frequency-shifting results affirming that we did not manipulate the MIFs to match test stimuli. We also include a couple of sentences in the Discussion section’s first paragraph stating the above argument (page 26, line 615).

    1. Author Response

      Reviewer #1 (Public Review):

      Causality is important and desired but usually difficult to establish. In this work, Park et al. conducted a comprehensive phenome-wide, two-sample Mendelian randomization analysis to infer the casual effects of plasma triglyceride (TG) levels on 2,600 disease traits. They identified causal associations between plasma TG levels and 19 disease traits, related to both atherosclerotic cardiovascular diseases (ASCVD) and non-ASCVD diseases. They used biobank-scale data in both discovery analysis and replication analysis.

      The conclusions of this work are mostly supported by the data and analysis, but some aspects need to be clarified and extended.

      (1) The datasets used in this study may not be very consistent. For example, UKB participants are aged 40-69 years old at recruitment. In addition, UKB is United Kingdom-based and FinnGen is Finland-based. So the definition of outcomes may not be identical. The authors should discuss the differences between the datasets and their potential effects.

      The reviewer is correct about the differences between UKB and FinnGen and that the definition of clinical outcomes between the two datasets may not be identical due to differences in healthcare systems and population demographics. We now mention this in the discussion section as a potential limitation.

      Manuscript changes:

      Line 520-539: “Third, UKB and FinnGen have innate differences in participant demographics and medical coding systems, due in part to the former being based in the United Kingdom and the latter in Finland. As such, potential misclassification of participants in case-control assignment is a liability to this study. We exercised caution in mapping UKB traits to FinnGen traits, but we were unable to reliably map all “categorical” traits from UKB to corresponding traits in FinnGen, testing for replication only 221 of the 598 associations that were nominally significant in the primary analysis. We note however that, despite geographical differences, both datasets largely involve White European participants of older age, with the mean age in UKB and FinnGen being 56.5 and 59.8, respectively.”

      (2) The discovery analysis and replication analysis are not completely independent because data from UKB have been used in both analyses. Although in discovery, the data were used for association with outcomes; while in replication, the data were used for association with exposure. The authors may want to explain if this may cause problems.

      The reviewer is correct that UKB data were used in both the discovery and replication analyses with the caveat that the discovery analysis used UKB for outcomes while using GLGC for exposures, whereas the replication analysis used UKB for exposures while using FinnGen for outcomes. We believed this would be a creative use of three different datasets and a strength of the study; however, we agree that examining the implications of this study design is needed to acknowledge potential biases. We now expand on this in the discussion section as a potential limitation.

      Manuscript changes:

      Lines 539-545: “Fourth, discovery and replication analyses were not completely independent, since UKB data were used in both analyses. This could potentially exacerbate demographic and measurement biases inherent to UKB; however, we show that taking a traditional replication approach using GLGC instead of UKB for selecting exposure instruments in replication returns comparable Tier 1 results (Supplementary Files 5), while losing statistical power to highlight many of the Tier 2 and 3 results.”

      (3) As stated in the manuscript, there are three assumptions for MR analysis. The validity of the results depends on the validity of the assumptions. The last two assumptions are usually difficult to validate. To the authors' credit, they conducted sensitivity analyses addressing horizontal pleiotropy, which is related to assumption 3. It would be helpful if the authors can discuss those assumptions explicitly.

      We now explicitly state the assumptions of Mendelian randomization in the introduction section and discuss the validity of these assumptions in the discussion section.

      Manuscript changes:

      Lines 501-514: “The study has several limitations. First, MR is a powerful but potentially fallible method that relies on several key assumptions, namely that genetic instruments are (i) associated with the exposure (the relevance assumption); (ii) have no common cause with the outcome (the independence assumption); and (iii) have effects on the outcome solely through the exposure (the exclusion restriction assumption) (Hartwig et al., 2016). In MR, (i) is relatively straightforward to test, while (ii) and (iii) are difficult to establish unequivocally. As a prominent example, horizontal or type I pleiotropy has been shown to be common in genetic variation, which can bias MR estimates (Verbanck et al., 2018) (Jordan et al., 2019). This occurs when a genetic instrument is associated with multiple traits other than the outcome of interest. To detect and correct for this as best as possible, we used various MR tests as sensitivity analyses that each aim to adjust for or account for the presence of horizontal pleiotropy, including MR-PRESSO, as well as MR-Egger and weighted median methods. There is no universally accepted method that is perfectly robust to horizontal pleiotropy, but we take the best current approach by using multiple methods and examining the consistency of results.”

      Reviewer #2 (Public Review):

      This work conducted a Mendelian randomization analysis between TG and a large number of disease traits in biobanks. They leverage the publicly available summary statistics from the European samples from the UK Biobank and FinnGen. A solid but routine standard summary-statistics based MR study is conducted. Several significant causal associations from TG to phenotypes are called by setting p-value cutoff with some Bonferroni correction. Sensitivity statistical analyses are conducted which generate largely consistent results. The research problem is important and relevant for public health as well we drug development. Overall this is a solid execution of current methods over appropriate data source and yields a convincing result. The interpretation of the results in discussion is also well-balanced.

      While the paper does have strengths in principle, a few technical weaknesses are observed.

      They used UK Biobank as the discovery and FinnGen as the replication. But the two cohorts are rather used symmetrically. Especially for the Tier 3 (NB), it seems to be an attempt of reusing the replication cohort as the discovery. I wonder if that would create additional multiple testing burden as a greater number of hypotheses are considered.

      We thank the reviewer for this thought-provoking comment. As the reviewer is aware, MR studies have generally not accounted for multiple testing in the past since they have usually focused on single exposures and/or single diseases. Ours is among one of the more unique MR studies taking a phenome-wide, high-throughput approach, so determining the optimal threshold for balancing true-positive vs. false-positive discovery is an important aspect of the study warranting discussion.

      We agree that Tier 3 results carry the least stringent level of statistical evidence (i.e., nominally significant in discovery using UK Biobank and Bonferroni-significant in replication using FinnGen), and that these results should be interpreted with caution. As a phenome-wide study, a significant aim of this work was to generate hypotheses, and so, we decided to present our results using the three tiers of statistical evidence to highlight as many promising associations as possible for further investigation. Nevertheless, we now express extra caution in the results and discussion sections regarding Tier 2 and 3 results, and we also note as a limitation that these results especially require external replication.

      Manuscript changes:

      Lines 438-444: “Regarding non-ASCVDs, we present suggestive genetic evidence of potentially causal associations between plasma TG levels and uterine leiomyomas (uterine fibroids), diverticular disease of intestine, paroxysmal tachycardia, hemorrhage from respiratory passages (hemoptysis), and calculus of kidney and ureter (kidney stones). Due to the weaker statistical evidence supporting these associations, special caution is encouraged when interpreting these results to infer causality, and further replication and validation studies are essential for all Tier 2 and Tier 3 results.”

      The replication p-value cutoff is a bit statistically lenient. In a typical discovery-replication setting the two stages are conducted sequentially and replication should go through the Bonferroni adjustment on the number of significant signals from discovery that is tested in the replication. For example, in this case, in tier 2, the cutoff should be 0.05/39. This may make the association of leiomyoma of the uterus slightly non-significant though. Similar cutoff should be applied to tier 3 as well.

      We thank the Reviewer for highlighting this important point. We agree that in a standard two-stage discovery and replication study design, the Bonferroni adjustment should be based on the number of significant signals from discovery that is tested in the replication. We had initially considered this approach but chose the current tiered approach based on a number of factors:

      First, we had initially considered performing a standard meta-analysis between UK Biobank and FinnGen datasets and using the Bonferroni adjustment of the total number of tests. However, it was not possible to reliably map the phenotypes between UK Biobank and FinnGen on a large-scale due to different classification schemes.

      Second, we had noticed that if we only focus on the sequential two-stage design, then we would be ignoring strong causal relationships observed in FinnGen that passed Bonferroni adjustment but may only be nominally associated in UK Biobank. Although not as strong as Tier 1 findings, we believe that these findings warranted some consideration. This is particularly relevant since differences in the strength of the causal relationship could be attributed to the different populations studied, sample size, different health systems used to measure disease outcomes, differences in statistical power in the MR tests between the two stages (e.g., number of IVs), amongst others.

      Third, we wanted to point out that the total adjustment for number of phenotypes tested using Bonferroni is a very conservative adjustment because the multiple EHR phenotypes have varying degrees of redundancy and correlation. We believe the appropriate Bonferroni-adjusted P-value cutoff is somewhere in between the Bonferroni adjustment of total number of phenotypes, and the nominal P-value (no adjustment for number of phenotypes).

      Although somewhat unconventional, we came up with this tiered P-value approach to overcome the points mentioned above. We have now included text to further explain our approach and to mention that tier 2 and tier 3 results require further replication and validation.

      Manuscript changes:

      Lines 266-283: “This presentation is somewhat unconventional and partly arises from the study’s use of three different datasets for instrument selection. In a traditional two-stage discovery and replication design, Bonferroni adjustment is based on the number of significant signals from discovery that is tested in replication. Here, we used three tiers of statistical evidence to present results because a standard meta-analysis between UKB and FinnGen was not possible, given it was not possible to reliably map all phenotypes between the two datasets. Additionally, Bonferroni-significant results in the replication analysis would have been ignored in FinnGen in a sequential two-stage design if they were also only nominally associated in UKB. The three tiers are defined below:”

      Lines 441-444: “Due to the weaker statistical evidence supporting these associations, special caution is encouraged when interpreting these results to infer causality, and further replication and validation studies are essential for all Tier 2 and Tier 3 results.”

      Lines 498-500: “However, we reiterate that this Tier 3 association was only nominally significant in discovery, while Bonferroni-significant in replication, and future studies are needed to validate the statistical evidence.”

      Lines 565-567: “However, caution is still warranted in inferring causality, as MR depends on specific assumptions and the validity of those assumptions must be carefully assessed. Thus, diverse study designs remain necessary to triangulate evidence on the causal effects of plasma TG levels.”

      The causal effect of TG to leiomyoma of the uterus is weak, as indicated by both the sub-significant in the replication and the non-significant of MR-PRESSO. Similarly, I would recommend more caution on the weak statistical rigor when interpreting Tier 2 and Tier 3 results.

      We agree with the Reviewer. We have now emphasized more caution in interpreting Tier 2 and Tier 3 results. We have also explicitly restated the weaker statistical evidence underlying these results and noted need for future validation. Please see our detailed response to the Comment above.

      Manuscript changes:

      Lines 498-500: “However, we reiterate that this Tier 3 association was only nominally significant in discovery, while Bonferroni-significant in replication, and future studies are needed to validate the statistical evidence.”

      Another methodological choice that might need justification is the use of UKB TG GWAS loci (1,248 SNPs) are the instrument for FinnGen. This may create some subtle interference with the use of UKB as outcomes in the discovery analysis. It may be minor but some justification or at least some discussions of potential limitations should be mentioned. What about the alternative of using GLGC as instruments in replication?

      We agree with the reviewer that the use of UKB TG GWAS loci (1,248 SNPs) as instruments for FinnGen outcomes needs additional justification. We now detail this decision in the text as copied below.

      Additionally, we now present new data comparing MR results on FinnGen outcomes when selecting TG instruments from UKB GWAS versus GLGC GWAS. Statistical significance after Bonferroni correction was set to 0.05/221, where 221 was the number of disease traits nominally significant in UKB that were tested in FinnGen. We note that the results were fairly consistent. All Tier 1 results remained Bonferroni significant, whether using TG SNPs from UKB or GLGC. Though statistical significance decreased for the remaining diseases of interest, the direction of causality remained consistent, and three disease traits remained significant (hypertension, aortic aneurysm, and alcoholic liver disease). These results support that instrumenting TG using 1,248 SNPs from UKB might carry more power than the 141 SNPs from GLGC, allowing for the detection of associations in our initial replication analysis using UKB for exposures and FinnGen for outcomes. We now include this analysis in the text and include the figure below, as well as its underlying data, as supplementals (Supplementary File 5).

      Manuscript changes:

      Lines 229-236: “We selected UKB TG GWAS loci as the instruments for replication on FinnGen outcomes, rather than GLGC TG GWAS loci, to diversify the source of TG instruments and mitigate potential biases associated with one TG GWAS. Moreover, UKB GWAS included a larger study population than GLGC GWAS, providing a greater number of genetic instruments that can together explain more of the variance in plasma TG levels, and thus, greater statistical power and precision. Nevertheless, we also performed the replication analyses using TG instruments from GLGC and included these results as supplemental data (Supplementary File 5).”

      For disease outcomes (line 188), UKB European sample size is ~400,000 rather than ~500,000. Can the author clarify the sample size they used?

      We thank the reviewer for catching this detail. We have now clarified the sample size of UKB European participants in the Methods section, and we also included the exact sample size of each disease trait GWAS (cases and controls) in Supplementary Figure 1.

      Manuscript changes:

      Lines 194-201: “Pan-UKB had performed 16,131 GWASs on 7,221 phenotypes in ~420,531 UKB participants of European ancestry using genetic and phenotypic data (PanUKBTeam, 2020). A total of 7,221 total phenotypes had been categorized as “biomarker”, “continuous”, “categorical”, “ICD-10 code”, “phecode”, or “prescription” (PanUKBTeam, 2020). We filtered for outcomes to retain categorical, ICD-10, and phecode types; non-null heritability in European ancestry as estimated by Pan-UKB; and relevance to disease, excluding medications. This yielded 2,600 traits for primary analysis. The exact sample size of each GWAS for each of these traits is provided in Supplementary File 1.”

      It would be reassuring to the reader if the TG measurements were measured in a treatment-naïve manner. GLGC accounted for treatment (at least LDL, check paper for TGs; if they didn’t, there must be reason). Maybe not UKB.

      We now provide information about whether the lipid measurements were measured in a treatment-naïve manner in the Methods for GLGC and UKB. We also address this point in the discussion section as a potential limitation.

      Manuscript changes:

      Lines 179-180: “We note that the GLGC GWAS had excluded individuals known to be on lipid-lowering medications.”

      Lines 187-188: “We note that the Pan-UKB GWAS study did not exclude participants based on their use of lipid-lowering medications.”

      Lines 545-546: “Fifth, the GLGC GWAS used to select instruments for plasma TG levels in discovery had accounted for lipid-lowering treatment, while the UKB GWAS used in replication had not.”

      "Phenome-wide MR is a high-throughput extension of MR that, under specific assumptions, estimates the causal effects of an exposure on multiple outcomes simultaneously." - I guess it is more informative to mention the specific assumptions, at least briefly, in the introduction so it is easier for the reader to interpret the results.

      We agree with the reviewer that it would be informative to explicitly state the assumptions of Mendelian randomization. We now explicitly state these assumptions in the introduction.

      Manuscript changes:

      Lines 123-129: “Phenome-wide MR is a high-throughput extension of MR that estimates the causal effects of an exposure on multiple outcomes simultaneously. As in conventional MR, this method uses genetic variants as instrumental variables (IV) to proxy modifiable exposures (Davey Smith & Ebrahim, 2003), and importantly, it relies on three critical assumptions: (1) The genetic variant is directly associated with the exposure; (2) The genetic variant is unrelated to confounders between the exposure and outcome; and (3) The genetic variant has no effect on the outcome other than through the exposure (Davey Smith & Ebrahim, 2003).”

      Reviewer #3 (Public Review):

      Park and Bafna et al. applied a genetics-based epidemiological approach, the Mendelian randomization analysis (MR), to evaluate the potential causal roles of triglycerides across 2,600 disease traits (i.e., the phenome). In a typical two-sample MR framework, they utilized existing genome-wide association study (GWAS) summary statistics from two separate studies. They are Global Lipids Genetics Consortium (GLGC) and UK Biobank in the discovery analysis, and UK Biobank and FinnGen in the replication analysis. This replication design is a great strength of the study, enhancing the robustness and reproducibility of the results. For the candidate pairs of causal associations, the authors further perform multiple sensitivity analyses to evaluate the robustness of the results to possible violations of assumptions in MR. To disentangle the independent effects of triglycerides from other lipid fractions (i.e., LDL-cholesterol and HDL-cholesterol), the authors performed multivariable MR analysis. In the end, possible causal associations were revealed in three tiers, based on statistical significance in the two-stage analysis. The results support the causal effects of triglycerides in increasing the risk of atherosclerotic cardiovascular disease. They also reveal novel conditions, which are either new treatable conditions (e.g., leiomyoma, hypertension, calculus of kidney and ureter) for repurposing of triglycerides-lowering drug, or possible side effects (e.g., alcoholic liver disease) the triglyceride-lowering treatment should pay special attention to.

      The analysis approaches in the paper are standard and solid. The discovery-replication study design is a great strength. Correction for multiple testing was implemented in a conservative way. The sensitivity analyses and MVMR strengthen the robustness of the results. The manuscript is very clearly written and pleasant to read. The limitations were well-presented. The conclusions and interpretations are mostly supported by the data, with one major concern as explained below. But overall, in addition to the specific findings, this study could be an exemplar study for the use of phenome-wide MR in identifying treatable conditions and side effects for most existing drugs.

      1) My major concern is about reverse causation. For example, having atherosclerotic cardiovascular disease increases circulating triglycerides. Reverse causation can induce false positives in MR analysis. With the existing data in this study, the authors can perform a reverse MR to evaluate the effect of the 19 disease traits on triglycerides. Ruling out the presence of reserve causation is important to make sure that the current findings are not false positives.

      We agree with the reviewer that performing reverse MR would be important to rule out reverse causation. We now present new results using reverse MR, selecting instruments for disease from UKB and instruments for TG from GLGC (i.e., reversing the discovery analysis). We provide an interpretation of these new results in the discussion section and present the underlying data, including the number of genetic variants used, in Supplementary File 6. Please note we could only perform reverse MR on 9 of the 19 diseases of interest, due to insufficient genetic data in GLGC to extract the specific exposure instruments. As expected, we observed significant associations (orange) between “disorders of lipoprotein metabolism” and “hyperlipidemia” with plasma TG levels; however, all other estimates were non-significant, suggesting unidirectional associations for the remaining seven disease traits. We now include the figure below and its underlying data as supplements (Supplementary File 6).

      Manuscript changes:

      Lines 258-261 “Finally, we performed bidirectional or reverse MR on significant results to examine the potential presence of reverse causation. We selected instruments for each disease as described above from Pan-UKB and instruments for plasma TG levels from GLGC, essentially reversing the discovery stage design using a fixed-effect IVW method.”

      Lines 368-373: “Finally, we performed reverse MR to estimate the effects of significant disease traits on plasma TG levels, selecting instruments from UKB and GLGC, respectively. Genetic data were sufficiently available to perform this analysis for 9 of the 19 diseases of interest. These results are presented in Supplementary File 6. Expectedly, “disorders of lipoprotein metabolism” and “hyperlipidemia” had positive effects on plasma TG levels; however, no other examined disease trait showed results suggesting reverse causation.”

    1. Author Response

      Reviewer #1 (Public Review):

      This paper evaluates the effect of knocking out CST7(Cystatin 5) on the APPNL-G-F Alzheimer's disease mouse model. They found sexually dimorphic outcomes, with differential transcriptional responses, increased phagocytosis (but interestingly a higher plaque burden) in females and suppressed inflammatory microglial activation in males (but interestingly no change in plaque burden). This study offers new insight into the functional role of CST7 that is upregulated in a subset of disease- associated microglia in AD models and human brain. Despite the discovery of disease-associated microglia several years ago, there has been little effort in understanding the function of the different genes that make up this profile, making this paper especially timely. Overall, the experiments are well-controlled and the data support the main conclusions and the manuscript could be strengthened by addressing the below comments and clarifying questions that could impact the interpretation of their data/ findings.

      1) In the first section discussing CST7 expression levels in AD models, it would be good to involve a discussion of levels of CST7 change in human AD samples. There are sufficient available datasets to look at this, and it would help us understand how comparable the animal models are to human patients. For example, while in mice CST7 is highly enriched in microglia/macrophages, in human datasets it seems like it is not quite so specific to microglia - it is equally expressed in endothelial cells. This might have a significant impact on the interpretation of the data, and it would be good to introduce and assess the findings in mice through the human subjects lens. There is a discussion of the human data in the discussion section, but it would be more appropriately assessed in the same way as the mouse data and comparatively presented in the results section. The authors could also include the data from Gerrits et al. 2021 in their first figure.

      We agree with the reviewer on the importance of considering the work in the context of human disease. While CST7 is not as strongly upregulated in human AD brain as it is in mouse expression is observed predominantly in myeloid cells in the brain with very minimal expression detected in endothelial cells (see screenshots in Author response image 1 from Brain Myeloid Landscape platform (http://research-pub.gene.com/BrainMyeloidLandscape/BrainMyeloidLandscape2/) and is enriched in AD clusters vs homeostatic in scRNASeq studies (Gerrits et al., 2021). We attempted immunostaining for human CF (CST7) in AD brains to assess expression and co-localisation with microglial markers but failed to validate any of the antibodies tested. Additionally, King et al., 2023 (PMID: 36547260) recently showed increase in CST7 expression in bulk hippocampal RNASeq in AD vs mid-life controls suggesting an ageing/AD mechanism. CST7 has also been shown to be expressed following overexpression of TREM2 in human microglia in vitro and that siRNA-mediated knockdown of expression leads to an increase in phagocytosis (Popescu et al., 2023 - PMID: 36480007), mirroring our data and suggesting a conserved role in human cells. Overall, we believe that, even in the context of mouse models, the understanding of the function of genes upregulated in disease is of importance to the field and that this study paves the way for further work investigating human CST7 in disease. We have added this (with citations to the datasets mentioned) to the discussion (highlighted).

      Author response image 1

      2) The differential RNAseq data is perhaps one of the most striking results of this paper; however it is difficult to see exactly how similar the male v female APPNL-G-F profiles are, in addition to the genes shared or not between the KO condition. Venn diagrams, in addition to statistical tests, would enhance this part of the paper and add more clarity.

      We have added Venn diagrams to show DEGs between male and female AppNL-G-F microglia vs WT control to show how similar the male v female APPNL-G-F profiles are. Additionally, to exemplify the Cst7KO-Sex interaction, a Venn showing DEGs between male and female AppNL-G-F microglia vs. AppNL-G-FCst7-/- microglia (Fig. 2 – Fig. supplement 3). We confirm we have derived all differential gene expression changes reported (including those represented in the Venn diagrams) using appropriate Padj statistical approaches (see Methods).

      3) A major argument in the paper is a continuation of Sala-Frigerio 2019 which says that the female phenotype is an acceleration of the male phenotype. Does this mean that if males were assessed at later timepoints, they would be more similar to the females? Or are there intrinsic differences that never resolve? It would be helpful to see a later timepoint for males to get at the difference between these two options

      This is an interesting question and while we acknowledge that empirically addressing with a later timepoint could add insight, we believe it would actually need multiple closely-spaced timepoints as choosing what single later timepoint would be optimal is difficult to judge (and likely not possible at all) for reasons below. We also believe data already published combined with our observations show it is most-likely a cell-intrinsic effect that explains our sex-specific differences.

      First, we emphasize the acceleration of the microglial phenotype in female AppNL-G-F mice previously published is fairly subtle and relative rather than absolute e.g. the DAM/ARM microglia state represents ~50% of all microglia in male and ~55% of all microglia in females at 12 months old therefore both sexes have similarly abundant microglia in the state that most highly express Cst7. Indeed, after the age at which DAM/ARM state microglia appear in appreciable numbers (~ 6 months), both females and males both have an abundance of them. It is important to note that a 12-month male is far more “progressed” than a 6-month female hence the stepped age effect is temporally short.

      Second, Cst7 deletion in the AppNL-G-F mice condition caused qualitative differences affecting distinct genes and/or overlapping genes moving in different directions between female and male mice - if a stepped age effect explained sex differences from Cst7 deletion, given that it could only be stepped by a very short timeframe (several weeks maximum) from reasoning above, we would expect to see similar qualitative changes but of different magnitude in female and male mice arising from Cst7 deletion; this is not the pattern we see.

      Third, beyond 12 months old, regression from ARM/DAM actually occurs, again making it unlikely males would “catch up” with females to show the same profile from Cst7 deletion but just at an older age – practically, this also complicates choosing a single later timepoint (and age-related systemic morbidity emerges as a potential confounder as well).

      In summary, while the acceleration of the DAM signature in female microglia offers an intriguing possible explanation to our observation of sexual dimorphism in response to deletion of one of the key genes in this signature, we believe it more likely that intrinsic effects are responsible for the Cst7 deletion sex-related impact. Taking the alternative perspective, even if a stepped age effect in the underlying progression of the model could explain our findings, this would need multiple timepoints with short gaps between (e.g. monthly at 12, 13, 14, 15 months old) to provide the temporal resolution to expose this pattern; we would not have the resources to conduct such a resource-intensive and lengthy study. We hope this reasoning appears logical and conscious of the importance to convey this in our manuscript we have revised the Discussion to as concisely as possible capture some key points outlined above.

      4) If the central argument is that CST7 in females decreases phagocytosis and in males increases microglia activation, are there changes in amyloid plaque burden or structure in the APPNL-G-F /CST 7 KO mice compared to APPNL-G-F/CST7 WT that reflect these changes? Please address. If not, how does this affect the functional interpretation of differential expression observed in phagocytic/reactive microglia genes? Pieces of this are discussed but it could be clearer.

      We emphasise the data already presented in Fig 6 and Fig. 6 – Fig. Supplement 2 showing altered Aβ burden (6E10 staining) and plaque count (MeX04) but no change in plaque area. Regarding the functional interpretation of Cst7-dependent gene changes in microglia beyond the endolysosomal function we present in figures 3-5, we have included additional data using simple immunohistochemistry, as suggested by the reviewer, to assess synapse abundance. We show loss of Sy38 coverage around plaques (Fig. 6I) and a moderate but significant decrease in coverage between AppNL-G-F/Cst7-/- vs AppNL-G-F brains only in females (Fig. 6J). This reflects the effect observed with plaque coverage whereby we observe increased burden in AppNL-G-F/Cst7-/- vs AppNL-G-F females but not males (Fig. 6B-F) suggesting the increased plaque burden in Cst7-/- female mice may lead to increased synapse loss. We would also emphasise that altered expression of phagolysosomal genes could affect disease in ways beyond interactions with amyloid and synapses.

      5) It is confusing that increased phagocytosis in the APPNL-G-F/CST7 KO females leads to greater plaque burden, considering proteolysis is not affected. What might explain this observation? Additionally, it is interesting that suppression of microglial activation doesn't lead to an increase in plaques in the male APPNL-G-F/CST7 KO mice. How does the profile of phagocytic microglia in the male APPNL-G-F/CST7 KO mice differ from the APPNL-G-F males?

      We emphasize our comments on this topic in the discussion where we speculate that the greater plaque burden in females is linked to increased uptake of Aβ (which we observe in Fig. 4B&C) and deposition into plaques as suggested by Huang et al., 2021 (PMID: 33859405), d’Errico et al., 2022 (PMID: 34811521) and Shabestari et al., 2022 (PMID: 35705056). Regarding the lack of effect in males despite the suppression of inflammatory genes, we agree this is a curious observation, although may point to as yet ill-defined mechanisms for how inflammatory pathways influence plaque pathology. Unfortunately, we were not able to specifically compare the profile of phagocytic microglia in AppNL-G-F vs AppNL-G-FCst7-/- as we did not perform single-cell RNASeq. However, our bulk RNASeq profiling suggests modest downregulation of phagocytic/endolysosomal genes (eg Lilrb4a, Fig. 2I) and reduced expression of LAMP2 in microglia by immunostaining. We have added further comment on this in the discussion.

      6) Seems that the authors have potentially discovered an unusual mechanism for how CST7 could regulate cell autonomous function without impacting its canonical protease target. The authors deal with this extensively in the discussion but an ELISA or ICC to localize CST7 to microglia in vitro or in vitro would help address this point.

      We have added FISH data localising Cst7 expression to IBA1+ cells specifically around plaques in App brains (Fig. 1B-E). We agree that assessing the subcellular localisation and any non-microglial expression of Cystatin-F (the protein coded by Cst7) would offer valuable insight into the protease target and may reveal details on the precise mechanism by which CF deletion leads the phenotype we observe in this study. However, despite attempting numerous commercially available and gifted antibodies to detect CF we were unable to validate (using Cst7-/- as controls) any methods other than FISH.

      7) The authors focus on plaques in their final figure, however dysregulated microglial phagocytosis could impact many other aspects of brain health. Simple immunohistochemistry for synapses and myelin/oligodendrocytes (especially given the results of the in vitro phagocytosis assay) could provide more insight here.

      We fully agree with the reviewer. As also outlined in our responses elsewhere, phagocytic changes could have multiple consequences, and we have included additional data using immunohistochemistry as advised for synapses in WT, AppNL-G-F, and AppNL-G-F/Cst7-/- brains. We show loss of Sy38 coverage around plaques (Fig. 6I) and a moderate but significant decrease in coverage between AppNL-G-F/Cst7-/- vs AppNL-G-F brains only in females (Fig. 6J). This reflects the effect observed with plaque coverage whereby we observe increased burden in AppNL-G-F/Cst7-/- vs AppNL-G-F females but not males (Fig. 6B-F) suggesting the increased plaque burden in Cst7-/- female mice may lead to increased synapse loss.

      We also performed immunohistochemistry for myelin makers MAG and MBP but found no plaque-associated pathology. Finally, we searched for dystrophic neurites using LAMP1 but found that the antibody stained microglial lysosomes rather than dystrophic neurites in this model (see Author response image 2), an observation that has been made by others (Sharoar et al., 2021 - PMID: 34215298).

      Overall, our data suggest Cst7 may play a protective role in females, limiting phagocytosis, reducing plaque burden and blunting synapse loss.

      Author response image 2.

      Reviewer #3 (Public Review):

      In this manuscript, Daniels et al explored the role of Cystatin F in an A-driven mouse model of Alzheimer's disease. By crossing a constitutive knockout mouse lacking the gene that encodes Cystatin F, Cst7, to the AppNL-G-F mouse line, the authors describe impairments in microglial gene expression and phagocytic function that emerge more prominently in females versus males lacking Cst7. A strength of the study is its focus: given mounting evidence that microglia are a hub of neurological dysfunction with particular potential to trigger or exacerbate neurodegenerative disorders, it is essential to determine the changes in microglia that occur pathologically to promote disease progression. Similarly, the wide-spread identification of the gene in question, Cst7, as upregulated in AD models makes this gene a good target for mechanistic studies.

      The paper in its current form also has several weaknesses which limit the insights derived, weaknesses that are largely related to the experimental tools and approaches chosen by the authors to test their hypotheses. For example, the paper begins with a figure replotting data from previous studies showing that Cst7 is upregulated in mouse models of Alzheimer's disease. Though relevant to the current study, there are no new insights provided here. Next, the authors perform bulk RNA-sequencing on microglia isolated from male and female mice in the Cst7-/-; AppNL-G-F mouse line. In the methods, it is unclear whether the authors took precautions to preserve the endogenous transcriptional state of these cells given evidence that microglia can acquire a DAM-like signature simply due to the process of dissociation (Marsh et al, Nature Neuroscience, 2022). If the authors did not control for this, their results may not support the conclusions they draw from the data. Relatedly, it appears the authors pooled all microglia together here, instead of just isolating DAMs specifically or analyzing microglia at single-cell resolution, which could reveal the heterogeneous nature of the role of Cst7 in microglia. In addition to losing information about heterogeneity, another concern is that they could be diluting out the major effects of the model on microglial function by including all microglia. Overall, the biggest issue I have with the RNA-sequencing data is the lack of validation of the gene expression changes identified using a different method that does not require dissociation, like immunohistochemistry or fluorescence in situ hybridization. Especially given the limited number of genes they found to be mis-regulated (see Fig. 2 E and G), I worry that these changes might simply be noise, especially since the authors provide no further evidence of their mis-regulation. Without further validation, the data presented are not sufficient to support the authors' claims.

      We believe we have addressed this comment in the “Essential Revisions (for the authors)” section above. Please see again below:

      We took standard precautions to minimise the risk of aberrant ex vivo cell activation, including maintaining cells on ice during non-enzyme steps of the procedure and carrying out preps in small batches to minimise time taken from removal of brain to purification of microglial RNA. Importantly, we also validated key expression data by in situ methods such as RNA FISH for Cst7 and Lilrb4a (Fig. 1B-E, Fig 2. - Fig. supplement 3) thus eliminating dissection-induced effects. Additionally, when performing qPCR on microglia from non-disease mice to test the disease-specific role of Cst7-dependent gene regulation we did not observe the same gene changes (Fig 2. - Fig. supplement 4) which, if such changes were dependent on tissue dissociation, we would expect to observe in WT or disease animals. We utilised the resources provided by Marsh et al. 2022 to search for overlap between enzyme-induced genes and our DEG lists from our key comparisons. We found the enzyme-induced gene set had very minimal overlap with any of our comparisons with overlap of only 4 genes between enzyme-induced genes and Cst7-dependent genes in males and no overlap between enzyme-induced genes and Cst7-dependent genes in females. We would further point out that the disease-induced microglial RNAseq profile in the AppNL-G-F Cst7+/+ (i.e. disease WT) condition mirrors those observed previously by multiple methods including in situ profiling (Zeng et al 2023 - PMID: 36732642) and RiboTag approaches (Kang et al 2018 - PMID: 30082275). We believe these combined approaches provide convincing validation of the RNAseq data.

      In assessing the changes in microglial function and A pathology that occur in males and females of the Cst7-/-; AppNL-G-F line, the authors identify some differences between how females and males are affected by the loss of Cst7. While the statistical analyses the authors perform as given in the figure legends appear to be correct, the plots do not show significant changes between males and females for a given parameter. Take for example Figure 3H. Loss of Cst7 decreases IBA+Lamp+ microglia in males but increases this parameter in females. However, it does not appear that there is a significant difference in IBA+Lamp+ microglia in male versus female mice lacking Cst7. If there is no absolute difference between males and females, can the differential effects of Cst7 knockout on the sexes really be so relevant to the sexual dimorphism observed in the disease? I question this connection, but perhaps a greater discussion of what the result might mean by the authors would be helpful for placing this into context.

      We understand the reviewer’s perspective and we agree that the interpretations could be presented and explained better in the text - we have updated the discussion as suggested to address this.

      We designed our study initially to search for sex-specific effects of Cst7. Therefore, whilst our ANOVA does include main effects analysis for disease or sex, we carried out post-hoc analysis primarily to investigate effects of Cst7 deletion within sex. In the case of Fig. 3H pointed out by the reviewer, we observe a main effect for disease in the ANOVA and for disease-sex interaction but not for sex. Post-hoc analysis revealed the sex-specific effects of Cst7 we describe in the manuscript. This approach on analysis was also taken by Hoghooghi et al. (2020 - PMID: 33027652) who show related pathway gene Cstc is detrimental in EAE in females but not males (included in the discussion in this manuscript). The observation in Fig. 3H that there appears to be a Cst7 effect in males and females but not a sex effect in Cst7-/- is accurate but a relative anomaly in this study. Generally, we find that, alongside Cst7 deletion affecting females differently to males, we also see a sex effect in Cst7-/- animals but not in Cst7+/+ animals i.e. absolute levels in disease condition as well as relative changes from control to disease condition are different between males and females. This is exemplified in Fig. 4B&C where we observe increased microglial Aβ in female Cst7-/- animals vs male Cst7-/- animals and in Fig. 6D where we observe increased Aβ plaque burden in female Cst7-/- animals vs male Cst7-/- animals. This is most strikingly demonstrated in the case of our RNASeq data where we observe a difference in sex-dependent genes in AppNL-G-F vs AppNL-G-F/Cst7-/- (Fig. 2 – Fig. supplement 3B) implying removal of the Cst7 gene led to an ‘unlocking’ of sexual dimorphism in our cohort which we comment on in the discussion.

      Finally, the use of in vitro assays of microglial function can be helpful as secondary analyses when coupled with in vivo or ex vivo approaches, but are not on their own sufficient to support the authors' conclusions. Quantitative engulfment assays (see Schafer et al, Neuron, 2012) on brain tissue showing that male and female microglia lacking Cst7 engulf different amounts of material (e.g. plaques, synapses, myelin) in the intact brain would be more convincing.

      We agree that in vitro assays for microglial function are not always sufficient as standalone methods to support conclusions on functions in disease. The reviewer may have missed our in vivo MeX04 uptake assays (Fig 4A-D) which use measurements by flow cytometry on isolated microglia, this is a reflection of the microglial uptake in vivo following MeX04 injection pre-mortem – this experiment showed increased microglial Aβ in female Cst7-/- animals vs male Cst7-/- animals (Fig. 4B&C). Our in vitro assays complement and extend insight in ways not possible in vivo, for example they offer key insight into uptake/degradation kinetics that would be extremely challenging to carry out in vivo.

      In general, a major limitation to the insights that can be derived in the study is the decision of the authors to perform all experiments at a single late-stage time point of 12 months of age. As this is quite far into disease progression for many AD models, phenotypic changes identified by the authors could arise due to the downstream effects of plaque deposition and therefore may not implicate Cst7 as a mechanism driving neurodegeneration rather than one of many inflammatory changes that accompany AD mouse models nearing the one-year time point. A related problem is that the study uses a constitutive KO mouse that has lacked Cst7 expression throughout life, not just during disease processes that increase with aging. In summary, the topic of the article is important and timely, but the connection between the data and the authors' conclusions is not as strong as it could be.

      As described above, Cst7 expression is absent at steady-state and low until 6-12 months. Therefore, we predict that deletion would have little effect until 12+ months whereby cells expressing Cst7 have had the temporal window to affect disease pathology, as we find in the current study. This was a key part of the reasoning in our choice of the 12-month age for analyses. The negligible expression of Cst7 at baseline/early stages of disease suggests constitutive KO of the gene will not impact the phenotype until disease onset. This is substantiated by the lack of any genotype-related differences in the WT vs Cst7-/- comparisons in the non-disease condition.

    1. Author Response

      Reviewer #2 (Public Review):

      In this manuscript, the authors performed single-cell RNA sequencing (scRNA-seq) analysis on bone marrow CD34+ cells from young and old healthy donors to understand the age-dependent cellular and molecular alterations during human hematopoiesis. Using a logistic regression classifier trained on young healthy donors, they identified cell-type composition changes in old donors, including an expansion of hematopoietic stem cells (HSCs) and a reduction of committed lymphoid and myeloid lineages. They also identified cell-type-specific molecular alterations between young and old donors and age-associated changes in differentiation trajectories and gene regulatory networks (GRNs). Furthermore, by comparing the single-cell atlas of normal hematopoiesis with that of myelodysplastic syndrome (MDS), they characterized cellular and molecular perturbations affecting normal hematopoiesis in MDS.

      The present manuscript provides a valuable single-cell transcriptomic resource to understand normal hematopoiesis in humans and the age-dependent cellular and molecular alterations. However, their main claims are not well supported by the data presented. All results were based on computational predictions, not experimentally validated.

      Major points:

      1) The authors constructed a regularized logistic regression trained on young donors with manually annotated cell types and predicted cell type labels of cells from old and MDS samples. As the manual annotation of cell types was implicitly assumed as ground truth in this manuscript, I'm wondering whether the predicted cell types in old and MDS samples are consistent with the manual annotation. They should apply the same strategy used in young samples for manual annotation to old and MDS samples, and evaluate how accurate their classifier is.

      We performed manual annotation for each MDS sample independently, and for the 3 healthy elderly donors integrated dataset. To do so, we performed unsupervised clustering with Seurat and annotated the clusters using the same set of canonical marker genes that we used for the young data. We then analyzed the correspondences between the annotated clusters and the predictions by GLMnet. Results are shown on Figure 1a. We observe that the biggest disagreements between methods occur between adjacent identities, such as HSC and LMPP, GMP and GMP with more prominent granulocytes profile, or MEP, early and late erythroid. When we explore these disagreements along the erythroid branch, we see that they particularly occur close to the border between subpopulations (Figure 1b). This is consistent with the continuous nature of the differentiation and the difficulty to establish boundaries between cell compartments. However, we observe that miss-labeling between different hematopoietic lineages is rare.

      In addition, unsupervised clustering was not always able to directly separate the data in the expected subpopulations. We can see different clusters containing the same cell types (e.g. LMPP1, LMPP2), as well as individual clusters containing cells with different identities (e.g. pDC and monocyte progenitors). This is usually due to sources of variability different to cell identity present in the data Additional, supervised finetuning by local sub clustering and merging would be needed to correct for this. On the contrary, we believe that our GLMnet-based method focusses on gene expression related to identity, resulting in a classification that is better suited for our purpose.

      Figure 1 Comparison between GLMnet predictions and manually annotated clusters A) Heatmaps showing percentages of cells in manually annotated clusters (columns) that have been assigned to each of the cell identities predicted by our GLMnet classification method (rows). The analysis was performed independently for the elderly integrated dataset and for every MDS sample. B) UMAP plots showing disagreements in classification between adjacent cell compartments in the erythroid branch. Cells from one erythroid cluster per patient are colored by the identity assigned by the GLMnet classifier. Cells in gray are not in the highlighted cluster, nor labeled as MEP, erythroid early or erythroid late by our classifier.

      2) The cell-type composition changes in Figures 1 and 4 were descriptively presented without providing the statistical significance of the changes. In addition, the age-dependent cell-type composition changes should be validated by flow cytometry.

      We thank the reviewer for the comment. Significance of the changes is included in Supplementary File 3. In addition, we included the percentage of several cell types we validated by flow cytometry, namely HSCs, GMPs and MEPs, in young and elderly healthy individuals in the manuscript, as Figure 1-figure supplement 3. Similarly to what we detected in our bioinformatic analyses, flow cytometry data demonstrated a significant increase in the percentage of HSCs, as well as an increasing trend in MEPs and a slight decrease in the percentage of GMPs in elderly individuals, corroborating our previous results.

      3) In Figure 2, the authors used two different pseudo-time inference methods, STREAM, and Palantir. It is not clear why they used two different methods for trajectory inference. Do they provide the same differentiation trajectories? How robust are the results of trajectory inference algorithms? It seems to be inconsistent that the pseudotime inferred by STREAM was not used for downstream analysis and the new pseudotime was recalculated by using Palantir.

      We thank the reviewer for the comment. The reason behind using two different methods to perform similar analyses, is that each of them provides specific outputs that can be used to perform a more robust and comprehensive analysis. STREAM allows to unravel the differentiation trajectories in a single cell dataset with an unsupervised approach. Also the visualization provided by STREAM (Figure 2C and 2D) allows for a simple interpretation of the results to the reader. On the other hand, Palantir provides a more robust analysis to dissect how gene expression dynamics interact and change with differentiation trajectories. For this reason, we decided to use this second method to investigate how specific genes were altered in the monocytic compartment.

      As a resource article, the showcase of different methods can be valuable as it provides examples on how each tool can be used to obtain specific results, which can help any reader to decide which might be the best tool for their specific case.

      Just to confirm that pseudotime results are similar, we perform a correlation analysis with the pseudotime values obtained from each method. We observed a correlation coefficient of 0.78 (p.val < 2.2e-16) confirming the similarity among both tools.

      Figure 2. Correlation analysis of pseudotime values obtained with STREAM and PALANTIR.

      4) In Figure 2D, some HSCs seem to be committed to the erythroid lineage. The authors should carefully examine whether these HSCs are genuinely HSCS, not early erythroid progenitors.

      We thank the reviewer for the comment. We have performed a deep analysis regarding the classification of HSCs (See Figure 3). Our analyses reveal that none of the cells classified as HSCs express early erythroid progenitor markers. We have also used STREAM to show the expression of these markers along the obtained trajectory and observed that erythroid markers show expression in the erythroid trajectory but not in the HSC compartment (Figure 4).

      Figure 3 Expression of marker genes in the HSC compartment. Dot plot depicting the normalized scaled expression of canonical marker genes by HSC of the 5 young and 3 elderly healthy donors. Marker genes are colored by the cell population they characterize. Dot color represents expression levels, and dot size represents the percentage of cells that express a gene.

      Figure 4. Expression of erythroid markers in STREAM trajectories. Expression of GATA1 and HBB (erythroid markers) in the predicted differentiation trajectories.

      5) It is not clear how the authors draw a conclusion from Figure 3D that the number of common targets between transcription factors is reduced. Some quantifications should be provided.

      We thank the reviewer for the comment. We have updated the manuscript to better reflect our findings and emphasize that the predicted regulatory networks of HSCs in elderly donors is displayed as an independent network, compared to the young donors. (Page 6, line 36).

      “Overall, we observed that the predicted regulatory network of elderly HSCs (Figure 3d) appeared as an independent network compared to the young GRN. This finding could result in the loss of co-regulatory mechanisms in the elderly donors.”

      6) The constructed GRNs and related descriptions were based solely on the SCENIC analysis. By providing the results of an orthogonal prediction method for GRNs, the authors should evaluate how robust and consistent their predictions are.

      We thank the reviewer for the comment regarding the method to build gene regulatory networks. As a resource article, our manuscript describes a complete workflow to perform different aspects of single cell analyses. These steps go from automated classification, trajectory inference and GRN prediction. All the selected algorithms have already been benchmarked and compared against other tools that perform similar analysis. SCENIC has already been benchmarked against other algorithms (11) and by others (12).

      We do agree with the reviewer that these new predictions could provide strength to our findings, however we believe that these orthogonal predictions would better fit if our article was intended for the Research Article category instead of Tools and Resources.

      7) The observed age-dependent cellular and molecular alterations in human hematopoiesis are interesting, but I'm wondering whether the observed alterations are driven by inflammatory microenvironment or intrinsic properties of a subpopulation of HSCs affected by clonal hematopoiesis (CH). To address this, the authors can perform genotyping of transcriptomes (GoT) on old healthy donors with CH. By comparing the transcriptomes of cells with and without CH mutations, we can evaluate the effects of CH on age-associated molecular alterations.

      We thank the reviewer for the comment. Unfortunately, in order to perform GoT (genotyping of transcriptomes) on the healthy donors, requires modifying the standard 10x Genomics workflow to amplify the targeted locus and transcript of interest. This would require collecting new samples, optimizing the method and performing new analysis from scratch (from sequencing up to analysis). We believe this is not in the scope of the manuscript. On the other hand, we don’t have enough material to create new single cell libraries, this fact would require the addition of new donors and as a result, a complete new analysis to perform the integration.

      Reviewer #3 (Public Review):

      The authors have performed a transcriptional analysis of young/aged hematopoietic stem/progenitor cells which were obtained from normal individuals and those with MDS.

      The authors generated an important and valuable dataset that will be of considerable benefit to the field. However, the data appear to be over-interpreted at times (for example, GSEA analysis does not have "functionality", as the authors claim). On the other hand, a comparison between normal-aged HSC and HSC from MDS patients appears to be under-explored in trying to understand how this disease (which is more common in the elderly) disrupts HSC function.

      A more extensive cross-referencing of other normal HSPC/MDS HSCP datasets from aged humans would have been helpful to highlight the usefulness of the analytical tools that the authors have generated.

      Major points

      1) The authors detail methodology for identification of cell types from single-cell data - GLMnet. This portion of the text needs to be clarified as it is not immediately clear what it is or how it's being used. It also needs to be explained by what metric the classifier "performed better among progenitor cell types" and why this apparent advantage was sufficient to use it for the subsequent analysis. This is critical since interpretation of the data that follows depends on the validation of GLMnet as a reliable tool.

      We thank the review for the comment. We have updated the corresponding section to better describe how GLMnet is used and that the reasoning on why we decided to use GLMnet as our cell type annotation method instead of other available tools such as Seurat, is based on the results of the benchmark described in Figure 1-figure supplement 1. We also described the main differences between our method and Seurat (See Answer to Review 1, Question # 4).

      2) The finding of an increased number of erythroid progenitors and decreased number of myeloid cells in aged HPSC is surprising since aging is known to be associated with anemia and myeloid bias. Given that the initial validation of GLMnet is insufficiently described, this result raises concerns about the method. Along the same lines, the authors report that their tool detects a reduced frequency of monocyte progenitors. How does this finding correlate with the published data on aging humans? Is monocytopenia a feature of normal aging?

      We thank the reviewer for this comment, as changes in the output of HSCs as a consequence of aging are of high interest. According to the literature, there is clear evidence of the loss of lymphoid progeny with age (13,14), which goes in agreement with our results. However, in the case of the myeloid compartment, the effects of aging are not as clear. Studies in mice have indeed observed that the loss of lymphoid cells is accompanied by increased myeloid output, starting at the level of GMPs (Rossi et al. 2005; Florian et al. 2012; Min et al. 2006). But studies on human individuals have not found changes in numbers of these myeloid progenitors (Kuranda et al. 2011; Pang et al. 2011). In addition, in the mentioned studies, myeloid production was measured exclusively by its white blood cells fraction. More recent studies have focused on the other myeloid compartments: megakaryocyte and erythroid cells. Results point towards the increase of platelet-biased HSC with age (Sanjuan-Pla et al. 2013; Grover et al. 2016) and a possible expansion of megakaryocytic and erythroid progenitor populations (Yamamoto et al. 2018; Poscablo et al. 2021; Rundberg Nilsson et al. 2016), which may represent a compensatory mechanism for the ineffective differentiation towards this lineage in elderly individuals. This goes in line with the accumulation of MEPs we see in our data. Finally, and in accordance with the reduced frequency of monocyte progenitors observed, it has been shown that with increasing age, there is a gradual decline in the monocyte count (15).

      Regarding the concerns about our classification method raised by the reviewer, we have performed additional validations that we describe in answers to reviewer 1 comment #4 and reviewer 2 comment #1. To further confirm that the changes in cellular proportions we found are real, we applied two additional classification methods: Seurat transfer and Celltypist (16) to the elderly donors dataset. We obtained a similar expansion in MEPs, together with reduction of monocytic progenitors with the three methods (Figure 5).

      Figure 5 Classification of HSPCs from elderly donors. Barplot showing proportions of every cell subpopulation per elderly donor, resulting from three classification methods: GLMnet-based classifier, Seurat transfer and Celltypist. For the three methods, cells with prediction scores < 0,5 were labeled as “not assigned”.

      3) The use of terminology requires more clarity in order to better understand what kind of comparison has been performed, i.e. whether global transcriptional profiles are being compared, or those of specific subset populations. Also, the young/aged comparisons are often unclear, i.e. it's not evident whether the authors are referring to genes upregulated in aged HSC and downregulated in young HSC or vice versa. A more consistent data description would make the paper much easier to read.

      We thank the reviewer for this comment. We have updated the manuscript to provide more clarity in the description of the different comparisons made in our analyses. Most changes are located in the Transcriptional profiling of human young and elderly hematopoietic progenitor systems sub-section within the Results.

      4) The link between aging and MDS is not explored but could be an informative use of the data that the authors have generated. For example, anemia is a feature of both aging and MDS whereas neutropenia and thrombocytopenia only occur in MDS. Are there any specific pathways governing myeloid/platelet development that are only affected in MDS?

      Thank you for raising this comment. We believe that discriminating events that take place during healthy aging from those associated to MDS will be helpful to understand this particular disease, as it is so closely related to age. This is why, when analyzing MDS, we have considered young and elderly donors as two separate sets of healthy controls, the eldery donors being the most suitable one for comparisons with MDS samples.

      With regards to the comment on myeloid and platelet development, the GSEA analysis gives potentially useful information. MYC targets and oxidative phosphorylation are significantly enriched in the MEP compartment from MDS patients when compared to elderly donors, indicating that these progenitors may recover a more active profile with the disease. Hypoxia related genes, on the other hand, are more active in HSCs and MEPs from healthy elderly donors than in MDS. Hypoxia is known to be implicated in megakaryocyte and erythroid differentiation (17)

      5) MDS is a very heterogeneous disorder and while the authors did specify that they were using samples from MDS with multilineage dysplasia, more clinical details (blood counts, cytogenetics, mutational status) are needed to be able to interpret the data.

      We thank the reviewer for the comment. All the clinical details for each MDS patient are included in Supplementary File 5.

    1. Author Response

      Reviewer #3 (Public Review):

      Dysbiosis has a substantial impact on host physiology. Using the nematode C. elegans and E.coli as a model of host-microbe interactions, Yang et al. defined a mechanism by which the host deals with gut dysbiosis to maintain fitness. They found that accumulation of E. coli in the intestine secreted indole, a tryptophan metabolite, and activated the transcription factor DAF-16. DAF-16 induced the expression of lys-7 and lys-8, which in turn limited E. coli proliferation in the gut of worms and maintained the longevity of worms. Finally, these authors demonstrated that indole-activated DAF-16 via TRPA-1 in neurons of worms.

      This study revealed a new mechanism of host-microbe interaction. The concept of their work is of broad interest and the results they present are convincing. However, there are some issues that need to be addressed to support the conclusions.

      Major issues

      1) The authors isolated the crude extract from a high-performance liquid chromatograph (HPLC). A candidate compound was detected by activity-guided isolation and further identified as indole with mass spectrometry and NMR data. The HPLC fractionations and activity-guided isolation experiments should be described in more detail with a schematic figure to reveal how these experiments were performed and how indole was identified. Showing a chemical characterization of indole in Figure 2A is not sufficient for the evaluation of the results. Rather, a figure comparing the fraction 26th with standard indole by MS and NMR is more appealing.

      We appreciate the concerns of the reviewer. Activity-guided isolation was performed as follows: The crude extract of E. coli supernatant metabolites was divided into 45 fractions according to polarity using Ultimate 3000 HPLC (Thermofisher, Waltham, MA) coupled with automated fraction collector. After freeze-drying each fraction, 1 mg of metabolites were dissolved in DMSO for DAF-16 nuclear localization assay in worms (Please see new Supplementary Table S2). The 26th fraction with DAF-16 nuclear translocation-inducing activity was then separated on silica gel column (200-300 mesh) with a continuous gradient of decreasing polarity (100%, 70%, 50%, 30%, petroleum ether/acetone) to yield four fractions (26a-d). Only the fraction of 26b could induce DAF-16 nuclear translocation. Then the fraction was further separated using a Sephadex LH-20 column to yield 32 fractions. The 26b-11th fraction with DAF-16 nuclear translocation-inducing activity contained a single compound identified by thin layer chromatography, mass spectrometry and nuclear magnetic resonance (NMR). The compound exhibited a quasimolecular ion peak at m/z 181.0782 [M+H]+ in the positive APCI-MS, and was assigned to a molecular formula of C8H7N. A comparison of these 1H NMR and 13C NMR spectra with the data reported in the literature revealed that the compound was indole (Yagudaev, 1986). The figure shows the comparison of the 26b-11 fraction with the standard indole by MS (Author response image 1).

      Author response image 1.

      High resolution mass spectrum of the candidate compound and indole.

      2) DAF-16::GFP was mainly located in the cytoplasm of the intestine in worms expressing daf-16p::daf-16::gfp fed live E. coli OP50 on Day 1 (Figure 1A and 1B). The nuclear translocation of DAF-16 in the intestine was increased in worms fed live E. coli OP50 on Days 4 and 7, but not in age-matched WT worms fed heat-killed (HK) E. coli OP50 (Figure 1A and 1B). Since DAF-16 functions downstream of DAF-2, have the levels of DAF-2 been tested during aging on OP50 and (HK) OP50, or with and without indole supplementation?

      In response to the reviewer’s suggestion, we carried out the RT-PCR experiment in 4-day-old and 7-day-old worms. It has been shown that DAF-2 initiates a kinase cascade that leads to the phosphorylation and cytoplasmic retention of DAF-16. By contrast, a reduction in the DAF-2 signaling leads to the dephosphorylation of DAF-16, allowing its nuclear translocation. In response to the reviewer’s suggestion, we tested the expression of daf-2 in 4-day-old and 7-day-old worms fed with OP50 and (HK) OP50. We found that the mRNA levels of daf-2 were significantly increased in worms on days 4 and 7 in the presence of either live or dead E. coli OP50, compared with those in worms on day 1 (Author response image 2A). In addition, supplementation with indole did not alter the mRNA levels of daf-2 in young adult worms (Author response image 2B). To conclude, the activation of DAF-16 is independent of DAF-2.

      Author response image 2.

      DAF-16 nuclear translocationisindependent of DAF-2.(A) The mRNA levelsof daf-2weregradually increasedin worms with age.P< 0.01;*P< 0.001; ns, not significant. (B)The mRNA levelsof daf-2were not alteredaftertreatment withindole for 24 hours.ns, not significant.

      3) In lines 155-157, the author argued that the increase in the levels of indole in worms results from the intestinal accumulation of live E. coli OP50, rather than exogenous indole produced by E. coli OP50 on the NGM plates. However, the work also showed that supplementation with indole (50-200 μM) could significantly increase the indole levels in young adult worms on Day 1 (Figure 2-figure supplement 3B), which could induce nuclear translocation of DAF-16 in worms (Figure 2B). This result suggested that worms could take in indole from outside culturing environment. The concentration of indole in OP50 and (HK) OP50 could be measured.

      We appreciate the concerns of the reviewer. Reviewer #2 also pointed out this problem. In this study, our data showed that the levels of indole were 30.9, 71.9, and 105.9 nmol/g dry weight in worms fed live E. coli OP50 on days 1, 4, and 7, respectively (Figure 2C). This increase in the levels of indole in worms was accompanied by an increase in CFU of live E. coli OP50 in the intestine of worms with age (Figure 2C). In addition, we determined the levels of indole in worms fed HK E. coli OP50, and found that the levels of indole were 28.2, 31.6, and 36.1 nmol/g dry weight in worms fed HK E. coli OP50 on days 1, 4, and 7, respectively (Figure 2-figure supplement 3A). It should be noted that the levels of indole in worms fed dead E. coli OP50 on day 1 were comparable of those in worms fed live E. coli OP50 on day 1 (30.9 vs 28.2 nmol/g dry weight). However, the levels of indole were not increased in worms fed HK E. coli OP50 on days 4 and 7. Furthermore, the observation that DAF-16 was retained in the cytoplasm of the intestine in worms fed live E. coli OP50 on day 1 (Figure 1A and 1B) also indicated that indole produced by E. coli OP50 on the NGM plates is not enough to induce DAF-16 nuclear translocation. By contrast, supplementation with indole (50-200 μM) significantly increased the indole levels in worms on day 1 (Figure 2-figure supplement 3B), which could induce nuclear translocation of DAF-16 in worms (Figure 2B). Thus, the increase in the levels of indole in worms with age results from intestinal accumulation of live E. coli OP50, rather than indole produced by E. coli OP50 on the NGM plates.

      4) Recent work showed that the multicopy DAF-16 transgene acts differently from the single copy GFP knock in DAF-16 transgene. Which DAF-16 transgene was used in this work?

      The strain we used is TJ356. Its genotype has been described as zIs356 [daf-16p::daf-16a/b::GFP+rol-6(su1006)] (Lee, Hench, & Ruvkun, 2001; Lin, Hsin, Libina, & Kenyon, 2001), from the Caenorhabditis Genetics Center (CGC).

      5) In lines 190-193, the author argued that the supplementation with indole (100 M) inhibited the CFU of E. coli K-12 in WT worms, but not daf-16(mu86) mutants, on Days 4 and 7 (Figure 3H and 3I). These results suggest that endogenous indole is involved in maintaining a normal lifespan in worms. This is overstating. The data here more likely suggest that indole could inhibit the proliferation of E. coli through DAF-16.

      We really appreciate this reviewer’s preciseness. In response to the reviewer’s suggestion, we had changed "...indole is involved in maintaining a normal lifespan in worms" to "...indole produced by bacteria in the gut could inhibit the proliferation of E. coli via DAF-16 in worms".

      6) Sonowal (2017) reported that AHR mediates indole-promoted lifespan extension at 16 C. Yet this work argued that RNAi knockdown of ahr-1 did not affect the nuclear translocation of DAF-16 in worms fed E. coli K12 strain on Day 7 (Figure 4-figure supplement 1A) or young adult worms treated with indole (100 M) for 24 h. The difference between these two works should be discussed.

      We really appreciate this reviewer’s preciseness. It has been shown that AHR-1 mediates indole-promoted lifespan extension in worms at 16 C (Sonowal et al., 2017). However, our data show that AHR-1 is not involved in activation of DAF-16 by indole-induced nuclear translocation of DAF-16 at 20 C. This means that AHR-1 and TRPA-1-lifespan extension by indole are essentially different. In our study, indole is added to NGM plates when worms reached the young adult stage. In the study by Sonowal et al., indole is supplemented at the stage of L1 larva. In addition, lifespan of C. elegans varies at different temperatures (Xiao et al., 2013). Thus, indole may promote lifespan extension via different mechanisms, which is dependent on exposure time and temperature.

      7) Sonowal (2017) conducted mRNA profiling for worms growing on K12 and K12△tnaA. Is TRPA1 in their de-regulated gene list? Have other de-regulated genes been tested in this work?

      We appreciate the concerns of the reviewer. We found that TRPA-1 is not included in the de-regulated gene list. Sonowal et al. focus on the gene expression profiles in worms from L1 larvae to young adults, whereas we pay attention to gene expression profiles in worms from young adults to aged worms. Thus, we did not test the de-regulated genes in their work.

      8) How does indole activate TRPA1? In the absence of trpa1, what is the concentration of indole in worms? Since TRPA1 is a channel, is there any possibility that TRPA1 is involved in the transport of indole? It is really interesting and surprising that neuronal TRPA-1, but not intestinal TRPA-1, mediates the beneficial effect of indole. How does indole specifically activate TRPA-1 in neurons to preserve the longevity of worms?

      We appreciate the concerns of the reviewer. TRPA1 is a nonselective cation channel permeable to Ca2+, Na+, and K+ (Zygmunt & Hogestatt, 2014). It is unlikely that TRPA1 is capable of transporting heterocyclic organic compounds, such as indole.

      In response to the reviewer’s suggestion, we detected the content of indole in trpa-1(ok999) worms. We found that the levels of indole in trpa-1(ok999) worms were slightly increased in worms on days 4 and 7, compared to those in WT worms on days 4 and 7 (Author response image 3).

      Recently, Ye et al. have demonstrated that indole and indole-3-carboxaldehyde (IAld) are agonists of TRPA1, which is conserved in vertebrates (Ye et al., 2021). Thus, it is mostly likely that indole acts as an agonist of TRPA-1 in C. elegans by directly binding to TRPA-1. One possibility is that activation of TRPA-1 in neurons by indole could induce a pathway that release a neurotransmitter, which in turn triggers a signaling pathway to extend lifespan of worms via activating DAF-16 in a non-cell autonomous manner. In contrast, the activation of TRPA-1 in the intestine by indole is unable to release such a neurotransmitter. Indeed, TRPA1 induces the releasing of calcitonin gene-related peptide in perivascular sensory nerves, leading to membrane hyperpolarization and arterial dilation on smooth muscle cells (Talavera et al., 2020). Moreover, the activation of TRPA1 by indole and IAld induces the secretion of the neurotransmitter serotonin in zebrafish (Ye et al., 2021).

      Author response image 3.

      The indole levels in trpa-1 mutants are increased on days 4 and 7, compared with those in WT worms. *P < 0.05.

      9) How neuronal- and intestinal-specific knockdown of trpa-1 by RNAi was conducted? And what is the tissue-specific expression pattern of trap-1? Speculating how indole was transported to neuron cells is pretty appealing.

      We appreciate the concerns of the reviewer. SID-1 is required cell-autonomously for systemic RNAi (Winston, Molodowitch, & Hunter, 2002). Thus, the sid-1 mutants are resistant to RNAi in the neuronal- and intestinal-specific RNAi strains, sid-1 was expressed under control of the neuronal-specific unc-119 and the intestinal-specific vha-6 promoters, respectively. Although it has been reported that TRPA-1 is expressed in neurons, muscles, hypodermal cells, and the intestine, Xiao et al. proved that only TRPA-1 expressed in the intestine and neurons contributes to life extension at low temperature (Xiao et al., 2013). The transporter of indole has not been identified. In Arabidopsis, ATP-binding cassette (ABC) transporter G family 37(ABCG37) has been reported to transport a range of indole derivatives (Ruzicka et al., 2010). However, all fifteen C. elegans ABC transporters share less than 30% sequence identity with ABCG37. Thus, it is impossible to determine which one is the transport channel for indole and indole derivatives in C. elegans.

      10) Supplementation with indole only up-regulated the expression of lys-7 and lys-8 in worms subjected to intestinal-specific (Figure 7-figure supplement 2C), but not neuronal-specific, RNAi of trpa-1 (Figure 7-figure supplement 2D). If this is the case, should the addition of indole specifically induce the expression of lys-7p::gfp or lys-8p::gfp in neurons?

      We really appreciate this reviewer’s preciseness. Indeed, lys-7 and lys-8 are expressed in both neurons and the intestine (Author response image 4A and 7B). However, the expression of lys-8p::gfp and lys-7p::gfp in neurons was not altered in worms after treatment with indole or knockdown of trpa-1 by RNAi (Author response image 4C and 4D).

      Author response image 4.

      The expression of LYS-7 and LYS-8 in neurons is not altered after treatment with indole or knockdown of trpa-1 by RNAi. (A and C) Representative images of lys-7p::gfp (A) and lys-8p::gfp (C). Both lys-7 and lys-8 could be expressed in neurons and the intestine. (B and D) Quantification of fluorescent intensity of lys-7p::gfp (B) and lys-8p::gfp (D) in neurons. These results are means ± SD of three independent experiments. ns, not significant.

      11) The authors demonstrated that K-12△tnaA strain had undetectable tnaA mRNA or indole levels. Furthermore, the deletion of tnaA significantly inhibited the nuclear translocation of DAF-16 in worms. However, mutations in E. coli still have non-specific effects as there are several transposon insertions or polar mutations influencing downstream genes. The authors should demonstrate that only disruption of TnaA causes the failure of nuclear translocation of DAF-16.

      In response to the reviewer’s suggestion, we rescued the expression of tnaA in the K-12 △tnaA strain. As expected, the indole level of from the supernatant in the K12 △tnaA::tnaA strain cultures was 34.1 μmol/L, which was comparable of that in the K12 strain cultures (42.5 μmol/L)(new Figure 2-figure supplement 4D). In addition, DAF-16 nuclear accumulation was increased in worms grown in the K12 △tnaA::tnaA strain on days 4 and 7 (new Figure 2-figure supplement 4E).

    1. Author Response

      Reviewer #1 (Public Review):

      “A sample size of 3 idiopathic seems underpowered relative to the many types of genetic changes that can occur in ASD. Since the authors carried out WGS, it would be useful to know what potential causative variants were found in these 3 individuals and even if not overlapping if they might expect to be in a similar biological pathway.

      If the authors randomly selected 3 more idiopathic cell lines from individuals with autism, would these cell lines also have altered mTOR signaling? And could a line have the same cell biology defects without a change in mTOR signaling? The authors argue that the sample size could be the reason for lack of overlap of the proteomic changes (unlike the phosphor-proteomic overlaps), which makes the overlapping cell biology findings even more remarkable. Or is the phenotyping simply too crude to know if the phenotypes truly are the same?”

      We appreciate these thoughtful comments and also agree that of several models, our studies indicate the possibility of mTOR alteration in multiple forms of ASD. As above, we are currently pursuing this hypothesis with newly acquired DOD support. With regard to the I-ASD population, we agree that there are a large variety of genetic changes that can occur in genetically undefined ASDs. Indeed, this is precisely why we expected to see “personalized” phenotypes in each I-ASD individual when we embarked on this study. At that time, several years ago, we had planned to expand the analyses to more I-ASD individuals to assess for additional personalized phenotypes. However, as our studies progressed, we were surprised to find convergence in our I-ASD population in terms of neurite outgrowth and migration and later proteomic results showing convergence in mTOR. We found it particularly remarkable that despite a sample size of 3 that this convergence was noted. When we had the opportunity to extend our studies to the 16p11.2 deletion population, we were thrilled to conduct the first comparison between I-ASD and a genetically defined ASD and, as such, the scope of the paper turned towards this comparison. We do agree that analyses of the other I-ASD individuals would be a beneficial endeavor, both to understand how pervasive NPC migration and neurite deficits are in autism and to assess the presence of mTOR dysregulation. Furthermore, it would be important to see whether alterations in other pathways could also lead to similar cell biological deficits, though we know that other studies of neurodevelopmental disorders have found such cellular dysregulations without reporting concurrent mTOR dysregulation. Given our current grant funding to extend these analyses, such experiments within this manuscript would not be feasible.

      Regarding the phenotyping methods used, we decided to assess neurite outgrowth and migration as they are both cytoskeleton dependent processes that are critical for neurodevelopment and are often regulated by the same genes. Furthermore, similar analyses have been applied to Fragile-X Syndrome, 22q11.2 deletion syndrome, and schizophrenia NPCs (Shcheglovitov A. et al., 2013; Mor-Shaked H. et al., 2016; Urbach A. et al., 2010; Kelley D. J. et al., 2008; Doers M. E. et al., 2014; Brennand K. et al., 2015; Lee I. S. et al., 2015; Marchetto M. C. et al., 2011). As such, it seems that multiple underlying etiologies can lead to similar dysregulated cellular phenotypes that can contribute to a variety of neurodevelopmental disorders. On a more global level, there are only a few different cellular functions a developing neuron can undergo, and these include processes such as proliferation, survival, migration, and differentiation. Thus, to understand neurodevelopmental disorders, it is important to study the more “crude” or “global” cellular functions occurring during neurodevelopment to determine whether they are disrupted in disorders such as ASD. In our studies we find that there are indeed dysregulations in many of these basic developmental processes, indicating that the typical steps that occur for normal brain cytoarchitecture may be disrupted in ASD. To understand why, we then further utilized molecular studies to “zoom” in on potential mechanisms which implicated common dysregulation in mTOR signaling as one driver for these common cellular phenotypes. As suggested, we did complete WGS on all the I-ASD individuals and did not see any overlapping genetic variants between the three I-ASD individuals as mentioned in our manuscript. The genetic data was published in a larger manuscript incorporating the data (Zhou A. et al., 2023). However, there were variants that were unique to each I-ASD individual which were not seen in their unaffected family members, and it is possible these variants could be contributing to the I-ASD phenotypes. We also utilized IPA to conduct pathway analysis on the WGS data utilizing the same approach we did in analysis of p- proteome and proteome data. From WGS data, we selected high read-quality variants that were found only in I-ASD individuals and had a functional impact on protein (ie excluding synonymous variants). The enriched pathways obtained from this data were strikingly different from the pathways we found in the p-proteome analysis and are now included in supplemental Figure 6 in the manuscript. Briefly, the top 5 enriched pathways were: O-linked glycosylation, MHC class 1 signaling, Interleukin signaling, Antigen presentation, and regulation of transcription.

      Reviewer #2 (Public Review):

      1) I found that interpreting how differential EF sensitivity is connected to the rest of the story difficult at times. First, it is unclear why these extracellular factors were picked. These are seemingly different in nature (a neuropeptide, a growth factor and a neuromodulator) targeting largely different pathways. This limits the interpretation of the ASD subtype-specific rescue results. One way of reframing that could help is that these are pro-migratory factors instead of EFs broadly defined that fail to promote migration in I-ASD lines due to a shared malfunctioning of the intracellular migration machinery or cell-cell interactions (possibly through tight junction signaling, Fig S2A). Yet, this doesn't explain the migration/neurite phenotypes in 16p11 lines where EF sensitivity is not altered, overall implying that divergent EF sensitivity independent of underlying mTOR state. What is the proposed model that connects all three findings (divergent EF sensitivity based on ASD subtypes, 2 mTOR classes, convergent cellular phenotypes)?

      We thank you for the kind assessment of our manuscript and for the thought-provoking questions posed. In terms of extracellular factors, for our study, we defined extracellular factor as any growth factor, amino acid, neurotransmitter, or neuropeptide found in the extracellular environment of the developing cells. The EFs utilized were selected due to their well-established role in regulation of early neurodevelopmental phenotypes, their expression during the “critical window” of mid-fetal development (as determined by Allan Brain Atlas), and in the case of 5-HT, its association with ASD (Abdulamir H. A. et al., 2018; Adamsen D. et al., 2014; Bonnin A. et al., 2011; Bonnin A. et al., 2007; Chen X. et al., 2015; El Marroun H. et al., 2014; Hammock E. et al., 2012; Yang C. J. et al., 2014; Dicicco-Bloom E. et al., 1998; Lu N. et al., 1998; Suh J. et al., 2001; Watanabe J. et al., 2016; Gilmore J. H. et al., 2003; Maisonpierre P. C. et al., 1990; Dincel N. et al., 2013; Levi- Montalcini R., 1987). Lastly, prior experiments in our lab with a mouse model of neurodevelopmental disorders, had shown atypical responses to EFs (IGF-1, FGF, PACAP). As such, when we first chose to use EFs in human NPCs we wanted to know 1) whether human NPCs even responded to these EFs, 2) whether EFs regulated neurite outgrowth and migration and 3) would there be a differential response in NPCs derived from those with ASD. Our studies were initiated on the I-ASD cohort and given the heterogeneity of ASD we had hypothesized we would get “personalized” neurite and migration phenotypes. Due to this reason, we also wanted to select multiple types of EFs that worked on different signaling pathways. Ultimately, instead of personalized phenotypes we found that all the I-ASD NPCs did not respond to any of the EFs tested whereas the 16p11.2 deletion NPCS did – this was therefore the only difference we found between these two “forms” of ASD. As noted, in I-ASD the lack of response to EFs can be ameliorated by modulating mTOR. However, in the 16p11.2 deletion, despite similar mTOR dysregulation as seen in I-ASD, there is no EF impairment. We do not have a cohesive model to explain why the 16pDel individuals differ from the I-ASD model other than to point to the p- proteomes which do show that the 16pDel NPCs are distinct from the I-ASD NPCs. It seems that mTOR alteration can contribute to impaired EF responsiveness in some NPCs but perhaps there is an additional defect that needs to be present in order for this defect to manifest, or that 16p11.2 deletion NPCs have specific compensatory features. For example, as noted in the thoughtful comment, the p-proteome canonical pathway analysis shows tight junction malfunction in I-ASD which is not present in the 16pDel NPCs and it could be the combination of mTOR dysregulation + dysregulated tight junction signaling that has led to lack of response to EFs in I-ASD. Regardless, we do not think the differences between two genetically distinct ASDs diminish the convergent mTOR results we have uncovered. That is, regardless of whatever defects are present in the ASD NPCs, we are able to rescue it with mTOR modulation which has fascinating implications for treatment and conceptualization for ASD. Lastly, we see our EF studies as an important inclusion as it shows that in some subtypes of ASD, lack of response to appropriate EFs could be contributing to neurodevelopmental abnormalities. Moreover, lack of response to these EFs could have implications for treatment of individuals with ASD (for example, SSRI are commonly used to treat co-morbid conditions in ASD but if an individual is unresponsive to 5- HT, perhaps this treatment is less effective). We have edited the manuscript to include an additional discussion section to address the EFs more thoroughly and have included a few extra sentences in the introduction as well!

      2) A similar bidirectional migration phenotype has been described in hiSPC-derived human cortical interneurons generated from individuals with Timothy Syndrome (Birey et al 2022, Cell Stem Cell). Here, authors show that the intracellular calcium influx that is excessive in Timothy Syndrome or pharmacologically dampened in controls results in similar migration phenotypes. Authors can consider referring to this report in support of the idea that bimodal perturbations of cardinal signaling pathways can converge upon common cellular migration deficits.

      We thank you for pointing out the similar migration phenotype in the Timothy Syndrome paper and have now cited it in our manuscript. We have also expanded on the concept of “too much or too little” of a particular signaling mechanism leading to common outcomes.

      3) Given that authors have access to 8 I-ASD hiPSC lines, it'd very informative to assay the mTOR state (e.g. pS6 westerns) in NPCs derived from all 8 lines instead of the 3 presented, even without assessing any additional cellular phenotypes, which authors have shown to be robust and consistent. This can help the readers better get a sense of the proportion of high mTOR vs low- mTOR classes in a larger cohort.

      We have already addressed this in response to reviewer 1 and the essential revisions section, providing our reasoning for not expanding the study to all 8 I-ASD individuals.

      4) Does the mTOR modulation rescue EF-specific responses to migration as well (Figure 7)

      We did not conduct sufficient replicates of the rescue EF specific responses to migration due to the time consuming and resource intensive nature of the neurosphere experiments. Unlike the neurite experiments, the neurosphere experiments require significantly more cells, more time, selection of neurospheres based on a size criterion, and then manual trace measurements. We did one experiment in Family-1 where we utilized MK-2206 to abolish the response of Sib NPCs to PACAP. Likewise, adding SC-79 to I-ASD-1 neurospheres allowed for response to PACAP.

      Author response image 1.

      Author response image 2.

      Reviewer #3: Public Review

      We appreciate the kind, detailed and very thorough review you provided for us!

      The results on the mTOR signaling pathway as a point of convergence in these particular ASD subtypes is interesting, but the discussion should address that this has been demonstrated for other autism syndromes, and in the present manuscript, there should be some recognition that other signaling pathways are also implicated as common factors between the ASD subtypes.

      With regards to the mTOR pathway, we had included the other ASD syndromes in which mTOR dysregulation has been seen including tuberous sclerosis, Cowden Syndrome, NF-1, as well as Fragile-X, Angelman, Rett and Phelan McDermid in the final paragraph of the discussion section “mTOR Signaling as a Point of Convergence in ASD”. We have now expanded our discussion to include that other signaling pathways such as MAPK, cyclins, WNT, and reelin which have also been implicated as common factors between the ASD subtypes.

      The conclusions of this paper are mostly well supported by data, but for the cell migration assay, it is not clear if the authors control for initial differences in the inner cell mass area of the neurospheres in control vs ASD samples, which would affect the measurement of migration.

      Thank you for this thoughtful comment! When we first started our migration data, inner cell mass size was indeed a major concern for which we controlled in our methods. First, when plating the neurospheres, we would only collect spheres when a majority of spheres were approximately a diameter of 100 um. Very large spheres often could not be imaged due to being out of focus and very small spheres would often disperse when plated. Thus, there were some constraints to the variability of inner cell mass size.

      Furthermore, when we initially collected data, we conducted a proof of principal test to see if initial inner cell mass area (henceforth referred to as initial sphere size or ISS) influenced migration data. To do so, we obtained migration and ISS data from each diagnosis (Sib, NIH, I-ASD, 16pASD). Then we utilized R studio to see if there is a relationship between Migration and ISS in each diagnosis category using the equation (lm(Migration~ISS, data=bydiagnosis). In this equation, lm indicates linear modeling and (~) is a term used to ascertain the relationship between Migration and ISS and the term data=bydiagnosis allows the data to be organized by diagnosis

      The results were expressed as R-squared values indicating the correlation between ISS and Migration for each diagnosis and the p-value showing statistical significance for each comparison. As shown in Author response table 1, for each data set, there is minimal correlation between Migration and ISS in each data set. Moreover, there are no statistically significant relationships between Migration and ISS indicating that initial sphere size DOES NOT influence migration data in any of our data-sets.

      Author response table 1.

      Lastly, utilizing R, we modeled what predicted migration would be like for Sib, NIH, I-ASD, and 16pASD if we accounted for ISS in each group. Raw migration data was then plotted against the predicted data as in Author response image 3.

      Author response image 3.

      As shown in the graph, there are no statistical differences between the raw migration data (the data that we actually measured in the dish) and the modeled data in which ISS is accounted for as a variable. As such, we chose not to normalize to or account for ISS in our other experiments. We have now included the above R studio analyses in our supplemental figures (Figure S1) as well.

      Also, in Fig 5 and 6, panels I and J omit the effects of drug on mTOR phosphorylation as shown for other conditions.

      Both SC-79 and MK2206 were selected in our experiments after thorough analysis of their effects on human epithelial cells and other cultured cells (citations in manuscript). However, initially, we did not know whether either of these drugs would modulate the mTOR pathway in human NPCs, thus, in Figures 5A,5D, 6A and 6D we chose to focus on two of our data-sets to establish the effect of these drugs in human NPCs. Our experiments in Family-1 and Family-2 showed us that SC-79 increases PS6 in human NPCs while MK-2206 downregulates it. Once this was established, we knew the drugs would have similar effects in the NPCs from the other families. Thus, we only conducted a proof of principle test to confirm the drug does indeed have the intended effect in I-ASD-3 and 16pDel. We have included these proof of principle westerns in Figure 5I, 5K, 6I and 6K to show that the effects of these drugs are reproducible across all our NPC lines. We did not include quantification since the data is only from our single proof of principle western.

    1. Author Response

      Reviewer #1 (Public Review):

      Using fMRI-based univariate and multivariate analyses, Root, Muret, et al. investigated the topography of face representation in the somatosensory cortex of typically developed two-handed individuals and individuals with a congenital and acquired missing hand. They provide clear evidence for an upright face topography in the somatosensory cortex in all three groups. Moreover, they find that one-handers, but not amputees, show shorter distances from lip representations to the hand area, suggesting a remapping of the lips. They also find a shift away of the upper face from the deprived hand area in one-handers, and significantly greater dissimilarity between face part representations in amputees and one-handers. The authors argue that this pattern of remapping is different to that of cortical neighborhood theories and points toward a remapping of face parts which have the ability to compensate for hand function, e.g., using the lips/mouth to manipulate an object.

      These findings provide interesting insights into the topographic organization of face parts and the principles of cortical (re)organization. The authors use several analytical approaches, including distance measures between hand- and face-part-responsive regions and representational similarity analysis (RSA). Particularly commendable is the rigorous statistical analysis, such as the use of Bayesian comparisons, and careful interpretation of absent group differences.

      We thank the reviewer for their positive and constructive feedback.

      Reviewer #2 (Public Review):

      After amputation, the deafferented limb representation in the somatosensory cortex is activated by stimulation of other body parts. A common belief is that the lower face, including the lips, preferentially "invades" deafferented cortex due to its proximity to cortex. In the present study, this hypothesis is tested by mapping the somatosensory cortex using fMRI as amputees, congenital one-handers, and controls moved their forehead, nose, lips or tongue. First, they found that, unlike its counterpart in monkeys, the representation of the face in the somatosensory cortex is right-side up, with the forehead most medial (and abutting the hand) and the lips most lateral. Second, there was little evidence of "reorganization" of the deafferented cortex in amputees, even when tested with movements across the entire face rather than only the lips. Third, congenital one-handers showed significant reorganization of deafferented cortex, characterized principally by the invasion of the lower face, in contrast to predictions from the hypothesis that proximity was the driving factor. Fourth, there was no relationship between phantom limb pain reports and reorganization.

      As a non-expert in fMRI, I cannot evaluate the methodology. That being said, I am not convinced that the current consensus is that the representation of the face in humans is flipped compared to that of monkeys. Indeed, the overwhelming majority of somatosensory homunculi I have seen for humans has the face right side up. My sense is that the fMRI studies that found an inverted (monkey-like) face representation contradict the consensus.

      Thank you for point this out. As we tried to emphasise in the introduction, very few neuroimaging studies actually investigated face somatotopy in humans, with inconsistent results. We agree the default consensus tends to be dominated by the up-right depiction of Penfield’s homunculus (recently replicated by Roux et al, 2018). However, due to methodological and practical constraints, alignment across subjects in the case of intracortical recordings is usually difficult to achieve, and thus makes it difficult to assess the consistency in topographical organisation. Moreover, previous imaging studies did not manage to convincingly support Penfield’s homunculus. For these two key reasons, the spatial orientation of the human facial homunculus is still debated. A further limiting factor of previous studies in humans is that the vast majority of human studies investigating face (re)mapping in humans focused solely on the lip representation, using the cortical proximity hypothesis to interpret their results. Consequently, as we highlight above in our response to the Editor, there is a wide-spread and false representation in the human literature of the lips neighbouring the hand area.

      To account for the reviewer’s critic and convey some of this context, we changed our title from: Reassessing face topography in primary somatosensory cortex and remapping following hand loss; to: Complex pattern of facial remapping in somatosensory cortex following congenital but not acquired hand loss. This was done to de-emphasise the novelty of face topography relative to our other findings.

      We also rewrote our introduction (lines 79-94) as follows:

      “The research focus on lip cortical remapping in amputees is based on the assumption that the lips neighbour the hand representation. However, this assumption goes against the classical upright orientation of the face in S126–30, as first depicted in Penfield’s Homunculus and in later intracortical recordings and stimulation studies26–29, with the upper-face (i.e., forehead) bordering the hand area. In contrast, neuroimaging studies in humans studying face topography provided contradictory evidence for the past 30 years. While a few neuroimaging studies provided partial evidence in support of the traditional upright face organisation31, other studies supported the inverted (or ‘upside-down’) somatotopic organisation of the face, similar to that of non-human primates32,33. Other studies suggested a segmental organisation34, or even a lack of somatotopic organisation35–37, whereas some studies provided inconclusive or incomplete results38–41. Together, the available evidence does not successfully converge on face topography in humans. In line with the upright organisation originally suggested by Penfield, recent work reported that the shift in the lip representation towards the missing hand in amputees was minimal42,43, and likely to reside within the face area itself. Surprisingly, there is currently no research that considers the representation of other facial parts, in particular the upper-face (e.g., the forehead), in relation to plasticity or PLP.”

      We also updated the discussion accordingly (lines 457, 469-477, 490-492).

      Similarly, it is not clear to me how the observations (1) of limited reorganization in amputees, (2) of significant reorganization in congenital one-handers, and (3) of the lack of relationship between PLP and reorganization is novel given the previous work by this group. Perhaps the authors could more clearly articulate the novelty of these results compared to their previous findings.

      Thank you for giving us the opportunity to clarify on this important point. The novelty of these results can be summarised as follow:

      (1) Conceptually, it is crucial for us to understand if deprivation-triggered plasticity is constrained by the local neighbourhood, because this can give us clues regarding the mechanisms driving the remapping. We provide strong topographic evidence about the face orientation in controls, amputees and one-handers.

      (2) The vast majority of previous research on brain plasticity following hand loss (both congenital and acquired) in humans has exclusively focused on the lower face, and lips in particular. We provide systematic evidence for stable organisation and remapping of the neighbouring upper face, as well as the lower face. We also study topographic representation of the tongue (and nose) for the first time.

      (3) The vast majority of previous research on brain remapping following hand loss (both congenital and acquired, neuroimaging and electrophysiological) was focused on univariate activity measures, such as the spatial spread of units showing a similar feature preference, or the average activity level across individual units. We are going beyond remapping by using RSA, which allows us to ask not only if new information is available in the deprived cortex (as well as the native face area), but also whether this new information is structured consistently across individuals and groups. We show that representational content is enhanced in the deprived cortex one-handers whereas it is stable in amputees relative to controls (and to their intact hand region).

      (4) Based on previous studies, the assumption was that reorganisation in congenital one-handers was relatively unspecific, affecting all tested body parts. Here, we provide evidence for a more complex pattern of remapping, with the forehead representation seemingly moving out of the missing hand region (and the nose representation being tentatively similar to controls). That is, we show not just “invasion” but also a shift of the neighbour away from the hand area which has never been documented (or in fact suggested).

      (5) Using Bayesian analyses we provide definitive evidence against a relationship between PLP and forehead remapping, providing first and conclusive evidence against the remapping hypothesis, based on cortical neighbourhood.

      Our inclination is not to add a summary paragraph of these points in our discussion, as it feels too promotional. Instead, we have re-written large sections of the introduction and discussion to better emphasise each of these points separately throughout the text, where the context is most appropriate. Given the public review strategy taken by eLife, the novelty summary provided above will be available for any interested reader, as part of the public review process. However, should the reviewer feel that a novelty summary paragraph is required (or an emphasis on any of the points summarised above), we will be happy to revise the manuscript accordingly.

      Finally, Jon Kaas and colleagues (notably Niraj Jain) have provided evidence in experiments with monkeys that much of the observed reorganization in the somatosensory cortex is inherited from plasticity in the brain stem. Jain did not find an increased propensity for axons to cross the septum between face and hand representations after (simulated) amputation. From this perspective, the relevant proximity would be that of the cuneate and trigeminal nuclei and it would be critical to map out the somatotopic organization of the trigeminal and cuneate nuclei to test hypotheses about the role of proximity in this remapping.

      Thank you for highlighting this very relevant point, which we are well aware of. We fully agree with the reviewer that this is an important goal for future study, but functional imaging of the brainstem in humans is particularly challenging and would require ultra high field imaging (7T) and specialised equipment. We have encountered much local resistance due to hypothetical issues for MRI safety for scanning amputees in this higher field strength, meaning we are unable to carry out this research ourselves. Our former lab member Sanne Kikkert, who is now running her independent research programme in Zurich, has been working towards this goal for the past 4 years. So we can say with confidence that this aim is well beyond the scope of the current study. In response to your comment, we mentioned this potential mechanism in the introduction (lines 98-101), we ensured that we only referred to “cortical proximity” throughout our manuscript, and we circle back to this important point in the discussion.

      Lines 539-543: “Moreover, even if the remapping we observed here goes against the theory of cortical proximity, it can still arise from representational proximity at the subcortical level, in particular at the brainstem level44,45. While challenging in humans, mapping both the cuneate and trigeminal nuclei would be critical to provide a more complete picture regarding the role of proximity in remapping.”

      Reviewer #3 (Public Review):

      In their study, the authors set up to challenge the long-held claim that cortical remapping in the somatosensory cortex in hand deprived cortical territories follows somatotopic proximity (the hand region gets invaded by cortical neighbors) as classically assumed. In contrast to this claim, the authors suggest that remapping may not follow cortical proximity but instead functional rules as to how the effector is used. Their data indeed suggest that the deprived hand area is not invaded by the forefront which is the cortical neighbor but instead by the lips which may compensate for hand loss in manipulating objects. Interestingly the authors suggest this is mostly the case for one-handers but not in amputees for who the reorganization seems more limited in general (but see my comments below on this last point).

      This is a remarkably ambitious study that has been skilfully executed on a strong number of participants in each group. The complementarity of state-of-the-art uni- and multi-variate analyses are in the service of the research question, and the paper is clearly written. The main contribution of this paper, relative to previous studies including those of the same group, resides in the mapping of multiple face parts all at once in the three groups.

      We are grateful to the reviewer for appreciating the immense effort that this study involved.

      In the winner takes all approach, the authors only include 3 face parts but exclude from the analyses the nose and the thumb. I am not fully convinced by the rationale for not including nose in univariate analyses - because it does not trigger reliable activity - while keeping it for representational similarity analyses. I think it would be better to include the nose in all analyses or demonstrate this condition is indeed "noisy" and then remove it from all the analyses. Indeed, if the activity triggered by nose movement is unreliable, it should also affect multivariate.

      Following this comment, we re-ran all univariate analyses to include the nose, and updated throughout the main text and supplemental results and related figures. In short, adding the nose did not change the univariate results, apart from a now significant group x hemisphere interaction for the CoG of the tongue when comparing amputees and controls, matching better the trends for greater surface coverage in the deprived hand ROI of amputees. Full details are provided in our response to Reviewer 1 above.

      The rationale for not including the hand is maybe more convincing as it seems to induce activity in both controls and amputees but not in one-handers. First, it would be great to visualize this effect, at least as supplemental material to support the decision. Then, this brings the interesting possibility that enhanced invasion of hand territory by lips in one-handers might link to the possibility to observe hand-related activity in the presupposed hand region in this population. Maybe the authors may consider linking these.

      Thank you for this comment. As we explain in our response to Reviewer 1 above, we did not intent the thumb condition in one-handers for analysis, as the task given to one-handers (imagine moving a body part you never had before) is inherently different to that given to the other groups (move - or at least attempt to move - your (phantom) hand). As such, we could not pursuit the analysis suggested by the reviewer here. To reduce the discrepancy and following Reviewer 1’s advice, we decided to remove the hand-face dissimilarity analysis which we included in our original manuscript, and might have sparked some of this interest. Upon reflection we agreed that this specific analysis does not directly relate to the question of remapping (but rather of shared representation), in addition to making the paper unbalanced. We will now feature this analysis in another paper that appears more appropriate in the context of referred sensations in amputees (Amoruso et al, 2022 MedRxiv).

      The use of the geodesic distance between the center of gravity in the Winner Take All (WTA) maps between each movement and a predefined cortical anchor is clever. More details about how the Center Of Gravity (COG) was computed on spatially disparate regions might deserve more explanations, however.

      We are happy to provide more detail on this analysis, which weights the CoG based on the clusters size (using the workbench command -metric-weighted-stats). Let’s consider the example shown here (Figure 1) for a single control participant, where each CoG is measured either without weighting (yellow vertices) or with cluster weighting (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red). When the movement produces a single cluster of activity (the lips in the non-dominant hemisphere, shown in blue), the CoG’s location was identical for both weighted (red) and unweighted (yellow) calculations. But other movements, such as the tongue (green), produced one large cluster (at the lateral end), with a few more disparate smaller clusters more medially. In this case, the larger cluster of maximal activity is weighted to a greater extent than the smaller clusters in the CoG calculation, meaning the CoG is slightly skewed towards it (dark red), relative to the smaller clusters.

      Figure 1. Centre-of-gravity calculation, weighted and unweighted by cluster size, in an example control participant. Here the winner-takes-all output for each facial movement (forehead=red, lips=blue, tongue=green) was used to calculate the centre-of-gravity (CoG) at the individual-level in both the dominant (left-hand side) and non-dominant (right-hand side) hemisphere, weighted by cluster size (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red), compared to an unweighted calculation (denoted by yellow dots within each movements’ winner-takes-all output).

      This is now explained in the methods (lines 760-765) as follows:

      “To assess possible shifts in facial representations towards the hand area, the centre-of-gravity (CoG) of each face-winner map was calculated in each hemisphere. The CoG was weighted by cluster size meaning that in the event of multiple clusters contributing to the calculation of a single CoG for a face-winner map, the voxels in the larger cluster are overweighted relative to those in the smaller clusters. The geodesic cortical distance between each movement’s CoG and a predefined cortical anchor was computed.”

      Moreover, imagine that for some reason the forefront region extends both dorsally and ventrally in a specific population (eg amputees), the COG would stay unaffected but the overlap between hand and forefront would increase. The analyses on the surface area within hand ROI for lips and forehead nicely complement the WTA analyses and suggest higher overlap for lips and lower overlap for forehead but none of the maps or graphs presented clearly show those results - maybe the authors could consider adding a figure clearly highlighting that there is indeed more lip activity IN the hand region.

      We agree with you on this limitation of the CoG and this is why we interpret all cortical distances analyses in tandem with the laterality indices. The laterality indices correspond to the proportion of surface area in the hand region for a given face part in the winner-maps.

      Nevertheless, to further convince the Reviewer, we extracted activity levels (beta values) within the hand region of congenitals and controls, and we ran (as for CoGs) a mixed ANOVA with the factors Hemisphere (deprived x intact) and Group (controls x one-handers).

      As expected from the laterality indices obtained for the Lips, we found a significant group x hemisphere interaction (F(1,41)=4.52, p=0.040, n2p=0.099), arising from enhanced activity in the deprived hand region in one-handers compared to the non-dominant hand region in controls (t(41)=-2.674, p=0.011) and to the intact hand region in one-handers (t(41)=-3.028, p=0.004).

      Since this kind of analysis was the focus of previous studies (from which we are trying to get away) and since it is redundant with the proportion of face-winner surface coverage in the hand region, we decided not to include it in the paper. But we could add it as a Supplementary result if the Reviewer believes this strengthens our interpretation.

      In addition to overlap analyses between hand and other body parts, the authors may also want to consider doing some Jaccard similarity analyses between the maps of the 3 groups to support the idea that amputees are more alike controls than one-handers in their topographic activity, which again does not appear clear from the figures.

      We thank the reviewers for this clever suggestion. We now include the Jaccard similarity analysis, which quantified the degree of similarity (0=no overlap between maps; 1=fully overlapping) between winner-takes-all maps (which included the nose; akin to the revised univariate results) across groups. For each face part/amputee, the similarity with the 22 controls and 21 one-handers respectively was averaged. We utilised a linear mixed model which included fixed factors of Group (One-handers x Controls), Movement (Forehead x Nose x Lips x Tongue) and Hemisphere (Intact x Deprived) on Jaccard similarity values (similar to what we used for the RSA analysis). A random effect of participant, as well as covariates of ages, were also included in the model.

      Results showed a significant group x hemisphere interaction (F(240.0)=7.70, p=0.006; controlled for age; Fig. 5), indicating that amputees’ maps showed different similarity values to controls’ and one-handers’ depending on the hemisphere. Post-hoc comparisons (corrected alpha=0.025; uncorrected p-values reported) revealed significantly higher similarity to controls’ than to one-handers’ maps in the deprived hemisphere (t(240)=-3.892, p<.001). Amputees’ maps also showed higher similarity to controls’ maps in the deprived relative to the intact hemisphere (t(240)=2.991, p=0.003). Amputees, therefore, displayed greater similarity of facial somatotopy in the deprived hemisphere to controls, suggesting again fewer evidence for cortical remapping in amputees.

      We added these results at the end of the univariate analyses (lines 335-351) and in the discussion (lines 464-465 and 497-500).

      This brings to another concern I have related to the claim that the change in the cortical organization they observe is mostly observed in one-handers. It seems that most of this conclusion relies on the fact that some effects are observed in one-handers but not in amputees when compared to controls, however, no direct comparisons are done between amputees and one-handers so we may be in an erroneous inference about the interaction when this is actually not tested (Nieuwenhuis, 11). For instance, the shift away from the hand/face border of the forehead is also (mildly) significant in amputees (as observed more strongly in one-handers) so the conclusion (eg from the subtitle of the results section) that it is specific to one-hander might not fully be supported by the data. Similar to the invasion of the hand territory from the lips which is significant in amputees in terms of surface area. All together this calls for toning down the idea that plasticity is restricted to congenital deprivation (eg last sentence of the abstract). Even if numerically stronger, if I am not wrong, there are no stats showing remapping is indeed stronger in one-handers than in amputees and actually, amputees show significant effects when compared to controls along the lines as those shown (even if more strongly) in one-handers.

      Thank you for this very important comment. We fully agree – the RSA across-groups comparison is highly informative but insufficient to support our claims. We did not compare the groups directly to avoid multiple comparisons (both for statistical reasons and to manage the size of the results section). But the reviewer’s suggestion to perform a Jaccard similarity analysis complements very nicely the univariate and multivariate results and allows for a direct (and statistically lean) comparison between groups, to assess whether amputees are more similar to controls or to congenital one-handers, taking into account all aspects of their maps (both spatial location/CoG and surface coverage). We added the Jaccard analysis to the main text, at the end of the univariate results (lines 335-385). The Jaccard analysis suggests that amputees’ maps in the deprived hemisphere were more similar to the maps of controls than to the ones of congenital one-handers. This allowed us to obtain significant statistical results to support the claim that remapping is indeed stronger in one-handers than in amputees (lines 346-351). We also compared both amputees and one-handers to the control group. In line with our univariate results, this revealed that the only face part for which controls were more similar to one-handers than to amputees was the tongue (lines 379-381). And that the forehead remapping observed at the univariate level in amputees (surface area), is likely to arise from differences in the intact hemisphere (lines 381-383).

      Finally, we also added the post-hoc statistics comparing amputees to congenitals in the RSA analysis (lines 425-427): “While facial information in the deprived hand area was increased in one-handers compared with amputees, this effect did not survive our correction for multiple comparisons (t(70.7)=-2.117, p=0.038).”

      Regarding the univariate results mentioned by the reviewer, we would like to emphasise that we had no significant effect for the lips in amputees, though we agree the surface area appears in between controls and one-handers. But this laterality index was not different from zero. This test is now added lines 189-190. Regarding the forehead, we fully agree with the Reviewer, and we adjusted the subtitle accordingly (lines 241-242). For consistency, we also added the t-test vs zero for the forehead surface area (non-significant, lines 251-253).

      Also, maybe the authors could explore whether there is actually a link between the number of years without hand and the remapping effects.

      To address this question, we explored our data using a correlation analysis. The only body part who showed some suggestive remapping effects was the tongue, and so we explored whether we could find a relationship (Pearson’s correlation) between years since amputation and the laterality index of the Tongue in amputees (r = 0.007, p=0.980, 95% CI [-0.475, 0.475]). We also explored amputees’ global Jaccard similarity values to controls in the deprived hemisphere (r = -0.010, p=0.970, 95% CI [-0.488, 0.473]), and could not find any relationship. Considering there was no strong remapping effect to explain, we find this result too exploratory to include in our manuscript.

      One hypothesis generated by the data is that lips remap in the deprived hand area because lips serve compensatory functions. Actually, also in controls, lips and hands can be used to manipulate objects, in contrast to the forehead. One may thus wonder if the preferential presence of lips in the hand region is not latent even in controls as they both link in functions?

      We agree with the reviewer’s reasoning, and we think that the distributed representational content we recently found in two-handers (Muret et al, 2022) provides a first hint in this direction. It is worth noting that in that previous publication we did not find differences across face parts in the activity levels obtained in the hand region, except for slightly more negative values for the tongue. But we do think that such latent information is likely to provide a “scaffolding” for remapping. While the design of our face task does not allow to assess information content for each face part (as done for the lips in Muret et al, 2022), this should be further investigated in follow-up studies.

      We added a sentence in the discussion to highlight this interesting notion: Lines 556-559: “Together with the recent evidence that lip information content is already significant in the hand area of two-handed participants (Muret et al, 2022), compensatory behaviour since developmental stages might further uncover (and even potentiate) this underlying latent activity.”

    1. Author Response

      Reviewer #1 (Public Review):

      The authors used data from extracellular recordings in mouse piriform cortex (PCx) by Bolding & Franks (2018), they examined the strength, timing, and coherence of gamma oscillations with respiration in awake mice. During "spontaneous" activity (i.e. without odor or light stimulation), they observed a large peak in gamma that was driven by respiration and aligned with the spiking of FBIs. TeLC, which blocks synaptic output from principal cells onto other principal cells and FBIs, abolishes gamma. Beta oscillations are evoked while gamma oscillations are induced. Odors strongly affect beta in PCx but have minimal (duration but not amplitude) effects on gamma. Unlike gamma, strong, odor-evoked beta oscillations are observed in TeLC. Using PCA, the authors found a small subset of neurons that conveyed most of the information about the odor (winner cells). Loser cells were more phase-locked to gamma, which matched the time course of inhibition. Odor decoding accuracy closely follows the time course of gamma power.

      We thank the reviewer for the accurate summary of our work.

      I think this is an interesting study that uses a publicly available dataset to good effect and advances the field elegantly, especially by selectively analyzing activity in identified principal neurons versus inhibitory interneurons, and by making use of defined circuit perturbations to causally test some of their hypotheses.

      We thank the reviewer for the positive appraisal.

      Major:

      • The authors show odor-specificity at the time of the gamma peak and imply that the gamma coupling is important for odor coding. Is this because gamma oscillations are important or because gamma is strongest when activity in PCx is strongest (i.e. both excitatory and inhibitory activity, which would cancel each other in the population PSTH, which peaks earlier)? To make this claim, the authors could show that odor decoding accuracy - with a small (~10 ms sliding window) - oscillates at approx. gamma frequencies. As is, Fig. 5 just shows that cells respond at slightly different times in the sniff cycle. What time window was used for computing the Odor Specificity Index? Put another way, is it meaningful that decoding is most accurate when gamma oscillations are strongest, or is this just a reflection of total population activity, i.e., when activity is greatest there is more gamma power, and odor decoding accuracy is best?

      We thank the reviewer for the critical comment. Please note that the employed decoding strategy (supervised learning with cross-validation) prevents us from quantifying a time series of decoding accuracy. Nevertheless, to overcome this difficulty, we divided the spike data (0-500 ms following the inhalation start) according to the gamma cycle into four non-overlapping gamma phase bins. Then we tested whether odor decoding accuracy varied as a function of the gamma cycle phase. Using this approach, we found that decoding depended on the gamma phase, as shown below:

      (The bottom plot shows the modulation of decoding accuracy within the gamma cycle [Real MI] compared to a surrogate distribution [Surr MI, obtained by circularly shifting the gamma phases by a random amount]).

      We interpret this new result as indicative that gamma influences decoding accuracy directly and that our previous result was not only a reflection of total population activity. Moreover, please note that we only use the principal cell activity for computing the odor specificity index (Fig 5E) and decoding accuracy (Fig 7B). Both peak at ~150 ms following inhalation start, at a time window where the net principal cell activity is roughly similar to baseline levels (Fig 5A bottom panel).

      These new panels were added to revised Figure 7 and mentioned in the revised manuscript (page 8); we now also discuss the above considerations about maximal decoding not coinciding with the peak firing rate (page 10).

      Regarding the Odor Specificity Index computation, we apologize for not describing it appropriately in the corresponding Methods subsection. We employed the same sliding time window as in the population vector correlation and the decoding analyses (i.e., 100 ms window, 62.5 % overlap). This information has been added to the revised manuscript (page 15).

      • The authors say, "assembly recruitment would depend on excitatory-excitatory interactions among winner cells occurring simultaneously during gamma activity." Can the authors test this prediction by examining the TeLC recordings, in which excitatory-excitatory connections are abolished?

      We thank the reviewer for the relevant comment. We followed the reviewer's suggestion and analyzed odor assemblies in TeLC recordings. Interestingly, we found a greater increase in the firing rate of winner cells in TeLC recordings (see figure below), which therefore does not support our previous interpretation that assembly recruitment would depend on excitatory-excitatory local interactions.

      Thus, this new result suggests a much more critical role than we previously considered for the OB projections in determining winner neurons.

      Moreover, we found significant differences in the properties of loser cells. In particular, the TeLC-infected piriform cortex showed a decreased number of losing cells, which were significantly less inhibited than their contralateral counterparts:

      Furthermore, the reduced inhibition of losing cells was associated with an increased correlation of assembly weights across odors for the affected hemisphere:

      Therefore, we believe these results highlight the role of gamma oscillations in segregating cell assemblies and generating a sparse orthogonal odor representation in the piriform cortex. These findings are now included as new panels of Figure 6 and discussed on page 8. Noteworthy, to conform with them, we modified our speculative sentence (page 9) "assembly recruitment would depend on excitatory-excitatory interactions among winner cells occurring simultaneously during gamma activity" to “(…) the assembly recruitment would depend on OB projections determining which winner cells “escape” gamma inhibition, highlighting the relevance of the OB-PCx interplay for olfaction (Chae et al., 2022; Otazu et al., 2015).”

      • The authors show that gamma oscillations are abolished in the TeLC condition and use this to claim that gamma arises in the PCx. However, PCx neurons also project back to the OB, where they form excitatory connections onto granule cells. Fukunaga et al (2012) showed that granule cells are essential for generating gamma oscillations in the bulb. Can the authors be sure that gamma is generated in the PCx, per se, rather than generated in the bulb by centrifugal inputs from the PCx, and then inherited from the bulb by the PCx?

      We thank the reviewer for the pertinent comment regarding gamma generation in the PCx. To address this point, we have performed current source density (CSD) analysis, which showed sink and sources of low-gamma oscillations within the PCx and also a phase reversal:

      This result – shown as panel F in Figure 1 – suggests a local generation of gamma within the PCx. Along with the fact that PCx gamma tightly correlates with piriform FBI firing and that PCx gamma disappears in the TeLC ipsi hemisphere, which has intact OB projections, we deem it more parsimonious to assume that gamma does originate in the piriform circuit during feedback inhibition acting on principal cells and is not directly inherited from OB (though it depends on its drive). We have edited our text to incorporate the figure above panel (page 4). We now also relate our results with those of Fukunaga and colleagues for the OB gamma generation and discuss the alternative interpretation of inherited gamma (page 9).

      Reviewer #2 (Public Review):

      This is a very interesting paper, in which the authors describe how respiration-driven gamma oscillations in the piriform cortex are generated. Using a published data set, they find evidence for a feedback loop between local principal cells and feedback interneurons (FBIs) as the main driver of respiration-driven gamma. Interestingly, odour-evoked gamma bursts coincide with the emergence of neuronal assemblies that activate when a given odour is presented. The results argue in favour of a winner-take-all mechanism of assembly generation that has previously been suggested on theoretical grounds.

      We thank the reviewer for his/her work and accurate summary of our results.

      The article is well-written and the claims are justified by the data. Overall, the manuscript provides novel key insights into the generation of gamma oscillations and a potential link to the encoding of sensory input by cell assemblies. I have only minor suggestions for additional analyses that could further strengthen the manuscript:

      We thank the reviewer for the positive appraisal.

      1) The authors' analysis of firing rates of FFIs and FBIs combined with TeLC experiments make a compelling case for respiration-driven gamma being generated in a pyramidal cell-FBI feedback mechanism. This conclusion could be further strengthened by analyzing the gamma phase-coupling of the three neuronal populations investigated. One would expect strong coupling for FBIs but not FFIs (assuming that enough spikes of these populations could be sampled during the respiration-triggered gamma bursts). An additional analysis to strengthen this conclusion could be to extract FBI- and FFI spike-triggered gamma-filtered signals. One might expect an increase in gamma amplitude following FBI but not FFI spiking (see e.g., Pubmed ID 26890123).

      We thank the reviewer for the comment. To address this point, we first computed spike-coupling strength (by means of the Mean Vector Length – MVL) for each neuronal subtype. As shown below, we did not find major differences in MVL values across subtypes (if anything, the FBIs actually displayed the lowest MVL, though it should be cautioned that this metric is sensible to sample size, which differed among subtypes):

      Of note, this result also translated to spike-triggered gamma-filtered signals, with FBIs having the lowest average. We don’t however believe these findings speak against a major role of FBIs in giving rise to field gamma, since it is expected that inhibited neurons will highly phase-lock to gamma (while more active neurons during gamma would show lower phase-locking). Nevertheless, we also computed the spike-triggered gamma amplitude envelope for all three neuronal subtypes. This analysis showed that gamma envelopes closely followed FBI spikes (and not FFIs or EXC cells), and thus this new result reinforces the idea that FBIs trigger gamma oscillations. This plot is now part of an inset of Figure 1G (described on page 5).

      2) The authors utilize the neurons' weight in the first PC to assign them to odour-related assemblies. This method convincingly extracts an assembly for each odour (when odours are used individually), and these seem to be virtually non-overlapping. It would be informative to test whether a similar clear separation of the individual assemblies could be achieved by running the analysis on all odours simultaneously, perhaps by employing a procedure of assembly extraction that allows to deal with overlapping assembly membership better than a pure PCA approach (as used for instance in the work cited on page 11, including the authors' previous work)? I do not doubt the validity of the authors' approach here at all, but the suggested additional analysis might allow the authors to increase their confidence that individual neurons contribute mostly to an assembly related to a single odour.

      We thank the reviewer for the pertinent comment. In order to address it, we ran the ICA-based approach to detect cell assemblies (Lopes-dos-Santos et al., 2013) using the spike time series of all odors concatenated. The concatenation included time windows around the gamma peak (100-400 ms after inhalation start). We chose this window to prevent the ICA from picking temporal features of the response as different ICs instead of the spiking variations caused by the different odors. As a reference, we also calculated ICA for each odor independently during the gamma peak.

      We found that the results obtained from ICA computed using concatenated data from all odors show important resemblances to those from the single ICA per odor approach. For instance, we get similar sparsity and cell assembly membership (Figure 6-figure supplement 1A), orthogonality (Figure 6-figure supplement 1B), and odor specificity (Figure 6-figure supplement 1C) in the ICs loadings through both approaches. Noteworthy, the average absolute IC correlation between the six odors (computed separately) and the six first ICs (computed from the combined odor responses) were similar across animals and showed no significant differences (Figure 6-figure supplement 1C).

      We also directly tested odor selectivity and separation in the concatenated data approach by computing each odor’s mean assembly activity (i.e., “IC projection”). Regarding the former, we found that most assemblies coded for 1 or 2 odors (Figure 6-figure supplement 1D). Regarding the diversity of representations for the sampled neurons, we assessed odor separation by examining to which odor each IC is activated the most. Under this framework, we get that, on average, the first 6 ICs encode three to five different odors (Figure 6-figure supplement 1E).

      We have included this result as a new Figure 6-figure supplement 1 and mention it on page 8. Of note, we have also performed all of our previous assembly analyses (i.e., Figure 6) using ICA instead of PCA to be consistent throughout the manuscript and allow the reader to compare with the new supplementary figure. This led to a new and enhanced version of Figure 6.

      3) Do the authors observe a slow drift in assembly membership as predicted from previous work showing slowly changing odour responses of principal neurons (Schoonover et al., 2021)? This could perhaps be quantified by looking at the expression strengths of assemblies at individual odour presentations or by running the PCA separately on the first and last third of the odour presentations to test whether the same neurons are still 'winners'.

      We thank the reviewer for calling our attention to this point. We note, however, that the representation drift observed by Schoonover et al. occurred along several days of recordings, i.e., at a much slower time scale than the single-day recordings we analyzed here (of note, Schoonover et al. observed no drift within the same day [their Fig 2a]). But irrespective of this, we believe that the data at hand does not allow for a confident analysis of possible drifts. This is because each odor was only presented ~12 times; so, further subdividing the data into subsets of only 4 trials would not render a reliable analysis, unfortunately.

      4) Does the winner-take-all scenario involve the recruitment of specific sets of FBIs during the activation of the individual odour-selective assemblies? The authors could address this by testing whether the rate of FBIs changes differently with the activation of the extracted assemblies.

      Within each recording session, the number of recorded FBIs is very low, on average 3.6 FBIs per recording session. Thus, unfortunately such interesting analysis cannot be confidently performed.

      5) Given the dependence on local gamma oscillations, one might expect that odour-selective assemblies do not emerge in the TeLC-expressing hemisphere. This could be directly tested in the existing data set.

      We are thankful for the comment. We followed the reviewer's suggestion and analyzed odor assemblies in TeLC recordings, comparing the ipsilateral hemisphere (infected) with the contralateral one. Interestingly, we find an increased correlation of assembly weights across odors, suggesting that the formation/segregation of odor-selective assemblies is hindered when the principal cell synapses are abolished. This assembly selectivity reduction co-occurred as the number of losing neurons decreased, and the inhibition of the latter was also reduced. Consequently, decoding accuracy significantly decreased during the 150-250 ms window in the infected TeLC hemisphere compared to the contralateral cortex.

      Therefore, we believe these new results support the role of gamma oscillations in segregating cell assemblies and generating a sparse orthogonal odor representation. These findings are now included as new panels of Figure 6 and Figure 7 and discussed on page 8.

    1. Author Response

      Reviewer #1 (Public Review):

      By studying the effect of Treg depletion in a CD8+ T cell-dependent diabetes model the group around Ondrej Stepanek described that in the absence of Treg cells antigen-specific CD8+ OT-I T cells show an activated phenotype and accelerate the development of diabetes in mice. These cells - termed KILR cells - express CD8+ effector and NK cell gene signatures and are identified as CD49d- KLRK1+ CD127+ CD8+ T cells. The authors suggest that the generation of these cells is dependent on TCR stimulation and IL-2 signals, either provided due to the absence of Treg cells or by injection of IL-2 complexed to specific antiIL-2 mAbs. In vivo, these cells show improved target cell killing properties, while the authors report improved anti-tumor responses of combination treatments with doxorubicin combined with IL-2/JES6 complexes. Finally, the authors identified a similar human subset in publicly available scRNAseq datasets, supporting the translational potential of their findings.

      The conclusions are mostly well supported, except for the following two considerations:

      We are happy for the positive overall evaluation of our manuscript by both reviewers and we are thankful for their specific insightful comments, which helped us to improve the manuscript.

      1) From Fig. 4A and B it is not conclusively shown, that Tregs limit IL-2 necessary for the expansion of OT-I cells and subsequent induction of diabetes. An IL-2 depletion experiment (e.g. with combined injection of the S4B6 and JES6-1 antibodies) would further strengthen this claim. Along these lines, the authors claim "IL-2Rα expression on T cells can be induced by antigen stimulation or by IL-2 itself in a positive feedback loop [20]. Accordingly, downregulation of IL-2Rα in OT-I T cells in the presence of Tregs might be a consequence of the limited availability of IL-2.". The cited reference 20 did observe CD25 upregulation by IL-2 on T cells but the observed effect might only be caused by upregulation of CD25 on Treg cells, which increases the MFI for the whole T cell population. Did the authors observe significant upregulation of CD25 on effector CD4+ and CD8+ T cells in their experiments with IL-2/S4B6 or IL-2/JES6 treatment?

      We added another reference to support our claim (Sereti, I., et al., Clin Immunol, 2000. 97(3): p. 266-76.). Along this line, we also observed that addition of IL-2 in vitro leads to IL-2Rα upregulation on CD8+ T cells (shown in Fig. 4C), which was IL-2Rα level was lower if Tregs were present. We also observed upregulation of IL-2Rα in vivo upon the stimulation of OT-I T cells with OVA and IL-2ic, which is now shown in the Fig. S6C of the revised manuscript.

      To further explore if Tregs limit expansion of OT-I and diabetes progression via IL-2 limitations, we performed the proposed experiment using a combined injection of S4B6 and JES6-1 anti-IL-2 antibodies. At the beginning, we were skeptical that we could completely block the IL-2 using this approach for the following reasons. First, IL-2 is produced locally in the spleen and lymph nodes and might not be easily accessible for the antibodies for a complete block. Second, IL-2 has a relatively short turnover and is continuously produced, but the half-life of the injected antibodies is unknown, which questions the duration of such a block. Third, it is possible that some IL-2 molecules would bound only to one of the two antibodies, which will make it a hyper-stimulating immune-complex, instead of neutralizing it.

      Anyway, we were curious enough to perform this experiment. We used a condition that based on our experience leads to diabetes manifestation in Tregs depleted, but not in Treg replete mice (10 k OT-I T cells, OVA + LPS immunization). One additional group of Treg-depleted mice received a single dose of S4B6 and JES6-1 anti-IL-2 (200 µg of each antibody per mouse). We observed that this IL-2 blocking delayed, but not prevented the development of diabetes in most animals (Fig. 1 below).

      Overall, we believe that this experiment is rather supporting our conclusions concerning the importance of IL-2, although the effect is only partial. However, we decided not to include this experiment in the manuscript, because we do not have the evidence about how efficient the IL-2 blocking was (see above), which makes the interpretation difficult. Because the reviews and the point-by-point response is public in eLife, we believe that showing the data here is appropriate.

      Figure 1. Role of IL-2 blocking on the development of experimental diabetes. Two independent experiments were performed. Statistical significance was calculated using Log-rank (Mantel-Cox) test for survival, and Kruskal-Wallis test for blood glucose (p-value is shown in italics).

      2) The anti-tumor efficacy of KILR cells is intriguing but currently, it is unclear if it is indeed mediated by KILR cells. Have KILR cells been identified by flow cytometry in the BCL1 and B16F10 models treated with doxorubicin and IL-2/JES6? Were specific KILR cell depletion studies conducted, e.g. with an anti-KLRK1 depleting antibody? Additional experiments addressing these questions would be desirable to further support the authors' claims.

      We are thankful to both reviewers for their similar comments concerning the analysis of CD8+ T cells in the tumor model. Addressing these comments lead to very useful data and significantly improved our manuscript.

      We performed the analysis of splenic CD8+ T cells in the BCL1 leukemia model (spleen is the major site of the leukemic cells in this model). We observed that KLRK1+ T cells represented almost half of CD8+ T cells in mice treated with DOX+IL-2, which was much higher frequency than in the control and DOX-only treated mice. Although not all KLRK1+ cells were bona fide KILR cells, the frequencies of KLRK1+ IL-7R+ and KLRK1+ CD49d- cells were also strongly elevated in the Dox+IL-2ic treated mice. Overall, the survival of DOX+IL-2ic treated mice correlated with the frequencies of KILR T cells and KLRK1+ T cells. Moreover, GZMB was almost exclusively expressed by KLRK1+ T cells. We are showing these data in Fig. 7C and Fig. S7B in the revised manuscript.

      In the B16 melanoma model, we analyzed CD8+ T cells in the spleens and also in the tumors. We observed a huge population of KLRK1+ GZMB+ CD8+ T-cell population in the spleen of DOX+IL-2ic-treated mice, but not in the untreated or DOX-only treated mice (Fig. 7F). Both KLRK1+ CD49d+ and KLRK1+ CD49d- CD8+ T cells were substantially more frequent in the DOX+IL-2ic-treated, but not in the untreated or DOX-only treated mice (Fig. S7F). In the tumor, the KLRK1+ CD49d- CD8+ T cells were found at large numbers only in the DOX+IL-2ic-treated mice (Fig. 7G). Moreover, these KLRK1+ CD49d- CD8+ T cells expressed high levels of IL-7R and GZMB only in DOX+IL-2ic-treated, but not in untreated and DOX-only treated mice (Fig. 7H).

      We believe that these new data provide evidence that the combination of immunogenic chemotherapy with IL-2 treatment induced KILR cells in the spleens and in the tumors and that this correlates with the better survival.

      Because the majority of non-naïve CD8+ T cells (and vast majority of GZMB+ CD8+ T cells) in the spleens and tumors of the tumor-bearing mice treated with DOX+IL-2ic were KLRK1+ and because we have shown that the protective effect of the DOX+IL-2ic therapy is largely CD8+ T cell-dependent, we did not find it essential to perform the depletion of KLRK1+ T-cells. We believe that it is almost inevitable that the depletion of KLRK1+ T cells would lead to increased tumor growth as it would probably deplete the majority of antigenspecific CD8+ T cells, mimicking the overall CD8+ T cell depletion. Moreover, we do not have this protocol established.

      Reviewer #2 (Public Review):

      In this study, the authors determine the superior cell killing abilities of KLRK1+ IL7R+ (KILR) CD8+ effector T cells in experimental diabetes and tumor mouse model. They also provide evidence that Tregs suppress the formation of this previously uncharacterized subset of CD8+ effector T cells by limiting IL-2.

      Strength and Limitation

      This study focuses on the relationship between Tregs and CD8+ T cells. They used different experimental diabetes mouse models to reveal that Tregs suppress the CD8+ effector T cells by limiting IL-2. They also found a unique subset of KLRK1+ IL7R+ (KILR) CD8+ effector T cells with superior cell killing abilities through single-cell sequencing, but killing abilities could be inhibited by Tregs. They also tested their theory in in vivo tumor model. The data, in general, support the conclusions; however, some issues need to be fully addressed, as detailed below.

      We are happy for the positive overall evaluation of our manuscript by both reviewers and we are thankful for their specific insightful comments, which helped us to improve the manuscript.

      1) This study used the concentration of urine glucose as the standard for diabetes ({greater than or equal to} 1000 mg/dl for two consecutive days). However, multiple reasons may lead to a high level of urine glucose. As a type I diabetes mouse model, authors could use immunohistological analysis of islet to show the proportion of T cells and islet cells in islet, which can display the geographic distribution of immune cells, severity and histology structure of damaged pancreas islet directly. If possible, different subsets of immune cells, especially CD4 vs CD8+ cells should be stained for their location.

      We added the histological examination of the pancreas in control, DEREG-, and DEREG+ mice using contrast H&E staining and immuno-fluorescence (Fig. 1D-E in the revised manuscript). We observed that the high glucose and blood levels are preceded by the destruction of the pancreatic islets (morphology and decreased insulin production) as well as by the infiltration of the islets with immune cells including CD4+ and CD8+ T cells.

      2) This article shows that KILR effector CD8+ T cells have strong cytotoxic properties. However, they do not describe the potential proliferation ability vs apoptosis of this subset from islets.

      We analyzed the proliferation (KI67 expression) and apoptosis (Annexin V, cleaved Caspase 3) in T cells isolated from the pancreas of DEREG- and DEREG+ mice on day 4 after the induction of diabetes using flow cytometry (Figure 2 below). We did not observe any differences between DEREG- and DEREG+ mice or among different subsets of OT-I T cells in the DEREG+ mice. Essentially, all T cells were proliferative (KI67+) and there was a very low percentage of Annexin V or cleaved Caspase 3 positive cells.

      Figure 2. Lymphocytes were isolated from the pancreas of DEREG- RIP.OVA and DEREG+ RIP.OVA mice on day 4 after the induction of diabetes, and analyzed using flow cytometry. Two independent experiments were performed. Gated on OT-I T cells. Top: proliferation rate based on Ki-67 staining. Representative histogram and MFI (median is shown). Middle: Apoptosis rate based on Annexin V staining. Representative histogram shows Annexin V staining in three populations of OT-I T cells from DEREG+ mouse (“AE” - CD49d+ KLRK1-, “++” - CD49d+ KLRK1+, KILR - CD49d- KLRK1+), total OT-I T cells from DEREG-, and a positive control: WT CD8+ T cells treated with hydrogen peroxide. Middle right: Percentage of Annexin V+ cells and MFI (median is shown). Bottom: Apoptosis rate based on cleaved Caspase 3 staining. Representative dot plots show cleaved Caspase 3 staining of OT-I T cells from DEREG+, DEREG-, and a positive control: WT CD8+ T cells treated with hydrogen peroxide. Bottom right: percentage of cleaved Caspase 3+ cells (median is shown).

      However, we found question concerning proliferation and apoptosis of KILR cells interesting and worth further investigation. For this reason, we assessed the proliferation, survival, and phenotypic stability of naïve, KILR, and effector T cells by their competitive transfer into CD3ε-/- mice. The phenotype of all these three subsets remained stable for 4 days (Fig. 6F), documenting that KILR cells are not just a very transient stage. Moreover, the KILR cells were ~2 fold more abundant then effector cells 3 days after their 1:1 cotransfer into CD3ε-/- mice (Fig. 6G, Fig. 6SE). This was probably caused by their slight advantages in both proliferation and survival (Fig. 6SF-G).

      3) Figure 7 shows that the antitumor efficacy of IL-2 depends on CD8+ T cells. But in this part, there is no data to show the change of KLRK1+ IL7R+ CD8+ effector T cells in tumor tissue. Therefore, the article needs to add more data to verify that IL-2 enhances antitumor ability via KLRK1+ IL7R+ CD8+ effector T cells.

      We are thankful to both reviewers for their similar comments concerning the analysis of CD8+ T cells in the tumor model. Addressing these comments lead to very useful data and significantly improved our manuscript.

      We performed the analysis of splenic CD8+ T cells in the BCL1 leukemia model (spleen is the major site of the leukemic cells in this model). We observed that KLRK1+ T cells represented almost half of CD8+ T cells in mice treated with DOX+IL-2, which was much higher frequency than in the control and DOX-only treated mice. Although not all KLRK1+ cells were bona fide KILR cells, the frequencies of KLRK1+ IL-7R+ and KLRK1+ CD49d- cells were also strongly elevated in the Dox+IL-2ic treated mice. Overall, the survival of DOX+IL-2ic treated mice correlated with the frequencies of KILR T cells and KLRK1+ T cells. Moreover, GZMB was almost exclusively expressed by KLRK1+ T cells. We are showing these data in Fig. 7C and Fig. S7B in the revised manuscript.

      In the B16 melanoma model, we analyzed CD8+ T cells in the spleens and also in the tumors. We observed a huge population of KLRK1+ GZMB+ CD8+ T-cell population in the spleen of DOX+IL-2ic-treated mice, but not in the untreated or DOX-only treated mice (Fig. 7F). Both KLRK1+ CD49d+ and KLRK1+ CD49d- CD8+ T cells were substantially more frequent in the DOX+IL-2ic-treated, but not in the untreated or DOX-only treated mice (Fig. S7F). In the tumor, the KLRK1+ CD49d- CD8+ T cells were found at large numbers only in the DOX+IL-2ic-treated mice (Fig. 7G). Moreover, these KLRK1+ CD49d- CD8+ T cells expressed high levels of IL-7R and GZMB only in DOX+IL-2ic-treated, but not in untreated and DOX-only treated mice (Fig. 7H).

      We believe that these new data provide evidence that the combination of immunogenic chemotherapy with IL-2 treatment induced KILR cells in the spleens and in the tumors and that this correlates with the better survival.

      4) It is unclear why the authors chose Dox to combine with IL-2/JES6. The authors should provide a more rational introduction to bridge such a combination. Authors should also explain the reason why there is no antitumor effect of IL-2/JES6 treatment alone.

      The experiments with OT-I mice showed that the formation of KILR cells required both the antigenic stimulation and IL-2 signals. We believe that there is only very week antigenic stimulation by the tumor itself. For this reason, we combined the treatment with the chemotherapy Doxorubicin, which is known to induce immunogenic cell death of the tumor cells (e.g., Casares et al. 2005, PMID: 16365148). We believe that doxorubicin induces the death of (some) tumor cells and the release and presentation of their tumorspecific antigens. Without it, the tumor are simply too “cold” to induce sufficient T-cell response. We emphasized this in the revised version of the manuscript.

      Importantly, some of us observed a similar effect of IL-2ic in a combination with check-point blockade therapy (without chemotherapy) in a different tumor model, which documents that the chemotherapy is not essential for this effect (unpublished data).

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1: Many of the initial analyses of behavior metrics, for instance predicting reaction times, number of fixations, or fixation duration, use value difference as a regressor. However, given a limited set of values, value differences are highly correlated with the option values themselves, as well as the chosen value. For instance, in this task the only time when there will be a value difference of 4 drops is when the options are 1 and 5 drops, and given the high performance of these monkeys, this means the chosen value will overwhelmingly be 5 drops. Likewise, there are only two combinations that can yield a value difference of 3 (5 vs. 2 and 4 vs 1), and each will have relatively high chosen values. Given that value motivates behavior and attracts attention, it may be that some of the putative effects of choice difficulty are actually driven by value.

      To address this question, we have adapted the methods of Balewski and colleagues (Neuron, 2022) to isolate the unique contributions of chosen value and trial difficulty to reaction time and the number of fixations in a given trial (the two behaviors modulated by difficulty in the original paper). This new analysis reveals a double dissociation in which reaction time decreases as a function of chosen value but not difficulty, while the number of fixations in a trial shows the opposite pattern. Our interpretation is that reaction time largely reflects reward anticipation, whereas the number of fixations largely reflects the amount of information required to render a decision (i.e., choice difficulty). See lines 144-167 and Figure 2.

      Point 2: Related to point 1, the study found that duration of first fixations increased with fixated values, and second (middle) fixation durations decreased with fixated value but increased with relative value of the fixated versus other value. Can this effect be more concisely described as an effect of the value of the first fixated option carrying over into behavior during the second fixation?

      This is a valid interpretation of the results. To test this directly, we now include an analysis of middle fixation duration as a function of the not-currentlyviewed target. Note that the vast majority of middle fixations are the second fixation in the trial, and therefore the value of the unattended target is typically the one that was viewed first. The analysis showed a negative correlation between middle fixation duration and the value of the unattended target which is consistent with the first fixated value carrying over to the second fixation. See lines 243-246.

      Point 3: Given that chosen (and therefore anticipated) values can motivate responses, often measured as faster reaction times or more vigorous motor movements, it seems curious that terminal non-decision times were calculated as a single value for all trials. Shouldn't this vary depending at least on chosen values, and perhaps other variables in the trial?

      In all sequential sampling model formulations we are aware of, nondecision time is considered to be fixed across trial types. Examples can be found for perceptual decisions (e.g., Resulaj et al., 2009) and in the “bifurcation point” approach used in the recent value-based decision study by Westbrook et al. (2020).

      To further investigate this issue, we asked whether other post-decision processes were sensitive to chosen value in our paradigm. To do so, we measured the interval between the center lever lift and the left or right lever press, corresponding to the time taken to perform the reach movement in each trial (reach latency). We then fit a mixed effects model explaining reach latency as a function of chosen value. While the results showed significantly faster reach latencies with higher chosen values, the effect size was very small, showing on average a ~3ms decrease per drop of juice. In other words, between the highest and lowest levels of chosen value (5 vs. 1), there is only a difference of approximately 12ms. In contrast, the main RT measure used in the study (the interval between target onset and center lever lift) is an order of magnitude more sensitive to chosen value, decreasing ~40ms per drop of juice. These results are shown in Author response image 1.

      Author response image 1.

      This suggests that post-decision processes (NDT in standard models and the additive stage in the Westbrook paper) vary only minimally as a function of chosen value. We are happy to include this analysis as a supplemental figure upon request.

      Point 4: The paper aims to demonstrate similarities between monkey and human gaze behavior in value-based decisions, but focuses mainly on a series of results from one group of collaborators (Krajbich, Rangel and colleagues). Other labs have shown additional nuance that the present data could potentially speak to. First, Cavanaugh et al. (J Exp Psychol Gen, 2014) found that gaze allocation and value differences between options independently influence drift rates on different choices. Second, gaze can correlate with choice because attention to an option amplifies its value (or enhances the accumulation of value evidence) or because chosen options are attended more after the choice is implicitly determined but not yet registered. Westbrook et al. (Science, 2020) found that these effects can be dissociated, with attention influencing choice early in the trial and choice influencing attention later. The NDTs calculated in the present study allot a consistent time to translating a choice into a motor command, but as noted above don't account for potential influences of choice or value on gaze.

      The two-stage model of gaze effects put forth by Westbrook et al. (2020) is consistent with other observations of gaze behavior and choice (i.e., Thomas et al., 2019, Smith et al., 2018, Manohar & Husain, 2013). In this model, gaze effects early in the trial are best described by a multiplicative relationship between gaze and value, whereas gaze effects later in the trial are best described with an additive model term. To test the two-stage hypothesis, Westbrook and colleagues determined a ‘bifurcation point’ for each subject that represented the time at which gaze effects transitioned from multiplicative to additive. In our data, trial durations were typically very short (<1s), making it difficult to divide trials and fit separate models to them. We therefore took at different approach: We reasoned that if gaze effects transition from multiplicative to additive at the end of the trial, then the transition point could be estimated by removing data from the end of each trial and assessing the relative fit of a multiplicative vs. additive model. If the early gaze effects are predominantly multiplicative and late gaze effects are additive, the relative goodness of fit for an additive model should decrease as more data are removed from the end of the trial. To test this idea, we compared the relative model fit of an additive vs. multiplicative models in the raw data, and for data in which successively larger epochs were removed from the end of the trial (50, 100, 150, 200, 300, and 400ms). The relative fit was assessed by computing the relative probability that each model accurately reflects the data. In addition, to identify significant differences in goodness of fit, we compared the WAIC values and their standard errors for each model (Supplemental File 3). As shown in Figure 4, the relative fit probability for both models is nonzero in the raw data 0 truncation), indicating that a neither model provides a definitive best fit, potentially reflecting a mixture of the two processes. However, the relative fit of the additive model decreases sharply as data is removed, reaching zero at 100ms truncation. 100ms is also the point at which multiplicative models provide a significantly better fit, indicated by non-overlapping standard error intervals for the two models (Supplemental File 3). Together, this suggested that the transition between early- and late-stage gaze effects likely occurs approximately 100ms before the RT.

      To minimize the influence of post-decision gaze effects, the main results use data truncated by 100ms. However, because 100ms is only an estimate, we repeated the main analyses over truncation values between 0 and 400ms, reported in Figure 6 - figure supplement 1 & Figure 7 - figure supplement 1. These show significant gaze duration biases and final gaze biases in data truncated by up to 200ms.

      Reviewer #2 (Public Review):

      Recommendation 1: The only real issue that I see with the paper is fairly obvious: the authors find that the last fixations are longer than the rest, which is inconsistent with a lot of the human work. They argue that this is due to the reaching required in this task, and they take a somewhat ad-hoc approach to trying to correct for it. Specifically, they take the difference between final and non-final, second fixations, and then choose the 95th percentile of that distribution as the amount of time to subtract from the end of each trial. This amounts to about 200 ms being removed from the end of each trial. There are several issues with this approach. First, it assumes that final and non-final fixations should be the same length, when we know from other work that final fixations are generally shorter. Second, it seems to assume that this 200ms is "the latency between the time that the subject commits to the movement and the time that the movement is actually detected by the experimenter". However, there is a mismatch between that explanation and the details of the task. Those last 200ms are before the monkey releases the middle lever, not before the monkey makes a left/right choice. When the monkey releases the middle lever, the stimuli disappear and they then have 500ms to press the left or right lever. But, the reaction time and fixation data terminate when the monkey releases the middle lever. Consequently, I don't find it very likely that the monkeys are using those last 200ms to plan their hand movement after releasing the middle lever.

      Thanks for the opportunity to clarify these points. There are three related issues:

      First, with regards to fixation durations, in the updated Figure 3 we now show durations as a function of both the absolute order in the trial (first, second, third, fourth, etc.) and the relative order (final/nonfinal). We find that durations decrease as a function of absolute order in the trial, an effect also seen in humans (see Manohar & Husain, 2013). At the same time, while holding absolute order constant, final fixations are longer than non-final fixations. To explain the discrepancy with human final fixation durations, we note that monkeys make many fewer fixations per trial (~2.5) than humans do (~3.7, computed from publicly available data from Krajbich et al., 2010.) This means that compared to humans, monkeys’ final fixations occur earlier in the trial (e.g., second or third), and are therefore comparatively longer in duration. Note that studies with humans have not independently measured fixation durations by absolute and relative order, and therefore would not have detected the potential interaction between the two effects.

      Second, the comment suggests that the final 200ms before lever lift is not spent planning the left/right movement, given that the monkeys have time after the lever lift in which to execute the movement (400 or 500ms, depending on the monkey). The presumption appears to be that 400/500ms should be sufficient to plan a left/right reach. However, we think that these two suggestions are unlikely, and that our original interpretation is the most plausible. First, the 400/500ms deadline between lift and left/right press was set to encourage the monkeys to complete the reach as fast as possible, to minimize deliberations or changes of mind after lifting the lever. More specifically, these deadlines were designed so that on ~0.5% of trials, the monkeys actually fail to complete the reach within the deadline and fail to obtain a reward. This manipulation was effective at motivating fast reaches, as the average reach latency (time between lift and press) was 165 SEM 20ms for Monkey K, and 290 SEM 100ms for Monkey C.

      Therefore, given the time pressure imposed by the task, it is very unlikely that significant reach planning occurs after the lever lift. In addition to these empirical considerations, the idea that the final moments before the RT are used for motor planning is a standard assumption in many theoretical models of choice (including sequential sampling models, see Ratcliff & McKoon 2008, for review), and is also well-supported by studies of motor control and motor system neurophysiology. Based on these, we think the assumption of some form of terminal NDT is warranted.

      Third, we have changed our method for estimating the NDT interval. In brief we sweep through a range of NDT truncation values (0-400ms) and identify the smallest interval (100ms) that minimizes the contribution of “additive” gaze effects, which are thought to reflect late-stage, post-decision gaze processes. See the response to Point 4 for Reviewer 1 above, Figure 4 and lines 267-325 in the main text. In addition, we report all of the major study results over a range of truncation values between 0 and 400ms.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper describes the neural activity, measured by intrinsic optical imaging in reach-to-grasp, and reach-only conditions in relation to the Intra-cortical micro stimulation maps. The paper mostly describes a relatively unique and potentially useful data set. However, in the current version, no real hypotheses about the organization of M1 and PMd are tested convincingly. For example, the claim of "clustered neural activity" is not tested against any quantifiable alternative hypothesis of non-clustered activity, and support for this idea is therefore incomplete.

      The combination of intrinsic optical imaging and intra-cortical micro-stimulation of the motor system of two macaque monkeys promised to be a unique and highly interesting dataset. The experiments are carefully conducted. In the analysis and interpretation of the results, however, the paper was disappointing to me. The two main weaknesses in my mind were:

      a) The alternative hypotheses depicted in Figure 1B are not subjected to any quantifiable test. When is an activity considered to be clustered and when is it distributed? The fact that the observed actions only activate a small portion of the forelimb area (Figure 5G, H) is utterly unconvincing, as this analysis is highly threshold-dependent. Furthermore, it could be the case that the non-activated regions simply do not give a good intrinsic signal, as they are close to microvasculature (something that you actually seem to argue in Figure 6b). Until the authors can show that the other parts of the forelimb area are clearly activated for other forelimb actions (as you suggest on line 625), I believe the claim of cluster neural activity stands unsupported.

      We appreciate the reviewer’s concerns and we have made several revisions.

      (1) The two panels in Fig 1B should have been presented as potential outcomes as opposed to hypotheses in need of quantifiable testing. We revised the Introduction (line 105-111) and the Results (line 149-152) accordingly.

      (2) We agree that the thresholding procedure adopted in the original submission could have impacted the spatial measurements of cortical activity (i.e., Fig 5G-H in original submission). We have completely revised the thresholding procedure and it is now based on statistical comparisons that include all trials (instead of thresholding by number of sessions in the original submission). Thus, the thresholded maps in Fig 5G & 5J are now obtained from pixel-by-pixel comparisons (t-tests, p<1e-4) between frames acquired post-movement and frames acquired before movement. Nevertheless, even with this relatively relaxed threshold, the largest activity maps overlapped <40% of the forelimb representations.

      It is important to note that major vessels were excluded from the thresholded map and from the motor map. Thus, uncertainty about imaging in and around vessels was likely not a factor in the calculated overlap between thresholded maps and the motor map.

      (3) We agree that showing activation in other parts of the forelimb representations in response to action other than reach-to-grasp would have supported some of the arguments that we previously put forth. Unfortunately, we do not have the supporting data and obtaining it would take months/years. We have therefore expanded the Discussion to include limitations of the behavioral task (line 439-443).

      b) The most interesting part of the study (which cannot be easily replicated with human fMRI studies) is the correspondence between the evoked activity and intra-cortical stimulation maps. However, this is impeded by the subjective and low-dimensional description of the evoked movement during stimulation (mainly classifying the moving body part), and the relatively low-dimensional nature (4 conditions) of the evoked activity.

      We agree with the reviewer on all accounts. We expanded the Discussion to consider the low dimensionality of the motor maps and the behavioral task (line 439-449).

      Measuring cortical activity in a variety of motor tasks would likely have provided additional insight about movement-related cortical activity. Nevertheless, including additional tasks, even if it were possible to do so in the same monkeys, would have delayed study completion by months/years. The hidden challenge of the experimental design is that each monkey is trained to not move for many seconds to minimize contamination of ISOI signals. For example, from trial initiation to Go Cue, the monkey must hold its hand in the start position for 5 seconds. Similarly, after movement completion, the monkey must hold its hand in the start position for another 5 seconds. In between successful trials, a monkey must wait for ~12 seconds before it can initiate a new trial. These durations are >1 order of magnitude longer than in electrophysiological studies in comparable tasks. Achieving consistent task performance with the long durations used here, took months of daily training. Moreover, our monkeys typically run out of steam after ~60-70 min of working on the task. This forces us to limit the overall number of task conditions tested in a session, to obtain a large enough number of trials from each condition.

      c) Many details about the statistical analysis remain unclear and seem not well motivated.

      We address the reviewer’s specific concerns.

      Reviewer #2 (Public Review):

      Chehade and Gharbawie investigated motor and premotor cortex in macaque monkeys performing grasping and reaching tasks. They used intrinsic signal optical imaging (ISOI) covering an exceedingly large field-of-view extending from the IPS to the PS. They compared reaching and fine/power-grip grasping ISOI maps with "motor" maps which they obtained using extensive intracranial microstimulation. The grasping/reaching-induced activity activated relatively isolated portions of M1 and PMd, and did not cover the entire ICM-induced 'motor' maps of the upper limbs. The authors suggest that small subzones exist in M1 and PMd that are preferentially activated by different types of forelimb actions. In general, the authors address an important topic. The results are not only highly relevant for increasing our basic understanding of the functional architecture of the motor-premotor cortex and how it represents different types of forelimb actions, but also for the development of brain-machine interfaces. These are challenging experiments to perform and add to the existing yet complementary electrophysiology, fMRI, and optical imaging experiments that have been performed on this topic - due to the high sensitivity and large coverage of the particular IOSI methods employed by the authors. The manuscript is generally well written and the analyses seem overall adequate - but see below for some additional analyses that should be done. Although I'm generally enthusiastic about this manuscript, there are two major issues that should be clarified. These major questions relate mainly to potential thresholding issues and clustering issues.

      Major:

      1) The main claim of the authors is that specific forelimb actions activate only a small fraction of what they call the motor map (i.e., those parts of M1/PMd that evoke muscle contractions upon ICM). The action-related activity is measured by ISOI. When looking a the 'raw' reflectance maps, it is rather clear that relatively wide portions of the exposed cortex are activated by grasping/reaching, especially at later time points after the action. In fact, another reading of the results may be that there are two zones of 'deactivation' that split a large swath of motor-premotor cortex being activated by the grasping/reaching actions. (e.g. at 6 seconds after the cue in Fig 3A, 5A). At first sight, the 'deactivated' regions seem to be located in the cortex representing the trunk/shoulder/face - hence regions not necessarily activated (or only weakly) during the grasping/reaching actions. If true, this means that most of the relevant M1/PMd cortex IS activated during the latter actions - opposing the 'clustering' claims of the authors. This raises the question of whether the 'granularity' claimed by the authors is

      a. threshold dependent. In this context, the authors should provide an analysis whereby 'granularity' is shown independent of statistical thresholds of the ISOI maps.

      We appreciate the reviewer’s concerns and have completely revised the analyses central to Fig 5. We believe that the figure now contains evidence from both thresholded and unthresholded ISOI data in support of limited spatial extent of cortical activation (i.e., “granularity” in the reviewer’s comments).

      For evidence from unthresholded ISOI data, we examined reflectance change time courses from different size ROIs (line 764-768). (A) Small circular ROIs (0.4 mm radius), which we placed in the M1 hand, M1 arm, and PMd arm, zones (Fig 5B). (B) Large ROI inclusive of the M1 and PMd forelimb representations (Fig 5B). We reasoned that if cortical activity is spatially widespread, then the small and large ROIs would report similar time courses. In contrast, if cortical activity is spatially focal, then activity would be detected in the small ROI time courses but would washed out in the large ROI time courses. Our results support the second possibility (Fig 5C-F). Thus, in the movement conditions, time courses from the small ROIs had a large negative peak after movement completion (Fig C-E). In contrast, the characteristic negative peak was absent in the time courses obtained from the large ROI (Fig 5F).

      Separately, we revised our thresholding approach to make those results less sensitive to thresholding effects (more details in our response to the first major point from Reviewer 1). The revised results – thresholded/ binarized maps – are consistent with focal cortical activity. Fig 5G & 5J show activity maps thresholded (t-test, p<0.0001) without correction for multiple comparisons, and therefore represent the least restrictive estimate of the spatial extent of cortical activity. Measurements from these maps showed that significantly active pixels overlapped <40% of the M1 & PMd forelimb representations. We interpret the thresholded results as evidence in support of focal cortical activity.

      This raises the question of whether the 'granularity' claimed by the authors is

      b. dependent on the time-point one assesses the maps. Given the sluggish hemodynamic responses, it is unclear which part of the ISOI maps conveys the most information relative to the cue and arm/hand movements. I suspect that timepoints > 6 s will reveal even larger 'homogeneous' activations compared to the maps < 6s.

      We agree with the reviewer that the lag in hemodynamic signals complicates frame selection. Nevertheless, it is unlikely that cortical activity maps would have been larger at time points >6s from Cue. We provide three supporting arguments.

      (1) In the imaging sessions used in Fig 4, we acquired images for 9s per trial and systematically varied Cue onset time. The time courses in Fig 4A-B show that for all Cue onset conditions, the negative peak occurred <6s from Cue. This observation from unthresholded results does not support the notion of greater cortical activity at time points >6s from Cue.

      (2) From the same experiment, Fig 4C shows 9 thresholded/binarized maps generated from different time points in relation to Cue. We measured the size of each map (i.e., overlap with the M1/PMd forelimb representations). We present the results in Author response image 1. The largest maps came from an average frame captured +5.8-6.0s from Cue. Those maps are on the diagonal in Fig 4E (top left to bottom right). This result from thresholded data therefore does not support the notion of greater cortical activity at time points >6s from Cue.

      Author response image 1.

      (3) In all other sessions, we acquired images for 7s per trial (-1.0 to +6.0 s from Cue) without varying Cue onset time. At every time point (100 ms), we measured the size of the thresholded/binarized map in relation to the size of the M1 and PMd forelimb representations. The results are presented in Fig 5I & 5L and indicate that thresholded maps plateau in size by 5.0-5.5 s from Cue. At peak size, the maps overlapped <50% of the M1 and PMd forelimb representations. These result indicates that it is unlikely that we underreported the size of activity maps by not measuring map size beyond 6s from Cue.

      In fact, Fig 5F (which is highly thresholded) shows a surprisingly good match between the different forelimb actions, which argues against the existence of small subzones that are preferentially activated by different types of forelimb actions -the main claim of the authors.

      Our original proposal should have been more clearly stated. We were proposing that the thresholded maps, which had similar spatial organizations across conditions as the reviewer suggested, reported on subzones tuned for reach-to-grasp actions. Adjacent to those subzones could be other subzones that are preferentially active during other types of forelimb actions (e.g., pulling, pushing, grooming). We could not test this possibility in our study because the behavioral task examined a narrow range of arm and hand actions. We therefore revised the Discussion to state the limitations of our task and to lean more on published work that supports the present proposal (439-443 and 504-508).

      2) Related to the previous point, the ROI selections/definitions for the time course analyses seem highly arbitrary. As indicated in the introduction, the clustering hypothesis dictates that "an arm function would be concentrated in subzones of the motor arm zones. Neural activity in adjacent subzones would be tuned for other arm functions." To test this hypothesis directly in a straightforward manner, the authors could use the results from the ICM experiment to construct independent ROIs and to evaluate the ISOI responses for the different actions. In that case, the authors could do a straightforward ANOVA (if the data permits parametric analyses) with ROI, action, and time point (and possibly subject) as factors.

      We agree with the reviewer, and we now leverage the ICMS map for guiding ROI placement. All time courses are now derived from 1 of 2 types of ROIs. (1) Small ROIs (0.4 mm radius) placed in zones defined from ICMS (e.g., M1 hand zone). (2) Large ROIs that include the entire forelimb representations in M1 or in PMd (Fig 5B).

    1. Author Response

      Reviewer #1 (Public Review):

      However, the authors are cautioned to tone down some of the sentences with the human diabetic samples as they rely heavily on extrapolation rather experimental tests.

      Thank you for this feedback. We have added an experimental test to support the CellChat results. We found that, in accordance with the CellChat analysis, more macrophage Gas6 expression is observed in diabetic wounds via IF. These data are now included in Figures 3C-D. We have additionally edited the text relating to Figure 3 to indicate that these results are not fully conclusive.

      For instance, the antibody inhibition of Axl had minimal effect on the clearance of apoptotic cells in the wound and this would be expected with the redundancy endowed by other TAM receptors.

      Thank you for this point. We have made a note of this in the text in lines 289-291.

      For instance, in Figure 6, the number of TUNEL+ cells seem to be higher in the IgG samples compared to the anti-Timd4 treatment, but this is not the case in the quantification

      Thank you for this comment. We have replaced these with more representative images, which appear in Figure 6A. We also repeated the staining with antibodies for cleaved caspase 3, which appear in Fig. 6 – Fig. supplement 1A, which showed similar results.

      Reviewer #2 (Public Review):

      I suggest to repeat the quantification of cells containing active caspase-3 with an anticleaved caspase-3 antibody. Here the authors use an antibody recognizing phospho S150 antibody, which is far from generally accepted to be a marker for active caspase-3. It would also be good to quantify the apoptotic cells observed in the sections (Fig 1 I and J) and compare to control treatment on sections. It is not clear from the data presented whether the number of apoptotic cells increases or not in the time frame analyzed since the controls are lacking.

      Thank you for this important suggestion. We have repeated the IF staining using an antibody for cleaved caspase 3 (Cell Signaling 9661S) and quantified the apoptotic cells present. We found that apoptotic cells were rare but present at both 24h and 48h after injury, and that significantly more cleaved caspase 3+ cells were present in 48h wounds than 24h wounds. These data are now included in Figure 1H-J and Fig. 1 – Fig. supplement 1F. We have also used this antibody in IF staining in Fig. 5 – Fig. supplement 1B and Fig. 6 – Fig. supplement 1A.

      In a FACS analysis (Fig S1 H), the authors show that there is no increase in dead cells in a time frame of 48 hrs. Could it be that the majority of the cells that may have died in vivo, were lost during the procedure of tissue digestions. Dead cells tend to aggregate.

      Based on these comments and the inconsistency in these data due to potential technical challenges, we have removed the FACS data quantifying Annexin V. We now include the quantification of cleaved caspase 3 and an efferocytosis assay to analyze the kinetics of efferocytosis.

      On line 104 the authors refer to the apoptosis-inducing activity of G0s2. Please, realize that there is little or no in vivo evidence for a role of G0s2 in apoptosis.

      Thank you for this helpful comment. We have removed this gene from our analysis and text.

      The authors state that Axl is uniquely expressed in DC and fibroblasts (Fig 2). Are the Axlcells positive in panel G (red, Fig 2) that do not stain for the Pdgfra marker (green) then all DCs? Please clarify or show with a triple staining that these cells are indeed DCs.

      Thank you for this comment. To clarify, our intention was to show that both DCs and fibroblasts express Axl, not to say conclusively that only DCs and fibroblasts express Axl. Indeed, in Figure 5, we show that a portion of macrophages also express Axl (at day 3), so some of the Axl+ cells in 2G may be macrophages rather than DCs. We have made this more clear in the text in lines 163-166.

      In addition, it is not clear to me to what reference level exactly the expression levels are compared in Fig 2A. Is this between the 24 and 48h time points after wounding (as mentioned in the legend)? If so, the analysis may indicate up or down regulation but not necessarily expression or no expression.

      Thank you for making this point. The heatmaps display scaled log-normalized mRNA counts for the entire dataset, not a comparison between the two timepoints. We have clarified this in the figure legends.

      2) Human diabetic wounds display increased and altered efferocytosis signaling via Axl. This conclusion is solely based on CellChat analysis and should be tuned down or validated.

      Thank you for this suggestion. We have experimentally validated this conclusion using IF staining for Gas6. We found that more Gas6 staining in CD68+ macrophages in diabetic foot ulcers when compared to nondiabetic foot wounds. These data are now included in Figure 3C-D.

      The authors conclude that anti-Axl treatment leads to healing defects based on lack of granulation tissue and larger scabs, a reduction of fibroblast repopulation and revascularization. The differences in the last two parameters mentioned above are obvious, however the other parameters, as granulation tissue and scabs are less clear to me. Is this quantified in any way? In Fig S4 D there is also a large scab visible in the control treatment image. Therefore, it would be good if these parameters could be better substantiated.

      Thank you for this comment. We have edited the text in lines 301-304 to de-emphasize these qualitative changes.

      In view of the lack of revascularization, are there differences in the mRNA expression levels of angiogenic factors such as VEGF and others at this time point? Does revascularization occur at later stages?.

      Thank you for this helpful suggestion. We have used qPCR to measure Vegfa mRNA expression, and these data are now included in Figure 5I. We found no significant difference in Vegfa expression 5 days after injury.

      Based on the FACS analysis the authors claim that there are no differences at the level of DCs. However, the plots shown in Fig 5C do not convincingly show the detection of DC (as boxed in the lower panel). Based on the density plots one would presume this is just the continuation of the CD11b+ population and not a separate CD11c+ population. To get a better view of that, it would be better to show dot plots instead of density plots.

      Thank you for this insightful comment. We have created new plots as suggested to demonstrate that this is not exactly the case. In the wound bed, contrary to what we see in blood isolates many times the full separation of populations is elusive and to ensure that we use single stain controls to set the gates. Nonetheless, we provide in Author response image 1 the same data as dot-plots as requested to show that that is not the case, alongside the single stain control to show that the gating strategy is adequate. We do understand and acknowledge that in dissociated tissues sometimes the outlines are not as perfect as what is obtained in immunological samples.

      Author response image 1.

      Finally, the authors state (line 265-266) that anti-Axl treatment leads to non-significantly increased expression of IL1alpha and IL6 after one day of injury (Fig S4C). If the difference between the control-treated and the anti-Axl-treated group is statistically not significant I would not conclude there is an increase. Please adapt phrasing or include more mice in the experiment (now only 4) to substantiate the observation and clarify whether it is increased or not.

      Thank you for this comment. We have altered the text in lines 286-289 to better reflect this.

      The authors conclude that overall healing was not affected but that the wound beds appeared more fragile. What is meant with 'appeared more fragile' is not clear. In addition, this seems to me a quite subjective interpretation. What are the objective parameters to come this conclusion?

      Thank you for this point. We have altered the text to remove this subjective language.

      Similar to inhibition of Axl, inhibition of Timd4 led to a defect in revascularization as witnessed by the absence of CD31 staining. Also in this experiment one can raise similar questions as in the anti-Axl experiment: 1) does revascularization occur at a later timepoint; 2) what about the expression of angiogenic factors?

      Thank you for this helpful suggestion. To further investigate the impact of Axl inhibition of angiogenesis, we have assayed for Vegfa by qPCR. We found no significant difference in Vegfa expression 5 days after injury. These data are now included in Figure 5I.

      In the anti-Timd4 treated wounds the authors observe more TUNEL-positive cells and conclude that this is due to a defect in efferocytosis. However, the formal experimental proof for this in the current model is lacking. How do the authors exclude the possibility that anti-Timd4 treatment attracts more infiltrating cells that then undergo treatment, or that the treatment with anti-Timd4 leads to more apoptosis of certain cells in the wound bed. What is the nature of these apoptotic cells (neutrophils, T cells, others)? It has been shown that Timd4 can have stimulatory effects on other cells, such as T cells. Could deprivation of Timd4 signaling in certain conditions lead to more dying cells in this model?

      Thank you for this insightful comment. To investigate this, we have repeated this experiment with IF staining for cleaved caspase 3 and found similar results, indicating the increase in apoptosis upon Timd4 inhibition (Fig. 6 – Fig. supplement 1A). We have also included text to acknowledge the possibility of an increase in apoptosis in lines 326-327.

      Reviewer #3 (public Review)):

      They never do show that there is an increase in apoptotic cells in the wounds, which then go down (which would be a sign that the cells are being cleared via efferocytosis. In addition, they are looking for apoptotic cells at very early time points (24-48 hours), times at which large numbers of apoptotic cells would not be expected. As an example, neutrophil infiltration peaks at 24-48 hours and efferocytosis of apoptotic neutrophils would be expected after that. Other types of apoptotic cells would likely be cleared even later. Finally, several of the panels showing apoptotic cells were done with a very small number of samples (1-3 per group) in some cases so it is unclear how rigorous the data are. I would recommend that the authors at the very least soften the wording related to these conclusions and discuss the limitations of their experimental design; ideally data from more samples would be included to provide clear support those statements.

      Thank you for raising this important point. In order to support these claims, we have undertaken two additional experiments. Firstly, we have repeated the immunofluorescence staining with a new antibody for activated caspase 3 and quantified the number of apoptotic cells present in 24h and 48h wound beds. We found that apoptotic cells significantly increased in 48h wound beds compared to 24h wounds (Figures 1H and Fig. 1 – Fig. supplement 1F).

      We have also undertaken a new experiment to show the temporal regulation of efferocytosis. We injected stained apoptotic neutrophils into 1D, 3D, and 5D wound beds and quantified the stained cells remaining after 1 hour in order to quantify the clearance of cells from the wound bed at different timepoints. We found that significantly more stained cells undergoing efferocytosis remained in 5D wounds, and that the rate of efferocytosis was approximately constant over this timeline. These data are now included in Figures 2H-M.

      While we would be interested to determine the identities of cells engaging in efferocytosis of the labeled apoptotic neutrophils, we found that co-staining for additional cell markers was impossible while maintaining the fluorescent labeling on the injected neutrophils.

      2) The human RNA-seq data is also quite limited, as non-diabetic wound tissue was all from one patient. Again, this limitation should be acknowledged.

      Thank you for this feedback. We have analyzed new data sets that include 5 individuals with diabetic foot ulcers and 4 individuals with non-diabetic wounds. These data are now included in Figure 3.

      Also, there are some important published papers by Sashwati Roy's group indicating that there are defects in efferocytosis in diabetic wounds, which may go against what the authors are showing here to some degree. Discussion of the authors' work in relation to these other studies should be discussed.

      Thank you for this suggestion. We have included discussion of this work to the text in lines 192193.

      3) For anti-Axl and anti-Timd4 experiments, the authors conclude that inhibition of Axl does not affect TUNEL+ cells and that Timd4 does not affect reepithelialization. However, in some cases the sample size was only 3 mice per group when measuring these parameters. That is a very small number of samples to draw conclusions about apoptotic cells or reepithelialization since these parameters are key for the overall conclusions of the experiments. Given that these are key data, it would be important to include more than n=3. Additionally, as stated above, a time point later than 24 h may be necessary to actually see changes in apoptotic cells.

      Thank you for this suggestion. We have repeated the staining for apoptotic cells using a new antibody for cleaved caspase 3 and stained wound beds from additional mice. In the anti-Axl experiments, we now show data for cleaved caspase 3 staining of 3- and 5-day wound beds with N=4 (Fig. 5 – Fig. supplement 1B). In the anti-Timd4 experiments, we now have N=6-11 for the TUNEL staining at 5 days after injection and injury (Figure 6B).

      4) In Fig 6, there look to be many more TUNEL+ cells in the wound bed of IgG control samples compared to anti-Timd4-treated samples, which contradicts the graph. Perhaps the authors could clarify where they were taking their measurements for panels with image analysis results.

      Thank you for this helpful point. We have updated this figure to be more representative of the quantification (Figure 6A-B), as well as repeated the staining with antibodies for cleaved caspase 3 (Fig. 6 – Fig. supplement 1A).

      Another question related to this experiment is how it is possible that efferocytosis is so drastically different yet there are no changes in wound healing (this is one reason why a larger sample size for reepithelialization may be critical) - this would seem to suggest that efferocytosis is not important in wound healing, which is confusing. Further discussion on this might be useful.

      Thank you for this point. Indeed, we see that there is a defect to revascularization when Timd4 is inhibited (Figure 6E-F), which indicates that efferocytosis is important to normal healing. This is discussed in lines 333-335.

    1. Author response:

      Reviewer #1 (Public Review):

      Reviewer #1, comment #1: The study is thorough and systematic, and in comparing three well-separated hypotheses about the mechanism leading from grid cells to hexasymmetry it takes a neutral stand above the fray which is to be particularly appreciated. Further, alternative models are considered for the most important additional factor, the type of trajectory taken by the agent whose neural activity is being recorded. Different sets of values, including both "ideal" and "realistic" ones, are considered for the parameters most relevant to each hypothesis. Each of the three hypotheses is found to be viable under some conditions, and less so in others. Having thus given a fair chance to each hypothesis, nevertheless, the study reaches the clear conclusion that the first one, based on conjunctive grid-by-head-direction cells, is much more plausible overall; the hypothesis based on firing rate adaptation has intermediate but rather weak plausibility; and the one based on clustering of cells with similar spatial phases in practice would not really work. I find this conclusion convincing, and the procedure to reach it, a fair comparison, to be the major strength of the study.

      Response: Thanks for your positive assessment of our manuscript.

      Reviewer #1, comment #2: What I find less convincing is the implicit a priori discarding of a fourth hypothesis, that is, that the hexasymmetry is unrelated to the presence of grid cells. Full disclosure: we have tried unsuccessfully to detect hexasymmetry in the EEG signal from vowel space and did not find any (Kaya, Soltanipour and Treves, 2020), so I may be ranting off my disappointment, here. I feel, however, that this fourth hypothesis should be at least aired, for a number of reasons. One is that a hexasymmetry signal has been reported also from several other cortical areas, beyond entorhinal cortex (Constantinescu et al, 2016); true, also grid cells in rodents have been reported in other cortical areas as well (Long and Zhang, 2021; Long et al, bioRxiv, 2021), but the exact phenomenology remains to be confirmed.

      Response: Thank you for the suggestion to add the hypothesis that the neural hexasymmetry observed in previous fMRI and intracranial EEG studies may be unrelated to grid cells. Following your suggestion, we have now mentioned at the end of the fourth paragraph of the Introduction that “the conjunctive grid by head-direction cell hypothesis does not necessarily depend on an alignment between the preferred head directions with the grid axes”. Furthermore, at the end of section “Potential mechanisms underlying hexadirectional population signals in the entorhinal cortex” (in the Discussion) we write: “However, none of the three hypotheses described here may be true and another mechanism may explain macroscopic grid-like representations. This includes the possibility that neural hexasymmetry is completely unrelated to grid-cell activity, previously summarized as the ‘independence hypothesis' (Kunz et al., 2019). For example, a population of head-direction cells whose preferred head directions occur at offsets of 60 degrees from each other could result in neural hexasymmetry in the absence of grid cells. The conjunctive grid by head-direction cell hypothesis thus also works without grid cells, which may explain why grid-like representations have been observed (using fMRI) in regions outside the entorhinal cortex, where rodent studies have not yet identified grid cells (Doeller et al., 2010; Constantinescu et al., 2016). In that case, however, another mechanism would be needed that could explain why the preferred head directions of different head-direction cells occur at multiples of 60 degrees. Attractor-network structures may be involved in such a mechanism, but this remains speculative at the current stage.” We now also mention the results from Long and Zhang (second paragraph of the Introduction): “Surprisingly, grid cells have also been observed in the primary somatosensory cortex in foraging rats (Long and Zhang, 2021).”

      Regarding your EEG study, we have added a reference to it in the manuscript and state that it is an example for a study that did not find evidence for neural hexasymmetry (end of first paragraph of the Discussion): “We note though that some studies did not find evidence for neural hexasymmetry. For example, a surface EEG study with participants “navigating” through an abstract vowel space did not observe hexasymmetry in the EEG signal as a function of the participants’ movement direction through vowel space (Kaya et al., 2020). Another fMRI study did not find evidence for grid-like representations in the ventromedial prefrontal cortex while participants performed value-based decision making (Lee et al., 2021). This raises the question whether the detection of macroscopic grid-like representations is limited to some recording techniques (e.g., fMRI and iEEG but not surface EEG) and to what extent they are present in different tasks.”

      Reviewer #1, comment #3: Second, as the authors note, the conjunctive mechanism is based on the tight coupling of a narrow head direction selectivity to one of the grid axes. They compare "ideal" with "Doeller" parameters, but to me the "Doeller" ones appear rather narrower than commonly observed and, crucially, they are applied to all cells in the simulations, whereas in reality only a proportion of cells in mEC are reported to be grid cells, only a proportion of them to be conjunctive, and only some of these to be narrowly conjunctive. Further, Gerlei et al (2020) find that conjunctive grid cells may have each of their fields modulated by different head directions, a truly surprising phenomenon that, if extensive, seems to me to cast doubts on the relation between mass activity hexasymmetry and single grid cells.

      Response: We have revised the manuscript in several ways to address the different aspects of this comment.

      Firstly, we agree with the reviewer that our “Doeller” parameter for the tuning width is narrower than commonly observed. We have therefore reevaluated the concentration parameter κ_c in the ‘realistic’ case from 10 rad-2 (corresponding to a tuning width of 18o) to 4 rad-2 (corresponding to a tuning width of 29o). We chose this value by referring to Supplementary Figure 3 of Doeller et al. (2010). In their figure, the tuning curves usually cover between one sixth and one third of a circle. Since stronger head-direction tuning contributes the most to the resulting hexasymmetry, we chose a value of κ_c=4 for the tuning parameter, which corresponds to a tuning width (= half width) of 29o (full width of roughly one sixth of a circle). Regarding the coupling of the preferred head directions to the grid axes, the specific value of the jitter σc = 3 degrees that quantifies the coupling of the head-direction preference to the grid axes was extracted from the 95% confidence interval given in the third row of the Table in Supplementary Figure 5b of Doeller et al. 2010. We now better explain the origin of these values in our new Methods section “Parameter estimation” and provide an overview of all parameter values in Table 1.

      Furthermore, in response to your comment, we have revised Figure 2E to show neural hexasymmetries for a larger range of values of the jitter (σc from 0 to 30 degrees), going way beyond the values that Doeller et al. suggested. We have also added a new supplementary figure (Figure 2 – figure supplement 1) where we further extend the range of tuning widths (parameter κ_c) to 60 degrees. This provides the reader with a comprehensive understanding of what parameter values are needed to reach a particular hexasymmetry.

      Regarding your comments on the prevalence of conjunctive grid by head-direction cells, we have revised the manuscript to make it explicit that the actual percentage of conjunctive cells with the necessary properties may be low in the entorhinal cortex (first paragraph of section “A note on our choice of the values of model parameters” of the Discussion): “Empirical studies in rodents found a wide range of tuning widths among grid cells ranging from broad to narrow (Doeller et al., 2010; Sargolini et al., 2006). The percentage of conjunctive cells in the entorhinal cortex with a sufficiently narrow tuning may thus be low. Such distributions (with a proportionally small amount of narrowly tuned conjunctive cells) lead to low values in the absolute hexasymmetry. The neural hexasymmetry in this case would be driven by the subset of cells with sufficiently narrow tuning widths. If this causes the neural hexasymmetry to drop below noise levels, the statistical evaluation of this hypothesis would change.” In addition, in Figure 5, we have applied the coupling between preferred head directions and grid axes to only one third of all grid cells (parameter pc= ⅓ in Table 1), following the values reported by Boccara et al. 2010 and Sargolini et al. 2006. To strengthen the link between Figure 5 and Figure 2, we now state the hexasymmetry when using pc= ⅓ along with a ‘realistic’ tuning width and jitter for head-direction modulated grid cells in Figure 2H. Additionally, we performed new simulations where we observed a linear relationship (above the noise floor) between the proportion of conjunctive cells and the hexasymmetry. This shall help the reader understand the effect of a reduced percentage of conjunctive cells on the absolute hexasymmetry values. We have added these results as a new supplementary figure (Figure 2 – figure supplement 2).

      Finally, regarding your comment on the findings by Gerlei et al. 2020, we now reference this study in our manuscript and discuss the possible implications (second paragraph of section “A note on our choice of the values of model parameters” of the Discussion): “Additionally, while we assumed that all conjunctive grid cells maintain the same preferred head direction between different firing fields, conjunctive grid cells have also been shown to exhibit different preferred head directions in different firing fields (Gerlei et al., 2020). This could lead to hexadirectional modulation if the different preferred head directions are offset by 60o from each other, but will not give rise to hexadirectional modulation if the preferred head directions are randomly distributed. To the best of our knowledge, the distribution of preferred head directions was not quantified by Gerlei et al. (2020), thus this remains an open question.”

      Reviewer #1, comment #4: Finally, a variant of the fourth hypothesis is that the hexasymmetry might be produced by a clustering of head direction preferences across head direction cells similar to that hypothesized in the first hypothesis, but without such cells having to fire in grid patterns. If head direction selectivity is so clustered, who needs the grids? This would explain why hexasymmetry is ubiquitous, and could easily be explored computationally by, in fact, a simplification of the models considered in this study.

      Response: We fully agree with you. We now explain this possibility in the Introduction where we introduce the conjunctive grid by head-direction cell hypothesis (fourth paragraph of the Introduction) and return to it in the Discussion (section “Potential mechanisms underlying hexadirectional population signals in the entorhinal cortex”). There, we now also explain that in such a case another mechanism would be needed to ensure that the preferred head directions of head-direction cells exhibit six-fold rotational symmetry.

      Reviewer #2 (Public Review):

      Reviewer #2, comment #1: Grid cells - originally discovered in single-cell recordings from the rodent entorhinal cortex, and subsequently identified in single-cell recordings from the human brain - are believed to contribute to a range of cognitive functions including spatial navigation, long-term memory function, and inferential reasoning. Following a landmark study by Doeller et al. (Nature, 2010), a plethora of human neuroimaging studies have hypothesised that grid cell population activity might also be reflected in the six-fold (or 'hexadirectional') modulation of the BOLD signal (following the six-fold rotational symmetry exhibited by individual grid cell firing patterns), or in the amplitude of oscillatory activity recorded using MEG or intracranial EEG. The mechanism by which these network-level dynamics might arise from the firing patterns of individual grid cells remains unclear, however.

      In this study, Khalid and colleagues use a combination of computational modelling and mathematical analysis to evaluate three competing hypotheses that describe how the hexadirectional modulation of population firing rates (taken as a simple proxy for the BOLD, MEG, or iEEG signal) might arise from the firing patterns of individual grid cells. They demonstrate that all three mechanisms could account for these network-level dynamics if a specific set of conditions relating to the agent's movement trajectory and the underlying properties of grid cell firing patterns are satisfied.

      The computational modelling and mathematic analyses presented here are rigorous, clearly motivated, and intuitively described. In addition, these results are important both for the interpretation of hexadirectional modulation in existing data sets and for the design of future experiments and analyses that aim to probe grid cell population activity. As such, this study is likely to have a significant impact on the field by providing a firmer theoretical basis for the interpretation of neuroimaging data. To my mind, the only weakness is the relatively limited focus on the known properties of grid cells in rodent entorhinal cortex, and the network level activity that these firing patterns might be expected to produce under each hypothesis. Strengthening the link with existing neurobiology would further enhance the importance of these results for those hoping to assay grid cell firing patterns in recordings of ensemble-level neural activity.

      Response: Thank you very much for reviewing our manuscript and your positive assessment. Following your comments, we have revised the manuscript to more closely link our simulations to known properties of grid cells in the rodent entorhinal cortex.

      Reviewer #3 (Public Review):

      Reviewer #3, comment #1: This is an interesting and carefully carried out theoretical analysis of potential explanations for hexadirectional modulation of neural population activity that has been reported in the human entorhinal cortex and some other cortical regions. The previously reported hexadirectional modulation is of considerable interest as it has been proposed to be a proxy for the activation of grid cell networks. However, the extent to which this proposal is consistent with the known firing properties of grids hasn't received the attention it perhaps deserves. By comparing the predictions of three different models this study imposes constraints on possible mechanisms and generates predictions that can be tested through future experimentation.

      Overall, while the conclusions of the study are convincing, I think the usefulness to the field would be increased if null hypotheses were more carefully considered and if the authors' new metric for hexadirectional modulation (H) could be directly contrasted with previously used metrics. For example, if the effect sizes for hexadirectional modulation in the previous fMRI and EEG data could be more directly compared with those of the models here, then this could help in establishing the extent to which the experimental hexadirectional modulation stands out from path hexasymmetry and how close it comes to the striking modulation observed with the conjunctive models. It could also be helpful to consider scenarios in which hexadirectional modulation is independent of grid firing, for example perhaps with appropriate coordination of head direction cell firing.

      Response: Thanks for reviewing our manuscript and for the overall positive assessment. The new Methods section “Implementation of previously used metrics” starts with the following sentences: “We applied three previously used metrics to our framework: the Generalized Linear Model (GLM) method by Doeller et al. 2010; the GLM method with binning by Kunz et al. 2015; and the circular-linear correlation method by Maidenbaum et al. 2018.” We have created a new supplementary figure (Figure 5 – figure supplement 4) in which we compare the results from these other methods to the results of our new method. Overall, the results are highly similar, indicating that all these methods are equally suited to test for a hexadirectional modulation of neural activity.

      In section “Implementation of previously used metrics” we then explain: “In brief, in the GLM method (e.g. used in Doeller et al., 2010), the hexasymmetry is found in two steps: the orientation of the hexadirectional modulation is first estimated on the first half of the data by using the regressors and on the time-discrete fMRI activity (Equation 9), with θt being the movement direction of the subject in time step t. The amplitude of the signal is then estimated on the second half of the data using the single regressor , where . The hexasymmetry is then evaluated as .

      The GLM method with binning (e.g. used in Kunz et al., 2015) uses the same procedure as the GLM method for estimating the grid orientation in the first half of the data, but the amplitude is estimated differently on the second half by a regressor that has a value 1 if θt is aligned with a peak of the hexadirectional modulation (aligned if , modulo operator) and a value of -1 if θt is misaligned. The hexasymmetry is then calculated from the amplitude in the same way as in the GLM method.

      The circular-linear correlation method (e.g. used in Maidenbaum et al., 2018) is similar to the GLM method in that it uses the regressors β1 cos(6θ_t) and β2 on the time-discrete mean activity, but instead of using β1 and β2 to estimate the orientation of the hexadirectional modulation, the beta values are directly used to estimate the hexasymmetry using the relation .”

      For each of the three previously used metrics and our new method, we estimated the resulting hexasymmetry (new Figure 5 – figure supplement 4 in the manuscript). In the Methods section “Implementation of previously used metrics” we then continue with our explanations: “Regarding the statistical evaluation, each method evaluates the size of the neural hexasymmetry differently. Specifically, the new method developed in our manuscript compares the neural hexasymmetry to path hexasymmetry to test whether neural hexasymmetry is significantly above path hexasymmetry. For the two generalized linear model (GLM) methods, we compare the hexasymmetry to zero (using the Mann-Whitney U test) to establish significance. Hexasymmetry values can be negative in these approaches, allowing the statistical comparison against 0. Negative values occur when the estimated grid orientation from the first data half does not match the grid orientation from the second data half. Regarding the statistical evaluation of the circular-linear correlation method, we calculated a z-score by comparing each empirical observation of the hexasymmetry to hexasymmetries from a set of surrogate distributions (as in Maidenbaum et al., 2018). We then calculate a p-value by comparing the distribution of z-scores versus zero using a Mann-Whitney U test. We use the z-scores instead of the hexasymmetry for the circular-linear correlation method to match the procedure used in Maidenbaum et al. (2018). We obtained the surrogate distributions by circularly shifting the vector of movement directions relative to the time dependent vector of firing rates. For random walks, the vector is shifted by a random number drawn from a uniform distribution defined with the same length as the number of time points in the vector of movement directions. For the star-like walks and piecewise linear walks, the shift is a random integer multiplied by the number of time points in a linear segment. Circularly shifting the vector of movement directions scrambles the correlations between movement direction and neural activity while preserving their temporal structure.”

      The results of these simulations, i.e. the comparison of our new method to previously used metrics, are summarized in Figure 5 – figure supplement 4 and show qualitatively identical findings when using the different methods. We have added this information also to the manuscript in the third paragraph of section “Quantification of hexasymmetry of neural activity and trajectories” of the Methods: “Empirical (fMRI/iEEG) studies (e.g. Doeller et al., 2010; Kunz et al., 2015; Maidenbaum et al., 2018) addressed this problem of trajectories spuriously contributing to hexasymmetry by fitting a Generalized Linear Model (GLM) to the time discrete fMRI/iEEG activity. In contrast, our new approach to hexasymmetry in Equation (12) quantifies the contribution of the path to the neural hexasymmetry explicitly, and has the advantage that it allows an analytical treatment (see next section). Comparing our new method with previous methods for evaluating hexasymmetry led to qualitatively identical statistical effects (Figure 5 – figure supplement 4).” We have also added a pointer to this new supplementary figure in the caption of Figure 5 in the manuscript: “For a comparison between our method and previously used methods for evaluating hexasymmetry, see Figure 5 – figure supplement 4.”

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript will interest cognitive scientists, neuroimaging researchers, and neuroscientists interested in the systems-level organization of brain activity. The authors describe four brain states that are present across a wide range of cognitive tasks and determine that the relative distribution of the brain states shows both commonalities and differences across task conditions.

      The authors characterized the low-dimensional latent space that has been shown to capture the major features of intrinsic brain activity using four states obtained with a Hidden Markov Model. They related the four states to previously-described functional gradients in the brain and examined the relative contribution of each state under different cognitive conditions. They showed that states related to the measured behavior for each condition differed, but that a common state appears to reflect disengagement across conditions. The authors bring together a state-of-the-art analysis of systemslevel brain dynamics and cognitive neuroscience, bridging a gap that has long needed to be bridged.

      The strongest aspect of the study is its rigor. The authors use appropriate null models and examine multiple datasets (not used in the original analysis) to demonstrate that their findings replicate. Their thorough analysis convincingly supports their assertion that common states are present across a variety of conditions, but that different states may predict behavioural measures for different conditions. However, the authors could have better situated their work within the existing literature. It is not that a more exhaustive literature review is needed-it is that some of their results are unsurprising given the work reported in other manuscripts; some of their work reinforces or is reinforced by prior studies; and some of their work is not compared to similar findings obtained with other analysis approaches. While space is not unlimited, some of these gaps are important enough that they are worth addressing:

      We appreciate the reviewer’s thorough read of our manuscript and positive comments on its rigor and implications. We agree that the original version of the manuscript insufficiently situated this work in the existing literature. We have made extensive revisions to better place our findings in the context of prior work. These changes are described in detail below.

      1) The authors' own prior work on functional connectivity signatures of attention is not discussed in comparison to the latest work. Neither is work from other groups showing signatures of arousal that change over time, particularly in resting state scans. Attention and arousal are not the same things, but they are intertwined, and both have been linked to large-scale changes in brain activity that should be captured in the HMM latent states. The authors should discuss how the current work fits with existing studies.

      Thank you for raising this point. We agree that the relationship between low-dimensional latent states and predefined activity and functional connectivity signatures is an important and interesting question in both attention research and more general contexts. Here, we did not empirically relate the brain states examined in this study and functional connectivity signatures previously investigated in our lab (e.g., Rosenberg et al., 2016; Song et al., 2021a) because the research question and methodological complexities deserved separate attention that go beyond the scope of this paper. Therefore, we conceptually addressed the reviewer’s question on how functional connectivity signatures of attention are related to the brain states that were observed here. Next, we asked how arousal relates to the brain states by indirectly predicting arousal levels of each brain state based on its activity patterns’ spatial resemblance to the predefined arousal network template (Goodale et al., 2021).

      Latent states and dynamic functional connectivity

      Previous work suggested that, on medium time scales (~20-60 seconds), changes in functional connectivity signatures of sustained attention (Rosenberg et al., 2020) and narrative engagement (Song et al., 2021a) predicted changes in attentional states. How do these attention-related functional connectivity dynamics relate to latent state dynamics, measured on a shorter time scale (1 second)?

      Theoretically, there are reasons to think that these measures are related but not redundant. Both HMM and dynamic functional connectivity provide summary measures of the whole-brain functional interactions that evolve over time. Whereas HMM identifies recurring low-dimensional brain states, dynamic functional connectivity used in our and others’ prior studies captures high-dimensional dynamical patterns. Furthermore, while the mixture Gaussian function utilized to infer emission probability in our HMM infers the states from both the BOLD activity patterns and their interactions, functional connectivity considers only pairwise interactions between regions of interests. Thus, with a theoretical ground that the brain states can be characterized at multiple scales and different methods (Greene et al., 2023), we can hypothesize that the both measures could (and perhaps, should be able to) capture brain-wide latent state changes. For example, if we were to apply kmeans clustering methods on the sliding window-based dynamic functional connectivity as in Allen et al. (2014), the resulting clusters could arguably be similar to the latent states derived from the HMM.

      However, there are practical reasons why the correspondence between our prior dynamic functional connectivity models and current HMM states is difficult to test directly. A time point-bytime point matching of the HMM state sequence and dynamic functional connectivity is not feasible because, in our prior work, dynamic functional connectivity was measured in a sliding time window (~20-60 seconds), whereas the HMM state identification is conducted at every TR (1 second). An alternative would be to concatenate all time points that were categorized as each HMM state to compute representative functional connectivity of that state. This “splicing and concatenating” method, however, disrupts continuous BOLD-signal time series and has not previously been validated for use with our dynamic connectome-based predictive models. In addition, the difference in time series lengths across states would make comparisons of the four states’ functional connectomes unfair.

      One main focus of our manuscript was to relate brain dynamics (HMM state dynamics) to static manifold (functional connectivity gradients). We agree that a direct link between two measures of brain dynamics, HMM and dynamic functional connectivity, is an important research question. However, due to some intricacies that needed to be addressed to answer this question, we felt that it was beyond the scope of our paper. We are eager, however, to explore these comparisons in future work which can more thoroughly address the caveats associated with comparing models of sustained attention, narrative engagement, and arousal defined using different input features and methods.

      Arousal, attention, and latent neural state dynamics

      Next, the reviewer posed an important question about the relationship between arousal, attention, and latent states. The current study was designed to assess the relationship between attention and latent state dynamics. However, previous neuroimaging work showed that low-dimensional brain dynamics reflect fluctuations in arousal (Raut et al., 2021; Shine et al., 2016; Zhang et al., 2023). Behavioral studies showed that attention and arousal hold a non-linear relationship, for example, mind-wandering states are associated with lower arousal and externally distracted states are associated with higher arousal, when both these states indicate low attention (Esterman and Rothlein, 2019; Unsworth and Robison, 2018, 2016).

      To address the reviewer’s suggestion, we wanted to test if our brain states reflected changes in arousal, but we did not collect relevant behavioral or physiological measures. Therefore, to indirectly test for relationships, we predicted levels of arousal in brain states by applying the “arousal network template” defined by Dr. Catie Chang’s group (Chang et al., 2016; Falahpour et al., 2018; Goodale et al., 2021). The arousal network template was created from resting-state fMRI data to predict arousal levels indicated by eye monitoring and electrophysiological signals. In the original study, the arousal level at each time point was predicted by the correlation between the BOLD activity patterns of each TR to the arousal template. The more similar the whole-brain activation pattern was to the arousal network template, the higher the participant was predicted to be aroused at that moment. This activity pattern-based model was generalized to fMRI data during tasks (Goodale et al., 2021).

      We correlated the arousal template to the activity patterns of the four brain states that were inferred by the HMM. The DMN state was positively correlated with the arousal template (r=0.264) and the SM state was negatively correlated with the arousal template (r=-0.303) (Author response image 1). These values were not tested for significance because they were single observations. While speculative, this may suggest that participants are in a high arousal state during the DMN state and a low arousal state during the SM state. Together with our results relating brain states to attention, it is possible that the SM state is a common state indicating low arousal and low attention. On the other hand, the DMN state, a signature of a highly aroused state, may benefit gradCPT task performance but not necessarily in engaging with a sitcom episode. However, because this was a single observation and we did not collect a physiological measure of arousal to validate this indirect prediction result, we did not include the result in the manuscript. We hope to more directly test this question in future work with behavioral and physiological measures of arousal.

      Author response image 1.

      Changes made to the manuscript

      Importantly, we agree with the reviewer that a theoretical discussion about the relationships between functional connectivity, latent states, gradients, as well as attention and arousal was a critical omission from the original Discussion. We edited the Discussion to highlight past literature on these topics and encourage future work to investigate these relationships.

      [Manuscript, page 11] “Previous studies showed that large-scale neural dynamics that evolve over tens of seconds capture meaningful variance in arousal (Raut et al., 2021; Zhang et al., 2023) and attentional states (Rosenberg et al., 2020; Yamashita et al., 2021). We asked whether latent neural state dynamics reflect ongoing changes in attention in both task and naturalistic contexts.”

      [Manuscript, page 17] “Previous work showed that time-resolved whole-brain functional connectivity (i.e., paired interactions of more than a hundred parcels) predicts changes in attention during task performance (Rosenberg et al., 2020) as well as movie-watching and story-listening (Song et al., 2021a). Future work could investigate whether functional connectivity and the HMM capture the same underlying “brain states” to bridge the results from the two literatures. Furthermore, though the current study provided evidence of neural state dynamics reflecting attention, the same neural states may, in part, reflect fluctuations in arousal (Chang et al., 2016; Zhang et al., 2023). Complementing behavioral studies that demonstrated a nonlinear relationship between attention and arousal (Esterman and Rothlein, 2019; Unsworth and Robison, 2018, 2016), future studies collecting behavioral and physiological measures of arousal can assess the extent to which attention explains neural state dynamics beyond what can be explained by arousal fluctuations.”

      2) The 'base state' has been described in a number of prior papers (for one early example, see https://pubmed.ncbi.nlm.nih.gov/27008543). The idea that it might serve as a hub or intermediary for other states has been raised in other studies, and discussion of the similarity or differences between those studies and this one would provide better context for the interpretation of the current work. One of the intriguing findings of the current study is that the incidence of this base state increases during sitcom watching, the strongest evidence to date is that it has a cognitive role and is not merely a configuration of activity that the brain must pass through when making a transition.

      We greatly appreciate the reviewer’s suggestion of prior papers. We were not aware of previous findings of the base state at the time of writing the manuscript, so it was reassuring to see consistent findings. In the Discussion, we highlighted the findings of Chen et al. (2016) and Saggar et al. (2022). Both studies highlighted the role of the base state as a “hub”-like transition state. However, as the reviewer noted, these studies did not address the functional relevance of this state to cognitive states because both were based on resting-state fMRI.

      In our revised Discussion, we write that our work replicates previous findings of the base state that consistently acted as a transitional hub state in macroscopic brain dynamics. We also note that our study expands this line of work by characterizing what functional roles the base state plays in multiple contexts: The base state indicated high attentional engagement and exhibited the highest occurrence proportion as well as longest dwell times during naturalistic movie watching. The base state’s functional involvement was comparatively minor during controlled tasks.

      [Manuscript, page 17-18] “Past resting-state fMRI studies have reported the existence of the base state. Chen et al. (2016) used the HMM to detect a state that had “less apparent activation or deactivation patterns in known networks compared with other states”. This state had the highest occurrence probability among the inferred latent states, was consistently detected by the model, and was most likely to transition to and from other states, all of which mirror our findings here. The authors interpret this state as an “intermediate transient state that appears when the brain is switching between other more reproducible brain states”. The observation of the base state was not confined to studies using HMMs. Saggar et al. (2022) used topological data analysis to represent a low-dimensional manifold of resting-state whole-brain dynamics as a graph, where each node corresponds to brain activity patterns of a cluster of time points. Topologically focal “hub” nodes were represented uniformly by all functional networks, meaning that no characteristic activation above or below the mean was detected, similar to what we observe with the base state. The transition probability from other states to the hub state was the highest, demonstrating its role as a putative transition state.

      However, the functional relevance of the base state to human cognition had not been explored previously. We propose that the base state, a transitional hub (Figure 2B) positioned at the center of the gradient subspace (Figure 1D), functions as a state of natural equilibrium. Transitioning to the DMN, DAN, or SM states reflects incursion away from natural equilibrium (Deco et al., 2017; Gu et al., 2015), as the brain enters a functionally modular state. Notably, the base state indicated high attentional engagement (Figure 5E and F) and exhibited the highest occurrence proportion (Figure 3B) as well as the longest dwell times (Figure 3—figure supplement 1) during naturalistic movie watching, whereas its functional involvement was comparatively minor during controlled tasks. This significant relevance to behavior verifies that the base state cannot simply be a byproduct of the model. We speculate that susceptibility to both external and internal information is maximized in the base state—allowing for roughly equal weighting of both sides so that they can be integrated to form a coherent representation of the world—at the expense of the stability of a certain functional network (Cocchi et al., 2017; Fagerholm et al., 2015). When processing rich narratives, particularly when a person is fully immersed without having to exert cognitive effort, a less modular state with high degrees of freedom to reach other states may be more likely to be involved. The role of the base state should be further investigated in future studies.”

      3) The link between latent states and functional connectivity gradients should be considered in the context of prior work showing that the spatiotemporal patterns of intrinsic activity that account for most of the structure in resting state fMRI also sweep across functional connectivity gradients (https://pubmed.ncbi.nlm.nih.gov/33549755/). In fact, the spatiotemporal dynamics may give rise to the functional connectivity gradients (https://pubmed.ncbi.nlm.nih.gov/35902649/). HMM states bear a marked resemblance to the high-activity phases of these patterns and are likely to be closely linked to them. The spatiotemporal patterns are typically obtained during rest, but they have been reported during task performance (https://pubmed.ncbi.nlm.nih.gov/30753928/) which further suggests a link to the current work. Similar patterns have been observed in anesthetized animals, which also reinforces the conclusion of the current work that the states are fundamental aspects of the brain's functional organization.

      We appreciate the comments that relate spatiotemporal patterns, functional connectivity gradients, and the latent states derived from the HMM. Our work was also inspired by the papers that the reviewer suggested, especially Bolt et al.’s (2022), which compared the results of numerous dimensionality and clustering algorithms and suggested three spatiotemporal patterns that seemed to be commonly supported across algorithms. We originally cited these studies throughout the manuscript, but did not discuss them comprehensively. We have revised the Discussion to situate our findings on past work that used resting-state fMRI to study low-dimensional latent brain states.

      [Manuscript, page 15-16] “This perspective is supported by previous work that has used different methods to capture recurring low-dimensional states from spontaneous fMRI activity during rest. For example, to extract time-averaged latent states, early resting-state analyses identified task-positive and tasknegative networks using seed-based correlation (Fox et al., 2005). Dimensionality reduction algorithms such as independent component analysis (Smith et al., 2009) extracted latent components that explain the largest variance in fMRI time series. Other lines of work used timeresolved analyses to capture latent state dynamics. For example, variants of clustering algorithms, such as co-activation patterns (Liu et al., 2018; Liu and Duyn, 2013), k-means clustering (Allen et al., 2014), and HMM (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017), characterized fMRI time series as recurrences of and transitions between a small number of states. Time-lag analysis was used to identify quasiperiodic spatiotemporal patterns of propagating brain activity (Abbas et al., 2019; Yousefi and Keilholz, 2021). A recent study extensively compared these different algorithms and showed that they all report qualitatively similar latent states or components when applied to fMRI data (Bolt et al., 2022). While these studies used different algorithms to probe data-specific brain states, this work and ours report common latent axes that follow a long-standing theory of large-scale human functional systems (Mesulam, 1998). Neural dynamics span principal axes that dissociate unimodal to transmodal and sensory to motor information processing systems.”

      Reviewer #2 (Public Review):

      In this study, Song and colleagues applied a Hidden Markov Model to whole-brain fMRI data from the unique SONG dataset and a grad-CPT task, and in doing so observed robust transitions between lowdimensional states that they then attributed to specific psychological features extracted from the different tasks.

      The methods used appeared to be sound and robust to parameter choices. Whenever choices were made regarding specific parameters, the authors demonstrated that their approach was robust to different values, and also replicated their main findings on a separate dataset.

      I was mildly concerned that similarities in some of the algorithms used may have rendered some of the inter-measure results as somewhat inevitable (a hypothesis that could be tested using appropriate null models).

      This work is quite integrative, linking together a number of previous studies into a framework that allows for interesting follow-up questions.

      Overall, I found the work to be robust, interesting, and integrative, with a wide-ranging citation list and exciting implications for future work.

      We appreciate the reviewer’s comments on the study’s robustness and future implications. Our work was highly motivated by the reviewer’s prior work.

      Reviewer #3 (Public Review):

      My general assessment of the paper is that the analyses done after they find the model are exemplary and show some interesting results. However, the method they use to find the number of states (Calinski-Harabasz score instead of log-likelihood), the model they use generally (HMM), and the fact that they don't show how they find the number of states on HCP, with the Schaeffer atlas, and do not report their R^2 on a test set is a little concerning. I don't think this perse impedes their results, but it is something that they can improve. They argue that the states they find align with long-standing ideas about the functional organization of the brain and align with other research, but they can improve their selection for their model.

      We appreciate the reviewer’s thorough read of the paper, evaluation of our analyses linking brain states to behavior as “exemplary”, and important questions about the modeling approach. We have included detailed responses below and updated the manuscript accordingly.

      Strengths:

      • Use multiple datasets, multiple ROIs, and multiple analyses to validate their results

      • Figures are convincing in the sense that patterns clearly synchronize between participants

      • Authors select the number of states using the optimal model fit (although this turns out to be a little more questionable due to what they quantify as 'optimal model fit')

      We address this concern on page 30-31 of this response letter.

      • Replication with Schaeffer atlas makes results more convincing

      • The analyses around the fact that the base state acts as a flexible hub are well done and well explained

      • Their comparison of synchrony is well-done and comparing it to resting-state, which does not have any significant synchrony among participants is obvious, but still good to compare against.

      • Their results with respect to similar narrative engagement being correlated with similar neural state dynamics are well done and interesting.

      • Their results on event boundaries are compelling and well done. However, I do not find their Chang et al. results convincing (Figure 4B), it could just be because it is a different medium that explains differences in DMN response, but to me, it seems like these are just altogether different patterns that can not 100% be explained by their method/results.

      We entirely agree with the reviewer that the Chang et al. (2021) data are different in many ways from our own SONG dataset. Whereas data from Chang et al. (2021) were collected while participants listened to an audio-only narrative, participants in the SONG sample watched and listened to audiovisual stimuli. They were scanned at different universities in different countries with different protocols by different research groups for different purposes. That is, there are numerous reasons why we would expect the model should not generalize. Thus, we found it compelling and surprising that, despite all of these differences between the datasets, the model trained on the SONG dataset generalized to the data from Chang et al. (2021). The results highlighted a robust increase in the DMN state occurrence and a decrease in the base state occurrence after the narrative event boundaries, irrespective of whether the stimulus was an audiovisual sitcom episode or a narrated story. This external model validation was a way that we tested the robustness of our own model and the relationship between neural state dynamics and cognitive dynamics.

      • Their results that when there is no event, transition into the DMN state comes from the base state is 50% is interesting and a strong result. However, it is unclear if this is just for the sitcom or also for Chang et al.'s data.

      We apologize for the lack of clarity. We show the statistical results of the two sitcom episodes as well as Chang et al.’s (2021) data in Figure 4—figure supplement 2 in our original manuscript. Here, we provide the exact values of the base-to-DMN state transition probability, and how they differ across moments after event boundaries compared to non-event boundaries.

      For sitcom episode 1, the probability of base-to-DMN state transition was 44.6 ± 18.8 % at event boundaries whereas 62.0 ± 10.4 % at non-event boundaries (FDR-p = 0.0013). For sitcom episode 2, the probability of base-to-DMN state transition was 44.1 ± 18.0 % at event boundaries whereas 62.2 ± 7.6 % at non-event boundaries (FDR-p = 0.0006). For the Chang et al. (2021) dataset, the probability of base-to-DMN state transition was 33.3 ± 15.9 % at event boundaries whereas 58.1 ± 6.4 % at non-event boundaries (FDR-p < 0.0001). Thus, our result, “At non-event boundaries, the DMN state was most likely to transition from the base state, accounting for more than 50% of the transitions to the DMN state” (pg 11, line 24-25), holds true for both the internal and external datasets.

      • The involvement of the base state as being highly engaged during the comedy sitcom and the movie are interesting results that warrant further study into the base state theory they pose in this work.

      • It is good that they make sure SM states are not just because of head motion (P 12).

      • Their comparison between functional gradient and neural states is good, and their results are generally well-supported, intuitive, and interesting enough to warrant further research into them. Their findings on the context-specificity of their DMN and DAN state are interesting and relate well to the antagonistic relationship in resting-state data.

      Weaknesses:

      • Authors should train the model on part of the data and validate on another

      Thank you for raising this issue. To the best of our knowledge, past work that applied the HMM to the fMRI data has conducted training and inference on the same data, including initial work that implemented HMM on the resting-state fMRI (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017) as well as more recent work that applied HMMs to the task or movie-watching fMRI (Cornblath et al., 2020; Taghia et al., 2018; van der Meer et al., 2020; Yamashita et al., 2021). That is, the parameters—emission probability, transition probability, and initial probability—were estimated from the entire dataset and the latent state sequence was inferred using the Viterbi algorithm on the same dataset.

      However, we were also aware of the potential problem this may have. Therefore, in our recent work asking a different research question in another fMRI dataset (Song et al., 2021b), we trained an HMM on a subset of the dataset (moments when participants were watching movie clips in the original temporal order) and inferred latent state sequence of the fMRI time series in another subset of the dataset (moments when participants were watching movie clips in a scrambled temporal order). To the best of our knowledge, this was the first paper that used different segments of the data to fit and infer states from the HMM.

      In the current study, we wanted to capture brain states that underlie brain activity across contexts. Thus, we presented the same-dataset training and inference procedure as our primary result. However, for every main result, we also showed results where we separated the data used for model fitting and state inference. That is, we fit the HMM on the SONG dataset, primarily report the inference results on the SONG dataset, but also report inference on the external datasets that were not included in model fitting. The datasets used were the Human Connectome Project dataset (Van Essen et al., 2013), Chang et al. (2021) audio-listening dataset, Rosenberg et al. (2016) gradCPT dataset, and Chen et al. (2017) Sherlock dataset.

      However, to further address the concern of the reviewer whether the HMM fit is reliable when applied to held-out data, we computed the reliability of the HMM inference by conducting crossvalidations and split-half reliability analysis.

      (1) Cross-validation

      To separate the dataset used for HMM training and inference, we conducted cross-validation on the SONG dataset (N=27) by training the model with the data from 26 participants and inferring the latent state sequence of the held-out participant.

      First, we compared the robustness of the model training by comparing the mean activity patterns of the four latent states fitted at the group level (N=27) with the mean activity patterns of the four states fitted across cross-validation folds. Pearson’s correlations between the group-level vs. cross-validated latent states’ mean activity patterns were r = 0.991 ± 0.010, with a range from 0.963 to 0.999.

      Second, we compared the robustness of model inference by comparing the latent state sequences that were inferred at the group level vs. from held-out participants in a cross-validation scheme. All fMRI conditions had mean similarity higher than 90%; Rest 1: 92.74 ± 5.02 %, Rest2: 92.74 ± 4.83 %, GradCPT face: 92.97 ± 6.41 %, GradCPT scene: 93.27 ± 5.76 %, Sitcom ep1: 93.31 ± 3.92 %, Sitcom ep2: 93.13 ± 4.36 %, Documentary: 92.42 ± 4.72 %.

      Third, with the latent state sequences inferred from cross-validation, we replicated the analysis of Figure 3 to test for synchrony of the latent state sequences across participants. The crossvalidated results were highly similar to manuscript Figure 3, which was generated from the grouplevel analysis. Mean synchrony of latent state sequences are as follows: Rest 1: 25.90 ± 3.81%, Rest 2: 25.75 ± 4.19 %, GradCPT face: 27.17 ± 3.86 %, GradCPT scene: 28.11 ± 3.89 %, Sitcom ep1: 40.69 ± 3.86%, Sitcom ep2: 40.53 ± 3.13%, Documentary: 30.13 ± 3.41%.

      Author response image 2.

      (2) Split-half reliability

      To test for the internal robustness of the model, we randomly assigned SONG dataset participants into two groups and conducted HMM separately in each. Similarity (Pearson’s correlation) between the two groups’ activation patterns were DMN: 0.791, DAN: 0.838, SM: 0.944, base: 0.837. The similarity of the covariance patterns were DMN: 0.995, DAN: 0.996, SM: 0.994, base: 0.996.

      Author response image 3.

      We further validated the split-half reliability of the model using the HCP dataset, which contains data of a larger sample (N=119). Similarity (Pearson’s correlation) between the two groups’ activation patterns were DMN: 0.998, DAN: 0.997, SM: 0.993, base: 0.923. The similarity of the covariance patterns were DMN: 0.995, DAN: 0.996, SM: 0.994, base: 0.996.

      Together the cross-validation and split-half reliability results demonstrate that the HMM results reported in the manuscript are reliable and robust to the way we conducted the analysis. The result of the split-half reliability analysis is added in the Results.

      [Manuscript, page 3-4] “Neural state inference was robust to the choice of 𝐾 (Figure 1—figure supplement 1) and the fMRI preprocessing pipeline (Figure 1—figure supplement 5) and consistent when conducted on two groups of randomly split-half participants (Pearson’s correlations between the two groups’ latent state activation patterns: DMN: 0.791, DAN: 0.838, SM: 0.944, base: 0.837).”

      • Comparison with just PCA/functional gradients is weak in establishing whether HMMs are good models of the timeseries. Especially given that the HMM does not explain a lot of variance in the signal (~0.5 R^2 for only 27 brain regions) for PCA. I think they don't report their own R^2 of the timeseries

      We agree with the reviewer that the PCA that we conducted to compare with the explained variance of the functional gradients was not directly comparable because PCA and gradients utilize different algorithms to reduce dimensionality. To make more meaningful comparisons, we removed the data-specific PCA results and replaced them with data-specific functional gradients (derived from the SONG dataset). This allows us to directly compare SONG-specific functional gradients with predefined gradients (derived from the resting-state HCP dataset from Margulies et al. [2016]). We found that the degrees to which the first two predefined gradients explained whole-brain fMRI time series (SONG: 𝑟! = 0.097, HCP: 0.084) were comparable to the amount of variance explained by the first two data-specific gradients (SONG: 𝑟! = 0.100, HCP: 0.086). Thus, the predefined gradients explain as much variance in the SONG data time series as SONG-specific gradients do. This supports our argument that the low-dimensional manifold is largely shared across contexts, and that the common HMM latent states may tile the predefined gradients.

      These analyses and results were added to the Results, Methods, and Figure 1—figure supplement 8. Here, we only attach changes to the Results section for simplicity, but please see the revised manuscript for further changes.

      [Manuscript, page 5-6] “We hypothesized that the spatial gradients reported by Margulies et al. (2016) act as a lowdimensional manifold over which large-scale dynamics operate (Bolt et al., 2022; Brown et al., 2021; Karapanagiotidis et al., 2020; Turnbull et al., 2020), such that traversals within this manifold explain large variance in neural dynamics and, consequently, cognition and behavior (Figure 1C). To test this idea, we situated the mean activity values of the four latent states along the gradients defined by Margulies et al. (2016) (see Methods). The brain states tiled the two-dimensional gradient space with the base state at the center (Figure 1D; Figure1—figure supplement 7). The Euclidean distances between these four states were maximized in the two-dimensional gradient space, compared to a chance where the four states were inferred from circular-shifted time series (p < 0.001). For the SONG dataset, the DMN and SM states fell at more extreme positions of the primary gradient than expected by chance (both FDR-p values = 0.004; DAN and SM states, FDRp values = 0.171). For the HCP dataset, the DMN and DAN states fell at more extreme positions on the primary gradient (both FDR-p values = 0.004; SM and base states, FDR-p values = 0.076). No state was consistently found at the extremes of the secondary gradient (all FDR-p values > 0.021).

      We asked whether the predefined gradients explain as much variance in neural dynamics as latent subspace optimized for the SONG dataset. To do so, we applied the same nonlinear dimensionality reduction algorithm to the SONG dataset’s ROI time series. Of note, the SONG dataset includes 18.95% rest, 15.07% task, and 65.98% movie-watching data whereas the data used by Margulies et al. (2016) was 100% rest. Despite these differences, the SONG-specific gradients closely resembled the predefined gradients, with significant Pearson’s correlations observed for the first (r = 0.876) and second (r = 0.877) gradient embeddings (Figure 1—figure supplement 8). Gradients identified with the HCP data also recapitulated Margulies et al.’s (2016) first (r = 0.880) and second (r = 0.871) gradients. We restricted our analysis to the first two gradients because the two gradients together explained roughly 50% of the entire variance of functional brain connectome (SONG: 46.94%, HCP: 52.08%), and the explained variance dropped drastically from the third gradients (more than 1/3 drop compared to second gradients). The degrees to which the first two predefined gradients explained whole-brain fMRI time series (SONG: 𝑟! = 0.097, HCP: 0.084) were comparable to the amount of variance explained by the first two data-specific gradients (SONG: 𝑟! = 0.100, HCP: 0.086; Figure 1—figure supplement 8). Thus, the low-dimensional manifold captured by Margulies et al. (2016) gradients is highly replicable, explaining brain activity dynamics as well as data-specific gradients, and is largely shared across contexts and datasets. This suggests that the state space of whole-brain dynamics closely recapitulates low-dimensional gradients of the static functional brain connectome.”

      The reviewer also pointed out that the PCA-gradient comparison was weak in establishing whether HMMs are good models of the time series. However, we would like to point out that the purpose of the comparison was not to validate the performance of the HMM. Instead, we wanted to test whether the gradients introduced by Margulies et al. (2016) could act as a generalizable lowdimensional manifold of brain state dynamics. To argue that the predefined gradients are a shared manifold, these gradients should explain SONG data fMRI time series as much as the principal components derived directly from the SONG data. Our results showed comparable 𝑟!, both in predefined gradient vs. data-specific PC comparisons and predefined gradient vs. data-specific gradient comparisons, which supported our argument that the predefined gradients could be the shared embedding space across contexts and datasets.

      The reviewer pointed out that the 𝑟2 of ~0.5 is not explaining enough variance in the fMRI signal. However, we respectfully disagree with this point because there is no established criterion for what constitutes a high or low 𝑟2 for this type of analysis. Of note, previous literature that also applied PCA to fMRI time series (Author response image 4A and 4B) (Lynn et al., 2021; Shine et al., 2019) also found that the cumulative explained variance of top 5 principal components is around 50%. Author response image 4C shows cumulative variances to which gradients explain the functional connectome of the resting-state fMRI data (Margulies et al., 2016).

      Author response image 4.

      Finally, the reviewer pointed out that the 𝑟! of the HMM-derived latent sequence to the fMRI time series should be reported. However, there is no standardized way of measuring the explained variance of the HMM inference. There is no report of explained variance in the traditional HMMfMRI papers (Baker et al., 2014; Chen et al., 2016; Vidaurre et al., 2018, 2017). Rather than 𝑟!, the HMM computes the log likelihood of the model fit. However, because log likelihood values are dependent on the number of data points, studies do not report log likelihood values nor do they use these metrics to interpret the goodness of model fit.

      To ask whether the goodness of the HMM fit was significant above chance, we compared the log likelihood of the HMM to the log likelihood distribution of the null HMM fits. First, we extracted the log likelihood of the HMM fit with the real fMRI time series. We iterated this 1,000 times when calculating null HMMs using the circular-shifted fMRI time series. The log likelihood of the real model was significantly higher than the chance distribution, with a z-value of 2182.5 (p < 0.001). This indicates that the HMM explained a large variance in our fMRI time series data, significantly above chance.

      • Authors do not specify whether they also did cross-validation for the HCP dataset to find 4 clusters

      We apologize for the lack of clarity. When we computed the Calinski-Harabasz score with the HCP dataset, three was chosen as the most optimal number of states (Author response image 5A). When we set K as 3, the HMM inferred the DMN, DAN, and SM states (Author response image 5C). The base state was included when K was set to 4 (Author response image 5B). The activation pattern similarities of the DMN, DAN, and SM states were r = 0.981, 0.984, 0.911 respectively.

      Author response image 5.

      We did not use K = 3 for the HCP data replication because we were not trying to test whether these four states would be the optimal set of states in every dataset. Although the CalinskiHarabasz score chose K = 3 because it showed the best clustering performance, this does not mean that the base state is not meaningful to this dataset. Likewise, the latent states that are inferred when we increase/decrease the number of states are also meaningful states. For example, in Figure 1—figure supplement 1, we show an example of the SONG dataset’s latent states when we set K to 7. The seven latent states included the DAN, SM, and base states, the DMN state was subdivided into DMN-A and DMN-B states, and the FPN state and DMN+VIS state were included. Setting a higher number of states like K = 7 would mean that we are capturing brain state dynamics in a higher dimension than when using K = 4. Because we are utilizing a higher number of states, a model set to K = 7 would inevitably capture a larger variance of fMRI time series than a model set to K = 4.

      The purpose of latent state replication with the HCP dataset was to validate the generalizability of the DMN, DAN, SM, and base states. Before characterizing these latent states’ relevance to cognition, we needed to verify that these latent states were not simply overfit to the SONG dataset. The fact that the HMM revealed a similar set of latent states when applied to the HCP dataset suggested that the states were not merely specific to SONG data.

      To make our points clearer in the manuscript, we emphasized that we are not arguing for the four states to be the exclusive states. We made edits to Discussion as follows.

      [Manuscript, page 16] “Our study adopted the assumption of low dimensionality of large-scale neural systems, which led us to intentionally identify only a small number of states underlying whole-brain dynamics. Importantly, however, we do not claim that the four states will be the optimal set of states in every dataset and participant population. Instead, latent states and patterns of state occurrence may vary as a function of individuals and tasks (Figure 1—figure supplement 2). Likewise, while the lowest dimensions of the manifold (i.e., the first two gradients) were largely shared across datasets tested here, we do not argue that it will always be identical. If individuals and tasks deviate significantly from what was tested here, the manifold may also differ along with changes in latent states (Samara et al., 2023). Brain systems operate at different dimensionalities and spatiotemporal scales (Greene et al., 2023), which may have different consequences for cognition. Asking how brain states and manifolds—probed at different dimensionalities and scales—flexibly reconfigure (or not) with changes in contexts and mental states is an important research question for understanding complex human cognition.”

      • One of their main contributions is the base state but the correlation between the base state in their Song dataset and the HCP dataset is only 0.399

      This is a good point. However, there is precedent for lower spatial pattern correlation of the base state compared to other states in the literature.

      Compared to the DMN, DAN, and SM states, the base state did not show characteristic activation or deactivation of functional networks. Most of the functional networks showed activity levels close to the mean (z = 0). With this flattened activation pattern, relatively low activation pattern similarity was observed between the SONG base state and the HCP base state.

      In Figure 1—figure supplement 6, we write, “The DMN, DAN, and SM states showed similar mean activity patterns. We refrained from making interpretations about the base state’s activity patterns because the mean activity of most of the parcels was close to z = 0”.

      A similar finding has been reported in a previous work by Chen et al. (2016) that discovered the base state with HMM. State 9 (S9) of their results is comparable to our base state. They report that even though the spatial correlation coefficient of the brain state from the split-half reliability analysis was the lowest for S9 due to its low degrees of activation or deactivation, S9 was stably inferred by the HMM. The following is a direct quote from their paper:

      “To the best of our knowledge, a state similar to S9 has not been presented in previous literature. We hypothesize that S9 is the “ground” state of the brain, in which brain activity (or deactivity) is similar for the entire cortex (no apparent activation or deactivation as shown in Fig. 4). Note that different groups of subjects have different spatial patterns for state S9 (Fig. 3A). Therefore, S9 has the lowest reproducible spatial pattern (Fig. 3B). However, its temporal characteristics allowed us to distinguish it consistently from other states.” (Chen et al., 2016)

      Thus, we believe our data and prior results support the existence of the “base state”.

      • Figure 1B: Parcellation is quite big but there seems to be a gradient within regions

      This is a function of the visualization software. Mean activity (z) is the same for all voxels within a parcel. To visualize the 3D contours of the brain, we chose an option in the nilearn python function that smooths the mean activity values based on the surface reconstructed anatomy.

      In the original manuscript, our Methods write, “The brain surfaces were visualized with nilearn.plotting.plot_surf_stat_map. The parcel boundaries in Figure 1B are smoothed from the volume-to-surface reconstruction.”

      • Figure 1D: Why are the DMNs further apart between SONG and HCP than the other states

      To address this question, we first tested whether the position of the DMN states in the gradient space is significantly different for the SONG and HCP datasets. We generated surrogate HMM states from the circular-shifted fMRI time series and positioned the four latent states and the null DMN states in the 2-dimensional gradient space (Author response image 6).

      Author response image 6.

      We next tested whether the Euclidean distance between the SONG dataset’s DMN state and the HCP dataset’s DMN state is larger than would be expected by chance (Author response image 7). To do so, we took the difference between the DMN state positions and compared it to the 1,000 differences generated from the surrogate latent states. The DMN states of the SONG and HCP datasets did not significantly differ in the Gradient 1 dimension (two-tailed test, p = 0.794). However, as the reviewer noted, the positions differed significantly in the Gradient 2 dimension (p = 0.047). The DMN state leaned more towards the Visual gradient in the SONG dataset, whereas it leaned more towards the Somatosensory-Motor gradient in the HCP dataset.

      Author response image 7.

      Though we cannot claim an exact reason for this across-dataset difference, we note a distinctive difference between the SONG and HCP datasets. Both datasets largely included resting-state, controlled tasks, and movie watching. The SONG dataset included 18.95% of rest, 15.07% of task, and 65.98% of movie watching. The task only contained the gradCPT, i.e., sustained attention task. On the other hand, the HCP dataset included 52.71% of rest, 24.35% of task, and 22.94% of movie watching. There were 7 different tasks included in the HCP dataset. It is possible that different proportions of rest, task, and movie watching, and different cognitive demands involved with each dataset may have created data-specific latent states.

      • Page 5 paragraph starting at L25: Their hypothesis that functional gradients explain large variance in neural dynamics needs to be explained more, is non-trivial especially because their R^2 scores are so low (Fig 1. Supplement 8) for PCA

      We address this concern on page 21-23 of this response letter.

      • Generally, I do not find the PCA analysis convincing and believe they should also compare to something like ICA or a different model of dynamics. They do not explain their reasoning behind assuming an HMM, which is an extremely simplified idea of brain dynamics meaning they only change based on the previous state.

      We appreciate this perspective. We replaced the Margulies et al.’s (2016) gradient vs. SONGspecific PCA comparison with a more direct Margulies et al.’s (2016) gradient vs. SONG-specific gradient comparison as described on page 21-23 of this response letter.

      More broadly, we elected to use HMM because of recent work showing correspondence between low-dimensional HMM states and behavior (Cornblath et al., 2020; Taghia et al., 2018; van der Meer et al., 2020; Yamashita et al., 2021). We also found the model’s assumption—a mixture Gaussian emission probability and first-order Markovian transition probability—to be the most suited to analyzing the fMRI time series data. We do not intend to claim that other data-reduction techniques would not also capture low-dimensional, behaviorally relevant changes in brain activity. Instead, our primary focus was identifying a set of latent states that generalize (i.e., recur) across multiple contexts and understanding how those states reflect cognitive and attentional states.

      Although a comparison of possible data-reduction algorithms is out of the scope of the current work, an exhaustive comparison of different models can be found in Bolt et al. (2022). The authors compared dozens of latent brain state algorithms spanning zero-lag analysis (e.g., principal component analysis, principal component analysis with Varimax rotation, Laplacian eigenmaps, spatial independent component analysis, temporal independent component analysis, hidden Markov model, seed-based correlation analysis, and co-activation patterns) to time-lag analysis (e.g., quasi-periodic pattern and lag projections). Bolt et al. (2022) writes “a range of empirical phenomena, including functional connectivity gradients, the task-positive/task-negative anticorrelation pattern, the global signal, time-lag propagation patterns, the quasiperiodic pattern and the functional connectome network structure, are manifestations of the three spatiotemporal patterns.” That is, many previous findings that used different methods essentially describe the same recurring latent states. A similar argument was made in previous papers (Brown et al., 2021; Karapanagiotidis et al., 2020; Turnbull et al., 2020).

      We agree that the HMM is a simplified idea of brain dynamics. We do not argue that the four number of states can fully explain the complexity and flexibility of cognition. Instead, we hoped to show that there are different dimensionalities to which the brain systems can operate, and they may have different consequences to cognition. We “simplified” neural dynamics to a discrete sequence of a small number of states. However, what is fascinating is that these overly “simplified” brain state dynamics can explain certain cognitive and attentional dynamics, such as event segmentation and sustained attention fluctuations. We highlight this point in the Discussion.

      [Manuscript, page 16] “Our study adopted the assumption of low dimensionality of large-scale neural systems, which led us to intentionally identify only a small number of states underlying whole-brain dynamics. Importantly, however, we do not claim that the four states will be the optimal set of states in every dataset and participant population. Instead, latent states and patterns of state occurrence may vary as a function of individuals and tasks (Figure 1—figure supplement 2). Likewise, while the lowest dimensions of the manifold (i.e., the first two gradients) were largely shared across datasets tested here, we do not argue that it will always be identical. If individuals and tasks deviate significantly from what was tested here, the manifold may also differ along with changes in latent states (Samara et al., 2023). Brain systems operate at different dimensionalities and spatiotemporal scales (Greene et al., 2023), which may have different consequences for cognition. Asking how brain states and manifolds—probed at different dimensionalities and scales—flexibly reconfigure (or not) with changes in contexts and mental states is an important research question for understanding complex human cognition.”

      • For the 25- ROI replication it seems like they again do not try multiple K values for the number of states to validate that 4 states are in fact the correct number.

      In the manuscript, we do not argue that the four will be the optimal number of states in any dataset. (We actually predict that this may differ depending on the amount of data, participant population, tasks, etc.) Instead, we claim that the four identified in the SONG dataset are not specific (i.e., overfit) to that sample, but rather recur in independent datasets as well. More broadly we argue that the complexity and flexibility of human cognition stem from the fact that computation occurs at multiple dimensions and that the low-dimensional states observed here are robustly related to cognitive and attentional states. To prevent misunderstanding of our results, we emphasized in the Discussion that we are not arguing for a fixed number of states. A paragraph included in our response to the previous comment (page 16 in the manuscript) illustrates this point.

      • Fig 2B: Colorbar goes from -0.05 to 0.05 but values are up to 0.87

      We apologize for the confusion. The current version of the figure is correct. The figure legend states, “The values indicate transition probabilities, such that values in each row sums to 1. The colors indicate differences from the mean of the null distribution where the HMMs were conducted on the circular-shifted time series.”

      We recognize that this complicates the interpretation of the figure. However, after much consideration, we decided that it was valuable to show both the actual transition probabilities (values) and their difference from the mean of null HMMs (colors). The values demonstrate the Markovian property of latent state dynamics, with a high probability of remaining in the same state at consecutive moments and a low probability of transitioning to a different state. The colors indicate that the base state is a transitional hub state by illustrating that the DMN, DAN, and SM states are more likely to transition to the base state than would be expected by chance.

      • P 16 L4 near-critical, authors need to be more specific in their terminology here especially since they talk about dynamic systems, where near-criticality has a specific definition. It is unclear which definition they are looking for here.

      We agree that our explanation was vague. Because we do not have evidence for this speculative proposal, we removed the mention of near-criticality. Instead, we focus on our observation as the base state being the transitional hub state within a metastable system.

      [Manuscript, page 17-18] “However, the functional relevance of the base state to human cognition had not been explored previously. We propose that the base state, a transitional hub (Figure 2B) positioned at the center of the gradient subspace (Figure 1D), functions as a state of natural equilibrium. Transitioning to the DMN, DAN, or SM states reflects incursion away from natural equilibrium (Deco et al., 2017; Gu et al., 2015), as the brain enters a functionally modular state. Notably, the base state indicated high attentional engagement (Figure 5E and F) and exhibited the highest occurrence proportion (Figure 3B) as well as the longest dwell times (Figure 3—figure supplement 1) during naturalistic movie watching, whereas its functional involvement was comparatively minor during controlled tasks. This significant relevance to behavior verifies that the base state cannot simply be a byproduct of the model. We speculate that susceptibility to both external and internal information is maximized in the base state—allowing for roughly equal weighting of both sides so that they can be integrated to form a coherent representation of the world—at the expense of the stability of a certain functional network (Cocchi et al., 2017; Fagerholm et al., 2015). When processing rich narratives, particularly when a person is fully immersed without having to exert cognitive effort, a less modular state with high degrees of freedom to reach other states may be more likely to be involved. The role of the base state should be further investigated in future studies.”

      • P16 L13-L17 unnecessary

      We prefer to have the last paragraph as a summary of the implications of this paper. However, if the length of this paper becomes a problem as we work towards publication with the editors, we are happy to remove these lines.

      • I think this paper is solid, but my main issue is with using an HMM, never explaining why, not showing inference results on test data, not reporting an R^2 score for it, and not comparing it to other models. Secondly, they use the Calinski-Harabasz score to determine the number of states, but not the log-likelihood of the fit. This clearly creates a bias in what types of states you will find, namely states that are far away from each other, which likely also leads to the functional gradient and PCA results they have. Where they specifically talk about how their states are far away from each other in the functional gradient space and correlated to (orthogonal) components. It is completely unclear to me why they used this measure because it also seems to be one of many scores you could use with respect to clustering (with potentially different results), and even odd in the presence of a loglikelihood fit to the data and with the model they use (which does not perform clustering).

      (1) Showing inference results on test data

      We address this concern on page 19-21 of this response letter.

      (2) Not reporting 𝑹𝟐 score

      We address this concern on page 21-23 of this response letter.

      (3) Not comparing the HMM model to other models

      We address this concern on page 27-28 of this response letter.

      (4) The use of the Calinski-Harabasz score to determine the number of states rather than the log-likelihood of the model fit

      To our knowledge, the log-likelihood of the model fit is not used in the HMM literature. It is because the log-likelihood tends to increase monotonically as the number of states increases. Baker et al. (2014) illustrates this problem, writing:

      “In theory, it should be possible to pick the optimal number of states by selecting the model with the greatest (negative) free energy. In practice however, we observe that the free energy increases monotonically up to K = 15 states, suggesting that the Bayes-optimal model may require an even higher number of states.”

      Similarly, the following figure is the log-likelihood estimated from the SONG dataset. Similar to the findings of Baker et al. (2014), the log-likelihood monotonically increased as the number of states increased (Author response image 8, right). The measures like AIC or BIC, which account for the number of parameters, also have the same issue of monotonic increase.

      Author response image 8.

      Because there is “no straightforward data-driven approach to model order selection” (Baker et al., 2014), past work has used different approaches to decide on the number of states. For example, Vidaurre et al. (2018) iterated over a range of the number of states to repeat the same HMM training and inference procedures 5 times using the same hyperparameters. They selected the number of states that showed the highest consistency across iterations. Gao et al. (2021) tested the clustering performance of the model output using the Calinski-Harabasz score. The number of states that showed the highest within-cluster cohesion compared to the across-cluster separation was selected as the number of states. Chang et al. (2021) applied HMM to voxels of the ventromedial prefrontal cortex using a similar clustering algorithm, writing: “To determine the number of states for the HMM estimation procedure, we identified the number of states that maximized the average within-state spatial similarity relative to the average between-state similarity”. In our previous paper (Song et al., 2021b), we reported both the reliability and clustering performance measures to decide on the number of states.

      In the current manuscript, the model consistency criterion from Vidaurre et al. (2018) was ineffective because the HMM inference was extremely robust (i.e., always inferring the exact same sequence) due to a large number of data points. Thus, we used the Calinski-Harabasz score as our criterion for the number of states selected.

      We agree with the reviewer that the selection of the number of states is critical to any study that implements HMM. However, the field lacks a consensus on how to decide on the number of states in the HMM, and the Calinski-Harabasz score has been validated in previous studies. Most importantly, the latent states’ relationships with behavioral and cognitive measures give strong evidence that the latent states are indeed meaningful states. Again, we are not arguing that the optimal set of states in any dataset will be four nor are we arguing that these four states will always be the optimal states. Instead, the manuscript proposes that a small number of latent states explains meaningful variance in cognitive dynamics.

      • Grammatical error: P24 L29 rendering seems to have gone wrong

      Our intention was correct here. To avoid confusion, we changed “(number of participantsC2 iterations)” to “(#𝐶!iterations, where N=number of participants)” (page 26 in the manuscript).

      Questions:

      • Comment on subject differences, it seems like they potentially found group dynamics based on stimuli, but interesting to see individual differences in large-scale dynamics, and do they believe the states they find mostly explain global linear dynamics?

      We agree with the reviewer that whether low-dimensional latent state dynamics explain individual differences—above and beyond what could be explained by the high-dimensional, temporally static neural signatures of individuals (e.g., Finn et al., 2015)—is an important research question. However, because the SONG dataset was collected in a single lab, with a focus on covering diverse contexts (rest, task, and movie watching) over 2 sessions, we were only able to collect 27 participants. Due to this small sample size, we focused on investigating group-level, shared temporal dynamics and across-condition differences, rather than on investigating individual differences.

      Past work has studied individual differences (e.g., behavioral traits like well-being, intelligence, and personality) using the HMM (Vidaurre et al., 2017). In the lab, we are working on a project that investigates latent state dynamics in relation to individual differences in clinical symptoms using the Healthy Brain Network dataset (Ji et al., 2022, presented at SfN; Alexander et al., 2017).

      Finally, the reviewer raises an interesting question about whether the latent state sequence that was derived here mostly explains global linear dynamics as opposed to nonlinear dynamics. We have two responses: one methodological and one theoretical. First, methodologically, we defined the emission probabilities as a linear mixture of Gaussian distributions for each input dimension with the state-specific mean (mean fMRI activity patterns of the networks) and variance (functional covariance across networks). Therefore, states are modeled with an assumption of linearity of feature combinations. Theoretically, recent work supports in favor of nonlinearity of large-scale neural dynamics, especially as tasks get richer and more complex (Cunningham and Yu, 2014; Gao et al., 2021). However, whether low-dimensional latent states should be modeled nonlinearly—that is, whether linear algorithms are insufficient at capturing latent states compared to nonlinear algorithms—is still unknown. We agree with the reviewer that the assumption of linearity is an interesting topic in systems neuroscience. However, together with prior work which showed how numerous algorithms—either linear or nonlinear—recapitulated a common set of latent states, we argue that the HMM provides a strong low-dimensional model of large-scale neural activity and interaction.

      • P19 L40 why did the authors interpolate incorrect or no-responses for the gradCPT runs? It seems more logical to correct their results for these responses or to throw them out since interpolation can induce huge biases in these cases because the data is likely not missing at completely random.

      Interpolating the RTs of the trials without responses (omission errors and incorrect trials) is a standardized protocol for analyzing gradCPT data (Esterman et al., 2013; Fortenbaugh et al., 2018, 2015; Jayakumar et al., 2023; Rosenberg et al., 2013; Terashima et al., 2021; Yamashita et al., 2021). The choice of this analysis is due to an assumption that sustained attention is a continuous attentional state; the RT, a proxy for the attentional state in the gradCPT literature, is a noisy measure of a smoothed, continuous attentional state. Thus, the RTs of the trials without responses are interpolated and the RT time courses are smoothed by convolving with a gaussian kernel.

      References

      Abbas A, Belloy M, Kashyap A, Billings J, Nezafati M, Schumacher EH, Keilholz S. 2019. Quasiperiodic patterns contribute to functional connectivity in the brain. Neuroimage 191:193–204.

      Alexander LM, Escalera J, Ai L, Andreotti C, Febre K, Mangone A, Vega-Potler N, Langer N, Alexander A, Kovacs M, Litke S, O’Hagan B, Andersen J, Bronstein B, Bui A, Bushey M, Butler H, Castagna V, Camacho N, Chan E, Citera D, Clucas J, Cohen S, Dufek S, Eaves M, Fradera B, Gardner J, Grant-Villegas N, Green G, Gregory C, Hart E, Harris S, Horton M, Kahn D, Kabotyanski K, Karmel B, Kelly SP, Kleinman K, Koo B, Kramer E, Lennon E, Lord C, Mantello G, Margolis A, Merikangas KR, Milham J, Minniti G, Neuhaus R, Levine A, Osman Y, Parra LC, Pugh KR, Racanello A, Restrepo A, Saltzman T, Septimus B, Tobe R, Waltz R, Williams A, Yeo A, Castellanos FX, Klein A, Paus T, Leventhal BL, Craddock RC, Koplewicz HS, Milham MP. 2017. Data Descriptor: An open resource for transdiagnostic research in pediatric mental health and learning disorders. Sci Data 4:1–26.

      Allen EA, Damaraju E, Plis SM, Erhardt EB, Eichele T, Calhoun VD. 2014. Tracking whole-brain connectivity dynamics in the resting state. Cereb Cortex 24:663–676.

      Baker AP, Brookes MJ, Rezek IA, Smith SM, Behrens T, Probert Smith PJ, Woolrich M. 2014. Fast transient networks in spontaneous human brain activity. Elife 3:e01867.

      Bolt T, Nomi JS, Bzdok D, Salas JA, Chang C, Yeo BTT, Uddin LQ, Keilholz SD. 2022. A Parsimonious Description of Global Functional Brain Organization in Three Spatiotemporal Patterns. Nat Neurosci 25:1093–1103.

      Brown JA, Lee AJ, Pasquini L, Seeley WW. 2021. A dynamic gradient architecture generates brain activity states. Neuroimage 261:119526.

      Chang C, Leopold DA, Schölvinck ML, Mandelkow H, Picchioni D, Liu X, Ye FQ, Turchi JN, Duyn JH. 2016. Tracking brain arousal fluctuations with fMRI. Proc Natl Acad Sci U S A 113:4518–4523.

      Chang CHC, Lazaridi C, Yeshurun Y, Norman KA, Hasson U. 2021. Relating the past with the present: Information integration and segregation during ongoing narrative processing. J Cogn Neurosci 33:1–23.

      Chang LJ, Jolly E, Cheong JH, Rapuano K, Greenstein N, Chen P-HA, Manning JR. 2021. Endogenous variation in ventromedial prefrontal cortex state dynamics during naturalistic viewing reflects affective experience. Sci Adv 7:eabf7129.

      Chen J, Leong YC, Honey CJ, Yong CH, Norman KA, Hasson U. 2017. Shared memories reveal shared structure in neural activity across individuals. Nat Neurosci 20:115–125.

      Chen S, Langley J, Chen X, Hu X. 2016. Spatiotemporal Modeling of Brain Dynamics Using RestingState Functional Magnetic Resonance Imaging with Gaussian Hidden Markov Model. Brain Connect 6:326–334.

      Cocchi L, Gollo LL, Zalesky A, Breakspear M. 2017. Criticality in the brain: A synthesis of neurobiology, models and cognition. Prog Neurobiol 158:132–152.

      Cornblath EJ, Ashourvan A, Kim JZ, Betzel RF, Ciric R, Adebimpe A, Baum GL, He X, Ruparel K, Moore TM, Gur RC, Gur RE, Shinohara RT, Roalf DR, Satterthwaite TD, Bassett DS. 2020. Temporal sequences of brain activity at rest are constrained by white matter structure and modulated by cognitive demands. Commun Biol 3:261.

      Cunningham JP, Yu BM. 2014. Dimensionality reduction for large-scale neural recordings. Nat Neurosci 17:1500–1509.

      Deco G, Kringelbach ML, Jirsa VK, Ritter P. 2017. The dynamics of resting fluctuations in the brain: Metastability and its dynamical cortical core. Sci Rep 7:3095.

      Esterman M, Noonan SK, Rosenberg M, Degutis J. 2013. In the zone or zoning out? Tracking behavioral and neural fluctuations during sustained attention. Cereb Cortex 23:2712–2723.

      Esterman M, Rothlein D. 2019. Models of sustained attention. Curr Opin Psychol 29:174–180.

      Fagerholm ED, Lorenz R, Scott G, Dinov M, Hellyer PJ, Mirzaei N, Leeson C, Carmichael DW, Sharp DJ, Shew WL, Leech R. 2015. Cascades and cognitive state: Focused attention incurs subcritical dynamics. J Neurosci 35:4626–4634.

      Falahpour M, Chang C, Wong CW, Liu TT. 2018. Template-based prediction of vigilance fluctuations in resting-state fMRI. Neuroimage 174:317–327.

      Finn ES, Shen X, Scheinost D, Rosenberg MD, Huang J, Chun MM, Papademetris X, Constable RT. 2015. Functional connectome fingerprinting: Identifying individuals using patterns of brain connectivity. Nat Neurosci 18:1664–1671.

      Fortenbaugh FC, Degutis J, Germine L, Wilmer JB, Grosso M, Russo K, Esterman M. 2015. Sustained attention across the life span in a sample of 10,000: Dissociating ability and strategy. Psychol Sci 26:1497–1510.

      Fortenbaugh FC, Rothlein D, McGlinchey R, DeGutis J, Esterman M. 2018. Tracking behavioral and neural fluctuations during sustained attention: A robust replication and extension. Neuroimage 171:148–164.

      Fox MD, Snyder AZ, Vincent JL, Corbetta M, Van Essen DC, Raichle ME. 2005. The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proc Natl Acad Sci U S A 102:9673–9678.

      Gao S, Mishne G, Scheinost D. 2021. Nonlinear manifold learning in functional magnetic resonance imaging uncovers a low-dimensional space of brain dynamics. Hum Brain Mapp 42:4510–4524.

      Goodale SE, Ahmed N, Zhao C, de Zwart JA, Özbay PS, Picchioni D, Duyn J, Englot DJ, Morgan VL, Chang C. 2021. Fmri-based detection of alertness predicts behavioral response variability. Elife 10:1–20.

      Greene AS, Horien C, Barson D, Scheinost D, Constable RT. 2023. Why is everyone talking about brain state? Trends Neurosci.

      Greene DJ, Marek S, Gordon EM, Siegel JS, Gratton C, Laumann TO, Gilmore AW, Berg JJ, Nguyen AL, Dierker D, Van AN, Ortega M, Newbold DJ, Hampton JM, Nielsen AN, McDermott KB, Roland JL, Norris SA, Nelson SM, Snyder AZ, Schlaggar BL, Petersen SE, Dosenbach NUF. 2020. Integrative and Network-Specific Connectivity of the Basal Ganglia and Thalamus Defined in Individuals. Neuron 105:742-758.e6.

      Gu S, Pasqualetti F, Cieslak M, Telesford QK, Yu AB, Kahn AE, Medaglia JD, Vettel JM, Miller MB, Grafton ST, Bassett DS. 2015. Controllability of structural brain networks. Nat Commun 6:8414.

      Jayakumar M, Balusu C, Aly M. 2023. Attentional fluctuations and the temporal organization of memory. Cognition 235:105408.

      Ji E, Lee JE, Hong SJ, Shim W (2022). Idiosyncrasy of latent neural state dynamic in ASD during movie watching. Poster presented at the Society for Neuroscience 2022 Annual Meeting.

      Karapanagiotidis T, Vidaurre D, Quinn AJ, Vatansever D, Poerio GL, Turnbull A, Ho NSP, Leech R, Bernhardt BC, Jefferies E, Margulies DS, Nichols TE, Woolrich MW, Smallwood J. 2020. The psychological correlates of distinct neural states occurring during wakeful rest. Sci Rep 10:1–11.

      Liu X, Duyn JH. 2013. Time-varying functional network information extracted from brief instances of spontaneous brain activity. Proc Natl Acad Sci U S A 110:4392–4397.

      Liu X, Zhang N, Chang C, Duyn JH. 2018. Co-activation patterns in resting-state fMRI signals. Neuroimage 180:485–494.

      Lynn CW, Cornblath EJ, Papadopoulos L, Bertolero MA, Bassett DS. 2021. Broken detailed balance and entropy production in the human brain. Proc Natl Acad Sci 118:e2109889118.

      Margulies DS, Ghosh SS, Goulas A, Falkiewicz M, Huntenburg JM, Langs G, Bezgin G, Eickhoff SB, Castellanos FX, Petrides M, Jefferies E, Smallwood J. 2016. Situating the default-mode network along a principal gradient of macroscale cortical organization. Proc Natl Acad Sci U S A 113:12574–12579.

      Mesulam MM. 1998. From sensation to cognition. Brain 121:1013–1052.

      Munn BR, Müller EJ, Wainstein G, Shine JM. 2021. The ascending arousal system shapes neural dynamics to mediate awareness of cognitive states. Nat Commun 12:1–9.

      Raut R V., Snyder AZ, Mitra A, Yellin D, Fujii N, Malach R, Raichle ME. 2021. Global waves synchronize the brain’s functional systems with fluctuating arousal. Sci Adv 7.

      Rosenberg M, Noonan S, DeGutis J, Esterman M. 2013. Sustaining visual attention in the face of distraction: A novel gradual-onset continuous performance task. Attention, Perception, Psychophys 75:426–439.

      Rosenberg MD, Finn ES, Scheinost D, Papademetris X, Shen X, Constable RT, Chun MM. 2016. A neuromarker of sustained attention from whole-brain functional connectivity. Nat Neurosci 19:165–171.

      Rosenberg MD, Scheinost D, Greene AS, Avery EW, Kwon YH, Finn ES, Ramani R, Qiu M, Todd Constable R, Chun MM. 2020. Functional connectivity predicts changes in attention observed across minutes, days, and months. Proc Natl Acad Sci U S A 117:3797–3807.

      Saggar M, Shine JM, Liégeois R, Dosenbach NUF, Fair D. 2022. Precision dynamical mapping using topological data analysis reveals a hub-like transition state at rest. Nat Commun 13.

      Schaefer A, Kong R, Gordon EM, Laumann TO, Zuo X-N, Holmes AJ, Eickhoff SB, Yeo BTT. 2018. Local-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI. Cereb Cortex 28:3095–3114.

      Shine JM. 2019. Neuromodulatory Influences on Integration and Segregation in the Brain. Trends Cogn Sci 23:572–583.

      Shine JM, Bissett PG, Bell PT, Koyejo O, Balsters JH, Gorgolewski KJ, Moodie CA, Poldrack RA. 2016. The Dynamics of Functional Brain Networks: Integrated Network States during Cognitive Task Performance. Neuron 92:544–554.

      Shine JM, Breakspear M, Bell PT, Ehgoetz Martens K, Shine R, Koyejo O, Sporns O, Poldrack RA. 2019. Human cognition involves the dynamic integration of neural activity and neuromodulatory systems. Nat Neurosci 22:289–296.

      Smith SM, Fox PT, Miller KL, Glahn DC, Fox PM, Mackay CE, Filippini N, Watkins KE, Toro R, Laird AR, Beckmann CF. 2009. Correspondence of the brain’s functional architecture during activation and rest. Proc Natl Acad Sci 106:13040–13045.

      Song H, Emily FS, Rosenberg MD. 2021a. Neural signatures of attentional engagement during narratives and its consequences for event memory. Proc Natl Acad Sci 118:e2021905118.

      Song H, Park B-Y, Park H, Shim WM. 2021b. Cognitive and Neural State Dynamics of Narrative Comprehension. J Neurosci 41:8972–8990.

      Taghia J, Cai W, Ryali S, Kochalka J, Nicholas J, Chen T, Menon V. 2018. Uncovering hidden brain state dynamics that regulate performance and decision-making during cognition. Nat Commun 9:2505.

      Terashima H, Kihara K, Kawahara JI, Kondo HM. 2021. Common principles underlie the fluctuation of auditory and visual sustained attention. Q J Exp Psychol 74:705–715.

      Tian Y, Margulies DS, Breakspear M, Zalesky A. 2020. Topographic organization of the human subcortex unveiled with functional connectivity gradients. Nat Neurosci 23:1421–1432.

      Turnbull A, Karapanagiotidis T, Wang HT, Bernhardt BC, Leech R, Margulies D, Schooler J, Jefferies E, Smallwood J. 2020. Reductions in task positive neural systems occur with the passage of time and are associated with changes in ongoing thought. Sci Rep 10:1–10.

      Unsworth N, Robison MK. 2018. Tracking arousal state and mind wandering with pupillometry. Cogn Affect Behav Neurosci 18:638–664.

      Unsworth N, Robison MK. 2016. Pupillary correlates of lapses of sustained attention. Cogn Affect Behav Neurosci 16:601–615.

      van der Meer JN, Breakspear M, Chang LJ, Sonkusare S, Cocchi L. 2020. Movie viewing elicits rich and reliable brain state dynamics. Nat Commun 11:1–14.

      Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil K. 2013. The WU-Minn Human Connectome Project: An overview. Neuroimage 80:62–79.

      Vidaurre D, Abeysuriya R, Becker R, Quinn AJ, Alfaro-Almagro F, Smith SM, Woolrich MW. 2018. Discovering dynamic brain networks from big data in rest and task. Neuroimage, Brain Connectivity Dynamics 180:646–656.

      Vidaurre D, Smith SM, Woolrich MW. 2017. Brain network dynamics are hierarchically organized in time. Proc Natl Acad Sci U S A 114:12827–12832.

      Yamashita A, Rothlein D, Kucyi A, Valera EM, Esterman M. 2021. Brain state-based detection of attentional fluctuations and their modulation. Neuroimage 236:118072.

      Yeo BTT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, Roffman JL, Smoller JW, Zöllei L, Polimeni JR, Fisch B, Liu H, Buckner RL. 2011. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J Neurophysiol 106:1125–1165.

      Yousefi B, Keilholz S. 2021. Propagating patterns of intrinsic activity along macroscale gradients coordinate functional connections across the whole brain. Neuroimage 231:117827.

      Zhang S, Goodale SE, Gold BP, Morgan VL, Englot DJ, Chang C. 2023. Vigilance associates with the low-dimensional structure of fMRI data. Neuroimage 267.

    1. Author Response

      Reviewer #2 (Public Review):

      "The cellular architecture of memory modules in Drosophila supports stochastic input integration" is a classical biophysical compartmental modelling study. It takes advantage of some simple current injection protocols in a massively complex mushroom body neuron called MBON-a3 and compartmental models that simulate the electrophysiological behaviour given a detailed description of the anatomical extent of its neurites.

      This work is interesting in a number of ways:

      • The input structure information comes from EM data (Kenyon cells) although this is not discussed much in the paper - The paper predicts a potentially novel normalization of the throughput of KC inputs at the level of the proximal dendrite and soma - It claims a new computational principle in dendrites, this didn’t become very clear to me Problems I see:

      • The current injections did not last long enough to reach steady state (e.g. Figure 1FG), and the model current injection traces have two time constants but the data only one (Figure 2DF). This does not make me very confident in the results and conclusions.

      These are two important but separate questions that we would like to address in turn.

      As for the first, in our new recordings using cytoplasmic GFP to identify MBON-alpha3, we performed both a 200 ms current injection and performed prolonged recordings of 400 ms to reach steady state (for all 4 new cells 1’-4’). For comparison with the original dataset we mainly present the raw traces for 200 ms recordings in Figure 1 Supplement 2. In addition, we now provide a direct comparison of these recordings (200 ms versus 400 ms) and did not observe significant differences in tau between these data (Figure 1 Supplement 2 K). This comparison illustrates that the 200 ms current injection reaches a maximum voltage deflection that is close to the steady state level of the prolonged protocol. Importantly, the critical parameter (tau) did not change between these datasets.

      Regarding the second question, the two different time constants, we thank the reviewer for pointing this out. Indeed, while the simulated voltage follows an approximately exponential decay which is, by design, essentially identical to the measured value (τ≈ 16ms, from Table 1; ee Figure 1 Supplement 2 for details), the voltage decays and rises much faster immediately following the onset and offset of the current injections. We believe that this is due to the morphology of this neuron. Current injection, and voltage recordings, are at the soma which is connected to the remainder of the neuron by a long and thin neurite. This ’remainder’ is, of course, in linear size, volume and surface (membrane) area much larger than the soma, see Fig 2A. As a result, a current injection will first quickly charge up the membrane of the soma, resulting in the initial fast voltage changes seen in Fig 2D,F, before the membrane in the remainder of the cell is charged, with the cell’s time constant τ.

      We confirmed this intuition by running various simplified simulations in Neuron which indeed show a much more rapid change at step changes in injected current than over the long-term. Indeed, we found that the pattern even appears in the simplest possible two-compartment version of the neuron’s equivalent circuit which we solved in an all-purpose numerical simulator of electrical circuitry (https://www.falstad.com/circuit). The circuit is shown in Figure 1. We chose rather generic values for the circuit components, with the constraints that the cell capacitance, chosen as 15pF, and membrane resistance, chosen as 1GΩ, are in the range of the observed data (as is, consequently, its time constant which is 15ms with these choices); see Table 1 of the manuscript. We chose the capacitance of the soma as 1.5pF, making the time constant of the soma (1.5ms) an order of magnitude shorter than that of the cell.

      Figure 1: Simplified circuit of a small soma (left parallel RC circuit) and the much larger remainder of a cell (right parallel RC circuit) connected by a neurite (right 100MΩ resistor). A current source (far left) injects constant current into the soma through the left 100MΩ resistor.

      Figure 2 shows the somatic voltage in this circuit (i.e., at the upper terminal of the 1.5pF capacitor) while a -10pA current is injected for about 4.5ms, after which the current is set back to zero. The combination of initial rapid change, followed by a gradual change with a time constant of ≈ 15ms is visible at both onset and offset of the current injection. Figure 3 show the voltage traces plotted for a duration of approximately one time constant, and Fig 4 shows the detailed shape right after current onset.

      Figure 2: Somatic voltage in the circuit in Fig. 1 with current injection for about 4.5ms, followed by zero current injection for another ≈ 3.5ms.

      Figure 3: Somatic voltage in the circuit, as in Fig. 2 but with current injected for approx. 15msvvvvv

      While we did not try to quantitatively assess the deviation from a single-exponential shape of the voltage in Fig. 2E, a more rapid increase at the onset and offset of the current injection is clearly visible in this Figure. This deviation from a single exponential is smaller than what we see in the simulation (both in Fig 2D of the manuscript, and in the results of the simplified circuit here in the rebuttal). We believe that the effect is smaller in Fig. E because it shows the average over many traces. It is much more visible in the ’raw’ (not averaged) traces. Two randomly selected traces from the first of the recorded neurons are shown in Figure 2 Supplement 2 C. While the non-averaged traces are plagued by artifacts and noise, the rapid voltage changes are visible essentially at all onsets and offsets of the current injection.

      Figure 4: Somatic voltage in the circuit, as in Fig. 2 but showing only for the time right after current onset, about 2.3ms.

      We have added a short discussion of this at the end of Section 2.3 to briefly point out this observation and its explanation. We there also refer to the simplified circuit simulation and comparison with raw voltage traces which is now shown in the new Figure 2 Supplement 2.

      • The time constant in Table 1 is much shorter than in Figure 1FG?

      No, these values are in agreement. To facilitate the comparison we now include a graphical measurement of tau from our traces in Figure 1 Supplement 2 J.

      • Related to this, the capacitance values are very low maybe this can be explained by the model’s wrong assumption of tau?

      Indeed, the measured time constants are somewhat lower than what might be expected. We believe that this is because after a step change of the injected current, an initial rapid voltage change occurs in the soma, where the recordings are taken. The measured time constant is a combination of the ’actual’ time constant of the cell and the ’somatic’ (very short) time constant of the soma. Please see our explanations above.

      Importantly, the value for tau from Table 1 is not used explicitly in the model as the parameters used in our simulation are determined by optimal fits of the simulated voltage curves to experimentally obtained data.

      • That latter in turn could be because of either space clamp issues in this hugely complex cell or bad model predictions due to incomplete reconstructions, bad match between morphology and electrophysiology (both are from different datasets?), or unknown ion channels that produce non-linear behaviour during the current injections.

      Please see our detailed discussion above. Furthermore, we now provide additional recordings using cytoplasmic GFP as a marker for the identification of MBON-alpha3 and confirm our findings. We agree that space-clamp issues could interfere with our recordings in such a complex cell. However, our approach using electrophysiological data should still be superior to any other approach (picking text book values). As we injected negative currents for our analysis at least voltage-gated ion channels should not influence our recordings.

      • The PRAXIS method in NEURON seems too ad hoc. Passive properties of a neuron should probably rather be explored in parameter scans.

      We are a bit at a loss of what is meant by the PRAXIS method being "too ad hoc." The PRAXIS method is essentially a conjugate gradient optimization algorithm (since no explicit derivatives are available, it makes the assumption that the objective function is quadratic). This seems to us a systematic way of doing a parameter scan, and the procedure has been used in other related models, e.g. the cited Gouwens & Wilson (2009) study.

      Questions I have:

      • Computational aspects were previously addressed by e.g. Larry Abbott and Gilles Laurent (sparse coding), how do the findings here distinguish themselves from this work

      In contrast to the work by Abbott and Laurent that addressed the principal relevance and suitability of sparse and random coding for the encoding of sensory information in decision making, here we address the cellular and computational mechanisms that an individual node (KC>MBON) play within the circuitry. As we use functional and morphological relevant data this study builds upon the prior work but significantly extends the general models to a specific case. We think this is essential for the further exploration of the topic.

      • What is valence information?

      Valence information is the information whether a stimulus is good (positive valence, e.g. sugar in appetitive memory paradigms, or negative valence in aversive olfactory conditioning - the electric shock). Valence information is provided by the dopaminergic system. Dopaminergic neurons are in direct contact with the KC>MBON circuitry and modify its synaptic connectivity when olfactory information is paired with a positive or negative stimulus.

      • It seems that Martin Nawrot’s work would be relevant to this work

      We are aware of the work by the Nawrot group that provided important insights into the processing of information within the olfactory mushroom body circuitry. We now highlight some of his work. His recent work will certainly be relevant for our future studies when we try to extend our work from an individual cell to networks.

      • Compactification and democratization could be related to other work like Otopalik et al 2017 eLife but also passive normalization. The equal efficiency in line 427 reminds me of dendritic/synaptic democracy and dendritic constancy

      Many thanks for pointing this out. This is in line with the comments from reviewer 1 and we now highlight these papers in the relevant paragraph in the discussion (line 442ff).

      • The morphology does not obviously seem compact, how unusual would it be that such a complex dendrite is so compact?

      We should have been more careful in our terminology, making clear that when we write ’compact’ we always mean ’electrotonically compact," in the sense that the physical dimensions of the neuron are small compared to its characteristic electrotonic length (usually called λ). The degree of a dendritic structure being electrotonically compact is determined by the interaction of morphology, size and conductances (across the membrane and along the neurites). We don’t believe that one of these factors alone (e.g. morphology) is sufficient to characterize the electrical properties of a dendritic tree. We have now clarified this in the relevant section.

      • What were the advantages of using the EM circuit?

      The purpose of our study is to provide a "realistic" model of a KC>MBON node within the memory circuitry. We started our simulations with random synaptic locations but wondered whether such a stochastic model is correct, or whether taking into account the detailed locations and numbers of synaptic connections of individual KCs would make a difference to the computation. Therefore we repeated the simulations using the EM data. We now address the point between random vs realistic synaptic connectivity in Figure 4F. We do not observe a significant difference but this may become more relevant in future studies if we compute the interplay between MBONs activated by overlapping sets of KCs. We simply think that utilizing the EM data gets us one step closer to realistic models.

      • Isn’t Fig 4E rather trivial if the cell is compact?

      We believe this figure is a visually striking illustration that shows how electrotonically compact the cell is. Such a finding may be trivial in retrospect, once the data is visualized, but we believe it provides a very intuitive description of the cell behavior.

      Overall, I am worried that the passive modelling study of the MBON-a3 does not provide enough evidence to explain the electrophysiological behaviour of the cell and to make accurate predictions of the cell’s responses to a variety of stochastic KC inputs.

      In our view our model adequately describes the behavior of the MBON with the most minimal (passive) model. Our approach tries to make the least assumptions about the electrophysiological properties of the cell. We think that based on the current knowledge our approach is the best possible approach as thus far no active components within the dendritic or axonal compartments of Drosophila MBONs have been described. As such, our model describes the current status which explains the behavior of the cell very well. We aim to refine this model in the future if experimental evidence requires such adaptations.

      Reviewer #3 (Public Review):

      This manuscript presents an analysis of the cellular integration properties of a specific mushroom body output neuron, MBON-α3, using a combination of patch clamp recordings and data from electron microscopy. The study demonstrates that the neuron is electrotonically compact permitting linear integration of synaptic input from Kenyon cells that represent odor identity.

      Strengths of the manuscript:

      The study integrates morphological data about MBON-α3 along with parameters derived from electrophysiological measurements to build a detailed model. 2) The modeling provides support for existing models of how olfactory memory is related to integration at the MBON.

      Weaknesses of the manuscript:

      The study does not provide experimental validation of the results of the computational model.

      The goal of our study is to use computational approaches to provide insights into the computation of the MBON as part of the olfactory memory circuitry. Our data is in agreement with the current model of the circuitry. Our study therefore forms the basis for future experimental studies; those would however go beyond the scope of the current work.

      The conclusion of the modeling analysis is that the neuron integrates synaptic inputs almost completely linearly. All the subsequent analyses are straightforward consequences of this result.

      We do, indeed, find that synaptic integration in this neuron is almost completely linear. We demonstrate that this result holds in a variety of different ways. All analyses in the study serve this purpose. These results are in line with the findings by Hige and Turner (2013) who demonstrated that also synaptic integration at PN>KC synapses is highly linear. As such our data points to a feature conservation to the next node of this circuit.

      The manuscript does not provide much explanation or intuition as to why this linear conclusion holds.

      We respectfully disagree. We demonstrate that this linear integration is a combination of the size of the cell and the combination of its biophysical parameters, mainly the conductances across and along the neurites. As to why it holds, our main argument is that results based on the linear model agree with all known (to us) empirical results, and this is the simplest model.

      In general, there is a clear takeaway here, which is that the dendritic tree of MBON-α3 in the lobes is highly electrotonically compact. The authors did not provide much explanation as to why this is, and the paper would benefit from a clearer conclusion. Furthermore, I found the results of Figures 4 and 5 rather straightforward given this previous observation. I am sceptical about whether the tiny variations in, e.g. Figs. 3I and 5F-H, are meaningful biologically.

      Please see the comment above as to the ’why’ we believe the neuron is electrotonically compact: a model with this assumption agrees well with empirically found results.

      We agree that the small variations in Fig 5F-H are likely not biologically meaningful. We state this now more clearly in the figure legends and in the text. This result is important to show, however. It is precisely because these variations are small, compared to the differences between voltage differences between different numbers of activated KCs (Fig 5D) or different levels of activated synapses (Fig 5E) that we can conclude that a 25% change in either synaptic strength or number can represent clearly distinguishable internal states, and that both changes have the same effect. It is important to show these data, to allow the reader to compare the differences that DO matter (Fig 5D,E) and those that DON’T (Fig 5F-H).

      The same applies to Fig 3I. The reviewer is entirely correct: the differences in the somatic voltage shown in Figure 3I are minuscule, less than a micro-Volt, and it is very unlikely that these difference have any biological meaning. The point of this figure is exactly to show this!. It is to demonstrate quantitatively the transformation of the large differences between voltages in the dendritic tree and the nearly complete uniform voltage at the soma. We feel that this shows very clearly the extreme "democratization" of the synaptic input!

    1. Author Response

      Reviewer #1 (Public Review):

      Nicotine preference is highly variable between individuals. The paper by Mondoloni et al. provided some insight into the potential link between IPN nAchR heterogeneity with male nicotine preference behavior. They scored mice using the amount of nicotine consumption, as well as the rats' preference of the drug using a two-bottle choice experiment. An interesting heterogeneity in nicotine-drinking profiles was observed in adult male mice, with about half of the mice ceasing nicotine consumption at high concentrations. They observed a negative association of nicotine intake with nicotine-evoked currents in the antiparticle nucleus (IPN). They also identified beta4-containing nicotine acetylcholine receptors, which exhibit an association with nicotine aversion. The behavioral differentiation of av vs. n-avs and identification of IPN variability, both in behavioral and electrophysiological aspects, add an important candidate for analyzing individual behavior in addiction.

      The native existence of beta4-nAchR heterogeneity is an important premise that supports the molecules to be the candidate substrate of variabilities. However, only knockout and re-expression models were used, which is insufficient to mimic the physiological state that leads to variability in nicotine preference.

      We’d like to thank reviewer 1 for his/her positive remarks and for suggesting important control experiments. Regarding the reviewer’s latest comment on the link between b4 and variability, we would like to point out that the experiment in which mice were put under chronic nicotine can be seen as another way to manipulate the physiological state of the animal. Indeed, we found that chronic nicotine downregulates b4 nAChR expression levels (but has no effect on residual nAChR currents in b4-/- mice) and reduces nicotine aversion. Therefore, these results also point toward a role of IPN b4 nAChRs in nicotine aversion. We have now performed additional experiments and analyses to address these concerns and to reinforce our demonstration.

      Reviewer #2 (Public Review):

      In the current study, Mondoloni and colleagues investigate the neural correlates contributing to nicotine aversion and its alteration following chronic nicotine exposure. The question asked is important to the field of individual vulnerability to drug addiction and has translational significance. First, the authors identify individual nicotine consumption profiles across isogenic mice. Further, they employed in vivo and ex vivo physiological approaches to defining how antiparticle nuclei (IPn) neuronal response to nicotine is associated with nicotine avoidance. Additionally, the authors determine that chronic nicotine exposure impairs IPn neuronal normal response to nicotine, thus contributing to higher amounts of nicotine consumption. Finally, they used transgenic and viralmediated gene expression approaches to establish a causal link between b4 nicotine receptor function and nicotine avoidance processes.

      The manuscript and experimental strategy are well designed and executed; the current dataset requires supplemental analyses and details to exclude possible alternatives. Overall, the results are exciting and provide helpful information to the field of drug addiction research, individual vulnerability to drug addiction, and neuronal physiology. Below are some comments aiming to help the authors improve this interesting study.

      We would like to thank the reviewer for his/her positive remarks and we hope the new version of the manuscript will clarify his/her concerns.

      1) The authors used a two-bottle choice behavioral paradigm to investigate the neurophysiological substrate contributing to nicotine avoidance behaviors. While the data set supporting the author's interpretation is compelling and the experiments are well-conducted, a few supplemental control analyses will strengthen the current manuscript.

      a) The bitter taste of nicotine might generate confounds in the data interpretation: are the mice avoiding the bitterness or the nicotine-induced physiological effect? To address this question, the authors mixed nicotine with saccharine, thus covering the bitterness of nicotine. Additionally, the authors show that all the mice exposed to quinine avoid it, and in comparison, the N-Av don't avoid the bitterness of the nicotine-saccharine solution. Yet it is unclear if Av and N-Av have different taste discrimination capacities and if such taste discrimination capacities drive the N-Av to consume less nicotine. Would Av and N-Av mice avoid quinine differently after the 20-day nicotine paradigm? Would the authors observe individual nicotine drinking behaviors if nicotine/quinine vs. quinine were offered to the mice?

      As requested by all three reviewers, we have now performed a two-bottle choice experiment to verify whether different sensitivities to the bitterness of the nicotine solution could explain the different sensitivities to the aversive properties of nicotine. Indeed, even though we used saccharine to mask the bitterness of the nicotine solution, we cannot fully exclude the possibility that the taste capacity of the mice could affect their nicotine consumption. Reviewers 1 and 2 suggested to perform nicotine/quinine versus quinine preference tests, but we were afraid that forcing mice to drink an aversive, quinine-containing solution might affect the total volume of liquid consumed per day, and also might create a “generalized conditioned aversion to drinking water - detrimental to overall health and a confounding factor” as pointed out by reviewer 3. Therefore, we designed the experiment a little differently.

      In this two-bottle choice experiment, mice were first proposed a high concentration of nicotine (100 µg/ml) which has previously been shown to induce avoidance behavior in mice (Figure 3C). Then, mice were offered three increasing concentrations of quinine: 30, 100 and 300 µM. Quinine avoidance was dose dependent, as expected: it was moderate for 30 µM but almost absolute for 300 µM quinine. We then investigated whether nicotine and quinine avoidances were linked. We found no correlation between nicotine and quinine preference (new Figure: Figure 1- supplementary figure 1D). This new experiment strongly suggests that aversion to the drug is not directly tied to the sensitivity of mice to the bitter taste of nicotine.

      Other results reinforce this conclusion. First, none of the b4-/- mice (0/13) showed aversion to nicotine, whereas about half of the virally-rescued animals (8/17, b4 re-expressed in the IPN of b4-/- mice) showed nicotine aversion, a proportion similar to the one observed in WT mice. This experiment makes a clear, direct link between the expression of b4 nAChRs in the IPN and aversion to the drug.

      Furthermore, we also verified that the sensitivity of b4-/- mice to bitterness is not different from that of WT mice (new Figure 4 – figure supplement 1B). This new result indicates that the reason why b4-/- mice consume more nicotine than WT mice is not because they have a reduced sensitivity bitterness.

      Together, these new experiments strongly suggests that interindividual differences in sensitivity to the bitterness of nicotine play little role in nicotine consumption behavior in mice.

      b) Metabolic variabilities amongst isogenic mice have been observed. Thus, while the mice consume different amounts of nicotine, changes in metabolic processes, thus blood nicotine concentrations, could explain differences in nicotine consumption and neurophysiology across individuals. The authors should control if the blood concentration of nicotine metabolites between N-Av and Av are similar when consuming identical amounts of nicotine (50ug/ml), different amounts (200ug/ml), and in response to an acute injection of a fixed nicotine quantity.

      We agree with the reviewer that metabolic variabilities could explain (at least in part) the differences observed between avoiders and non-avoiders. But other factors could also play a role, such as stress level (there is a strong interaction between stress and nicotine addiction, as shown by our group (PMID: 29155800, PMID: 30361503) and others), hierarchical ranking, epigenetic factors etc… Our goal in this study is not to examine all possible sources of variability. What is striking about our results is that deletion of a single gene (encoding the nAChR b4 subunit) is sufficient to eliminate nicotine avoidance, and that re-expression of this receptor subunit in the IPN is sufficient to restore nicotine avoidance. In addition, we observe a strong correlation between the amplitude of nicotineinduced current in the IPN, and nicotine consumption. Therefore, the expression level of b4 in the IPN is sufficient to explain most of the behavioral variability we observe. We do not feel the need to explore variations in metabolic activities, which are (by the way) very expensive experiments. However, we have added a sentence in the discussion to mention metabolic variabilities as a potential source of variability in nicotine consumption.

      2) Av mice exposed to nicotine_200ug/ml display minimal nicotine_50ug/ml consumption, yet would Av mice restore a percent nicotine consumption >20 when exposed to a more extended session at 50ug/kg? Such a data set will help identify and isolate learned avoidance processes from dose-dependent avoidance behaviors.

      We have now performed an additional two-bottle choice experiment to examine an extended time at 50 µg/ml. But we also performed the experiment a little differently. We directly proposed a high nicotine concentration to mice (200 µg/ml), followed by 8 days at 50 µg/ml. We found that, overall, mice avoided the 200 µg/ml nicotine solution, and that the following increase in nicotine preference was slow and gradual throughout the eight days at 50 µg/ml (Figure 2-figure supplement 1C). This slow adjustment to a lower-dose contrasts with the rapid (within a day) change in intake observed when nicotine concentration increases (Figure 1-figure supplement 1A). About half of the mice (6/13) retained a steady, low nicotine preference (< 20%) throughout the eight days at 50 µg/ml, resembling what was observed for avoiders in Figure 2D. Together, these results suggest that some of the mice, the non-avoiders, rapidly adjust their intake to adapt to changes in nicotine concentration in the bottle. For avoiders, aversion for nicotine seems to involve a learning mechanism that, once triggered, results in prolonged cessation of nicotine consumption.

      3) The author should further investigate the basal properties of IPn neuron in vivo firing rate activity recorded and establish if their spontaneous activity determines their nicotine responses in vivo, such as firing rate, ISI, tonic, or phasic patterns. These analyses will provide helpful information to the neurophysiologist investigating the function of IPn neurons and will also inform how chronic nicotine exposure shapes the IPn neurophysiological properties.

      We have performed additional analyses of the in vivo recordings. First, we have built maps of the recorded neurons, and we show that there is no anatomical bias in our sampling between the different groups. The only condition for which we did not sample neurons similarly is when we compare the responses to nicotine in vivo in WT and b4-/- mice (Figure 4E). The two groups were not distributed similarly along the dorso-ventral axis (Figure 4-figure supplement 2B). Yet, we do not think that the difference in nicotine responses observed between WT and b4-/- mice is due to a sampling bias. Indeed, we found no link between the response to nicotine and the dorsoventral coordinates of the neurons, in any of the groups (MPNic and MP Sal in Figure 3-figure supplement 1D; WT and b4-/- mice in Figure 4-figure supplement 2C). Therefore, our different groups are directly comparable, and the conclusions drawn in our study fully justified.

      As requested, we have looked at whether the basal firing rate of IPN neurons determines the response to nicotine and indeed, neurons with higher firing rate show greater change in firing frequency upon nicotine injection (Figure 3 -figure supplement 1G and Figure 4-figure supplement 2F). We have also looked at the effect of chronic nicotine on the spontaneous firing rate of IPN neurons (Figure 3 -figure supplement 1F) but found no evidence for a change in basal firing properties. Similarly, the deletion of b4 had no effect on the spontaneous activity of the recorded neurons (Figure 4-figure supplement 2F). Finally, we found no evidence for any link between the anatomical coordinates of the neurons and their basal firing rate (Figure 3-figure supplement 1E and Figure 4figure supplement 2D).

      Reviewer #3 (Public Review):

      The manuscript by Mondoloni et al characterizes two-bottle choice oral nicotine consumption and associated neurobiological phenotypes in the antiparticle nucleus (IPN) using mice. The paper shows that mice exhibit differential oral nicotine consumption and correlate this difference with nicotine-evoked inward currents in neurons of the IPN. The beta4 nAChR subunit is likely involved in these responses. The paper suggests that prolonged exposure to nicotine results in reduced nAChR functional responses in IPN neurons. Many of these results or phenotypes are reversed or reduced in mice that are null for the beta4 subunit. These results are interesting and will add a contribution to the literature. However, there are several major concerns with the nicotine exposure model and a few other items that should be addressed.

      Strengths:

      Technical approaches are well-done. Oral nicotine, electrophysiology, and viral re-expression methods were strong and executed well. The scholarship is strong and the paper is generally well-written. The figures are high-quality.

      We would like to thank the reviewer for his/her comments and suggestions on how to improve the manuscript.

      Weaknesses:

      Two bottle choice (2BC) model. 2BC does not examine nicotine reinforcement, which is best shown as a volitional preference for the drug over the vehicle. Mice in this 2BC assay (and all such assays) only ever show indifference to nicotine at best - not preference. This is seen in the maximal 50% preference for the nicotine-containing bottle. 2BC assays using tastants such as saccharin are confounded. Taste responses can very likely differ from primary reinforcement and can be related to peripheral biology in the mouth/tongue rather than in the brain reward pathway.

      The two-bottle nicotine drinking test is a commonly used method to study addiction in mice (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319). Like all methods, it has its limitations, but it also allows for different aspects to be addressed than those covered by selfadministration protocols. The two-bottle nicotine drinking test simply measures the animals' preference for a solution containing nicotine over a control solution without nicotine: the animals are free to choose nicotine or not, which allows to evaluate sensitivity and avoidance thresholds. What we show in this paper is precisely that despite interindividual differences in the way the drug is used (passively or actively), a significant proportion of the animals avoids the nicotine bottle at a certain concentration, suggesting that we are dealing with individual characteristics that are interesting to identify in the context of addiction and vulnerability. We agree that the twobottle choice test cannot provide as much information about the reinforcing effects of the drug as selfadministration procedures. We are aware of the limitations of the method and were careful not to interpret our data in terms of reinforcement to the drug. For instance, mice that consume nicotine were called “non-avoiders” and not “consumers”. We added a few sentences at the beginning of the discussion to highlight these limitations.

      The reviewer states that the mice in this 2BC assay (and all such assays) “only ever show indifference to nicotine at best - not preference”. This is seen in the maximal 50% preference for the nicotine-containing bottle. While this is true on average, it isn’t when we look at individual profiles, as we did here. We clearly observed that some mice have a strong preference for nicotine and, conversely, that some mice actively avoid nicotine after a certain concentration is proposed in the bottle.

      Regarding tastants, we indeed used saccharine to hide the bitter taste of nicotine and prevent taste-related side bias. This is a classical (though not perfect) paradigm in the field of nicotine research (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319). To evaluate whether different sensitivities to the bitterness of nicotine may explain the interindividual differences in nicotine consumption we performed new experiments (as suggested by all three reviewers). In this two-bottle choice experiment, mice were first proposed a high concentration of nicotine (100 µg/ml) which has previously been shown to induce avoidance behavior in mice (Figure 3C). Then, mice were offered three increasing concentrations of quinine: 30, 100 and 300 µM. Quinine avoidance was dose dependent, as expected: it was moderate for 30 µM but almost absolute for 300 µM quinine. We then investigated whether nicotine and quinine avoidances were linked. We found no correlation between nicotine and quinine preference (new Figure: Figure 1- supplementary figure 1D). This new experiment strongly suggests that aversion to the drug is not directly tied to the sensitivity of mice to the bitter taste of nicotine. Other results reinforce this conclusion. First, none of the b4-/- mice (0/13) showed aversion to nicotine, whereas about half of the virally-rescued animals (8/17, b4 re-expressed in the IPN of b4-/- mice) showed nicotine aversion, a proportion similar to the one observed in WT mice. This experiment makes a clear, direct link between the expression of b4 nAChRs in the IPN and aversion to the drug. Furthermore, we also verified that the sensitivity of b4-/- mice to bitterness is not different from that of WT mice (new Figure 4 - figure supplement 1B). This new result indicates that the reason why b4-/- mice consume more nicotine than WT mice is not because they have a reduced sensitivity bitterness. Together, these new experiments strongly suggests that interindividual differences in sensitivity to the bitterness of nicotine play little role in nicotine consumption behavior in mice.

      Moreover, this assay does not test free choice, as nicotine is mixed with water which the mice require to survive. Since most concentrations of nicotine are aversive, this may create a generalized conditioned aversion to drinking water - detrimental to overall health and a confounding factor.

      Mice are given a choice between two bottles, only one of which contains nicotine. Hence, even though their choices are not fully free (they are being presented with a limited set of options), mice can always decide to avoid nicotine and drink from the bottle containing water only. We do not understand how this situation may create a generalized aversion to drinking. In fact, we have never observed any mouse losing weight or with deteriorated health condition in this test, so we don’t think it is a confounding factor.

      What plasma concentrations of nicotine are achieved by 2BC? When nicotine is truly reinforcing, rodents and humans titrate their plasma concentrations up to 30-50 ng/mL. The Discussion states that oral self-administration in mice mimics administration in human smokers (lines 388-389). This is unjustified and should be removed. Similarly, the paragraph in lines 409-423 is quite speculative and difficult or impossible to test. This paragraph should be removed or substantially changed to avoid speculation. Overall, the 2BC model has substantial weaknesses, and/or it is limited in the conclusions it will support.

      The reviewer must have read another version of our article, because these sentences and paragraphs are not present in our manuscript.

      Regarding the actual concentration of nicotine in the plasma, this is indeed a good question. We have actually measured the plasma concentrations of nicotine for another study (article in preparation). The results from this experiment can be found below. The half-life of nicotine is very short in the blood and brain of mice (about 6 mins, see Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319), making it very hard to assess. Therefore, we also assessed the plasma concentration of cotinine, the main metabolite of nicotine. We compared 4 different conditions: home-cage (forced drinking of 100 ug/ml nicotine solution); osmotic minipump (OP, 10 mg/kg/d, as in our current study); Souris-city (a large social environment developed by our group, see Torquet et al. Nat. Comm. 2018); and the two-bottle choice procedure (when a solution of nicotine 100 ug/ml was proposed). The concentrations of plasma nicotine found were very low for all groups that drank nicotine, but not for the group that received nicotine through the osmotic minipump group. This is most likely because mice did not drink any nicotine in the hour prior to being sampled and all nicotine was metabolized. Indeed, when we look at the plasma concentration of cotinine, we see that cotinine was present in all of the groups. The plasma concentration of cotinine was similar in the groups for which “consumption” was forced: forced drinking in the home cage (HC) or infusion through osmotic minipump. This indicates that the plasma concentration of cotinine is similar whether mice drink nicotine (100 ug/ml) or whether nicotine is infused with the minipump (10 mg/kg/d). For Souris city and the two-bottle choice procedure, the cotinine concentrations were in the same range (mostly between 0-100 ng/ml). Globally, the concentrations of nicotine and cotinine found in the plasma of mice that underwent the two-bottle choice procedure are in the range of what has been previously described (Matta, S. G. et al. 2006. Guidelines on nicotine dose selection for in vivo research. Psychopharmacology 190, 269–319).

      Regarding the limitations of the two-bottle choice test, we discuss them more extensively in the current version of the manuscript.

      Statistical testing on subgroups. Mice are run through an assay and assigned to subgroups based on being classified as avoiders or non-avoiders. The authors then perform statistical testing to show differences between the avoiders and non-avoiders. It is circular to do so. When the authors divided the mice into avoiders and non-avoiders, this implies that the mice are different or from different distributions in terms of nicotine intake. Conducting a statistical test within the null hypothesis framework, however, implies that the null hypothesis is being tested. The null hypothesis, by definition, is that the groups do NOT differ. Obviously, the authors will find a difference between the groups in a statistical test when they pre-sorted the mice into two groups, to begin with. Comparing effect sizes or some other comparison that does not invoke the null hypothesis would be appropriate.

      Our analysis, which can be summarized as follows, is fairly standard (see Krishnan, V. et al. (2007) Molecular adaptations underlying susceptibility and resistance to social defeat in brain reward regions. Cell 131, 391–404). Firstly, the mice are segregated into two groups based on their consumption profile, using the variability in their behavior. The two groups are obviously statistically different when comparing their consumption. This first analytical step allows us to highlight the variability and to establish the properties of each sub-population in terms of consumption. Our analysis could support the reviewer's comment if it ended at this point. However, our analysis doesn't end here and moves on to the second step. The separation of the mice into two groups (which is now a categorical variable) is used to compare the distribution of other variables, such as mouse choice strategy and current amplitude, based on the 2 categories. The null hypothesis tested is that the value of these other variables is not different between groups. There is no a priori obvious reason for the currents recorded in the IPN to be different in the two groups. These approaches allow us to show correlations between the variables. Finally, in the third and last step, one (or several) variable(s) are manipulated to check whether nicotine consumption is modified accordingly. Manipulation was performed by exposing mice to chronic nicotine, by using mutant mice with decreased nicotinic currents, and by re-expressing the deleted nAChR subunit only in the IPN. This procedure is fairly standard, and cannot be considered as a circular analysis with data selection problem, as explained in (Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F. & Baker, C. I. (2009) Circular analysis in systems neuroscience: the dangers of double dipping. Nature Neuroscience 12, 535-540).

      Decreased nicotine-evoked currents following passive exposure to nicotine in minipumps are inconsistent with published results showing that similar nicotine exposure enhances nAChR function via several measures (Arvin et al, J Neurosci, 2019). The paper does acknowledge this previous paper and suggests that the discrepancy is explained by the fact that they used a higher concentration of nicotine (30 uM) that was able to recruit the beta4containing receptor (whereas Arvin et al used a caged nicotine that was unable to do so). This may be true, but the citation of 30 uM nicotine undercuts the argument a bit because 30 uM nicotine is unlikely to be achieved in the brain of a person using tobacco products; nicotine levels in smokers are 100-500 nM. It should be noted in the paper that it is unclear whether the down-regulated receptors would be active at concentrations of nicotine found in the brain of a smoker.

      We indeed find opposite results compared to Arvin et al., and we give possible explanations for this discrepancy in the discussion. To be honest we don’t fully understand why we have opposite results. However, we clearly observed a decreased response to nicotine, both in vitro (with 30 µM nicotine on brain slices) and in vivo (with a classical dose of 30 µg/kg nicotine i.v.), while Arvin et al. only tested nicotine in vitro.

      Regarding the reviewer’s comment about the nicotine concentration used (30 µM): we used that concentration in vitro to measure nicotine-induced currents (it’s a concentration close to the EC50 for heteromeric receptors, which will likely recruit low affinity a3b4 receptors) and to evaluate the changes in nAChR current following nicotine exposure. We did not use that concentration to induce nAChR desensitization, so we don’t really understand the argument regarding the levels of nicotine in smokers. For inducing desensitization, we used a minipump that delivers a daily dose of 10 mg/kg/day, which is the amount of nicotine mice drink in our assay.

      The statement in lines 440-41 ("we show that concentrations of nicotine as low as 7.5 ug/kg can engage the IPN circuitry") is misleading, as the concentration in the water is not the same as the concentration in the CSF since the latter would be expected to build up over time. The paper did not provide measurements of nicotine in plasma or CSF, so concluding that the water concentration of nicotine is related to plasma concentrations of nicotine is only speculative.

      The sentence “we show that concentrations of nicotine as low as 7.5 ug/kg can engage the IPN circuitry" is not in the manuscript so the reviewer must have read another version of the paper.

      The results in Figure 2E do not appear to be from a normal distribution. For example, results cluster at low (~100 pA) responses, and a fraction of larger responses drive the similarities or differences.

      Indeed, that is why we performed a non-parametric Mann-Whitney test for comparing the two groups, as indicated in the legend of figure 2E.

      10 mg/kg/day in mice or rats is likely a non-physiological exposure to nicotine. Most rats take in 1.0 to 1.5 mg/kg over a 23-hour self-administration period (O'Dell, 2007). Mice achieve similar levels during SA (Fowler, Neuropharmacology 2011). Forced exposure to 10 mg/kg/day is therefore 5 to 10-fold higher than rodents would ever expose themselves to if given the choice. This should be acknowledged in a limitations section of the Discussion.

      The two-bottle choice task is very different from nicotine self-administration procedures in terms of administration route: oral versus injected (in the blood or in the brain), respectively. Therefore, the quantities of drug consumed cannot be directly compared. In our manuscript, mice consume on average 10 mg/kg/day of nicotine at the highest nicotine concentration tested, which is fully consistent with what was already published in many studies (20 mg/kg/day in Frahm et al. Neuron 2013, 5-10 mg/kg/day in Bagdas et al., NP 2020, 10-20 mg/kg/day in Bagdas et al. NP2019, to cite a few...). Hence, we used that concentration of nicotine (10 mg/kg/d) for chronic administration of nicotine using minipumps. This is also a nicotine concentration that is classically used in osmotic minipumps for chronic administration of nicotine: 10 mg/kg/d in Dongelmans et al. Nat. Com 2021 (our lab), 12 mg/kg/d in Arvin et al. J. Neuro. 2019 (Drenan lab), 12 mg/kg/d in Lotfipour et al. J. Neuro. 2013 (Boulter lab) etc… Therefore, we do not see the issue here.

      Are the in vivo recordings in IPN enriched or specific for cells that have a spontaneous firing at rest? If so, this may or may not be the same set/type of cells that are recorded in patch experiments. The results could be biased toward a subset of neurons with spontaneous firing. There are MANY different types of neurons in IPN that are largely intermingled (see Ables et al, 2017 PNAS), so this is a potential problem.

      It is true that there are many types of neurons in the IPN. In-vivo electrophysiology and slice electrophysiology should be considered as two complementary methods to obtain detailed properties of IPN neurons. The populations sampled by these two methods are certainly not identical (IPR in patch -clamp versus mostly IPR and IPC in vivo), and indeed only spontaneously active neurons are recorded in in-vivo electrophysiology. The question is whether this is or not a potential problem. The results we obtained using in-vivo and brain-slice electrophysiology are consistent (i.e., a decreased response to nicotine), which indicates that our results are robust and do not depend on the selection of a particular subpopulation. In addition, we now provide the maps of the neurons recorded both in slices and in vivo (see supplementary figures, and response to the other two referees). We show that, overall, there is no bias sampling between the different groups. Together, these new analyses strongly suggest that the differences we observe between the groups are not due to sampling issues. We have added the Ables 2017 reference and are discussing neuron variability more extensively in the revised manuscript.

      Related to the above issue, which of the many different IPN neuron types did the group re-express beta4? Could that be controlled or did beta4 get re-expressed in an unknown set of neurons in IPN? There is insufficient information given in the methods for verification of stereotaxic injections.

      Re-expression of b4 was achieved with a strong, ubiquitous promoter (pGK), hence all cell types should in principle be transduced. This is now clearly stated in the result section, the figure legend and the method section. Unfortunately, we had no access to a specific mouse line to restrict expression of b4 to b4-expressing cells, since the b4-Cre line of GENSAT is no more alive. This mouse line was problematic anyways because expression levels of the a3, a5 and b4 nAChR subunits, which belong to the same gene cluster, were reported to be affected. Yet, we show in this article that deleting b4 leads to a strong reduction of nicotine-induced currents in the IPR (80%, patch-clamp), and of the response to nicotine in vivo (65%). These results indicate that b4 is strongly expressed in the IPN, likely in a large majority of IPR and IPC neurons (see also our response to reviewer 1). In addition, we show that our re-expression strategy restores nicotine-induced currents in patch-clamp experiments and also the response to nicotine in vivo (new Figure 5C). Non-native expression levels could potentially be achieved (e.g. overexpression) but this is not what we observed: responses to nicotine were restored to the WT levels (in slices and in vivo). And importantly this strategy rescued the WT phenotype in terms of nicotine consumption. Expression of b4 alone in cells that do not express any other nAChR subunit (as, presumably, in the lateral parts of the IPN, see GENSAT images above) should not produce any functional nAChR, since alpha subunits are mandatory to produce functional receptors. As specified in the manuscript, proper transduction of the IPN was verified using post-hoc immunochemistry, and mice with transduction of b4 in the VTA were excluded from the analyses.

      Data showing that alpha3 or beta4 disruption alters MHb/IPN nAChR function and nicotine 2BC intake is not novel. In fact, some of the same authors were involved in a paper in 2011 (Frahm et al., Neuron) showing that enhanced alpha3beta4 nAChR function was associated with reduced nicotine consumption. The present paper would therefore seem to somewhat contradict prior findings from members of the research group.

      Frahm et al used a transgenic mouse line (called TABAC) in which the expression of a3b4 receptor is increased, and they observed reduced nicotine consumption. We do the exact opposite: we reduce (a3)b4 receptor expression (using the b4 knock-out line, or by putting mice under chronic nicotine), and observe increased consumption. There is thus no contradiction. In fact, we discuss our findings in the light of Frahm et al. in the discussion section.

      Sex differences. All studies were conducted in male mice, therefore nothing was reported regarding female nicotine intake or physiology responses. Nicotine-related biology often shows sex differences, and there should be a justification provided regarding the lack of data in females. A limitations section in the Discussion section is a good place for this.

      We agree with the reviewer. We added a sentence in the discussion.

    1. Author Response

      Reviewer #1 (Public Review):

      1) Although I found the introduction well written, I think it lacks some information or needs to develop more on some ideas (e.g., differences between the cerebellum and cerebral cortex, and folding patterns of both structures). For example, after stating that "Many aspects of the organization of the cerebellum and cerebrum are, however, very different" (1st paragraph), I think the authors need to develop more on what these differences are. Perhaps just rearranging some of the text/paragraphs will help make it better for a broad audience (e.g., authors could move the next paragraph up, i.e., "While the cx is unique to mammals (...)").

      We have added additional context to the introduction and developed the differences between cerebral and cerebellar cortex, also re-arranging the text as suggested.

      2) Given that the authors compare the folding patterns between the cerebrum and cerebellum, another point that could be mentioned in the introduction is the fact that the cerebellum is convoluted in every mammalian species (and non-mammalian spp as well) while the cerebrum tends to be convoluted in species with larger brains. Why is that so? Do we know about it (check Van Essen et al., 2018)? I think this is an important point to raise in the introduction and to bring it back into the discussion with the results.

      We now mention in the introduction the fact that the cerebellum is folded in mammals, birds and some fishes, and provide references to the relevant literature. We have also expanded our discussion about the reasons for cortical folding in the discussion, which now contains a subsection addressing the subject (this includes references to the work of Van Essen).

      3) In the results, first paragraph, what do the authors mean by the volume of the medial cerebellum? This needs clarification.

      We have modified the relevant section in the results, and made the definition of the medial cerebellum more clear indicating that we refer to the vermal region of the cerebellum.

      4) In the results: When the authors mention 'frequency of cerebellar folding', do they mean the degree of folding in the cerebellum? At least in non-mammalian species, many studies have tried to compare the 'degree or frequency of folding' in the cerebellum by different proxies/measurements (see Iwaniuk et al., 2006; Yopak et al., 2007; Lisney et al., 2007; Yopak et al., 2016; Cunha et al., 2022). Perhaps change the phrase in the second paragraph of the result to: "There are no comparative analyses of the frequency of cerebellar folding in mammals, to our knowledge".

      We have modified the subsection in the methods referring to the measurement of folial width and folial perimeter to make the difference more clear. The folding indices that have been used previously (which we cite) are based on Zilles’s gyrification index. This index provides only a global idea of degree of folding, but it’s unable to distinguish a cortex with profuse shallow folds from one with a few deep ones. An example of this is now illustrated in Fig. 3d, where we also show how that problem is solved by the use of our two measurements (folial width and perimeter). The problem is also discussed in the section about the measurement of folding in the discussion section:

      “Previous studies of cerebellar folding have relied either on a qualitative visual score (Yopak et al. 2007, Lisney et al. 2008) or a “gyrification index” based on the method introduced by Zilles et al. (1988, 1989) for the study of cerebral folding (Iwaniuk et al. 2006, Cunha et al. 2020, 2021). Zilles’s gyrification index is the ratio between the length of the outer contour of the cortex and the length of an idealised envelope meant to reflect the length of the cortex if it were not folded. For instance, a completely lissencephalic cortex would have a gyrification index close to 1, while a human cerebral cortex typically has a gyrification index of ~2.5 (Zilles et al. 1988). This method has certain limitations, as highlighted by various researchers (Germanaud et al. 2012, 2014, Rabiei et al. 2018, Schaer et al. 2008, Toro et al. 2008, Heuer et al. 2019). One important drawback is that the gyrification index produces the same value for contours with wide variations in folding frequency and amplitude, as illustrated in Fig. 3d. In reality, folding frequency (inverse of folding wavelength) and folding amplitude represent two distinct dimensions of folding that cannot be adequately captured by a single number confusing both dimensions. To address this issue we introduced 2 measurements of folding: folial width and folial perimeter. These measurements can be directly linked to folding frequency and amplitude, and are comparable to the folding depth and folding wavelength we introduced previously for cerebral 3D meshes (Heuer et al. 2019). By using these measurements, we can differentiate folding patterns that could be confused when using a single value such as the gyrification index (Fig. 3d). Additionally, these two dimensions of folding are important, because they can be related to the predictions made by biomechanical models of cortical folding, as we will discuss now.”

      5) Sultan and Braitenberg (1993) measured cerebella that were sagittally sectioned (instead of coronal), right? Do you think this difference in the plane of the section could be one of the reasons explaining different results on folial width between studies? Why does the foliation index calculated by Sultan and Braitenberg (1993) not provide information about folding frequency?

      The measurement of foliation should be similar as far as enough folds are sectioned perpendicular to their main axis. This will be the case for folds in the medial cerebellum (vermis) sectioned sagittally, and for folds in the lateral cerebellum sectioned coronally. The foliation index of Sultan and Braitenberg does not provide a similar account of folding frequency as we do because they only measure groups of folia (what some called lamellae), whereas we measure individual folia. It is not easy to understand exactly how Sultan and Braitenberg proceeded from their paper. We contacted Prof. Fahad Sultan (we acknowledge his help in our manuscript). Author response image 1 provides a more clear description of their procedure:

      Author response image 1.

      As Author response image 1 shows, each of the structures that they call a fold is composed of several folia, and so their measurements are not comparable with ours which measure individual folia (a). The flattened representation (b) is made by stacking the lengths of the fold axes (dashed lines), separating them by the total length of each fold (the solid lines), which each may contain several folia.

      6) Another point that needs to be clarified is the log transformation of the data. Did the authors use log-transformed data for all types of analyses done in the study? Write this information in the material and methods.

      Yes, we used the log10 transformation for all our measurements. This is now mentioned in the methods section, and again in the section concerning allometry. We are including a link to all our code to facilitate exact replication of our entire method, including this transformation.

      7) The discussion needs to be expanded. The focus of the paper is on the folding pattern of the cerebellum (among different mammalian species) and its relationship with the anatomy of the cerebrum. Therefore, the discussion on this topic needs to be better developed, in my opinion (especially given the interesting results of this paper). For example, with the findings of this study, what can we say about how the folding of the cerebellum is determined across mammals? The authors found that the folial width, folial perimeter, and thickness of the molecular layer increase at a relatively slow rate across the species studied. Does this mean that these parameters have little influence on the cerebellar folding pattern? What mostly defines the folding patterns of the cerebellum given the results? Is it the interaction between section length and area? Can the authors explain why size does not seem to be a "limiting factor" for the folding of the cerebellum (for example, even relatively small cerebella are folded)? Is that because the 'white matter' core of the cerebellum is relatively small (thus more stress on it)?

      We have expanded the discussion as suggested, with subsections detailing the measuring of folding, the modelling of folding for the cerebrum and the cerebellum, and the role that cerebellar folding may play in its function. We refer to the literature on cortical folding modelling, and we discuss our results in terms of the factors that this research has highlighted as critical for folding. From the discussion subsection on models of cortical folding:

      “The folding of the cerebral cortex has been the focus of intense research, both from the perspective of neurobiology (Borrell 2018, Fernández and Borrell 2023) and physics (Toro and Burnod 2005, Tallinen et al. 2014, Kroenke and Bayly 2018). Current biomechanical models suggest that cortical folding should result from a buckling instability triggered by the growth of the cortical grey matter on top of the white matter core. In such systems, the growing layer should first expand without folding, increasing the stress in the core. But this configuration is unstable, and if growth continues stress is released through cortical folding. The wavelength of folding depends on cortical thickness, and folding models such as the one by Tallinen et al. (2014) predict a neocortical folding wavelength which corresponds well with the one observed in real cortices. Tallinen et al. (2014) provided a prediction for the relationship between folding wavelength λ and the mean thickness (𝑡) of the cortical layer: λ = 2π𝑡(µ/(3µ𝑠))1/3. (...)”

      From this biomechanical framework, our answers to the questions of the Reviewer would be:

      • How is the folding of the cerebellum determined across mammals? By the expansion of a layer of reduced thickness on top of an elastic layer (the white matter)

      • Folial width, folial perimeter, and thickness of the molecular layer increase at a relatively slow rate across the species studied. Does this mean that these parameters have little influence on the cerebellar folding pattern? On the contrary, that indicates that the shape of individual folia is stable, providing the smallest level of granularity of a folding pattern. In the extreme case where all folia had exactly the same size, a small cerebellum would have enough space to accommodate only a few folia, whereas a large cerebellum would accommodate many more.

      • What mostly defines the folding patterns of the cerebellum given the results? Is it the interaction between section length and area? It’s the mostly 2D expansion of the cerebellar cortical layer and its thickness.

      • Can the authors explain why size does not seem to be a "limiting factor" for the folding of the cerebellum? Because even a cerebellum of very small volume would fold if its cortex were thin enough and expanded sufficiently. That’s why the cerebellum folds even while being smaller than the cerebrum: because its cortex is much thinner.

      8) One caveat or point to be raised is the fact that the authors use the median of the variables measured for the whole cerebellum (e.g., median width and median perimeter across all folia). Although the cerebellum is highly uniform in its gross internal morphology and circuitry's organization across most vertebrates, there is evidence showing that the cerebellum may be organized in different functional modules. In that way, different regions or folia of the cerebellum would have different olivo-cortico-nuclear circuitries, forming, each one, a single cerebellar zone. Although it is not completely clear how these modules/zones are organized within the cerebellum, I think the authors could acknowledge this at the end of their discussion, and raise potential ideas for future studies (e.g., analyse folding of the cerebellum within the brain structure - vermis vs lateral cerebellum, for example). I think this would be a good way to emphasize the importance of the results of this study and what are the main questions remaining to be answered. For example, the expansion of the lateral cerebellum in mammals is suggested to be linked with the evolution of vocal learning in different clades (see Smaers et al., 2018). An interesting question would be to understand how foliation within the lateral cerebellum varies across mammalian clades and whether this has something to do with the cellular composition or any other aspect of the microanatomy as well as the evolution of different cognitive skills in mammals.

      We now address this point in a subsection of the discussion which details the implications of our methodological decisions and the limitations of our approach. It is true that the cerebellum is regionally variable. Our measurements of folial width, folial perimeter and molecular layer thickness are local, and we should be able to use them in the future to study regional variation. However, this comes with a number of difficulties. First, it would require sampling all the cerebellum (and the cerebrum) and not just one section. But even if that were possible that would increase the number of phenotypes, beyond the current scope of this study. Our central question about brain folding in the cerebellum compared to the cerebrum is addressed by providing data for a substantial number of mammalian species. As indicated by Reviewer #3, adding more variables makes phylogenetic comparative analyses very difficult because the models to fit become too large.

      Reviewer #2 (Public Review):

      1) The methods section does not address all the numerical methods used to make sense of the different brain metrics.

      We now provide more detailed descriptions of our measurements of foliation, phylogenetic models, analysis of partial correlations, phylogenetic principal components, and allometry. We have added illustrations (to Figs. 3 and 5), examples and references to the relevant literature.

      2) In the results section, it sometimes makes it difficult for the reader to understand the reason for a sub-analysis and the interpretation of the numerical findings.

      The revised version of our manuscript includes motivations for the different types of analyses, and we have also added a paragraph providing a guide to the structure of our results.

      3) The originality of the article is not sufficiently brought forward:

      a) the novel method to detect the depth of the molecular layer is not contextualized in order to understand the shortcomings of previously-established methods. This prevents the reader from understanding its added value and hinders its potential re-use in further studies.

      The revised version of the manuscript provides additional context which highlights the novelty of our approach, in particular concerning the measurement of folding and the use of phylogenetic comparative models. The limitations of the previous approaches are stated more clearly, and illustrated in Figs. 3 and 5.

      b) The numerous results reported are not sufficiently addressed in the discussion for the reader to get a full grasp of their implications, hindering the clarity of the overall conclusion of the article.

      Following the Reviewer’s advice, we have thoroughly restructured our results and discussion section.

      Reviewer #3 (Public Review):

      1) The first problem relates to their use of the Ornstein-Uhlenbeck (OU) model: they try fitting three evolutionary models, and conclude that the Ornstein-Uhlenbeck model provides the best fit. However, it has been known for a while that OU models are prone to bias and that the apparent superiority of OU models over Brownian Motion is often an artefact, a problem that increases with smaller sample sizes. (Cooper et al (2016) Biological Journal of the Linnean Society, 2016, 118, 64-77).

      Cooper et al.’s (2016) article “A Cautionary Note on the Use of Ornstein Uhlenbeck Models in Macroevolutionary Studies” suggests that comparing evolutionary models using the model’s likelihood leads often to incorrectly selecting OU over BM even for data generated from a BM process. However, Grabowski et al (2023) in their article ‘A Cautionary Note on “A Cautionary Note on the Use of Ornstein Uhlenbeck Models in Macroevolutionary Studies”’ suggest that Cooper et al.’s (2016) claim may be misleading. The work of Clavel et al. (2019) and Clavel and Morlon (2017) shows that the penalised framework implemented in mvMORPH can successfully recover the parameters of a multivariate OU process. To address more directly the concern of the Reviewer, we used simulations to evaluate the chances that we would decide for an OU model when the correct model was BM – a similar procedure to the one used by Cooper et al.’s (2016). However, instead of using the likelihood of the fitted models directly as Cooper et al. (2016) – which does not control for the number of parameters in the model – we used the Akaike Information Criterion, corrected for small sample sizes: AICc. The standard Akaike Information Criterion takes the number of parameters of the model into account, but this is not sufficient when the sample size is small. AICc provides a score which takes both aspects into account: model complexity and sample size. This information has been added to the manuscript:

      “We selected the best fitting model using the Akaike Information Criterion (AIC), corrected for 𝐴𝐼𝐶 = − 2 𝑙𝑜𝑔(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2 𝑝. This approximation is insufficient when the𝑝 sample size small sample sizes (AICc). AIC takes into account the number of parameters in the model: is small, in which case an additional correction is required, leading to the corrected AIC: 𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶 + (2𝑝2 + 2𝑝)/(𝑛 − 𝑝 − 1), where 𝑛 is the sample size.”

      In 1000 simulations of 9 correlated multivariate traits for 56 species (i.e., 56*9 data points) using our phylogenetic tree, only 0.7% of the times we would decide for OU when the real model was BM.

      2) Second, for the partial correlations (e.g. fig 7) and Principal Components (fig 8) there is a concern about over-fitting: there are 9 variables and only 56 data points (violating the minimal rule of thumb that there should be >10 observations per parameter). Added to this, the inclusion of variables lacks a clear theoretical rationale. The high correlations between most variables will be in part because they are to some extent measuring the same things, e.g. the five different measures of cerebellar anatomy which include two measures of folial size. This makes it difficult to separate their effects. I get that the authors are trying to tease apart different aspects of size, but in practice, I think these results (e.g. the presence of negative coefficients in Fig 7) are really hard or impossible to interpret. The partial correlation network looks like a "correlational salad" rather than a theoretically motivated hypothesis test. It isn't clear to me that the PC analyses solve this problem, but it partly depends on the aims of these analyses, which are not made very clear.

      PCA is simply a rigid rotation of the data, distances among multivariate data points are all conserved. Neither our PCA nor our partial correlation analysis involve model fitting, the concept of overfitting does not apply. PCA and partial correlations are also not used here for hypothesis testing, but as exploratory methods which provide a transformation of the data aiming at capturing the main trends of multivariate change. The aim of our analysis of correlation structure is precisely to avoid the “correlational salad” that the Reviewer mentions. The Reviewer is correct: all our variables are correlated to a varying degree (note that there are 56 data points per variable = 56*9 data points, not just 56 data points). Partial correlations and PCA aim at providing a principled way in which correlated measurements can be explored. In the revised version of the manuscript we include a more detailed description of partial correlations and PCA (phylogenetic). Whenever variables measure the same thing, they will be combined into the same principal component (these are the combinations shown in Fig. 8 b and d). Additionally, two variables may be correlated because of their correlation with a third variable (or more). Partial correlations address this possibility by looking at the correlations between the residuals of each pair of variables after all other variables have been covaried out. We provide a simple example which should make this clear, providing in particular an intuition for the meaning of negative correlations:

      “All our phenotypes were strongly correlated. We used partial correlations to better understand pairwise relationships. The partial correlation between 2 vectors of measurements a and b is the correlation between their residuals after the influence of all other measurements has been covaried out. Even if the correlation between a and b is strong and positive, their partial correlation could be 0 or even negative. Consider, for example, 3 vectors of measurements a, b, c, which result from the combination of uncorrelated random vectors x, y, z. Suppose that a = 0.5 x + 0.2 y + 0.1 z, b = 0.5 x - 0.2 y + 0.1 z, and c = x. The measurements a and b will be positively correlated because of the effect of x and z. However, if we compute the residuals of a and b after covarying the effect of c (i.e., x), their partial correlation will be negative because of the opposite effect of y on a and b. The statistical significance of each partial correlation being different than 0 was estimated using the edge exclusion test introduced by Whittaker (1990).”

      The rationale for our analyses has been made more clear in the revised version of the manuscript, aided by the more detailed description of our methods. In particular, we describe better the reason for our 2 measurements of folial shape – width and perimeter – which measure independent dimensions of folding (this is illustrated in Fig. 3d).

      3) The claim of concerted evolution between cortical and cerebellar values (P 11-12) seems to be based on analyses that exclude body size and brain size. It, therefore, seems possible - or even likely - that all these analyses reveal overall size effects that similarly influence the cortex and cerebellum. When the authors state that they performed a second PC analysis with body and brain size removed "to better understand the patterns of neuroanatomical evolution" it isn't clear to me that is what this achieves. A test would be a model something like [cerebellar measure ~ cortical measure + rest of the brain measure], and this would deal with the problem of 'correlation salad' noted below.

      The answer to this question is in the partial correlation diagram in Fig. 7c. This analysis does not exclude body weight nor brain weight. It shows that the strong correlation between cerebellar area and length is supported by a strong positive partial correlation, as is the link between cerebral area and length. There is a significant positive partial correlation between cerebellar section area and cerebral section length. That is, even after covarying everything else, there is still a correlation between cerebellar section area and cerebral section length (this partial correlation is equivalent to the suggestion of the Reviewer). Additionally, there is a positive partial correlation between body weight and cerebellar section area, but not significant partial correlation between body weight and cerebral section area or length. Our approach aims at obtaining a general view of all the relationships in the data. Testing an individual model would certainly decrease the number of correlations, however, it would provide only a partial view of the problem.

      4) It is not quite clear from fig 6a that the result does indeed support isometry between the data sets (predicted 2/3 slope), and no coefficient confidence intervals are provided.

      We have now added the numerical values of the CIs to all our plots in addition to the graphical representations (grey regions) in the previous version of the manuscript. The isometry slope (0.67) is either within the CIs (both for the linear and orthogonal regressions) or at the margin, indicating that if the relationships are not isometric, they are very close to it.

      Referencing/discussion/attribution of previous findings

      5) With respect to the discussion of the relationship between cerebellar architecture and function, and given the emphasis here on correlated evolution with cortex, Ramnani's excellent review paper goes into the issues in considerable detail, which may also help the authors develop their own discussion: Ramnani (2006) The primate cortico-cerebellar system: anatomy and function. Nature Reviews Neuroscience 7, 511-522 (2006)

      We have added references to the work of Ramnani.

      6) The result that humans are outliers with a more folded cerebellum than expected is interesting and adds to recent findings highlighting evolutionary changes in the hominin human cerebellum, cerebellar genes, and epigenetics. Whilst Sereno et al (2020) are cited, it would be good to explain that they found that the human cerebellum has 80% of the surface area of the cortex.

      We have added this information to the introduction:

      “In humans, the cerebellum has ~80% of the surface area of the cerebral cortex (Sereno et al. 2020), and contains ~80% of all brain neurons, although it represents only ~10% of the brain mass (Azevedo et al. 2009)”

      7) It would surely also be relevant to highlight some of the molecular work here, such as Harrison & Montgomery (2017). Genetics of Cerebellar and Neocortical Expansion in Anthropoid Primates: A Comparative Approach. Brain Behav Evol. 2017;89(4):274-285. doi: 10.1159/000477432. Epub 2017 (especially since this paper looks at both cerebellar and cortical genes); also Guevara et al (2021) Comparative analysis reveals distinctive epigenetic features of the human cerebellum. PLoS Genet 17(5): e1009506. https://doi.org/10.1371/journal. pgen.1009506. Also relevant here is the complex folding anatomy of the dentate nucleus, which is the largest structure linking cerebellum to cortex: see Sultan et al (2010) The human dentate nucleus: a complex shape untangled. Neuroscience. 2010 Jun 2;167(4):965-8. doi: 10.1016/j.neuroscience.2010.03.007.

      The information is certainly important, and could have provided a wider perspective on cerebellar evolution, but we would prefer to keep a focus on cerebellar anatomy and address genetics only indirectly through phylogeny.

      8) The authors state that results confirm previous findings of a strong relationship between cerebellum and cortex (P 3 and p 16): the earliest reference given is Herculano-Houzel (2010), but this pattern was discovered ten years earlier (Barton & Harvey 2000 Nature 405, 1055-1058. https://doi.org/10.1038/35016580; Fig 1 in Barton 2002 Nature 415, 134-135 (2002). https://doi.org/10.1038/415134a) and elaborated by Whiting & Barton (2003) whose study explored in more detail the relationship between anatomical connections and correlated evolution within the cortico-cerebellar system (this paper is cited later, but only with reference to suggestions about the importance of functions of the cerebellum in the context of conservative structure, which is not its main point). In fact, Herculano-Houzel's analysis, whilst being the first to examine the question in terms of numbers of neurons, was inconclusive on that issue as it did not control for overall size or rest of the brain (A subsequent analysis using her data did, and confirmed the partially correlated evolution - Barton 2012, Philos Trans R Soc Lond B Biol Sci. 367:2097-107. doi: 10.1098/rstb.2012.0112.)

      We apologise for this oversight, these references are now included.

    1. Author Response

      Reviewer #2 (Public Review):

      The manuscript by Ma et al, "Two RNA-binding proteins mediate the sorting of miR223 from mitochondria into exosomes" examines the contribution of two RNA-binding proteins on the exosomal loading of miR223. The authors conclude that YBX1 and YBAP1 work in tandem to traffic and load miR223 into the exosome. The manuscript is interesting and potentially impactful. It proposes the following scenario regarding the exosomal loading of miR223: (1) YBAP1 sequesters miR223 in the mitochondria, (2) YBAP1 then transfers miR223 to YBX1, and (3) YBX1 then delivers miR223 into the early endosome for eventual secretion within an exosome. While the authors propose plausible explanations for this phenomenon, they do not specifically test them and no mechanism by which miR223 is shuttled between YBAP1 and YBX1, and the exosome is shown. Thus, the paper is missing critical mechanistic experiments that could have readily tested the speculative conclusions that it makes.

      Comments:

      1) The major limitation of this paper is that it fails to explore the mechanism of any of the major changes it describes. For example, the authors propose that miR223 shuttles from mitochondrially localized YBAP1 to P-body-associated YBX1 to the exosome. This needs to be tested directly and could be easily addressed by showing a transfer of miR223 from YBAP1 to YBX1 to the exosome.

      Testing this idea using fluorescently labeled miR223 would indeed be an ideal experiment. However, miRNA imaging presents challenges. As reviewer 1 pointed out, and we have now confirmed, the atto-647 dye itself localizes to mitochondria. We will continue our efforts to identify a suitable fluorescent label for miR223in order to be in a position to evaluate the temporal relationship between mitochondrial and endosomal miR223.

      2) If YBAP1 retains miR223 in mitochondria, what is the trigger for YBAP1 to release it and pass it off to YBX1? The authors speculate in their discussion that sequestration of mito-miR223 plays a "role in some structural or regulatory process, perhaps essential for mitochondrial homeostasis, controlled by the selective extraction of unwanted miRNA into RNA granules and further by secretion in exosomes...". This is readily testable by altering mitochondria dynamics and/or integrity.

      A previous study has reported that YBAP1 can be released from mitochondria to the cytosol during HSV-1 infection (Song et al., 2021). However, due to restrictions, we are unable to conduct experiments using HSV to verify this condition. We attempted to induce mitochondrial stress by using different concentrations of CCCP, but we did not observe the release of YBAP1 from mitochondria after CCCP treatment. We speculate that not all mitochondrial stress conditions can trigger YBAP1 release. Investigating the mechanism of mito-miR223 release from mitochondria is one of our interests that we aim to explore in future studies.

      3) Much of the miRNA RT-PCR analysis is presented as a ratio of exosomal/cellular. This particular analysis assumes that cellular miRNA is unaffected by treatments. For example, Figure 1a shows that the presence of exosomal miR223 is significantly reduced when YBX1 is knocked out. This analysis does not consider the possibility that YBX1-KO alters (up or down-regulates) intracellular miR223 levels. Should that be the case, the ratiometric analysis is greatly skewed by intracellular miRNA changes. It would be better to not only show the intracellular levels of the miRs but also normalize the miRNA levels to the total amount of RNA isolated or an irrelevant/unchanged miRNA.

      Our previous publications demonstrated that miR223 levels are increased in YBX1-KO cells and decreased in exosomes derived from YBX1 KO cells. However, no significant changes were observed in miR190 levels (Liu et al., 2021; Shurtleff et al., 2016). The repeated data has been included in Figure 1a.

      For the analysis of other miRNAs by RT-PCR, we assessed changes in intracellular and exosomal miRNA levels in the corresponding figures. In the qPCR analysis, miRNA levels were normalized to the total amount of RNA.

      4) In figure 1, the authors show that in YBX1-KO cells, miR223 levels are decreased in the exosome. They further suggest this is because YBX1 binds with high affinity to miR223. This binding is compared to miR190 which the authors state is not enriched in the exosome. However, no data showing that miR190 is not present in the exosome is shown. A figure showing the amount of cellular and exosomal miR223 and 190 should be shown together on the same graph.

      In previous publications we demonstrated that miR190 is not localized in exosomes and not significantly changed in YBX1 knockout (KO) cells and exosomes derived from YBX1 KO cells (Liu et al., 2021; Shurtleff et al., 2016). The repeated data has been included in Figure 1a.

      5) Figure 2 Supplement 1 - As to determine the nucleotides responsible for interacting with YBX1, the authors made several mutations within the miR223 sequence. However, no explanation is given regarding the mutant sequences used or what the ratios mean. Mutant sequences need to be included. How do the authors conclude that UCAGU is important when the locations of the mutations are unclear? Also, the interpretation of this data would benefit from a binding affinity curve as shown in Fig 2C.

      The ratio is of labeled miR223/unlabeled miR223 (wt and mutant). All mutant sequences of miR223 have been included in Figure 2 supplement 1.

      6) While the binding of miR223mut to YBX1 is reduced, there is still significant binding. Does this mean that the 5nt binding motif is not exact? Do the authors know if there are multiple nucleotide possibilities at these positions that could facilitate binding? Perhaps confirming binding "in vivo" via RIP assay would further solidify the UCAGU motif as critical for binding to YBX1.

      The binding affinity of miR223mut with YBX1 is reduced approximately 27-fold compared to miR223. We speculate that the secondary structure of miR223 may contribute to the interaction with YBX1.

      Our EMSA data, in vitro packaging data, and exosome analysis reinforce the conclusion that UCAGU is critical for YBX1 binding. These findings suggest that the presence of the UCAGU motif in miR223 is crucial for its interaction with YBX1 and subsequent sorting into exosomes.

      7) Figures 2g, h - It would be nice to show that miR190mut also packages in the cell-free system. This would confirm that the sequence is responsible. Also, to confirm that the sorting of miR223 is YBX1-dependent, a cell-free reaction using cytosol and membranes from YBX1 KO cells is needed.

      Although we have not performed the suggested experiment, we purified exosomes from cells overexpressing miR190sort and observed an increase in the enrichment of miR190sort in exosomes compared to miR190. This finding confirmed that the UCAGU motif facilitates miRNA sorting into exosomes.

      Regarding the in vitro packaging assay, our previously published paper demonstrated that cytosol from YBX1 knockout (KO) cells significantly reduces the protection of miR223 from RNase digestion. We concluded that the sorting of miR223 into exosomes is dependent on YBX1 (Shurtleff et al., 2016).

      8) In Figure 3a, the authors show that miR223 is mitochondrially localized. Does the sequence of miR223 (WT or Mut) matter for localization? Does it matter for shuttling between YBAP1 and YBX1?

      The localization of miR223mut has not been tested in our current study. We plan to conduct these experiments in the future.

      9) Supplement 3c - Is it strange that miR190 is not localized to any particular compartment? Is miR190 present ubiquitously and equally among all intracellular compartments?

      Most mature miRNAs are predominantly localized in the cytoplasm. Although there is no specific subcellular localization reported for miR190 in the literature, our experimental findings indicate a relatively high expression of miR190 in 293T cells. It is likely that most of miR190 is localized in the cytosol. However, it is also possible that a small fraction of miR190 may associate with a membrane, which could explain its distribution in various subcellular structures. Importantly, we did not observe enrichment of miR190 in the mitochondria or exosomes.

      10) Figure 3h - Why would the miR223 levels increase if you remove mitochondria? Does CCCP also cause miR223 upregulation? I would have thought miR223 would just be mis-localized to the cytosol.

      We report that the levels of cytoplasmic miR223 increase following the removal of mitochondria using CCCP treatment. While we cannot rule out the possibility that upregulation of miR223 is directly caused by CCCP treatment, we suggest that miR223 becomes mis-localized to the cytosol upon mitochondrial removal. Our data suggests that mitochondria contribute to the secretion of miR223 into exosomes. When mitochondria are removed by mitophagy, cytosolic miR223 is not efficiently secreted, which provides an alternative explanation for the observed increase in miR223 level after mitochondrial removal.

      11) Figure 3i - What is the meaning of "Urd" in the figure label? This isn't mentioned anywhere.

      “Urd” represents Uridine. Uridine is now spelled out in figure 3i. The absence of mitochondria can impact the function of the mitochondrial enzyme dihydroorotate dehydrogenase, which plays a role in pyrimidine synthesis. To address this issue, one approach is to supplement the cell culture medium with Urd. A previous study demonstrated that primary fibroblasts showed positive responses when Urd was added to the cell culture medium, resulting in improved cell viability for extended periods of time (Correia-Melo et al., 2017).

      12) Figure 3j - The data is presented as a ratio of EV/cell. Again, this inaccurately represents the amount of miR223 in the EV. This issue is apparent when looking at Figures 3h and 3j. In 3h, CCCP causes an upregulation of intracellular miR223. As such, the presumed decrease in EV miR233 after CCCP (3j) could be an artifact due to increased levels of intracellular miR223. Both intracellular and EV levels of miRs need to be shown.

      Both the intracellular and exosomal levels of miR223 have been included in Figure 3j.

      13) In Figure 4, the authors show that when overexpressed, YBX1 will pulldown YBAP1. Can the authors comment as to why none of the earlier purifications show this finding (Figure 1 for example)? Even more curious is that when YBAP1 is purified, YBX1 does not co-purify (Figure 4 supplement 1a, b).

      In Figure 4a-b, human YBX1 fused with a Strep II tag was purified from 293T cells using Strep-Tactin® Sepharose® resin in a one-step purification process. Our data has shown that YBAP1 is expressed in 293T cells.

      In Figure 1 and Figure 4 Supplement 1a, human YBX1 or YBAP1 fused with His and MBP tags were purified from insect cells using a three-step purification process involving Ni-NTA His-Pur resin, amylose resin, and Superdex-200 gel filtration chromatography.

      One possibility is that human YBX1 or YBAP1 may not interact well with insect YBAP1 or YBX1, which could result in separate tagged forms of YBX1 or YBAP1 isolated from insect cells.

      Another possibility is that the expression levels of insect YBAP1 and YBX1 may be too low. Consequently, tagged forms YBX1 or YBAP1 expressed in insect cells may copurify with partners not readily detected by Coomassie blue stain. However, in Figure 4 Supplement 1b, human YBX1 fused with His and MBP tags was co-expressed with non-tagged human YBAP1, and both bands of YBX1 and YBAP1 were visible on the Coomassie blue gel after purification using Ni-NTA His-Pur resin, amylose resin, and Superdex-200 gel filtration chromatography.

      14) Figure 4f, g - The text associated with these figures is very confusing, as is the labeling for the input. Also, what is "miR223 Fold change" in this regard? Seeing as your IgG should not have IP'd anything, normalizing to IgG can amplify noise. As such, RIP assays are typically presented as % input or fold enrichment.

      The RIP assay results have been calculated and presented as a % input in Figure 4g.

      15) Figure 4h - The authors show binding between miR223 and YBAP1 however it is not clear how significant this binding is. There is more than a 30-fold difference in binding affinity between miR223 and YBX1 than between miR223 and YBAP1. Even more, when comparing the EMSAs and fraction bound from figures 1 and 2 to those of Figure 4h, the binding between miR223 and YBAP1 more closely resembles that of miR190 and YBX1, which the authors state is a non-binder of YBX1. The authors will need to reconcile these discrepancies.

      We agree that the binding of YBAP and YBX1 differ quite significantly in the affinity of their interaction with miR223. It is difficult to draw conclusions from a comparison of the affinities of YBX1 for miR190 and YBAP1 for miR223. Nonetheless, a quantitative difference in the interaction of YBAP1 with miR223 and miR190 is apparent (Fig. 4 h, I, j) and we observed no enrichment miR190 in isolated mitochondria (Fig. 3 supplement 1a) whereas YBAP1 selectively IP’d miR223 from isolated mitochondria (Fig. 4 f and g).

      16) Can the authors present the Kd values for EMSA data?

      The Kd values for the EMSA data have been added to the respective figures.

      17) Figure 5 - Does YBAP1-KO affect mitochondrial protein integrity or numbers?

      We generated stable cell lines expressing 3xHA-GFP-OMP25 in both 293T WT and YBAP1-KO cells, but we did not observe any alterations in mitochondrial morphology (Author response image 1).

      Author response image 1.

      Additionally, we performed a comparison of different mitochondrial markers using immunoblot in 293T WT cells and YBAP1-KO cells and did not observe any changes in these markers (data has been included in Figure 5b.).

      18) Figure 6a - Are the authors using YBAP1 as their mitochondrial marker? Please include TOM20 and/or 22.

      In Figure 4c and 4e, our data clearly demonstrate that the majority of YBAP1 is localized in the mitochondria.

      To further validate this localization, we performed immunofluorescence staining using antibodies against endogenous Tom20 and YBX1. The immunofluorescence images document YBX1 associated with mitochondria (Author response image 2 and new Fig 6a.).

      Author response image 2.

      19) Figure 6b - Rab5 is an early endosome marker and may not fully represent the organelles that become MVBs. Co-localization at this point does not suggest that associating proteins will be present in the exosome, and it is possible that the authors are looking at the precursor of a recycling endosome. Even more, exosome loading does not occur at the early endosome, but instead at the MVB. Perhaps looking at markers of the late endosome such as Rab7 or ideally markers of the MVB such as M6P or CD63 would help draw an association between YBX1, YBAP1, and the exosome. Also, If the authors want to make the claim that interactions at the early endosome leads to secretion as an exosome, the authors should show that isolated EVs from Rab5Q79L-expressing cells contain miR223.

      We have previously used overexpressed Rab5(Q79L) to monitor the localization of exosomal content, specifically CD63 and YBX1, in enlarged endosomes (Liu et al. 2021, Fig. 4A, B). These endosomes exhibit a mixture of early and late endocytic markers, including CD63. (Wegner et al., 2010). Hence, the presence of Rab5(Q79L)-positive enlarged endosomes does not solely indicate early endosomes.

      20) The mentioning of P-bodies is interesting but at no time is an association addressed. This is therefore an overly speculative conclusion. Either show an association or leave this out of the manuscript.

      In a previous paper we demonstrated that YBX1 puncta colocalize with P-body markers EDC4, Dcp1 and DDX6 (Liu et al., 2021).

      21) In lines 55-58, the authors make the comment "However, many of these studies used sedimentation at ~100,000 g to collect EVs, which may also collect RNP particles not enclosed within membranes which complicates the interpretation of these data." Do RNPs not dissolve when secreted? Can the authors give a reference for this statement?

      In a previous paper, we demonstrated that the RNP Ago2 does not dissolve in the conditioned medium and is not in vesicles but sediments to the bottom of a density gradient (Temoche-Diaz et al., 2019).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Strengths:

      The study was designed as a 6-month follow-up, with repeated behavioral and EEG measurements through disease development, providing valuable and interesting findings on AD progression and the effect of early-life choline supplantation. Moreover, the behavioral data that suggest an adverse effect of low choline in WT mice are interesting and important beyond the context of AD.

      Thank you for identifying several strengths.

      Weaknesses:

      (1) The multiple headings and subheadings, focusing on the experimental method rather than the narrative, reduce the readability.

      We have reduced the number of headings.

      (2) Quantification of NeuN and FosB in WT littermates is needed to demonstrate rescue of neuronal death and hyperexcitability by high choline supplementation and also to gain further insights into the adverse effect of low choline on the performance of WT mice in the behavioral test.

      We agree and have added WT data for the NeuN and ΔFosB analyses. These data are included in the text and figures. For NeuN, the Figure is Figure 6. For ΔFosB it is Figure 7. In brief, the high choline diet restored NeuN and ΔFosB to the levels of WT mice.

      Below is Figure 6 and its legend to show the revised presentation of data for NeuN. Afterwards is the revised figure showing data for ΔFosB. After that are the sections of the Results that have been revised.

      Author response image 1.

      Choline supplementation improved NeuN immunoreactivity (ir) in hilar cells in Tg2576 animals. A. Representative images of NeuN-ir staining in the anterior DG of Tg2576 animals. (1) A section from a Tg2576 mouse fed the low choline diet. The area surrounded by a box is expanded below. Red arrows point to NeuN-ir hilar cells. Mol=molecular layer, GCL=granule cell layer, HIL=hilus. Calibration for the top row, 100 µm; for the bottom row, 50 µm. (2) A section from a Tg2576 mouse fed the intermediate diet. Same calibrations as for 1. (3) A section from a Tg2576 mouse fed the high choline diet. Same calibrations as for 1. B. Quantification methods. Representative images demonstrate the thresholding criteria used to quantify NeuN-ir. (1) A NeuN-stained section. The area surrounded by the white box is expanded in the inset (arrow) to show 3 hilar cells. The 2 NeuN-ir cells above threshold are marked by blue arrows. The 1 NeuN-ir cell below threshold is marked by a green arrow. (2) After converting the image to grayscale, the cells above threshold were designated as red. The inset shows that the two cells that were marked by blue arrows are red while the cell below threshold is not. (3) An example of the threshold menu from ImageJ showing the way the threshold was set. Sliders (red circles) were used to move the threshold to the left or right of the histogram of intensity values. The final position of the slider (red arrow) was positioned at the onset of the steep rise of the histogram. C. NeuN-ir in Tg2576 and WT mice. Tg2576 mice had either the low, intermediate, or high choline diet in early life. WT mice were fed the standard diet (intermediate choline). (1) Tg2576 mice treated with the high choline diet had significantly more hilar NeuN-ir cells in the anterior DG compared to Tg2576 mice that had been fed the low choline or intermediate diet. The values for Tg2576 mice that received the high choline diet were not significantly different from WT mice, suggesting that the high choline diet restored NeuN-ir. (2) There was no effect of diet or genotype in the posterior DG, probably because the low choline and intermediate diet did not appear to lower hilar NeuN-ir.

      Author response image 2.

      Choline supplementation reduced ∆FosB expression in dorsal GCs of Tg2576 mice. A. Representative images of ∆FosB staining in GCL of Tg2576 animals from each treatment group. (1) A section from a low choline-treated mouse shows robust ∆FosB-ir in the GCL. Calibration, 100 µm. Sections from intermediate (2) and high choline (3)-treated mice. Same calibration as 1. B. Quantification methods. Representative images demonstrating the thresholding criteria established to quantify ∆FosB. (1) A ∆FosB -stained section shows strongly-stained cells (white arrows). (2) A strict thresholding criteria was used to make only the darkest stained cells red. C. Use of the strict threshold to quantify ∆FosB-ir. (1) Anterior DG. Tg2576 mice treated with the choline supplemented diet had significantly less ∆FosB-ir compared to the Tg2576 mice fed the low or intermediate diets. Tg2576 mice fed the high choline diet were not significantly different from WT mice, suggesting a rescue of ∆FosB-ir. (2) There were no significant differences in ∆FosB-ir in posterior sections. D. Methods are shown using a threshold that was less strict. (1) Some of the stained cells that were included are not as dark as those used for the strict threshold (white arrows). (2) All cells above the less conservative threshold are shown in red. E. Use of the less strict threshold to quantify ∆FosB-ir. (1) Anterior DG. Tg2576 mice that were fed the high choline diet had less ΔFosB-ir pixels than the mice that were fed the other diets. There were no differences from WT mice, suggesting restoration of ∆FosB-ir by choline enrichment in early life. (2) Posterior DG. There were no significant differences between Tg2576 mice fed the 3 diets or WT mice.

      Results, Section C1, starting on Line 691:

      “To ask if the improvement in NeuN after MCS in Tg256 restored NeuN to WT levels we used WT mice. For this analysis we used a one-way ANOVA with 4 groups: Low choline Tg2576, Intermediate Tg2576, High choline Tg2576, and Intermediate WT (Figure 5C). Tukey-Kramer multiple comparisons tests were used as the post hoc tests. The WT mice were fed the intermediate diet because it is the standard mouse chow, and this group was intended to reflect normal mice. The results showed a significant group difference for anterior DG (F(3,25)=9.20; p=0.0003; Figure 5C1) but not posterior DG (F(3,28)=0.867; p=0.450; Figure 5C2). Regarding the anterior DG, there were more NeuN-ir cells in high choline-treated mice than both low choline (p=0.046) and intermediate choline-treated Tg2576 mice (p=0.003). WT mice had more NeuN-ir cells than Tg2576 mice fed the low (p=0.011) or intermediate diet (p=0.003). Tg2576 mice that were fed the high choline diet were not significantly different from WT (p=0.827).”

      Results, Section C2, starting on Line 722:

      “There was strong expression of ∆FosB in Tg2576 GCs in mice fed the low choline diet (Figure 7A1). The high choline diet and intermediate diet appeared to show less GCL ΔFosB-ir (Figure 7A2-3). A two-way ANOVA was conducted with the experimental group (Tg2576 low choline diet, Tg2576 intermediate choline diet, Tg2576 high choline diet, WT intermediate choline diet) and location (anterior or posterior) as main factors. There was a significant effect of group (F(3,32)=13.80, p=<0.0001) and location (F(1,32)=8.69, p=0.006). Tukey-Kramer post-hoc tests showed that Tg2576 mice fed the low choline diet had significantly greater ΔFosB-ir than Tg2576 mice fed the high choline diet (p=0.0005) and WT mice (p=0.0007). Tg2576 mice fed the low and intermediate diets were not significantly different (p=0.275). Tg2576 mice fed the high choline diet were not significantly different from WT (p>0.999). There were no differences between groups for the posterior DG (all p>0.05).”

      “∆FosB quantification was repeated with a lower threshold to define ∆FosB-ir GCs (see Methods) and results were the same (Figure 7D). Two-way ANOVA showed a significant effect of group (F(3,32)=14.28, p< 0.0001) and location (F(1,32)=7.07, p=0.0122) for anterior DG but not posterior DG (Figure 7D). For anterior sections, Tukey-Kramer post hoc tests showed that low choline mice had greater ΔFosB-ir than high choline mice (p=0.0024) and WT mice (p=0.005) but not Tg2576 mice fed the intermediate diet (p=0.275); Figure 7D1). Mice fed the high choline diet were not significantly different from WT (p=0.993; Figure 7D1). These data suggest that high choline in the diet early in life can reduce neuronal activity of GCs in offspring later in life. In addition, low choline has an opposite effect, suggesting low choline in early life has adverse effects.”

      (3) Quantification of the discrimination ratio of the novel object and novel location tests can facilitate the comparison between the different genotypes and diets.

      We have added the discrimination index for novel object location to the paper. The data are in a new figure: Figure 3. In brief, the results for discrimination index are the same as the results done originally, based on the analysis of percent of time exploring the novel object.

      Below is the new Figure and legend, followed by the new text in the Results.

      Author response image 3.

      Novel object location results based on the discrimination index. A. Results are shown for the 3 months-old WT and Tg2576 mice based on the discrimination index. (1) Mice fed the low choline diet showed object location memory only in WT. (2) Mice fed the intermediate diet showed object location memory only in WT. (3) Mice fed the high choline diet showed memory both for WT and Tg2576 mice. Therefore, the high choline diet improved memory in Tg2576 mice. B. The results for the 6 months-old mice are shown. (1-2) There was no significant memory demonstrated by mice that were fed either the low or intermediate choline diet. (3) Mice fed a diet enriched in choline showed memory whether they were WT or Tg2576 mice. Therefore, choline enrichment improved memory in all mice.

      Results, Section B1, starting on line 536:

      “The discrimination indices are shown in Figure 3 and results led to the same conclusions as the analyses in Figure 2. For the 3 months-old mice (Figure 3A), the low choline group did not show the ability to perform the task for WT or Tg2576 mice. Thus, a two-way ANOVA showed no effect of genotype (F(1,74)=0.027, p=0.870) or task phase (F(1,74)=1.41, p=0.239). For the intermediate diet-treated mice, there was no effect of genotype (F(1,50)=0.3.52, p=0.067) but there was an effect of task phase (F(1,50)=8.33, p=0.006). WT mice showed a greater discrimination index during testing relative to training (p=0.019) but Tg2576 mice did not (p=0.664). Therefore, Tg2576 mice fed the intermediate diet were impaired. In contrast, high choline-treated mice performed well. There was a main effect of task phase (F(1,68)=39.61, p=<0.001) with WT (p<0.0001) and Tg2576 mice (p=0.0002) showing preference for the moved object in the test phase. Interestingly, there was a main effect of genotype (F(1,68)=4.50, p=0.038) because the discrimination index for WT training was significantly different from Tg2576 testing (p<0.0001) and Tg2576 training was significantly different from WT testing (p=0.0003).”

      “The discrimination indices of 6 months-old mice led to the same conclusions as the results in Figure 2. There was no evidence of discrimination in low choline-treated mice by two-way ANOVA (no effect of genotype, (F(1,42)=3.25, p=0.079; no effect of task phase, F(1,42)=0.278, p=0.601). The same was true of mice fed the intermediate diet (genotype, F(1,12)=1.44, p=0.253; task phase, F(1,12)=2.64, p=0.130). However, both WT and Tg2576 mice performed well after being fed the high choline diet (effect of task phase, (F(1,52)=58.75, p=0.0001, but not genotype (F(1,52)=1.197, p=0.279). Tukey-Kramer post-hoc tests showed that both WT (p<0.0001) and Tg2576 mice that had received the high choline diet (p=0.0005) had elevated discrimination indices for the test session.”

      (4) The longitudinal analyses enable the performance of multi-level correlations between the discrimination ratio in NOR and NOL, NeuN and Fos levels, multiple EEG parameters, and premature death. Such analysis can potentially identify biomarkers associated with AD progression. These can be interesting in different choline supplementation, but also in the standard choline diet.

      We agree and added correlations to the paper in a new figure (Figure 9). Below is Figure 9 and its legend. Afterwards is the new Results section.

      Author response image 4.

      Correlations between IIS, Behavior, and hilar NeuN-ir. A. IIS frequency over 24 hrs is plotted against the preference for the novel object in the test phase of NOL. A greater preference is reflected by a greater percentage of time exploring the novel object. (1) The mice fed the high choline diet (red) showed greater preference for the novel object when IIS were low. These data suggest IIS impaired object location memory in the high choline-treated mice. The low choline-treated mice had very weak preference and very few IIS, potentially explaining the lack of correlation in these mice. (2) There were no significant correlations for IIS and NOR. However, there were only 4 mice for the high choline group, which is a limitation. B. IIS frequency over 24 hrs is plotted against the number of dorsal hilar cells expressing NeuN. The dorsal hilus was used because there was no effect of diet on the posterior hilus. (1) Hilar NeuN-ir is plotted against the preference for the novel object in the test phase of NOL. There were no significant correlations. (2) Hilar NeuN-ir was greater for mice that had better performance in NOR, both for the low choline (blue) and high choline (red) groups. These data support the idea that hilar cells contribute to object recognition (Kesner et al. 2015; Botterill et al. 2021; GoodSmith et al. 2022).

      Results, Section F, starting on Line 801:

      “F. Correlations between IIS and other measurements

      As shown in Figure 9A, IIS were correlated to behavioral performance in some conditions. For these correlations, only mice that were fed the low and high choline diets were included because mice that were fed the intermediate diet did not have sufficient EEG recordings in the same mouse where behavior was studied. IIS frequency over 24 hrs was plotted against the preference for the novel object in the test phase (Figure 9A). For NOL, IIS were significantly less frequent when behavior was the best, but only for the high choline-treated mice (Pearson’s r, p=0.022). In the low choline group, behavioral performance was poor regardless of IIS frequency (Pearson’s r, p=0.933; Figure 9A1). For NOR, there were no significant correlations (low choliNe, p=0.202; high choline, p=0.680) but few mice were tested in the high choline-treated mice (Figure 9B2).

      We also tested whether there were correlations between dorsal hilar NeuN-ir cell numbers and IIS frequency. In Figure 9B, IIS frequency over 24 hrs was plotted against the number of dorsal hilar cells expressing NeuN. The dorsal hilus was used because there was no effect of diet on the posterior hilus. For NOL, there was no significant correlation (low choline, p=0.273; high choline, p=0.159; Figure 9B1). However, for NOR, there were more NeuN-ir hilar cells when the behavioral performance was strongest (low choline, p=0.024; high choline, p=0.016; Figure 9B2). These data support prior studies showing that hilar cells, especially mossy cells (the majority of hilar neurons), contribute to object recognition (Botterill et al. 2021; GoodSmith et al. 2022).”

      We also noted that all mice were not possible to include because they died or other reasons, such a a loss of the headset (Results, Section A, Lines 463-464): Some mice were not possible to include in all assays either because they died before reaching 6 months or for other reasons.

      Reviewer #2 (Public Review):

      Strengths:

      The strength of the group was the ability to monitor the incidence of interictal spikes (IIS) over the course of 1.2-6 months in the Tg2576 Alzheimer's disease model, combined with meaningful behavioral and histological measures. The authors were able to demonstrate MCS had protective effects in Tg2576 mice, which was particularly convincing in the hippocampal novel object location task.

      We thank the Reviewer for identifying several strengths.

      Weaknesses:

      Although choline deficiency was associated with impaired learning and elevated FosB expression, consistent with increased hyperexcitability, IIS was reduced with both low and high choline diets. Although not necessarily a weakness, it complicates the interpretation and requires further evaluation.

      We agree and we revised the paper to address the evaluations that were suggested.

      Reviewer #1 (Recommendations For The Authors):

      (1) A reference directing to genotyping of Tg2576 mice is missing.

      We apologize for the oversight and added that the mice were genotyped by the New York University Mouse Genotyping core facility.

      Methods, Section A, Lines 210-211: “Genotypes were determined by the New York University Mouse Genotyping Core facility using a protocol to detect APP695.”

      (2) Which software was used to track the mice in the behavioral tests?

      We manually reviewed videos. This has been clarified in the revised manuscript. Methods, Section B4, Lines 268-270: Videos of the training and testing sessions were analyzed manually. A subset of data was analyzed by two independent blinded investigators and they were in agreement.

      (3) Unexpectedly, a low choline diet in AD mice was associated with reduced frequency of interictal spikes yet increased mortality and spontaneous seizures. The authors attribute this to postictal suppression.

      We did not intend to suggest that postictal depression was the only cause. It was a suggestion for one of many potential explanations why seizures would influence IIS frequency. For postictal depression, we suggested that postictal depression could transiently reduce IIS. We have clarified the text so this is clear (Discussion, starting on Line 960):

      If mice were unhealthy, IIS might have been reduced due to impaired excitatory synaptic function. Another reason for reduced IIS is that the mice that had the low choline diet had seizures which interrupted REM sleep. Thus, seizures in Tg2576 mice typically started in sleep. Less REM sleep would reduce IIS because IIS occur primarily in REM. Also, seizures in the Tg2576 mice were followed by a depression of the EEG (postictal depression; Supplemental Figure 3) that would transiently reduce IIS. A different, radical explanation is that the intermediate diet promoted IIS rather than low choline reducing IIS. Instead of choline, a constituent of the intermediate diet may have promoted IIS.

      However, reduced spike frequency is already evident at 5 weeks of age, a time point with a low occurrence of premature death. A more comprehensive analysis of EEG background activity may provide additional information if the epileptic activity is indeed reduced at this age.

      We did not intend to suggest that premature death caused reduced spike frequency. We have clarified the paper accordingly. We agree that a more in-depth EEG analysis would be useful but is beyond the scope of the study.

      (4) Supplementary Fig. 3 depicts far more spikes / 24 h compared to Fig. 7B (at least 100 spikes/24h in Supplementary Fig. 3 and less than 10 spikes/24h in Fig. 7B).

      We would like to clarify that before and after a seizure the spike frequency is unusually high. Therefore, there are far more spikes than prior figures.

      We clarified this issue by adding to the Supplemental Figure more data. The additional data are from mice without a seizure, showing their spikes are low in frequency.

      All recordings lasted several days. We included the data from mice with a seizure on one of the days and mice without any seizures. For mice with a seizure, we graphed IIS frequency for the day before, the day of the seizure, and the day after. For mice without a seizure, IIS frequency is plotted for 3 consecutive days. When there was a seizure, the day before and after showed high numbers of spikes. When there was no seizure on any of the 3 days, spikes were infrequent on all days.

      The revised figure and legend are shown below. It is Supplemental Figure 4 in the revised submission.

      Author response image 5.

      IIS frequency before and after seizures. A. Representative EEG traces recorded from electrodes implanted in the skull over the left frontal cortex, right occipital cortex, left hippocampus (Hippo) and right hippocampus during a spontaneous seizure in a 5 months-old Tg2576 mouse. Arrows point to the start (green arrow) and end of the seizure (red arrow), and postictal depression (blue arrow). B. IIS frequency was quantified from continuous video-EEG for mice that had a spontaneous seizure during the recording period and mice that did not. IIS frequency is plotted for 3 consecutive days, starting with the day before the seizure (designated as day 1), and ending with the day after the seizure (day 3). A two-way RMANOVA was conducted with the day and group (mice with or without a seizure) as main factors. There was a significant effect of day (F(2,4)=46.95, p=0.002) and group (seizure vs no seizure; F(1,2)=46.01, p=0.021) and an interaction of factors (F(2,4)=46.68, p=0.002)..Tukey-Kramer post-hoc tests showed that mice with a seizure had significantly greater IIS frequencies than mice without a seizure for every day (day 1, p=0.0005; day 2, p=0.0001; day 3, p=0.0014). For mice with a seizure, IIS frequency was higher on the day of the seizure than the day before (p=0.037) or after (p=0.010). For mice without a seizure, there were no significant differences in IIS frequency for day 1, 2, or 3. These data are similar to prior work showing that from one day to the next mice without seizures have similar IIS frequencies (Kam et al., 2016).

      In the text, the revised section is in the Results, Section C, starting on Line 772:

      “At 5-6 months, IIS frequencies were not significantly different in the mice fed the different diets (all p>0.05), probably because IIS frequency becomes increasingly variable with age (Kam et al. 2016). One source of variability is seizures, because there was a sharp increase in IIS during the day before and after a seizure (Supplemental Figure 4). Another reason that the diets failed to show differences was that the IIS frequency generally declined at 5-6 months. This can be appreciated in Figure 8B and Supplemental Figure 6B. These data are consistent with prior studies of Tg2576 mice where IIS increased from 1 to 3 months but then waxed and waned afterwards (Kam et al., 2016).”

      (5) The data indicating the protective effect of high choline supplementation are valuable, yet some of the claims are not completely supported by the data, mainly as the analysis of littermate WT mice is not complete.

      We added WT data to show that the high choline diet restored cell loss and ΔFosB expression to WT levels. These data strengthen the argument that the high choline diet was valuable. See the response to Reviewer #1, Public Review Point #2.

      • Line 591: "The results suggest that choline enrichment protected hilar neurons from NeuN loss in Tg2576 mice." A comparison to NeuN expression in WT mice is needed to make this statement.

      These data have been added. See the response to Reviewer #1, Public Review Point #2.

      • Line 623: "These data suggest that high choline in the diet early in life can reduce hyperexcitability of GCs in offspring later in life. In addition, low choline has an opposite effect, again suggesting this maternal diet has adverse effects." Also here, FosB quantification in WT mice is needed.

      These data have been added. See the response to Reviewer #1, Public Review Point #2.

      (7) Was the effect of choline associated with reduced tauopathy or A levels?

      The mice have no detectable hyperphosphorylated tau. The mice do have intracellular A before 6 months. This is especially the case in hilar neurons, but GCs have little (Criscuolo et al., eNeuro, 2023). However, in neurons that have reduced NeuN, we found previously that antibodies generally do not work well. We think it is because the neurons become pyknotic (Duffy et al., 2015), a condition associated with oxidative stress which causes antigens like NeuN to change conformation due to phosphorylation. Therefore, we did not conduct a comparison of hilar neurons across the different diets.

      (8) Since the mice were tested at 3 months and 6 months, it would be interesting to see the behavioral difference per mouse and the correlation with EEG recording and immunohistological analyses.

      We agree that would be valuable and this has been added to the paper. Please see response to Reviewer #1, Public Review Point #4.

      Reviewer #2 (Recommendations For The Authors):

      There were several areas that could be further improved, particularly in the areas of data analysis (particularly with images and supplemental figures), figure presentation, and mechanistic speculation.

      Major points:

      (1) It is understandable that, for the sake of labor and expense, WT mice were not implanted with EEG electrodes, particularly since previous work showed that WT mice have no IIS (Kam et al. 2016). However, from a standpoint of full factorial experimental design, there are several flaws - purists would argue are fatal flaws. First, the lack of WT groups creates underpowered and imbalanced groups, constraining statistical comparisons and likely reducing the significance of the results. Also, it is an assumption that diet does not influence IIS in WT mice. Secondly, with a within-subject experimental design (as described in Fig. 1A), 6-month-old mice are not naïve if they have previously been tested at 3 months. Such an experimental design may reduce effect size compared to non-naïve mice. These caveats should be included in the Discussion. It is likely that these caveats reduce effect size and that the actual statistical significance, were the experimental design perfect, would be higher overall.

      We agree and have added these points to the Limitations section of the Discussion. Starting on Line 1050: In addition, groups were not exactly matched. Although WT mice do not have IIS, a WT group for each of the Tg2576 groups would have been useful. Instead, we included WT mice for the behavioral tasks and some of the anatomical assays. Related to this point is that several mice died during the long-term EEG monitoring of IIS.

      (2) Since behavior, EEG, NeuN and FosB experiments seem to be done on every Tg2576 animal, it seems that there are missed opportunities to correlate behavior/EEG and histology on a per-mouse basis. For example, rather than speculate in the discussion, why not (for example) directly examine relationships between IIS/24 hours and FosB expression?

      We addressed this point above in responding to Reviewer #1, Public Review Point #4.

      (3) Methods of image quantification should be improved. Background subtraction should be considered in the analysis workflow (see Fig. 5C and Fig. 6C background). It would be helpful to have a Methods figure illustrating intermediate processing steps for both NeuN and FosB expression.

      We added more information to improve the methods of quantification. We did use a background subtraction approach where ImageJ provides a histogram of intensity values, and it determines when there is a sharp rise in staining relative to background. That point is where we set threshold. We think it is a procedure that has the least subjectivity.

      We added these methods to the Methods section and expanded the first figure about image quantification, Figure 6B. That figure and legend are shown above in response to Reviewer #1, Point #2.

      This is the revised section of the Methods, Section C3, starting on Line 345:

      “Photomicrographs were acquired using ImagePro Plus V7.0 (Media Cybernetics) and a digital camera (Model RET 2000R-F-CLR-12, Q-Imaging). NeuN and ∆FosB staining were quantified from micrographs using ImageJ (V1.44, National Institutes of Health). All images were first converted to grayscale and in each section, the hilus was traced, defined by zone 4 of Amaral (1978). A threshold was then calculated to identify the NeuN-stained cell bodies but not background. Then NeuN-stained cell bodies in the hilus were quantified manually. Note that the threshold was defined in ImageJ using the distribution of intensities in the micrograph. A threshold was then set using a slider in the histogram provided by Image J. The slider was pushed from the low level of staining (similar to background) to the location where staining intensity made a sharp rise, reflecting stained cells. Cells with labeling that was above threshold were counted.”

      (4) This reviewer is surprised that the authors do not speculate more about ACh-related mechanisms. For example, choline deficiency would likely reduce Ach release, which could have the same effect on IIS as muscarinic antagonism (Kam et al. 2016), and could potentially explain the paradoxical effects of a low choline diet on reducing IIS. Some additional mechanistic speculation would be helpful in the Discussion.

      We thank the Reviewer for noting this so we could add it to the Discussion. We had not because we were concerned about space limitations.

      The Discussion has a new section starting on Line 1009:

      “Choline and cholinergic neurons

      There are many suggestions for the mechanisms that allow MCS to improve health of the offspring. One hypothesis that we are interested in is that MCS improves outcomes by reducing IIS. Reducing IIS would potentially reduce hyperactivity, which is significant because hyperactivity can increase release of A. IIS would also be likely to disrupt sleep since it represents aberrant synchronous activity over widespread brain regions. The disruption to sleep could impair memory consolidation, since it is a notable function of sleep (Graves et al. 2001; Poe et al. 2010). Sleep disruption also has other negative consequences such as impairing normal clearance of A (Nedergaard and Goldman 2020). In patients, IIS and similar events, IEDs, are correlated with memory impairment (Vossel et al. 2016).

      How would choline supplementation in early life reduce IIS of the offspring? It may do so by making BFCNs more resilient. That is significant because BFCN abnormalities appear to cause IIS. Thus, the cholinergic antagonist atropine reduced IIS in vivo in Tg2576 mice. Selective silencing of BFCNs reduced IIS also. Atropine also reduced elevated synaptic activity of GCs in young Tg2576 mice in vitro. These studies are consistent with the idea that early in AD there is elevated cholinergic activity (DeKosky et al. 2002; Ikonomovic et al. 2003; Kelley et al. 2014; Mufson et al. 2015; Kelley et al. 2016), while later in life there is degeneration. Indeed, the chronic overactivity could cause the degeneration.

      Why would MCS make BFCNs resilient? There are several possibilities that have been explored, based on genes upregulated by MCS. One attractive hypothesis is that neurotrophic support for BFCNs is retained after MCS but in aging and AD it declines (Gautier et al. 2023). The neurotrophins, notably nerve growth factor (NGF) and brain-derived neurotrophic factor (BDNF) support the health of BFCNs (Mufson et al. 2003; Niewiadomska et al. 2011).”

      Minor points:

      (1) The vendor is Dyets Inc., not Dyets.

      Thank you. This correction has been made.

      (2) Anesthesia chamber not specified (make, model, company).

      We have added this information to the Methods, Section D1, starting on Line 375: The animals were anesthetized by isoflurane inhalation (3% isoflurane. 2% oxygen for induction) in a rectangular transparent plexiglas chamber (18 cm long x 10 cm wide x 8 cm high) made in-house.

      (3) It is not clear whether software was used for the detection of behavior. Was position tracking software used or did blind observers individually score metrics?

      We have added the information to the paper. Please see the response to Reviewer #1, Recommendations for Authors, Point #2.

      (4) It is not clear why rat cages and not a true Open Field Maze were used for NOL and NOR.

      We used mouse cages because in our experience that is what is ideal to detect impairments in Tg2576 mice at young ages. We think it is why we have been so successful in identifying NOL impairments in young mice. Before our work, most investigators thought behavior only became impaired later. We would like to add that, in our experience, an Open Field Maze is not the most common cage that is used.

      (5) Figure 1A is not mentioned.

      It had been mentioned in the Introduction. Figure B-D was the first Figure mentioned in the Results so that is why it might have been missed. We now have added it to the first section of the Results, Line 457, so it is easier to find.

      6) Although Fig 7 results are somewhat complicated compared to Fig. 5 and 6 results, EEG comes chronologically earlier than NeuN and FosB expression experiments.

      We have kept the order as is because as the Reviewer said, the EEG is complex. For readability, we have kept the EEG results last.

      (7) Though the statistical analysis involved parametric and nonparametric tests, It is not clear which normality tests were used.

      We have added the name of the normality tests in the Methods, Section E, Line 443: Tests for normality (Shapiro-Wilk) and homogeneity of variance (Bartlett’s test) were used to determine if parametric statistics could be used. We also added after this sentence clarification: When data were not normal, non-parametric data were used. When there was significant heteroscedasticity of variance, data were log transformed. If log transformation did not resolve the heteroscedasticity, non-parametric statistics were used. Because we added correlations and analysis of survival curves, we also added the following (starting on Line 451): For correlations, Pearson’s r was calculated. To compare survival curves, a Log rank (Mantel-Cox) test was performed.

      Figures:

      (1) In Fig. 1A, Anatomy should be placed above the line.

      We changed the figure so that the word “Anatomy” is now aligned, and the arrow that was angled is no longer needed.

      In Fig. 1C and 1D, the objects seem to be moved into the cage, not the mice. This schematic does not accurately reflect the Fig. 1C and 1D figure legend text.

      Thank you for the excellent point. The figure has been revised. We also updated it to show the objects more accurately.

      Please correct the punctuation in the Fig. 1D legend.

      Thank you for mentioning the errors. We corrected the legend.

      For ease of understanding, Fig. 1C and 1D should have training and testing labeled in the figure.

      Thank you for the suggestion. We have revised the figure as suggested.

      Author response image 6.

      (2) In Figure 2, error bars for population stats (bar graphs) are not obvious or missing. Same for Figure 3.

      We added two supplemental figures to show error bars, because adding the error bars to the existing figures made the symbols, colors, connecting lines and error bars hard to distinguish. For novel object location (Fig. 2) the error bars are shown in Supp. Fig. 2. For novel object recognition, the error bars are shown in Supplemental Fig. 3.

      (3) The authors should consider a Methods figure for quantification of NeuN and deltaFOSB (expansions of Fig. 5C and Fig. 6C).

      Please see Reviewer #1, Public Review Point #2.

      (4) In Figure 5, A should be omitted and mentioned in the Methods/figure legend. B should be enlarged. C should be inset, zoomed-in images of the hilus, with an accompanying analysis image showing a clear reduction in NeuN intensity in low choline conditions compared to intermediate and high choline conditions. In D, X axes could delineate conditions (figure legend and color unnecessary). Figure 5C should be moved to a Methods figure.

      We thank the review for the excellent suggestions. We removed A as suggested. We expanded B and included insets. We used different images to show a more obvious reduction of cells for the low choline group. We expanded the Methods schematics. The revised figure is Figure 6 and shown above in response to Reviewer 1, Public Review Point #2.

      (5) In Figure 6, A should be eliminated and mentioned in the Methods/figure legend. B should be greatly expanded with higher and lower thresholds shown on subsequent panels (3x3 design).

      We removed A as suggested. We expanded B as suggested. The higher and lower thresholds are shown in C. The revised figure is Figure 7 and shown above in response to Reviewer 1, Public Review Point #2.

      (6) In Figure 7, A2 should be expanded vertically. A3 should be expanded both vertically and horizontally. B 1 and 2 should be increased, particularly B1 where it is difficult to see symbols. Perhaps colored symbols offset/staggered per group so that the spread per group is clearer.

      We added a panel (A4) to show an expansion of A2 and A3. However, we did not see that a vertical expansion would add information so we opted not to add that. We expanded B1 as suggested but opted not to expand B2 because we did not think it would enhance clarity. The revised figure is below.

      Author response image 7.

      (7) Supplemental Figure 1 could possibly be combined with Figure 1 (use rounded corner rat cage schematic for continuity).

      We opted not to combine figures because it would make one extremely large figure. As a result, the parts of the figure would be small and difficult to see.

      (8) Supplemental Figure 2 - there does not seem to be any statistical analysis associated with A mentioned in the Results text.

      We added the statistical information. It is now Supplemental Figure 4:

      Author response image 8.

      Mortality was high in mice treated with the low choline diet. A. Survival curves are shown for mice fed the low choline diet and mice fed the high choline diet. The mice fed the high choline diet had a significantly less severe survival curve. B. Left: A photo of a mouse after sudden unexplained death. The mouse was found in a posture consistent with death during a convulsive seizure. The area surrounded by the red box is expanded below to show the outstretched hindlimb (red arrow). Right: A photo of a mouse that did not die suddenly. The area surrounded by the box is expanded below to show that the hindlimb is not outstretched.

      The revised text is in the Results, Section E, starting on Line 793:

      “The reason that low choline-treated mice appeared to die in a seizure was that they were found in a specific posture in their cage which occurs when a severe seizure leads to death (Supplemental Figure 5). They were found in a prone posture with extended, rigid limbs (Supplemental Figure 5). Regardless of how the mice died, there was greater mortality in the low choline group compared to mice that had been fed the high choline diet (Log-rank (Mantel-Cox) test, Chi square 5.36, df 1, p=0.021; Supplemental Figure 5A).”

      Also, why isn't intermediate choline also shown?

      We do not have the data from the animals. Records of death were not kept, regrettably.

      Perhaps labeling of male/female could also be done as part of this graph.

      We agree this would be very interesting but do not have all sex information.

      B is not very convincing, though it is understandable once one reads about posture.

      We have clarified the text and figure, as well as the legend. They are above.

      Are there additional animals that were seen to be in a specific posture?

      There are many examples, and we added them to hopefully make it more convincing.

      We also added posture in WT mice when there is a death to show how different it is.

      Is there any relationship between seizures detected via EEG, as shown in Supplemental Figure 3, and death?

      Several mice died during a convulsive seizure, which is the type of seizure that is shown in the Supplemental Figure.

      (9) Supplemental Figure 3 seems to display an isolated case in which EEG-detected seizures correlate with increased IIEs. It is not clear whether there are additional documented cases of seizures that could be assembled into a meaningful population graph. If this data does not exist or is too much work to include in this manuscript, perhaps it can be saved for a future paper.

      We have added other cases and revised the graph. This is now Supplemental Figure 4 and is shown above in response to Reviewer #1, Recommendation for Authors Point #4.

      Frontal is misspelled.

      We checked and our copy is not showing a misspelling. However, we are very grateful to the Reviewer for catching many errors and reading the manuscript carefully.

      (10) Supplemental Figure 4 seems incomplete in that it does not include EEG data from months 4, 5, and 6 (see Fig. 7B).

      We have added data for these ages to the Supplemental Figure (currently Supplemental Figure 6) as part B. In part A, which had been the original figure, only 1.2, 2, and 3 months-old mice were shown because there were insufficient numbers of each sex at other ages. However, by pooling 1.2 and 2 months (Supplemental Figure 6B1), 3 and 4 months (B2) and 5 and 6 months (B3) we could do the analysis of sex. The results are the same – we detected no sex differences.

      Author response image 9.

      A. IIS frequency was similar for each sex. A. IIS frequency was compared for females and males at 1.2 months (1), 2 months (2), and 3 months (3). Two-way ANOVA was used to analyze the effects of sex and diet. Female and male Tg2576 mice were not significantly different. B. Mice were pooled at 1.2 and 2 months (1), 3 and 4 months (2) and 5 and 6 months (3). Two-way ANOVA analyzed the effects of sex and diet. There were significant effects of diet for (1) and (2) but not (3). There were no effects of sex at any age. (1) There were significant effects of diet (F(2,47)=46.21, p<0.0001) but not sex (F(1,47)=0.106, p=0.746). Female and male mice fed the low choline diet or high choline diet were significantly different from female and male mice fed the intermediate diet (all p<0.05, asterisk). (2) There were significant effects of diet (F(2,32)=10.82, p=0.0003) but not sex (F(1,32)=1.05, p=0.313). Both female and male mice of the low choline group were significantly different from male mice fed the intermediate diet (both p<0.05, asterisk) but no other pairwise comparisons were significant. (3) There were no significant differences (diet, F(2,23)=1.21, p=0.317); sex, F(1,23)=0.844, p=0.368).

      The data are discussed the Results, Section G, tarting on Line 843:

      In Supplemental Figure 6B we grouped mice at 1-2 months, 3-4 months and 5-6 months so that there were sufficient females and males to compare each diet. A two-way ANOVA with diet and sex as factors showed a significant effect of diet (F(2,47)=46.21; p<0.0001) at 1-2 months of age, but not sex (F1,47)=0.11, p=0.758). Post-hoc comparisons showed that the low choline group had fewer IIS than the intermediate group, and the same was true for the high choline-treated mice. Thus, female mice fed the low choline diet differed from the females (p<0.0001) and males (p<0.0001) fed the intermediate diet. Male mice that had received the low choline diet different from females (p<0.0001) and males (p<0.0001) fed the intermediate diet. Female mice fed the high choline diet different from females (p=0.002) and males (p<0.0001) fed the intermediate diet, and males fed the high choline diet difference from females (p<0.0001) and males (p<0.0001) fed the intermediate diet.

      For the 3-4 months-old mice there was also a significant effect of diet (F(2,32)=10.82, p=0.0003) but not sex (F(1,32)=1.05, p=0.313). Post-hoc tests showed that low choline females were different from males fed the intermediate diet (p=0.007), and low choline males were also significantly different from males that had received the intermediate diet (p=0.006). There were no significant effects of diet (F(2,23)=1.21, p=0.317) or sex (F(1,23)=0.84, p=0.368) at 5-6 months of age.

    1. Author Response

      Reviewer #1 (Public Review):

      Weaknesses:

      Gene expression level as a confounding factor was not well controlled throughout the study. Higher gene expression often makes genes less dispensable after gene duplication. Gene expression level is also a major determining factor of evolutionary rates (reviewed in http://www.ncbi.nlm.nih.gov/pubmed/26055156). Some proposed theories explain why gene expression level can serve as a proxy for gene importance (http://www.ncbi.nlm.nih.gov/pubmed/20884723, http://www.ncbi.nlm.nih.gov/pubmed/20485561). In that sense, many genomic/epigenomic features (such as replication timing and repressed transcriptional regulation) that were assumed "neutral" or intrinsic by the authors (or more accurately, independent of gene dispensability) cannot be easily distinguishable from the effect of gene dispersibility.

      We thank the reviewer for this important comment. We totally agree that transcriptomic and epigenomic features cannot be easily distinguished from gene dispensability and do not think that these features of the elusive genes can be explained solely by intrinsic properties of the genomes. Our motivation for investigating the expression profiles of the elusive gene is to understand how they lost their functional indispensability (original manuscript L285-286 in Results). We also discussed the possibility that sequence composition and genomic location of elusive genes may be associated with epigenetic features for expression depression, which may result in a decrease of functional constraints (original manuscript L470-474 in Discussion). Nevertheless, we think that the original manuscript may have contained misleading wordings, and thus we have edited them to better convey our view that gene expression and epigenomic features are related to gene function.

      (P.2, Introduction) This evolutionary fate of a gene can also be affected by factors independent of gene dispensability, including the mutability of genomic positions, but such features have not been examined well.

      (P6, Introduction) These data assisted us to understand how intrinsic genomic features may affect gene fate, leading to gene loss by decreasing the expression level and eventually relaxing the functional importance of ʻelusiveʼ genes.

      (P33, Discussion) Another factor is the spatiotemporal suppression of gene expression via epigenetic constraints. Previous studies showed that lowly expressed genes reduce their functional dispensability (Cherry, 2010; Gout et al., 2010), and so do the elusive genes.

      Additionally, responding to the advices from Reviewers 1 and 2 [Rev1minor7 and Rev2-Major4], we have added a new section Elusive gene orthologs in the chicken microchromosomes in which we describe the relationship between the elusive genes and chicken microchromosomes. In this section, we also argue for the relationship between the genomic feature of the elusive genes and their transcriptomic and epigenomic characteristics. In the chicken genome, elusive genes did not show reduced pleiotropy of gene expression nor the epigenetic features relevant with the reduction, consistently with the moderation of nucleotide substitution rates. This also suggests that the relaxation of the ‘elusiveness’ is associated with the increase of functional indispensability.

      (P27, Elusive gene orthologs in the chicken microchromosomes in Results) Our analyses indicates that the genomic features of the elusive genes such as high GC and high nucleotide substitutions do not always correlate with a reduction in pleiotropy of gene expression that potentially leads to an increase in functional dispensability, although these features have been well conserved across vertebrates. In addition, the avian orthologs of the elusive genes did not show higher KA and KS values than those of the non-elusive genes (Figure 3; Figure 3–figure supplement 1), likely consistent with similar expression levels between them (Figure 5–figure supplement 1) (Cherry, 2010; Zhang and Yang, 2015). With respect to the chicken genome, the sequence features of the elusive genes themselves might have been relaxed during evolution.

      Ks was used by the authors to indicate mutation rates. However, synonymous mutations substantially affect gene expression levels (https://pubmed.ncbi.nlm.nih.gov/25768907/, https://pubmed.ncbi.nlm.nih.gov/35676473/). Thus, synonymous mutations cannot be simply assumed as neutral ones and may not be suitable for estimating local mutation rates. If introns can be aligned, they are better sequences for estimating the mutability of a genomic region.

      We appreciate the reviewer for this meaningful suggestion. As a response, we have computed the differences in intron sequences between the human and chimpanzee genomes and compared them between the elusive and non-elusive genes. As expected, we found larger sequence differences in introns for the elusive genes than for the non-elusive genes. In Figure 2c of the revised manuscript, we have included the distribution of KI, sequence differences in introns between the human and chimpanzee genomes for the elusive and non-elusive genes. Additionally, we have added the corresponding texts to Results and the procedure to Methods as shown below.

      (P11, Identification of human ‘elusive’ genes in Results) In addition, we computed nucleotide substitution rates for introns (KI) between human and chimpanzee (Pan troglodytes) orthologs and compared them between the elusive and non-elusive genes.

      (P11, Identification of human ‘elusive’ genes in Results) Our analysis further illuminated larger KS and KI values for the elusive genes than in the non-elusive genes (Figure 2b, c; Figure 2–figure supplement 1). Importantly, the higher rate of synonymous and intronic nucleotide substitutions, which may not affect changes in amino acid residues, indicates that the elusive genes are also susceptible to genomic characteristics independent of selective constraints on gene functions.

      (P39, Methods) To compute nucleotide sequence differences of the individual introns, we extracted 473 elusive and 4,626 non-elusive genes that harbored introns aligned with the chimpanzee genome assembly. The nucleotide differences were calculated via the whole genome alignments of hg38 and panTro6 retrieved from the UCSC genome browser.

      The term "elusive gene" is not necessarily intuitive to readers.

      We previously published a paper reporting the group of genes that we refer to as ‘elusive genes,’ lost in mammals and aves independently but retained by reptiles, in the gecko genome assembly (Hara et al., 2018, BMC Biology). We initially termed them with a more intuitive name (‘loss-prone genes’) but changed it because one of our peer-reviewers did not agree to use this name. Later on, we have continuously used this term in another paper (Hara et al., 2018, Nat. Ecol. Evol.). In addition, some other groups have used the word ‘elusive’ with a similar intention to ours (Prokop et al, 2014, PLOS ONE, doi: 10.1371/journal.pone.0092751; Ribas et al., 2011, BMC Genomics, doi: 10.1186/1471-2164-12-240). We would appreciate the reviewer’s understanding of this naming to ensure the consistency of our researches on gene loss. In the revised manuscript, we have added sentences to provide a more intuitive guide to ‘elusive genes’,

      (P6, Introduction) We previously referred to the nature of genes prone to loss as ‘elusive’(Hara et al., 2018a, 2018b). In the present study, we define the elusive genes as those that are retained by modern humans but have been lost independently in multiple mammalian lineages. As a comparison of the elusive genes, we retrieved the genes that were retained by almost all of the mammalian species examined and defined them as ‘non-elusive’, representing those persistent in the genomes.

      Reviewer #3 (Public Review):

      Overall, the study is descriptive and adds incremental evidence to an existing body of extensive gene loss literature. The topic is specialised and will be of interest to a niche audience. The text is highly redundant, repeating the same false positive issue in the introduction, methods, and discussion sections, while no clear conclusion or interpretation of their main findings are presented.

      Major comments

      While some of the false discovery rate issues of gene loss detection were addressed in the presented pipeline, the authors fail to test one of the most severe cases of mis-annotating gene loss events: frameshift mutations which cause gene annotation pipelines to fail reporting these genes in the first place. Running a blastx or diamond blastx search of their elusive and non-elusive gene sets against all other genomes, should further enlighten the robustness of their gene loss detection approach

      For the revised manuscript, we have refined the elusive gene set as the reviewer suggested. In the genome assemblies, we have searched for the orthologs of the elusive genes for the species in which they were missing. The search has been conducted by querying amino acid sequences of the elusive genes with tblastn as well as MMSeqs2 that performed superior to tblastn in sensitivity and computational speed. In addition, regarding another comment by Reviewer 3. we have searched for the orthologs by referring to existing ortholog annotations. We used the ortholog annotations implemented in RefSeq instead of those from the TOGA pipeline: both employ synteny conservation. We have coordinated the identified orthologs with our gene loss criteria–absence from all the species used in a particular taxon–and excluded 268 genes from the original elusive gene set. These genes contain those missing in the previous gene annotations used in the original manuscript but present in the latest ones, as well as those falsely missing due to incorrect inference of gene trees. Finally, the refined set of 813 elusive genes were subject to comparisons with the non-elusive genes. Importantly, these comparisons retained the significantly different trends of the particular genomic, transcriptomic, and epigenomic features between them except for very few cases (Table R1 included below). This indicates that both initial and revised sets of the elusive genes reflect the nature of the ‘elusiveness,’ though the initial set contained some noises. We have modified the numbers of elusive genes in the corresponding parts of the manuscript including figures and tables. Additionally, we have added the validation procedures in Methods.

      Table R1. Difference in statistical significances across different elusive gene sets *The other features showed significantly different trends between the elusive and non-elusive genes for all of the elusive gene sets and thus are not included in this table.

      (P38 in Methods) The gene loss events inferred by molecular phylogeny were further assessed by synteny-based ortholog annotations implemented in RefSeq, as well as a homolog search in the genome assemblies (Table S2) with TBLASTN v2.11.0+ (Altschul et al., 1997) and MMSeqs2 (Steinegger and Söding, 2017) referring to the latest RefSeq gene annotations (last accessed on 2 Dec, 2022). This procedure resulted in the identification of 813 elusive genes that harbored three or fewer duplicates. Similarly, we extracted 8,050 human genes whose orthologs were found in all the mammalian species examined and defined them as non-elusive genes.

      The reviewer also suggested us investigating falsely-missing genes due to frameshift mutations (in this case we guess that the reviewer assumed the genome assembly that falsely included frameshift mutations). This requires us to search for the orthologs by revisiting the sequencing reads because the frameshift is sometimes caused by indels of erroneous basecalling. We have selected five elusive genes and searched for the fragments of orthologs in sequencing reads for the species in which they are missing. We have retrieved sequencing reads corresponding to the genome assemblies from NCBI SRA and performed sequence similarity search using the program Diamond against the amino acid sequences of the elusive genes and could not find the frameshift that potentially causes the mis-annotation of the elusive genes.

      Along this line, we noticed that when annotation files were pooled together via CD-Hit clustering, a 100% identity threshold was chosen (Methods). Since some of the pooled annotations were drawn from less high quality assemblies which yield higher likelihoods of mismatches between annotations, enforcing a 100% identity threshold will artificially remove genes due to this strict constraint. It will be paramount for this study to test the robustness of their findings when 90% and 95% identity thresholds were selected.

      cd-hit clustering with 100% sequence identity only clusters those with identical (and sometimes truncated) sequences, and, in the cluster, the sequences other than the representative are discarded. This means that the sequences remain if they are not identical to the other ones. If the similarity threshold is lowered, both identical and highly similar sequences are clustered with each other, and more sequences are discarded. Therefore, our approach that employs clustering with 100% similarity may minimize false positive gene loss.

      While some statistical tests were applied (although we do recommend consulting a professional statistician, since some identical distributions tend to show significantly low p-values), the authors fail to discuss the fact that their elusive gene set comprises of ~5% of all human genes (assuming 21,000 genes), while their non-elusive set represents ~40% of all genes. In other words, the authors compare their sequence and genomic features against the genomic background rather than a biological signal (nonelusiveness). An analysis whereby 1,081 genes (same number as elusive set) are randomly sampled from the 21,000 gene pool is compared against the elusive and non-elusive distributions for all presented results will reveal whether the non-elusive set follows a background distribution (noise) or not.

      Our study aims to elucidate the characteristics of genes that differentiate their fates, retention or loss. To achieve this, we put this characterization into the comparison between the elusive and non-elusive genes. This comparison highlighted clearly different phylogenetic signals for gene loss between elusive and non-elusive genes, allowing us to extract the features associated with the loss-prone nature. The random sampling set suggested by Reviewer may largely consists of the remainders that were not classified by the elusive and non-elusive genes. However, these remainders may contain a considerable number of genes with distinctive phylogenetic signatures rather than the intermediates between the elusive and nonelusive genes: the genes with multiple loss events in more restricted taxa than our criterion, the ones with frequent duplication, etc. Therefore, we think that a comparison of the elusive genes with the random-sampling set does not achieve our objective: the comparison of the clearly different phylogenetic signals.

      We also wondered whether the authors considered testing the links between recombination rate / LD and the genomic locations of their elusive genes (again compared against randomly sampled genes)?

      We have retrieved fine-scale recombination rate data of males and females from https://www.decode.com/addendum/ (Suppl. Data of Kong, A et al., Nature, 467:1099–1103, 2010) and have compared them between the gene regions of the elusive and non-elusive genes. Both comparisons show no significant differences: average 0.829 and 0.900 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.898, for males; average 0.836 and 0.846 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.256, for females).

      Given the evidence presented in Figure 6b, we do not agree with the statement (l.334-336): "These observations suggest that the elusive genes are unlikely to be regulated by distant regulatory elements". Here, a data population of ~1k genes is compared against a data population of ~8k genes and the presented difference between distributions could be a sample size artefact. We strongly recommend retesting this result with the ~1k randomly sampled genes from the total ~21,000 gene pool and then compare the distributions.

      Analogous random sampling analysis should be performed for Fig 6a,d

      As described above, our study does not intend to extract signals from background. To make the comparison objectives clear, we have revised the corresponding sentence as below.

      (P22, Transcriptomic natures of elusive genes in Results) These observations suggest that the elusive genes are unlikely to be regulated by distant regulatory elements compared with the non-elusive genes (Figure 6b).

      We didn't see a clear pattern in Figure 7. Please quantify enrichments with statistical tests. Even if there are enriched regions, why did the authors choose a Shannon entropy cutoff configuration of <1 (low) and >1 (high)? What was the overall entropy value range? If the maximum entropy value was 10 or 100 or even more, then denoting <1 as low and >1 as high seems rather biased.

      To use Figure 7 in a new section in Results, we have added an ideogram showing the distribution of the genes that retain the chicken orthologs in microchromosomes. In response to the comment by Reviewer 2, we have performed statistical tests and found that the elusive genes were significantly more abundant in orthologs in microchromosomes than the non-elusive genes. Furthermore, the observation that the elusive genes prefer to be located in gene-rich regions was already statistically supported (Figure 2f).

      As shown in Figure 5, Shannon’s H' ranged from zero to approximately 4 (exact maximum value is 3.97) and 5 (5.11) for the GTEx and Descartes gene expression datasets, respectively. Although the threshold H'=1 was an arbitrarily set, we think that it is reasonable to classify the genes with high pleiotropy from those with low pleiotropy.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Wei & Robles et al seek to estimate the heritability contribution of Neanderthal Informative Markers (NIM) relative to SNPs that arose in modern humans (MH). This is a question that has received a fair amount of attention in recent studies, but persistent statistical limitations have made some prior results difficult to interpret. Of particular concern is the possibility that heritability (h^2) attributed to Neanderthal markers might be tagging linked variants that arose in modern humans, resulting in overestimation of h^2 due to Neanderthal variants. Neanderthal variants also tend to be rare, and estimating the contribution of rare alleles to h^2 is challenging. In some previous studies, rare alleles have been excluded from h^2 estimates.

      Wei & Robles et al develop and assess a method that estimates both total heritability and per-SNP heritability of NIMs, allowing them to test whether NIM contributions to variation in human traits are similar or substantially different than modern human SNPs. They find an overall depletion of heritability across the traits that they studied, and found no traits with enrichment of heritability due to NIMs. They also developed a 'fine-mapping' procedure that aims to find potential causal alleles and report several potentially interesting associations with putatively functional variants.

      Strengths of this study include rigorous assessment of the statistical methods employed with simulations and careful design of the statistical approaches to overcome previous limitations due to LD and frequency differences between MH and NIM variants. I found the manuscript interesting and I think it makes a solid contribution to the literature that addresses limitations of some earlier studies.

      My main questions for the authors concern potential limitations of their simulation approach. In particular, they describe varying genetic architectures corresponding to the enrichment of effects among rare alleles or common alleles. I agree with the authors that it is important to assess the impact of (unknown) architecture on the inference, but the models employed here are ad hoc and unlikely to correspond to any mechanistic evolutionary model. It is unclear to me whether the contributions of rare and common alleles (and how these correspond with levels of LD) in real data will be close enough to these simulated schemes to ensure good performance of the inference.

      In particular, the common allele model employed makes 90% of effect variants have frequencies above 5% -- I am not aware of any evolutionary model that would result in this outcome, which would suggest that more recent mutations are depleted for effects on traits (of course, it is true that common alleles explain much more h^2 under neutral models than rare alleles, but this is driven largely by the effect of frequency on h^2, not the proportion of alleles that are effect alleles). Likewise, the rare allele model has the opposite pattern, with 90% of effect alleles having frequencies under 5%. Since most alleles have frequencies under 5% anyway (~58% of MH SNPs and ~73% of NIM SNPs) this only modestly boosts the prevalence of low frequency effect alleles relative to their proportion. Some selection models suggest that rare alleles should have much bigger effects and a substantially higher likelihood of being effect alleles than common alleles. I'm not sure this situation is well-captured by the simulations performed. With LD and MAF annotations being applied in relatively wide quintile bins, do the authors think their inference procedure will do a good job of capturing such rare allele effects? This seems particularly important to me in the context of this paper, since the claim is that Neanderthal alleles are depleted for overall h^2, but Neanderthal alleles are also disproportionately rare, meaning they could suffer a bigger penalty. This concern could be easily addressed by including some simulations with additional architectures to those considered in the manuscript.

      We thank the reviewers for their thoughtful comments regarding rare alleles, and we agree that our RARE simulations only moderately boosted the enrichment of rare alleles in causal mutations. To address this, we added new simulations, ULTRA RARE, in which SNPs with MAF < 0.01 constitute 90% of the causal variants. Similar to our previous simulations, we use 100,000 and 10,000 causal variants to mimic highly polygenic and moderately polygenic phenotypes, and 0.5 and 0.2 for high and moderately heritable phenotypes. We similarly did three replicated simulations for each combination and partitioned the heritability with Ancestry only annotation, Ancestry+MAF annotation, Ancestry+LD annotation, and Ancestry+MAF+LD annotation. Our Ancestry+MAF+LD annotation remains calibrated in this setting (see Figure below). We believe this experiment strengthens our paper and have added it as Fig S2.

      While we agree that these architectures are ad-hoc and are unlikely to correspond to realistic evolutionary scenarios, we have chosen these architectures to span the range of possible architecture so that the skew towards common or rare alleles that we have explored are extreme. The finding that our estimates are calibrated across the range that we have explored leads us to conclude that our inferences should be robust.

      More broadly, we concur with the reviewer that our results (as well as others in the field) may need to be revisited as our view of the genetic architecture of complex traits evolves. The methods that we propose in this paper are general enough to explore such architectures in the future by choosing a sufficiently large set of annotations that match the characteristics across NIMs and MH SNPs. A practical limitation to this strategy is that the use of a large number of annotations can result in some annotations being assigned a small number of SNPs which would, in turn, reduce the precision of our estimates. This limitation is particularly relevant due to the smaller number of NIMs compared to MH SNPs (around 250K vs around 8M).

      Reviewer #2 (Public Review):

      The goal of the work described in this paper is to comprehensively describe the contribution of Neanderthal-informative mutations (NIMs) to complex traits in modern human populations. There are some known challenges in studying these variants, namely that they are often uncommon, and have unusually long haplotype structures. To overcome these, the authors customized a genotyping array to specifically assay putative Neanderthal haplotypes, and used a recent method of estimating heritability that can explicitly account for differences in MAF and LD.

      This study is well thought-out, and the ability to specifically target the genotyping array to the variants in question and then use that information to properly control for population structure is a massive benefit. The methodology also allowed them to include rarer alleles that were generally excluded from previous studies. The simulations are thorough and convincingly show the importance of accounting for both MAF and LD in addition to ancestry. The fine-mapping done to disentangle effects between actual Neanderthal variants and Modern human ones on the same haplotype also seems reasonable. They also strike a good balance between highlighting potentially interesting examples of Neanderthal variants having an effect on phenotype without overinterpreting association-based findings.

      The main weakness of the paper is in its description of the work, not the work itself. The paper currently places a lot of emphasis on comparing these results to prior studies, particularly on its disagreement with McArthur, et al. (2021), a study on introgressed variant heritability that was also done primarily in UK Biobank. While they do show that the method used in that study (LDSR) does not account for MAF and LD as effectively as this analysis, this work does not support the conclusion that this is a major problem with previous heritability studies. McArthur et al. in fact largely replicate these results that Neanderthal variants (and more generally regions with Neanderthal variants) are depleted of heritability, and agree with the interpretation that this is likely due to selection against Neanderthal alleles. I actually find this a reassuring point, given the differences between the variant sets and methods used by the two studies, but it isn't mentioned in the text. Where the two studies differ is in specifics, mainly which loci have some association with human phenotypes; McArthur et al. also identified a couple groups of traits that were exceptions to the general rule of depleted heritability. While this work shows that not accounting for MAF and LD can lead to underestimating NIM heritability, I don't follow the logic behind the claim that this could lead to a false positive in heritability enrichment (a false negative would be more likely, surely?). There are also more differences between this and previous heritability studies than just the method used to estimate heritability, and the comparisons done here do not sufficiently account for these. A more detailed discussion to reconcile how, despite its weaknesses, LDSR picks up similar broad patterns while disagreeing in specifics is merited.

      We agree with the reviewer that our results are generally concordant with those of McArthur et al. 2021 and this concordance is reassuring given the differences across our studies. The differences across the studies, wherein McArthur et al. 2021 identify a few traits with elevated heritability while we do not, could arise due to reasons beyond the methodological differences such as differences in the sets of variants analyzed. We have partially explored this possibility in the revised manuscript by analyzing the set of introgressed variants identified by the Sprime method (which was used in McArthur et al. 2021) using our method: we continue to observe a pattern of depletion with no evidence for enrichment. We hypothesize that the reason why LDSR picks up similar overall patterns despite its limitations is indicative of the nature of selection on introgressed alleles (which, in turn, influences the dependence of effect size on allele frequency and LD). Investigating this hypothesis will require a detailed understanding of the LDSR results on parameters such as the MAF threshold on the regression SNPs and the LD reference SNPs and the choice of the LD reference panel.

      Not accounting for MAF and LD can underestimate NIM heritability but can both underestimate and overestimate heritability at MH SNPs. Hence, tests that compare per-SNP heritability at NIMs to MH SNPs can therefore lead to false positives both in the direction of enrichment and depletion.

      We have now written in the Discussion: “In spite of these differences in methods and NIMs analyzed, our observation of an overall pattern of depletion in the heritability of introgressed alleles is consistent with the findings of McArthur et al. The robustness of this pattern might provide insights into the nature of selection against introgressed alleles”

      In general this work agrees with the growing consensus in the field that introgressed Neanderthal variants were selected against, such that those that still remain in human populations do not generally have large effects on phenotypes. There are exceptions to this, but for the most part observed phenotypic associations depend on the exact set of variants being considered, and, like those highlighted in this study, still lack more concrete validation. While this paper does not make a significant advance in this general understanding of introgressed regions in modern populations, it does increase our knowledge in how best to study them, and makes a good attempt at addressing issues that are often just mentioned as caveats in other studies. It includes a nice quantification of how important these variables are in interpreting heritability estimates, and will be useful for heritability studies going forward.

    1. Author Response:

      Reviewer #1 (Public Review):

      In this manuscript, the authors leverage novel computational tools to detect, classify and extract information underlying sharp-wave ripples, and synchronous events related to memory. They validate the applicability of their method to several datasets and compare it with a filtering method. In summary, they found that their convolutional neural network detection captures more events than the commonly used filter method. This particular capability of capturing additional events which traditional methods don't detect is very powerful and could open important new avenues worth further investigation. The manuscript in general will be very useful for the community as it will increase the attention towards new tools that can be used to solve ongoing questions in hippocampal physiology.

      We thank the reviewer for the constructive comments and appreciation of the work.

      Additional minor points that could improve the interpretation of this work are listed below:

      • Spectral methods could also be used to capture the variability of events if used properly or run several times through a dataset. I think adjusting the statements where the authors compare CNN with traditional filter detections could be useful as it can be misleading to state otherwise.

      We thank the reviewer for this suggestion. We would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR. We have adjusted the text as suggested. In particular, a) we removed the misleading sentence from the abstract, and instead declared the need for new automatic detection strategies; b) we edited the introduction similarly, and clarified the need for improved online applications.

      • The authors show that their novel method is able to detect "physiological relevant processes" but no further analysis is provided to show that this is indeed the case. I suggest adjusting the statement to "the method is able to detect new processes (or events)".

      We have corrected text as suggested. In particular, we declare that “The new method, in combination with community tagging efforts and optimized filter, could potentially facilitate discovery and interpretation of the complex neurophysiological processes underlying SWR.” (page 12).

      • In Fig.1 the authors show how they tune the parameters that work best for their CNN method and from there they compare it with a filter method. In order to offer a more fair comparison analogous tuning of the filter parameters should be tested alongside to show that filters can also be tuned to improve the detection of "ground truth" data.

      Thank you for this comment. As explained before, see below the results of the parameter study for the filter in the very same sessions used for training the CNN. The parameters chosen (100- 300Hz band, order 2) provided maximal performance in the test set. Therefore, both methods are similarly optimized along training. This is now included (page 4): “In order to compare CNN performance against spectral methods, we implemented a Butterworth filter, which parameters were optimized using the same training set (Fig.1-figure supplement 1D).”

      • Showing a manual score of the performance of their CNN method detection with false positive and false negative flags (and plots) would be clarifying in order to get an idea of the type of events that the method is able to detect and fails to detect.

      We have added information of the categories of False Positives for both the CNN and the filter in the new Fig.4F. We have also prepared an executable figure to show examples and to facilitate understanding how the CNN works. See new Fig.5 and executable notebook https://colab.research.google.com/github/PridaLab/cnn-ripple-executable-figure/blob/main/cnn-ripple-false-positive-examples.ipynb

      • In fig 2E the authors show the differences between CNN with different precision and the filter method, while the performance is better the trends are extremely similar and the numbers are very close for all comparisons (except for the recall where the filter clearly performs worse than CNN).

      This refers to the external dataset (Grosmark and Buzsaki 2016), which is now in the new Fig.3E. To address this point and to improve statistical report, we have added more data resulting in 5 sessions from 2 rats. Data confirm better performance of CNN model versus the filter. The purpose of this figure is to show the effect of the definition of the ground truth on the performance by different methods, and also the proper performance of the CNN on external datasets without retraining. Please, note that in Grosmark and Buzsaki, SWR detection was conditioned on the

      coincidence of both population synchrony and LFP definition thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).

      • The authors acknowledge that various forms of SWRs not consistent with their common definition could be captured by their method. But theoretically, it could also be the case that, due to the spectral continuum of the LFP signals, noisy features of the LFP could also be passed as "relevant events"? Discussing this point in the manuscript could help with the context of where the method might be applied in the future.

      As suggested, we have mentioned this point in the revised version. In particular: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez- Villegas et al., 2015). Since the power spectral level operationally defines the detection of SWR, part of this microcircuit intrinsic variability may be escaping analysis when using spectral filters” (page 16).

      • In fig. 5 the authors claim that there are striking differences in firing rate and timings of pyramidal cells when comparing events detected in different layers (compare to SP layer). This is not very clear from the figure as the plots 5G and 5H show that the main differences are when compare with SO and SLM.

      We apologize for generating confusion. We meant that the analysis was performed by comparing properties of SWR detected at SO, SR and SLM using z- values scored by SWR detected at SP only). We clarified this point in the revised version: “We found larger sinks and sources for SWR that can be detected at SLM and SR versus those detected at SO (Fig.7G; z-scored by mean values of SWR detected at SP only).” (page 14).

      • Could the above differences be related to the fact that the performance of the CNN could have different percentages of false-positive when applied to different layers?

      The rate of FP is similar/different across layers: 0.52 ± 0.21 for SO, 0.50 ± 0.21 for SR and 0.46 ± 0.19 for SLM. This is now mentioned in the text: “No difference in the rate of False Positives between SO (0.52 ± 0.21), SR (0.50 ± 0.21) and SLM (0.46 ± 0.19) can account for this effect.” (page 12)

      Alternatively, could the variability be related to the occurrence (and detection) of similar events in neighboring spectral bands (i.e., gamma events)? Discussion of this point in the manuscript would be helpful for the readers.

      We have discussed this point: “While we cannot discard noisy detection from a continuum of LFP activity, our categorization suggest they may reflect processes underlying buildup of population events (de la Prida et al., 2006). In addition, the ability of CA3 inputs to bring about gamma oscillations and multi-unit firing associated with sharp-waves is already recognized (Sullivan et al., 2011), and variability of the ripple power can be related with different cortical subnetworks (Abadchi et al., 2020; Ramirez-Villegas et al., 2015).” (Page 16)

      Overall, I think the method is interesting and could be very useful to detect more nuance within hippocampal LFPs and offer new insights into the underlying mechanisms of hippocampal firing and how they organize in various forms of network events related to memory.

      We thank the reviewer for constructive comments and appreciation of the value of our work.

      Reviewer #2 (Public Review):

      Navas-Olive et al. provide a new computational approach that implements convolutional neural networks (CNNs) for detecting and characterizing hippocampal sharp-wave ripples (SWRs). SWRs have been identified as important neural signatures of memory consolidation and retrieval, and there is therefore interest in developing new computational approaches to identify and characterize them. The authors demonstrate that their network model is able to learn to identify SWRs by showing that, following the network training phase, performance on test data is good. Performance of the network varied by the human expert whose tagging was used to train it, but when experts' tags were combined, performance of the network improved, showing it benefits from multiple input. When the network trained on one dataset is applied to data from different experimental conditions, performance was substantially lower, though the authors suggest that this reflected erroneous annotation of the data, and once corrected performance improved. The authors go on to analyze the LFP patterns that nodes in the network develop preferences for and compare the network's performance on SWRs and non-SWRs, both providing insight and validation about the network's function. Finally, the authors apply the model to dense Neuropixels data and confirmed that SWR detection was best in the CA1 cell layer but could also be detected at more distant locations.

      The key strengths of the manuscript lay in a convincing demonstration that a computational model that does not explicitly look for oscillations in specific frequency bands can nevertheless learn to detect them from tagged examples. This provides insight into the capabilities and applications of convolutional neural networks. The manuscript is generally clearly written and the analyses appear to have been carefully done.

      We thank the reviewer for the summary and for highlighting the strengths of our work.

      While the work is informative about the capabilities of CNNs, the potential of its application for neuroscience research is considerably less convincing. As the authors state in the introduction, there are two potential key benefits that their model could provide (for neuroscience research): 1. improved detection of SWRs and 2. providing additional insight into the nature of SWRs, relative to existing approaches. To this end, the authors compare the performance of the CNN to that of a Butterworth filter. However, there are a number of major issues that limit the support for the authors' claims:

      Please, see below the answers to specific questions, which we hope clarify the validity of our approach

      • Putting aside the question of whether the comparison between the CNN and the filter is fair (see below), it is unclear if even as is, the performance of the CNN is better than a simple filter. The authors argue for this based on the data in Fig. 1F-I. However, the main result appears to be that the CNN is less sensitive to changes in the threshold, not that it does better at reasonable thresholds.

      This comment now refers to the new Fig.2A (offline detection) and Fig.2C,D (online detection). Starting from offline detection, yes, the CNN is less sensitive than the filter and that has major consequences both offline and online. For the filter to reach it best performance, the threshold has to be tuned which is a time-consuming process. Importantly, this is only doable when you know the ground truth. In practical terms, most lab run a semi-automatic detection approach where they first detect events and then they are manually validated. The fact that the filter is more sensible to thresholds makes this process very tedious. Instead, the CNN is more stable.

      In trying to be fair, we also tested the performance of the CNN and the filter at their best performance (i.e. looking for the threshold f¡providing the best matching with the ground truth). This is shown at Fig.3A. There are no differences between methods indicating the CNN meet the gold standard provided the filter is optimized. Note again this is only possible if you know the ground truth because optimization is based in looking for the best threshold per session.

      Importantly, both methods reach their best performance at the expert’s limit (gray band in Fig.3A,B). They cannot be better than the individual ground truth. This is why we advocate for community tagging collaborations to consolidate sharp-wave ripple definitions.

      Moreover, the mean performance of the filter across thresholds appears dramatically dampened by its performance on particularly poor thresholds (Fig. F, I, weak traces). How realistic these poorly tested thresholds are is unclear. The single direct statistical test of difference in performance is presented in Fig. 1H but it is unclear if there is a real difference there as graphically it appears that animals and sessions from those animals were treated as independent samples (and comparing only animal averages or only sessions clearly do not show a significant difference).

      Please, note this refers to online detection. We are not sure to understand the comment on whether the thresholds are realistic. To clarify, we detect SWR online using thresholds we similarly optimize for the filter and the CNN over the course of the experiment. This is reported in Fig.2C as both, per session and per animals, reaching statistical differences (we added more experiments to increase statistical power). Since, online defined thresholds may still not been the best, we then annotated these data and run an additional posthoc offline optimization analysis which is presented in Fig.2D. We hope this is now more clear in the revised version.

      Finally, the authors show in Fig. 2A that for the best threshold the CNN does not do better than the filter. Together, these results suggest that the CNN does not generally outperform the filter in detecting SWRs, but only that it is less sensitive to usage of extreme thresholds.

      We hope this is now clarified. See our response to your first bullet point

      Indeed, I am not convinced that a non-spectral method could even theoretically do better than a spectral method to detect events that are defined by their spectrum, assuming all other aspects are optimized (such as combining data from different channels and threshold setting)

      As can be seen in the responses to the editor synthesis, we have optimized the filter parameter similarly (new Fig.1-supp-1D) and there is no improvement by using more channels (see below). In any case, we would like to emphasize that we do not advocate at all for disusing filters. We feel that a combination of methods is required to improve our understanding of the complex electrophysiological processes underlying SWR.

      • The CNN network is trained on data from 8 channels but it appears that the compared filter is run on a single channel only. This is explicitly stated for the online SWR detection and presumably, that is the case for the offline as well. This unfair comparison raises the possibility that whatever improved performance the CNN may have may be due to considerably richer input and not due to the CNN model itself. The authors state that a filter on the data from a single channel is the standard, but many studies use various "consensus" heuristics, e.g. in which elevated ripple power is required to be detected on multiple channels simultaneously, which considerably improves detection reliability. Even if this weren't the case, because the CNN learns how to weight each channel, to argue that better performance is due to the nature of the CNN it must be compared to an algorithm that similarly learns to optimize these weights on filtered data across the same number of channels. It is very likely that if this were done, the filter approach would outperform the CNN as its performance with a single channel is comparable.

      We appreciate this comment. Using one channel to detect SWR is very common for offline detection followed by manual curation. In some cases, a second channel is used either to veto spurious detections (using a non-ripple channel) or to confirm detection (using a second ripple channel and/or a sharp-wave) (Fernandez-Ruiz et al., 2019). Many others use detection of population firing together with the filter to identify replay (such as in Grosmark and Buzsaki 2019, where ripples were conditioned on the coincidence of both population firing and LFP detected ripples). To address this comment, we compared performance using different combinations of channels, from the standard detection at the SP layer (pyr) up to 4 and 8 channels around SP using the consensus heuristics. As can be seen filter performance is consistent across configurations and using 8 channels is not improving detection. We clarify this in the revised version: ”We found no effect of the number of channels used for the filter (1, 4 and 8 channels), and chose that with the higher ripple power” (see caption of Fig.1-supp-1D).

      • Related to the point above, for the proposed CNN model to be a useful tool in the neuroscience field it needs to be amenable to the kind of data and computational resources that are common in the field. As the network requires 8 channels situated in close proximity, the network would not be relevant for numerous studies that use fewer or spaced channels. Further, the filter approach does not require training and it is unclear how generalizable the current CNN model is without additional network training (see below). Together, these points raise the concern that even if the CNN performance is better than a filter approach, it would not be usable by a wide audience.

      Thank you for this comment. To handle with different input channel configurations, we have developed an interpolation approach, which transform any data into 8-channel inputs. We are currently applying the CNN without re-training to data from several labs using different electrode number and configurations, including tetrodes, linear silicon probes and wires. Results confirm performance of the CNN. Since we cannot disclose these third-party data here, we have looked for a new dataset from our own lab to illustrate the case. See below results from 16ch silicon probes (100 um inter-electrode separation), where the CNN performed better than the filter (F1: p=0.0169; Precision, p=0.0110; 7 sessions, from 3 mice). We found that the performance of the CNN depends on the laminar LFP profile, as Neuropixels data illustrate.

      • A key point is whether the CNN generalizes well across new datasets as the authors suggest. When the model trained on mouse data was applied to rat data from Grosmark and Buzsaki, 2016, precision was low. The authors state that "Hence, we evaluated all False Positive predictions and found that many of them were actually unannotated SWR (839 events), meaning that precision was actually higher". How were these events judged as SWRs? Was the test data reannotated?

      We apologize for not explaining this better in the original version. We choose Grosmark and Buzsaki 2016 because it provides an “incomplete ground truth”, since (citing their Methods) “Ripple events were conditioned on the coincidence of both population synchrony events, and LFP detected ripples”. This means there are LFP ripples not included in their GT. This dataset provides a very good example of how the experimental goal (examining replay and thus relying in population firing plus LFP definitions) may limit the ground truth.

      Please, note we use the external dataset for validation purposes only. The CNN model was applied without retraining, so it also helps to exemplify generalization. Consistent with a partial ground truth, the CNN and the filter recalled most of the annotated events, but precision was low. By manually validating False Positive detections, we re-annotated the external dataset and both the CNN and the filter increased precision.

      To make the case clearer, we now include more sessions to increase the data size and test for statistical effects (Fig.3E). We also changed the example to show more cases of re-annotated events (Fig.3D). We have clarified the text: “In that work, SWR detection was conditioned on the coincidence of both population synchrony and LFP definition, thus providing a “partial ground truth” (i.e. SWR without population firing were not annotated in the dataset).” (see page 7).

      • The argument that the network improves with data from multiple experts while the filter does not requires further support. While Fig. 1B shows that the CNN improves performance when the experts' data is combined and the filter doesn't, the final performance on the consolidated data does not appear better in the CNN. This suggests that performance of the CNN when trained on data from single experts was lower to start with.

      This comment refers to the new Fig.3B. We apologize for not have had included a between- method comparison in the original version. To address this, we now include a one-way ANOVA analysis for the effect of the type of the ground truth on each method, and an independent one- way ANOVA for the effect of the method in the consolidated ground truth. To increase statistical power we have added more data. We also detected some mistake with duplicated data in the original figure, which was corrected. Importantly, the rationale behind experts’ consolidated data is that there is about 70% consistency between experts and so many SWR remain not annotated in the individual ground truths. These are typically some ambiguous events, which may generate discussion between experts, such as sharp-wave with population firing and few ripple cycles. Since the CNN is better in detecting them, this is the reason supporting they improve performance when data from multiple experts are integrated.

      Further, regardless of the point in the bullet point above, the data in Fig. 1E does not convincingly show that the CNN improves while the filter doesn't as there are only 3 data points per comparison and no effect on F1.

      Fig.1E shows an example, so we guess the reviewer refers to the new Fig.2C, which show data on online operation, where we originally reported the analysis per session and per animal separately with only 3 mice. We have run more experiments to increase the data size and test for statistical effects (8 sessions, 5 mice; per sessions p=0.0047; per mice p=0.033; t-test). This is now corrected in the text and Fig.1C, caption. Please, note that a posthoc offline evaluation of these online sessions confirmed better performance of the CNN versus the filter, for all normalized thresholds (Fig.2D).

      • Apart from the points above regarding the ability of the network to detect SWRs, the insight into the nature of SWRs that the authors suggest can be achieved with CNNs is limited. For example, the data in Fig. 3 is a nice analysis of what the components of the CNN learn to identify, but the claim that "some predictions not consistent with the current definition of SWR may identify different forms of population firing and oscillatory activities associated to sharp-waves" is not thoroughly supported. The data in Fig. 4 is convincing in showing that the network better identifies SWRs than non-SWRs, but again the insight is about the network rather than about SWRs.

      In the revised version, have now include validation of all false positives detected by the CNN and the filter (Fig.4F). To facilitate the reader examining examples of True Positive and False Positive detection we also include a new figure (Fig.5), which comes with the executable code (see page 9). We also include comparisons of the features of TP events detected by both methods (Fig.2B), where is shown that SWR events detected by the CNN exhibited features more similar to those of the ground truth (GT), than those detected by the filter. We feel the entire manuscript provides support to these claims.

      Finally, the application of the model on Neuropixels data also nicely demonstrates the applicability of the model on this kind of data but does not provide new insight regarding SWRs.

      We respectfully disagree. Please, note that application to ultra-dense Neuropixels not only apply the model to an entirely new dataset without retraining, but it shows that some SWR with larger sinks and sources can be actually detected at input layers (SO, SR and SLM). Importantly, those events result in different firing dynamics providing mechanistic support for heterogeneous behavior underlying, for instance, replay.

      In summary, the authors have constructed an elegant new computational tool and convincingly shown its validity in detecting SWRs and applicability to different kinds of data. Unfortunately, I am not convinced that the model convincingly achieves either of its stated goals: exceeding the performance of SWR detection or providing new insights about SWRs as compared to considerably simpler and more accessible current methods.

      We thank you again for your constructive comments. We hope you are now convinced on the value of the new method in light to the new added data.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper by Zhuang and colleagues seeks to answer an important clinical question by trying to come up with novel predictive biomarkers to predict high-risk T1 colorectal cancers that are at risk for nodal involvement. The current clinical features may both miss patients who underwent local therapy and who should have gone on to have surgery and patients for whom surgery was done based on risk features but perhaps unnecessarily. Using a training and validation set, they developed a protein-based classifier with an AUC of 0.825 based on mass spec analyses and proteomic analyses of patients with and without LN importantly linking biological rationale to the proteomic discoveries.

      In the training cohort, they took 105 candidate proteins reduced to 55, and did a validation in the training cohort first and then in two validation cohorts (one of which was prospective). They also looked at a 9-protein classifier which also performed well and furthermore looked at IHC for clinical ease.

      We appreciate the reviewers for the positive review and valuable comments. We have revised the manuscript according to the comments.

      Reviewer #2 (Public Review):

      The authors utilized a label-free LC-MS/MS analysis in formalin-fixed paraffin-embedded (FFPE) tumors from 143 LNM-negative and 78 LNM-positive patients with T1 CRC to identify protein biomarkers to determine LNM in T1 CRC.

      The authors used a fair number of clinical samples for the proteomics investigation. The experimental design is reasonable, and the statistical methods used in this manuscript are solid.

      The authors largely achieved their aims and the results supported their conclusion. The method used in this proteomic study can also be used for the proteomics analysis of other cancer types to identify diagnostic and prognostic biomarkers. In addition, the 9-marker panel has a potential clinical diagnosis practice in determining LNM in T1 CRC.

      Nevertheless, the authors need to justify their standards in selecting the biomarkers. For example, a p-value cut-off of 0.1 is not a usual criterion in similar proteomic studies. In addition, an identification frequency of 30% in patients seems not preferable for biomarker identification. The authors also need to justify the definition of fold change in the three subtypes with Kruskal-Walli's test. The authors need to describe more details on how they identified the 13 proteins from a 55-protein database. In addition, what is the connection between the final 9 proteins and the 19 proteins? What is the criterion to select 5 proteins for IHC validation from the 9 proteins?

      We appreciate the reviewers for the positive review and valuable comments. We have revised the manuscript according to the comments.

      The criteria and details of our standards in selecting are as follows.

      1) About p-value cut-off of 0.1:

      The purpose of this step is to screen appropriate variables for subsequent machine learning, rather than comparing differences between groups. The p-value cut-off of 0.1 is also a reliable strategy for variable selection in proteomics research. For example, it has been used in studies to predict the response to tumor necrosis factor-α inhibitors in rheumatoid arthritis (PMID: 28650254); the research about circadian clock in mouse liver (PMID: 29674717); the proteomic biomarker discovery in atherosclerosis (PMID: 15496433); and the proteomics and transcriptomics analysis in bacillus subtilis (PMID: 19948795).

      Based on reviewer’s suggestion, we used a cutoff of p-value 0.05 to screen for variables. In a training set of 70 lymph node-negative and 62 lymph node-positive cases, we identified 355 protein markers. We further incorporated these proteins into a lasso regression analysis and ultimately developed a lymph node metastasis prediction model consisting of 52 protein markers. We validated the model in VC1 and VC2, with AUC values of 1.000, 0.824, and 0.918 for the training set, VC1, and VC2, respectively, the predictive performance was slightly inferior to that of the model developed in this study (Figure 3- figure supplement 1C).

      2) About identification frequency of 30%:

      The analysis focusing on the proteins identified in > 30% of the samples has been applied in the previous published studies. For instance, the study of using proteomic biomarkers to build diagnostic model in lung cancer (PMID: 29576497), proteins identified in > 30% cohort samples were used for downstream analysis. In the study on the impact of Reptin on protein-protein interaction (PMID: 30862565) have demonstrated that proteins were required to have at least in > 30% of samples in order to be included in the proteome dataset.

      We compared our cohort with Jun Qin et al. and Bing Zhang et al., study published in Nature (PMID: 25043054), according to the number of the proteins detected in more than 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% of samples, respectively (Figure 2- figure supplement 1). The proportion proteins detected at different cutoff of the samples in the three cohort were, 10% (0.60, 0.94, 0.48), 20% (0.52, 0.83, 0.38), 30% (0.46,0.75, 0.31), 40% (0.41, 0.69, 0.26), 50% (0.37, 0.63, 0.23), 60% (0.33, 0.57, 0.18), 70% (0.29, 0.52, 0.15), 80% (0.25, 0.45, 0.11), 90% (0.19, 0.37, 0.11), 100% (0.07, 0.23, 0.10), respectively. The results showed that our cohort was reliable.

      To investigate the impacts of protein identification frequency cutoff in our study, we performed comparative pathway enrichment analysis of the differential expressed proteins (LNM+ vs. LNM-: p-value < 0.05, Wilcoxon rank-sum tests) under different observation percentiles, which were detected in more than 10%, more than 30% and more than 50% of samples, respectively. The results revealed that proteins from three thresholds (10%, 30% and 50%) represented similar pathway enrichment, such as mTOR signaling pathway and amino acid metabolism pathways were dominant in LNM-negative patients, coagulation cascades and Lipid metabolism pathways were overrepresented in the LNM-positive patients (Figure 2- figure supplement 1)

      Based on reviewer’s suggestion, we used a cutoff of 50% as identification frequency for variables. The lasso regression was carried out in training cohort (70 LNM-negative and 62 LNM-positive), with AUC of 0.999. The model was validated in VC1 and VC2, with AUC of 0.812 and 0.886, respectively. (Figure 2- figure supplement 1).

      3) About identification of the 13 proteins and the criterion to select 5 proteins for IHC validation from 55-protein database:

      The process of reducing the number of proteins from 55 to 13 and finally establishing a 5-molecule classifier based on the IHC score is as shown in Figure 1- figure supplement 2 in the revision. We first selected 19 proteins with [log2FC] > 1 or < -1 and p<0.05 (Wilcoxon rank-sum test) between the LNM-negative and LNM-positive in 221 patients from 55 proteins. Then we started looking for antibodies to these 19 proteins. We finally obtained 13 antibodies for further immunohistochemistry. We did immunohistochemical staining to the FFPE samples with 13 antibodies, and got the IHC score of each protein to build the single molecular prediction model by SPSS on ROC curve. For the principles of MS based proteomic and IHC stain are different, not all identified proteins can be converted into IHC. Finally, 5 IHC makers with p-value of IHC score less than 0.05 (Student’s t-test) were selected to build the IHC classifier using Logistic Regression. We also updated the description in the “Result” section in the revised manuscript (line 718-722, page 34-35 in the revision).

      4) About the connection between the final 9 proteins and the 19 proteins:

      To facilitate the clinical translation of the model, Multiple Logistic Regression was used to obtain 9 core proteins from 19 proteins (Figure 1- figure supplement 2 in the revision). We first performed logistic regression in 19 proteins, and eliminated 10 proteins with insignificant Estimate Std. Error z value (Pr (>|z|) > 0.05, and obtained 9 proteins with Pr(>|z|) < 0.05. After that, we carried out Binary Logistic Regression calculation again with 9 proteins to build the simplified classifier. We also updated the description in the “Materials and methods” section in the revised manuscript (line 1092, page 51 in the revision).

      5) About the definition of fold change in the three subtypes with Kruskal-Walli's test:

      The fold change in the three subtypes is the ratio of the mean of the expressions in each group (well to moderately differentiated adenocarcinoma, poorly differentiated adenocarcinoma and mucinous adenocarcinoma) to the mean of the other two group. Kruskal-Walli's test was performed between three subtypes.

      We also updated the description in the “Result” section in the revised manuscript (line 506-517, page 25 in the revision), and “Figure 1- figure supplement 2H in the revision”.

      Reviewer #3 (Public Review):

      This work provides a proteomic analysis of 132 early-stage (pT1) colorectal cancers (CRC) to attempt to identify proteins (or a signature pattern thereof) that might be used to predict the patient risk of lymph node metastases (LNM) and potentially stratify patients for further treatment or surveillance. The generated dataset is extensive and the methods appear solid. The work identifies a 55-protein signature that is strongly predictive of LNM in the training cohort and two validation cohorts and then generates two simplified classifiers: a 9-protein proteomic and a 5-protein immunohistochemical classifier. These also perform very well in predicting LNM. Loss of the small GTPase RHOT2 is identified as a poor prognostic factor and validated in a migration assay. The findings could allow better prognostication in CRC and, if confirmed and better validated and contextualized, might impact patient care.

      Strengths:

      A large training cohort of resected early-stage (pT1M0) CRCs was analyzed by rigorous methods including careful quantitative analysis. The data generated are unbiased and potentially useful. A number of proteins are found to be different between CRCs with and without lymph node metastases, which are used to train a machine learning model that performs flawlessly in predicting LNM in the training cohort and very well in predicting LNM in two validation cohorts. The authors then develop two simplified classifiers that might be more readily extended into clinical care: a 9-protein proteomic assay and a 5-protein immunohistochemical assay; both of these also perform well in predicting LNM. Because LNM is a key prognostic factor, and colectomy (which includes removal of lymph nodes needed to assess LNM) carries significant risk and morbidity, particularly in rectal cancer, classifiers like these are potentially interesting. Finally, the authors identify the loss of expression of RHOT2 as a novel prognostic factor.

      Weaknesses:

      Major points:

      The data are limited by a number of assumptions about metastasis, minimal contextualization of the results, and claims that are too strong given the data. Critically, the authors use the presence or absence of LNM as the study's only outcome; while LNM is a key predictor in CRC, it is uncommon in T1 CRC (generally 3-10%, 12% in this study), stochastic, inefficient, and incompletely identified by histologic evaluation. Larger resection (here, colectomy) removes both identified and occult LNM, which is probably best studied in randomized trials of lymphadenectomy in Japanese gastric cancer cohorts and should be better discussed. Critically, patient survival or disease-free survival would be more relevant outcomes. Further, absent longer-term data, many patients without identified LNM might nonetheless be high-risk and skew the cohorts. It is also not clear whether these findings would be generalizable to other early-stage colon cancers.

      The data are also not correlated with the genetics of the cases, which were not discussed.

      The results would benefit from the inclusion of standard-of-care MSI status. The classifiers would also be much more impactful if they were generalizable beyond T1 CRCs; this could be readily tested in public datasets.

      The authors explain the data as mechanistic, but, aside from one experiment modulating RHOT2 levels, they are fundamentally correlative and should be described as such.

      Although they focused on areas containing >80% tumor as judged by the reading pathologist, it is unclear whether the identified proteomic changes originate from the tumor or the microenvironment.

      The authors fail to properly contextualize the results or overstate the novelty of their study. A number of examples - the study is claimed as "the first proteomic study of T1 CRC" and "the first comprehensive proteomics study to focus on LNM in patients with submucosal T1 CRCs"; neither of these appears to be true, for example, Steffen et al. (Journal of Proteome Research, 2021, reference 18) may satisfy both of these, although the numbers are smaller. Many other results are reported without context, for example, proteomic characterization of mucinous carcinomas has been performed previously, a modest correlation in mucinous carcinoma is ascribed a large mechanistic role, and PDPN is discussed but is not contextualized as a protein that has been well-studied in the context of metastasis.

      The data on RHOT2 are promising but very preliminary. RHOT2 is described as ubiquitous in colorectal cancer cell lines; a brief search in Human Protein Atlas shows RHOT2 RNA and proteins are ubiquitously expressed throughout the body. While its loss appears potentially prognostic, it is unclear whether this is simply a surrogate for other features, such as loss of differentiation state, and whether this is unique to CRC; multivariate analysis would be important.

      We appreciate the reviewer for the constructive and insightful comments, which help to improve the quality of this manuscript. Here, we summarized the reviewer’s comments as following: (1) Lack of longer-term data and micrometastasis; (2) test the classifier in public datasets; (3) inclusion of standard genetics and gene alterations; (4) about the tumor purity of all tumor samples and whether the results were influenced by the tumor microenvironment; (5) contextualize the results; (6) multivariate analysis of RHOT2.

      1) Lack of longer-term data and micrometastasis:

      Thank the reviewer for the comments. We fully acknowledge the limitations of our study, including the uncertainty associated with the detection of lymph node micrometastasis and the lack of long-term survival data, which can impact the strength of our conclusions. We agree that LNM is a key predictor in CRC and that it is uncommon in T1 CRC, with a reported incidence of 3-10%. We acknowledge that larger resections, such as colectomy, are generally recommended for patients with T1 CRC with LNM due to the potential risk of metastasis. However, our study aimed to establish a predictive model for LNM in T1 CRC, which could potentially help guide clinical decision-making on whether additional surgery is needed after endoscopic resection, according to the current NCCN guidelines.

      We have taken following methods to address these limitations:

      • We matched propensity-score of patients to reduce confounding biases in our training cohort, and patients were prospectively enrolled in our validation cohort, which was designed as a single-blinded prospective study to enhance the rigor and reliability of our findings.

      • For the influence of micrometastases in our study. According to reviewer's suggestion, we discussed the reports related to lymph nodes micrometastases in Japanese gastric cancer cohorts (PMID: 17377930, 9070482), and at the same time, we consulted the articles about micrometastases in T1 CRC (PMID: 17661146, 16412600). There were about 5% pT1N0 gastric cancer patients have ITCs in LN, and 10% in pT1Nx CRC. The effect of MMs on prognosis in pT1N0 CRC is still unclear. The present of ITCs/MMs in LN may explain why there are nearly 13% (29 of 221) LNM-negative patients were classified into high-risk group by the prediction model in our study.

      We have also added a section to the “Discussion” in the revised manuscript to discuss the potential impact of these limitations on the interpretation of our findings (line 856-873, page 41) in the revision, as follow:

      “In this study, to ensure the accuracy of LN status of the enrolled patients, the dissected number of LN in all patients including both surgical resection and ESD was more than 12. However, the longer-term follow-up data, including DFS, PFS, etc., are not available, due to limitations in sample collection time and the prognosis of such patients needs to be tracked over long periods of time, and may impact the strength of our conclusions. To address this limitation, we used propensity-score matching to reduce confounding biases in our training cohort. Patients were prospectively enrolled in our validation cohort (VC2), which was designed as a single-blinded prospective study to enhance the rigor and reliability of our findings. Furthermore, the presence of isolated tumor cells (ITCs) or micrometastases (MMs) within regional LN are not considered, due to conventional histopathologic examination cannot detected them. According to previous studies, there were about 5% pT1N0 gastric cancer patients have ITCs in LN, and 10% in pT1Nx CRC. The effect of MMs on prognosis in pT1N0 CRC is still unclear. The present of ITCs/MMs in LN may explain why there are nearly 13% (29 of 221) LNM-negative patients were classified into high-risk group by the prediction model in our study. Our study would provide a valuable database and could help for clinical decision-making in the context of T1 CRC. We will continuously follow the prognosis of the patients, and the ITCs/MMs in LN also need to be further validated in the future studies.”

      In conclusion, we appreciate reviewer’s comments and acknowledge the limitations of our study. We believe that our study provides valuable insights into the development of a predictive model for LNM in T1 CRC, which could potentially aid in clinical decision-making according to the current NCCN guidelines.

      2) Test the classifier in public datasets:

      According to reviewer’s suggestions, we tested our classifier in two different public datasets, including the colon and rectal cancer study from CPTAC published in Nature (PMID: 25043054), and the metastatic colorectal cancer study published in Cancer Cell (PMID: 32888432). The detail was further discussed in “point-to-point responses R3 Q2.”.

      3) Standard genetics and gene alterations:

      According to reviewer’s suggestions, we assessed MSI status and CRC-associated gene mutations (RAS, BRAF and PIK3CA) in our cohort. The detail was further discussed in “point-to-point responses R3 Q1.”

      4) The influence of microenvironment:

      We apologized for not explaining it clearly. To the question of whether the differences between two groups (LNM+ and LNM-) are caused by tumor microenvironment or the tumor tissues, we firstly, used xCell (PMID: 29141660) to study the composition of the tumor microenvironment (Figure2-source data 4 in the revision). The results showed that there was no difference in the tumor microenvironment between the LNM-positive and negative groups (P > 0.05, Wilcoxon rank-sum test) (Figure RL1A). However, when we compared the xCell algorism-based cell deconvolution results between the LNM-positive and -negative groups, we found 8 microenvironment associated cell features differed in two groups (p<0.05) (Figure RL1B). LNM-positive patients were featured with Chondrocytes and Th1 cells. And the remaining 6 features are all high in LNM-negative patients, including, B cells, cDC, Myocytes, etc. Correspondingly, 7 immune cell markers were also observed to be significantly different between the two groups (Log2FC>1 or <-1, P > 0.05, Wilcoxon rank-sum test) (Figure RL1C).

      Secondly, we checked the expression profile of the signature proteins detected in our study by The Human Protein Atlas (HPA). Among 9404 identified proteins, 7852 (83.4%) have HPA’s CRC IHC staining data, and 6249 (79.6%) showed medium to high tumor-specific staining in CRC samples (Figure RL1D). Of the signature proteins up-regulated in LNM-positive patients (LNM+ vs. LNM-: log2FC > 1 and p<0.05, Wilcoxon rank-sum test), 76 of 84 (90.5%) have IHC staining data in HPA, and 63 (82.9%) showed medium to high tumor-specific staining in CRC samples (Figure RL1E). For specific proteins of LNM-negative patients (LNM+ vs. LNM-: log2FC <-1 and p<0.05, Wilcoxon rank-sum test), 72 of 82 (87.8%) have IHC staining data in HPA, and 60 (83.3%) showed medium to high tumor-specific staining in CRC samples (Figure RL1F).

      Finally, we reviewed again all H&E-stained slides of tumor tissues of patients involved in the study, and supplemented tumor purity values of tumor samples of all the patients in Figure1-source data 1. We compared the tumor purity between the LNM-positive (with average 87.75%) and negative patients (with average 88.27%). The result showed there was no difference between the two groups (P = 0.46, Student’s t-test), demonstrating the high purity and quality of the tumor tissues. (Figure1-supplementary figure 1J in the revision).

      These results indicate that, in our study the differences between LNM-positive and LNM-negative groups are mainly caused by tumor tissues. However, the tumor microenvironment may also play a critical but not direct role in T1 CRC development and progression.

      Figure RL1. A. Comparison of xCell scores of immune and microenvironment between the LNM-negative group (n= 143) and LNM-positive group (n= 78). B&C. Immune/stromal signatures identified from xCell, together with derived relative abundance of immune and stromal cell types. D, E, F. Identified signature proteins (D), LNM-positive group up-regulated proteins (E) and LNM-negative group up-regulated proteins (F) were mostly validated by HPA IHC Staining Data. G. Barplot for tumor purity between LNM-negative and -positive patients.

      5) Contextualize the results:

      According to the reviewer’s advice, we have made corresponding adjustments in the revised manuscript, for example:

      • “We have made a comprehensive proteomic study of T1 CRC and provides a reliable data source for future research. “(line 342, page 17 in the revision)

      -“Here, we present a comprehensive proteomic study to focus on LNM in patients with submucosal T1 CRCs.” (line 788, page 37 in the revision)

      With regard to the problem of results are reported without context, we have provided supplementary descriptions of the context of the results in the “Result” section of the revised manuscript, for example:

      • “Mucinous adenocarcinoma was considered to be a significant risk factor of LNM in T1 CRC (PMID: 31620912).” (line 498, page 24 in the revision)

      • “Mucinous adenocarcinoma of the colorectal is a lethal cancer with unknown molecular etiology and a high propensity to lymph node metastasis. Previous proteomic studies on mucinous adenocarcinoma have found the proteins associated with treatment response in rectal mucinous adenocarcinoma and mechanisms of metastases in mucinous salivary adenocarcinoma.” (PMID: 34990823, 28249646) (line 534-538, page 26 in the revision)

      • “Previous studies have shown that PDPN expression correlated with LNM in numerous cancers, especially in early oral squamous cell carcinomas.” (PMID: 21105028).” (line 570, page 27 in the revision)

      6) Multivariate analysis of RHOT2:

      RHOT2 and its paralog RHOT1 plays an important role in mitochondrial trafficking (PMID: 16630562). Although the function of RHOT2 in cancer is still unknown, the expression of RHOT1 affects metastasis in a variety of tumors, including pancreatic cancer (PMID: 26101710), gastric cancer (PMID: 35170374), small cell lung cancer (PMID: 33515563), etc. In addition, previous studies have found that Myc regulation of mitochondrial trafficking through RHOT1 and RHOT2 enables tumor cell motility and metastasis (PMID: 31061095).

      As shown in Figure 4, in our analysis of previous version, we found RHOT2 was significant down-regulated (Log2FC=-1.35; p=0.003, Wilcoxon rank-sum test) in LNM-positive patients compared with LNM-negative patients in our T1 CRC cohort and the low level of RHOT2 is related to low overall survival of patients with colon cancer in TCGA cohort. Knockdown of RHOT2 expression could markedly enhance the migration ability of colon cancer cells.

      In order to further explore the influence of RHOT2 on T1 CRC LNM, in addition to the previous results, we carried out the following analysis as shown in Figure4 in the revision.

      We, firstly, calculated the correlations between the expression of RHOT2 and other proteins in our cohort (Figure 4). 1,508 proteins were correlated significantly (P < 0.05, Spearman) with RHOT2, and 1,354 proteins showed a positive correlation (coefficient >0) with RHOT2, and 154 proteins were negatively correlated with RHOT2 (coefficient <0). However, when we performed GSEA in RHOT2-associated proteins to identify biological signatures impacted by RHOT2, most of the obtained pathways (p<0.01) showed NES less than 0, which means these pathways were mainly enriched in RHOT2-negative-correlated group, only “mitochondrion” (GOCC) had a positive correlation (Figure 4). As we known RHOT2 is an important protein involved in the regulation of mitochondrial dynamics and mitophagy (PMID: 16630562). This result indicates that the involvement of RHOT2 in regulation of mitochondrial function might contribute to the pathogenesis of metastasis in cancer, especially in early-stage CRC. Consistent with the previous results, RHOT2-negative-correlated group was significantly enriched for EMT (HALLMARK) and complement and coagulation cascades pathways. Proteins up-regulated in LNM-positive group (LNM+ vs. LNM-: Log2FC >0; p<0.05, Wilcoxon rank-sum test) were negatively correlated with RHOT2(p < 0.05, coefficient<0, Spearman), including CAP2, COL6A3, COL6A2, TNC, DPYSL3, PCOLCE and BGN in pathway EMT; and GUCY1B3, VWF and F13A1 in pathway complement and coagulation cascades (Figure 2E, L; Figure 4D in the revision). ECM, focal adhesion and Dilated cardiomyopathy (DCM) pathways were also enriched in negative-correlated group. Degradation of RHOT2 has already been reported to be associated with DCM (PMID: 31455181). Overall, combined with the previous results, RHOT2 may play an important role in T1 CRC LNM (Figure 4D in the revision.).

      As reviewer mentioned the data on RHOT2 are promising, but the understanding of it is preliminary. More analytical studies and experiments are needed in our future researches to understand the specific role and mechanism of RHOT2 in the process of tumor metastasis. In the revision, we discussed these limitations of our research.

    1. Author Response

      Reviewer #1 (Public Review):

      The central claim that the R400Q mutation causes cardiomyopathy in humans require(s) additional support.

      We regret that the reviewer interpreted our conclusions as described. Because of the extreme rarity of the MFN2 R400Q mutation our clinical data are unavoidably limited and therefore insufficient to support a conclusion that it causes cardiomyopathy “in humans”. Importantly, this is a claim that we did not make and do not believe to be the case. Our data establish that the MFN2 R400Q mutation is sufficient to cause lethal cardiomyopathy in some mice (Q/Q400a; Figure 4) and predisposes to doxorubicin-induced cardiomyopathy in the survivors (Q/Q400n; new data, Figure 7). Based on the clinical association we propose that R400Q may act as a genetic risk modifier in human cardiomyopathy.

      To avoid further confusion we modified the manuscript title to “A human mitofusin 2 mutation can cause mitophagic cardiomyopathy” and provide a more detailed discussion of the implications and limitations of our study on page 11).

      First, the claim of an association between the R400Q variant (identified in three individuals) and cardiomyopathy has some limitations based on the data presented. The initial association is suggested by comparing the frequency of the mutation in three small cohorts to that in a large database gnomAD, which aggregates whole exome and whole genome data from many other studies including those from specific disease populations. Having a matched control population is critical in these association studies.

      We have added genotyping data from the matched non-affected control population (n=861) of the Cincinnati Heart study to our analyses (page 4). The conclusions did not change.

      For instance, according to gnomAD the MFN2 Q400P variant, while not observed in those of European ancestry, has a 10-fold higher frequency in the African/African American and South Asian populations (0.0004004 and 0.0003266, respectively). If the authors data in table one is compared to the gnomAD African/African American population the p-value drops to 0.029262, which would not likely survive correction for multiple comparison (e.g., Bonferroni).

      Thank you for raising the important issue of racial differences in mutant allele prevalence and its association with cardiomyopathy. Sample size for this type of sub-group analysis is limited, but we are able to provide African-derived population allele frequency comparisons for both the gnomAD population and our own non-affected control group.

      As now described on page 4, and just as with the gnomAD population we did not observe MFN2 R400Q in any Caucasian individuals, either cardiomyopathy or control. Its (heterozygous only) prevalence in African American cardiomyopathy is 3/674. Thus, the R400Q minor allele frequency of 3/1,345 in AA cardiomyopathy compares to 10/24,962 in African gnomAD, reflecting a statistically significant increase in this specific population group (p=0.003308; Chi2 statistic 8.6293). Moreover, all African American non-affected controls in the case-control cohort were wild-type for MFN2 (0/452 minor alleles).

      (The source and characteristics of the subjects used by the authors in Table 1 is not clear from the methods.)

      The details of our study cohorts were inadvertently omitted during manuscript preparation. As now reported on pages 3 and 4, the Cincinnati Heart Study is a case-control study consisting of 1,745 cardiomyopathy (1,117 Caucasian and 628 African American) subjects and 861 non-affected controls (625 Caucasian and 236 African American) (Liggett et al Nat Med 2008; Matkovich et al JCI 2010; Cappola et al PNAS 2011). The Houston hypertrophic cardiomyopathy cohort [which has been screened by linkage analysis, candidate gene sequencing or clinical genetic testing) included 286 subjects (240 Caucasians and 46 African Americans) (Osio A et al Circ Res 2007; Li L et al Circ Res 2017).

      Relatedly, evaluation in a knock-in mouse model is offered as a way of bolstering the claim for an association with cardiomyopathy. Some caution should be offered here. Certain mutations have caused a cardiomyopathy in mice when knocked in have not been observed in humans with the same mutation. A recent example is the p.S59L variant in the mitochondrial protein CHCHD10, which causes cardiomyopathy in mice but not in humans (PMID: 30874923). While phenocopy is suggestive there are differences in humans and mice, which makes the correlation imperfect.

      We understand that a mouse is not a man, and as noted above we view the in vitro data in multiple cell systems and the in vivo data in knock-in mice as supportive for, not proof of, the concept that MFN2 R400Q can be a genetic cardiomyopathy risk modifier. As indicated in the following responses, we have further strengthened the case by including results from 2 additional, previously undescribed human MFN2 mutation knock-in mice.

      Additionally, the argument that the Mfn2 R400Q variant causes a dominant cardiomyopathy in humans would be better supported by observing of a cardiomyopathy in the heterozygous Mfn2 R400Q mice and not just in the homozygous Mfn2 R400Q mice.

      We are intrigued that in the previous comment the reviewer warns that murine phenocopies are not 100% predictive of human disease, and in the next sentence he/she requests that we show that the gene dose-phenotype response is the same in mice and humans. And, we again wish to note that we never argued that MFN2 R400Q “causes a dominant cardiomyopathy in humans.” Nevertheless, we understand the underlying concerns and in the revised manuscript we present data from new doxorubicin challenge experiments comparing cardiomyopathy development and myocardial mitophagy in WT, heterozygous, and surviving (Q/Q400n) homozygous Mfn2 R400Q KI mice (new Figure 7, panels E-G). Homozygous, but not heterozygous, R400Q mice exhibited an amplified cardiomyopathic response (greater LV dilatation, reduced LV ejection performance, exaggerated LV hypertrophy) and an impaired myocardial mitophagic response to doxorubicin. These in vivo data recapitulate new in vitro results in H9c2 rat cardiomyoblasts expressing MFN2 R400Q, which exhibited enhanced cytotoxicity (cell death and TUNEL labelling) to doxorubicin associated with reduced reactive mitophagy (Parkin aggregation and mitolysosome formation) (new Figure 7, panels A-D). Thus, under the limited conditions we have explored to date we do not observe cardiomyopathy development in heterozygous Mfn2 R400Q KI mice. However, we have expanded the association between R400Q, mitophagy and cardiomyopathy thereby providing the desired additional support for our argument that it can be a cardiomyopathy risk modifier.

      Relatedly, it is not clear what the studies in the KI mouse prove over what was already known. Mfn2 function is known to be essential during the neonatal period and the authors have previously shown that the Mfn2 R400Q disrupts the ability of Mfn2 to mediate mitochondrial fusion, which is its core function. The results in the KI mouse seem consistent with those two observations, but it's not clear how they allow further conclusions to be drawn.

      We strenuously disagree with the underlying proposition of this comment, which is that “mitochondrial fusion (is the) core function” of mitofusins. We also believe that our previous work, alluded to but not specified, is mischaracterized.

      Our seminal study defining an essential role for Mfn2 for perinatal cardiac development (Gong et al Science 2015) reported that an engineered MFN2 mutation that was fully functional for mitochondrial fusion, but incapable of binding Parkin (MFN2 AA), caused perinatal cardiomyopathy when expressed as a transgene. By contrast, another engineered MFN2 mutant transgene that potently suppressed mitochondrial fusion, but constitutively bound Parkin (MFN2 EE) had no adverse effects on the heart.

      Our initial description of MFN2 R400Q and observation that it exhibited impaired fusogenicity (Eschenbacher et al PLoS One 2012) reported results of in vitro studies and transgene overexpression in Drosophila. Importantly, a role for MFN2 in mitophagy was unknown at that time and so was not explored.

      A major point both of this manuscript and our work over the last decade on mitofusin proteins has been that their biological importance extends far beyond mitochondrial fusion. As introduced/discussed throughout our manuscript, MFN2 plays important roles in mitophagy and mitochondrial motility. Because this central point seems to have been overlooked, we have gone to great lengths in the revised manuscript to unambiguously show that impaired mitochondrial fusion is not the critical functional aspect that determines disease phenotypes caused by Mfn2 mutations. To accomplish this we’ve re-structured the experiments so that R400Q is compared at every level to two other natural MFN2 mutations linked to a human disease, the peripheral neuropathy CMT2A. These comparators are MFN2 T105M in the GTPase domain and MFN2 M376A/V in the same HR1 domain as MFN2 R400Q. Each of these human MFN2 mutations is fusion-impaired, but the current studies reveal that that their spectrum of dysfunction differs in other ways as summarized in Author response table 1:

      Author response table 1.

      We understand that it sounds counterintuitive for a mutation in a “mitofusin” protein to evoke cardiac disease independent of its appellative function, mitochondrial fusion. But the KI mouse data clearly relate the occurrence of cardiomyopathy in R400Q mice to the unique mitophagy defect provoked in vitro and in vivo by this mutation. We hope the reviewer will agree that the KI models provide fresh scientific insight.

      Additionally, the authors conclude that the effect of R400Q on the transcriptome and metabolome in a subset of animals cannot be explained by its effect on OXPHOS (based on the findings in Figure 4H). However, an alternative explanation is that the R400Q is a loss of function variant but does not act in a dominant negative fashion. According to this view, mice homozygous for R400Q (and have no wildtype copies of Mfn2) lack Mfn2 function and consequently have an OXPHOS defect giving rise to the observed transcriptomic and metabolomic changes. But in the rat heart cell line with endogenous rat Mfn2, exogenous of the MFN2 R400Q has no effect as it is loss of function and is not dominant negative.

      Our results in the original submission, which are retained in Figures 1D and 1E and Figure 1 Figure Supplement 1 of the revision, exclude the possibility that R400Q is a functional null mutant for, but not a dominant suppressor of, mitochondrial fusion. We have added additional data for M376A in the revision, but the original results are retained in the main figure panels and a new supplemental figure:

      Figure 1D reports results of mitochondrial elongation studies (the morphological surrogate for mitochondrial fusion) performed in Mfn1/Mfn2 double knock-out (DKO) MEFs. The baseline mitochondrial aspect ratio in DKO cells infected with control (b-gal containing) virus is ~2 (white bar), and increases to ~6 (i.e. ~normal) by forced expression of WT MFN2 (black bar). By contrast, aspect ratio in DKO MEFs expressing MFN2 mutants T105M (green bar), M376A and R400Q (red bars in main figure), R94Q and K109A (green bars in the supplemental figure) is only 3-4. For these results the reviewer’s and our interpretation agree: all of the MFN2 mutants studied are non-functional as mitochondrial fusion proteins.

      Importantly, Figure 1E (left panel) reports the results of parallel mitochondrial elongation studies performed in WT MEFs, i.e. in the presence of normal endogenous Mfn1 and Mfn2. Here, baseline mitochondrial aspect ratio is already normal (~6, white bar), and increases modestly to ~8 when WT MFN2 is expressed (black bar). By comparison, aspect ratio is reduced below baseline by expression of four of the five MFN2 mutants, including MFN2 R400Q (main figure and accompanying supplemental figure; green and red bars). Only MFN2 M376A failed to suppress mitochondrial fusion promoted by endogenous Mfns 1 and 2. Thus, MFN2 R400Q dominantly suppresses mitochondrial fusion. We have stressed this point in the text on page 5, first complete paragraph.

      Additionally, as the authors have shown MFN2 R400Q loses its ability to promote mitochondrial fusion, and this is the central function of MFN2, it is not clear why this can't be the explanation for the mouse phenotype rather than the mitophagy mechanism the authors propose.

      Please see our response #7 above beginning “We strenuously disagree...”

      Finally, it is asserted that the MFN2 R400Q variant disrupts Parkin activation, by interfering with MFN2 acting a receptor for Parkin. The support for this in cell culture however is limited. Additionally, there is no assessment of mitophagy in the hearts of the KI mouse model.

      The reviewer may have overlooked the studies reported in original Figure 5, in which Parkin localization to cultured cardiomyoblast mitochondria is linked both to mitochondrial autophagy (LC3-mitochondria overlay) and to formation of mito-lysosomes (MitoQC staining). These results have been retained and expanded to include MFN2 M376A in Figure 6 B-E and Figure 6 Figure Supplement 1 of the revised manuscript. Additionally, selective impairment of Parkin recruitment to mitochondria was shown in mitofusin null MEFs in current Figure 3C and Figure 3 Figure Supplement 1, panels B and C.

      The in vitro and in vivo doxorubicin studies performed for the revision further strengthen the mechanistic link between cardiomyocyte toxicity, reduced parkin recruitment and impaired mitophagy in MFN2 R400Q expressing cardiac cells: MFN2 R400Q-amplified doxorubicin-induced H9c2 cell death is associated with reduced Parkin aggregation and mitolysosome formation in vitro, and the exaggerated doxorubicin-induced cardiomyopathic response in MFN2 Q/Q400 mice was associated with reduced cardiomyocyte mitophagy in vivo, measured with adenoviral Mito-QC (new Figure 7).

      Reviewer #2 (Public Review):

      In this manuscript, Franco et al show that the mitofusin 2 mutation MFN2 Q400 impaires mitochondrial fusion with normal GTPase activity. MFN2 Q400 fails to recruit Parkin and further disrupts Parkin-mediated mitophagy in cultured cardiac cells. They also generated MFN2 Q400 knock-in mice to show the development of lethal perinatal cardiomyopathy, which had an impairment in multiple metabolic pathways.

      The major strength of this manuscript is the in vitro study that provides a thorough understanding in the characteristics of the MFN2 Q400 mutant in function of MFN2, and the effect on mitochondrial function. However, the in vivo MFN2 Q/Q400 knock-in mice are more troubling given the split phenotype of MFN2 Q/Q400a vs MFN2 Q/Q400n subtypes. Their main findings towards impaired metabolism in mutant hearts fail to distinguish between the two subtypes.

      Thanks for the comments. We do not fully understand the statement that “impaired metabolism in mutant hearts fails to distinguish between the two (in vivo) subtypes.” The data in current Figure 5 and its accompanying figure supplements show that impaired metabolism measured both as metabolomic and transcriptomic changes in the subtypes (orange Q400n vs red Q400a in Figure 5 panels A and D) are reflected in the histopathological analyses. Moreover, newly presented data on ROS-modifying pathways (Figure 5C) suggest that a central difference between Mfn2 Q/Q400 hearts that can compensate for the underlying impairment in mitophagic quality control (Q400n) vs those that cannot (Q400a) is the capacity to manage downstream ROS effects of metabolic derangements and mitochondrial uncoupling. Additional support for this idea is provided in the newly performed doxorubicin challenge experiments (Figure 7), demonstrating that mitochondrial ROS levels are in fact increased at baseline in adult Q400n mice.

      While the data support the conclusion that MFN2 Q400 causes cardiomyopathy, several experiments are needed to further understand mechanism.

      We thank the reviewer for agreeing with our conclusion that MFN2 Q400 can cause cardiomyopathy, which was the major issue raised by R1. As detailed below we have performed a great deal of additional experimentation, including on two completely novel MFN2 mutant knock-in mouse models, to validate the underlying mechanism.

      This manuscript will likely impact the field of MFN2 mutation-related diseases and show how MFN2 mutation leads to perinatal cardiomyopathy in support of previous literature.

      Thank you again. We think our findings have relevance beyond the field of MFN2 mutant-related disease as they provide the first evidence (to our knowledge) that a naturally occurring primary defect in mitophagy can manifest as myocardial disease.

    1. Author Response

      Reviewer #1 (Public Review):

      This work introduces a novel framework for evaluating the performance of statistical methods that identify replay events. This is challenging because hippocampal replay is a latent cognitive process, where the ground truth is inaccessible, so methods cannot be evaluated against a known answer. The framework consists of two elements:

      1) A replay sequence p-value, evaluated against shuffled permutations of the data, such as radon line fitting, rank-order correlation, or weighted correlation. This element determines how trajectory-like the spiking representation is. The p-value threshold for all accepted replay events is adjusted based on an empirical shuffled distribution to control for the false discovery rate.

      2) A trajectory discriminability score, also evaluated against shuffled permutations of the data. In this case, there are two different possible spatial environments that can be replayed, so the method compares the log odds of track 1 vs. track 2.

      The authors then use this framework (accepted number of replay events and trajectory discriminability) to study the performance of replay identification methods. They conclude that sharp wave ripple power is not a necessary criterion for identifying replay event candidates during awake run behavior if you have high multiunit activity, a higher number of permutations is better for identifying replay events, linear Bayesian decoding methods outperform rank-order correlation, and there is no evidence for pre-play.

      The authors tackle a difficult and important problem for those studying hippocampal replay (and indeed all latent cognitive processes in the brain) with spiking data: how do we understand how well our methods are doing when the ground truth is inaccessible? Additionally, systematically studying how the variety of methods for identifying replay perform, is important for understanding the sometimes contradictory conclusions from replay papers. It helps consolidate the field around particular methods, leading to better reproducibility in the future. The authors' framework is also simple to implement and understand and the code has been provided, making it accessible to other neuroscientists. Testing for track discriminability, as well as the sequentiality of the replay event, is a sensible additional data point to eliminate "spurious" replay events.

      However, there are some concerns with the framework as well. The novelty of the framework is questionable as it consists of a log odds measure previously used in two prior papers (Carey et al. 2019 and the authors' own Tirole & Huelin Gorriz, et al., 2022) and a multiple comparisons correction, albeit a unique empirical multiple comparisons correction based on shuffled data.

      With respect to the log odds measure itself, as presented, it is reliant on having only two options to test between, limiting its general applicability. Even in the data used for the paper, there are sometimes three tracks, which could influence the conclusions of the paper about the validity of replay methods. This also highlights a weakness of the method in that it assumes that the true model (spatial track environment) is present in the set of options being tested. Furthermore, the log odds measure itself is sensitive to the defined ripple or multiunit start and end times, because it marginalizes over both position and time, so any inclusion of place cells that fire for the animal's stationary position could influence the discriminability of the track. Multiple track representations during a candidate replay event would also limit track discriminability. Finally, the authors call this measure "trajectory discriminability", which seems a misnomer as the time and position information are integrated out, so there is no notion of trajectory.

      The authors also fail to make the connection with the control of the false discovery rate via false positives on empirical shuffles with existing multiple comparison corrections that control for false discovery rates (such as the Benjamini and Hochberg procedure or Storey's q-value). Additionally, the particular type of shuffle used will influence the empirically determined p-value, making the procedure dependent on the defined null distribution. Shuffling the data is also considerably more computationally intensive than the existing multiple comparison corrections.

      Overall, the authors make interesting conclusions with respect to hippocampal replay methods, but the utility of the method is limited in scope because of its reliance on having exactly two comparisons and having to specify the null distribution to control for the false discovery rate. This work will be of interest to electrophysiologists studying hippocampal replay in spiking data.

      We would like to thank the reviewer for the feedback.

      Firstly, we would like to clarify that it is not our intention to present this tool as a novel replay detection approach. It is indeed merely a novel tool for evaluating different replay detection methods. Also, while we previously used log odds metrics to quantify contextual discriminability within replay events (Tirole et al., 2021), this framework is novel in how it is used (to compare replay detection methods), and the use of empirically determined FPR-matched alpha levels. We have now modified the manuscript to make this point more explicit.

      Our use of the term trajectory-discriminability is now changed to track-discriminability in the revised manuscript, given we are summing over time and space, as correctly pointed out by the reviewer.

      While this approach requires two tracks in its current implementation, we have also been able to apply this approach to three tracks, with a minor variation in the method, however this is beyond the scope of our current manuscript. Prior experience on other tracks not analysed in the log odds calculation should not pose any issue, given that the animal likely replays many experiences of the day (e.g. the homecage). These “other” replay events likely contribute to candidate replay events that fail to have a statistically significant replay score on either track.

      With regard to using a cell-id randomized dataset to empirically estimate false-positive rates, we have provided a detailed explanation behind our choice of using an alpha level correction in our response to the essential revisions above. This approach is not used to examine the effect of multiple comparisons, but rather to measure the replay detection error due to non-independence and a non-uniform p value distribution. Therefore we do not believe that existing multiple comparison corrections such as Benjamini and Hochberg procedure are applicable here (Author response image 1-3). Given the potential issues raised with a session-based cell-id randomization, we demonstrate above that the null distribution is sufficiently independent from the four shuffle-types used for replay detection (the same was not true for a place field randomized dataset) (Author response image 4).

      Author response image 1.

      Distribution of Spearman’s rank order correlation score and p value for false events with random sequence where each neuron fires one (left), two (middle) or three (right) spikes.

      Author response image 2.

      Distribution of Spearman’s rank order correlation score and p value for mixture of 20% true events and 80% false events where each neuron fires one (left), two (middle) or three (right) spikes.

      Author response image 3.

      Number of true events (blue) and false events (yellow) detected based on alpha level 0.05 (upper left), empirical false positive rate 5% (upper right) and false discovery rate 5% (lower left, based on BH method)

      Author response image 4.

      Proportion of false events detected when using dataset with within and cross experiment cell-id randomization and place field randomization. The detection was based on single shuffle including time bin permutation shuffle, spike train circular shift shuffle, place field circular shift shuffle, and place bin circular shift shuffle.

      Reviewer #2 (Public Review):

      This study proposes to evaluate and compare different replay methods in the absence of "ground truth" using data from hippocampal recordings of rodents that were exposed to two different tracks on the same day. The study proposes to leverage the potential of Bayesian methods to decode replay and reactivation in the same events. They find that events that pass a higher threshold for replay typically yield a higher measure of reactivation. On the other hand, events from the shuffled data that pass thresholds for replay typically don't show any reactivation. While well-intentioned, I think the result is highly problematic and poorly conceived.

      The work presents a lot of confusion about the nature of null hypothesis testing and the meaning of p-values. The prescription arrived at, to correct p-values by putting animals on two separate tracks and calculating a "sequence-less" measure of reactivation are impractical from an experimental point of view, and unsupportable from a statistical point of view. Much of the observations are presented as solutions for the field, but are in fact highly dependent on distinct features of the dataset at hand. The most interesting observation is that despite the existence of apparent sequences in the PRE-RUN data, no reactivation is detectable in those events, suggesting that in fact they represent spurious events. I would recommend the authors focus on this important observation and abandon the rest of the work, as it has the potential to further befuddle and promote poor statistical practices in the field.

      The major issue is that the manuscript conveys much confusion about the nature of hypothesis testing and the meaning of p-values. It's worth stating here the definition of a p-value: the conditional probability of rejecting the null hypothesis given that the null hypothesis is true. Unfortunately, in places, this study appears to confound the meaning of the p-value with the probability of rejecting the null hypothesis given that the null hypothesis is NOT true-i.e. in their recordings from awake replay on different mazes. Most of their analysis is based on the observation that events that have higher reactivation scores, as reflected in the mean log odds differences, have lower p-values resulting from their replay analyses. Shuffled data, in contrast, does not show any reactivation but can still show spurious replays depending on the shuffle procedure used to create the surrogate dataset. The authors suggest using this to test different practices in replay detection. However, another important point that seems lost in this study is that the surrogate dataset that is contrasted with the actual data depends very specifically on the null hypothesis that is being tested. That is to say, each different shuffle procedure is in fact testing a different null hypothesis. Unfortunately, most studies, including this one, are not very explicit about which null hypothesis is being tested with a given resampling method, but the p-value obtained is only meaningful insofar as the null that is being tested and related assumptions are clearly understood. From a statistical point of view, it makes no sense to adjust the p-value obtained by one shuffle procedure according to the p-value obtained by a different shuffle procedure, which is what this study inappropriately proposes. Other prescriptions offered by the study are highly dataset and method dependent and discuss minutiae of event detection, such as whether or not to require power in the ripple frequency band.

      We would like to thank the reviewer for their feedback. The purpose of this paper is to present a novel tool for evaluating replay sequence detection using an independent measure that does not depend on the sequence score. As the reviewer stated, in this study, we are detecting replay events based on a set alpha threshold (0.05), based on the conditional probability of rejecting the null hypothesis given that the null hypothesis is true. For all replay events detected during PRE, RUN or POST, they are classified as track 1 or track 2 replay events by comparing each event’s sequence score relative to the shuffled distribution. Then, the log odds measure was only applied to track 1 and track 2 replay events selected using sequence-based detection. Its important to clarify that we never use log odds to select events to examine their sequenceness p value. Therefore, we disagree with the reviewer’s claim that for awake replay events detected on different tracks, we are quantifying the probability of rejecting the null hypothesis given that the null hypothesis is not true.

      However, we fully understand the reviewer’s concerns with a cell-id randomization, and the potential caveats associated with using this approach for quantifying the false positive rate. First of all, we would like to clarify that the purpose of alpha level adjustment was to facilitate comparison across methods by finding the alpha level with matching false-positive rates determined empirically. Without doing this, it is impossible to compare two methods that differ in strictness (e.g. is using two different shuffles needed compared to using a single shuffle procedure). This means we are interested in comparing the performance of different methods at the equivalent alpha level where each method detects 5% spurious events per track rather than an arbitrary alpha level of 0.05 (which is difficult to interpret if statistical tests are run on non-independent samples). Once the false positive rate is matched, it is possible to compare two methods to see which one yields more events and/or has better track discriminability.

      We agree with the reviewer that the choice of data randomization is crucial. When a null distribution of a randomized dataset is very similar to the null distribution used for detection, this should lead to a 5% false positive rate (as a consequence of circular reasoning). In our response to the essential revisions, we have discussed about the effect of data randomization on replay detection. We observed that while place field circularly shifted dataset and cell-id randomized dataset led to similar false-positive rates when shuffles that disrupt temporal information were used for detection, a place field circularly shifted dataset but not a cell-id randomized dataset was sensitive to shuffle methods that disrupted place information (Author response image 4). We would also like to highlight one of our findings from the manuscript that the discrepancy between different methods can be substantially reduced when alpha level was adjusted to match false-positive rates (Figure 6B). This result directly supports the utility of a cell-id randomized dataset in finding the alpha level with equivalent false positive rates across methods. Hence, while imperfect, we argue cell-id randomization remains an acceptable method as it is sufficiently different from the four shuffles we used for replay detection compared to place field randomized dataset (Author response image 4).

      While the use of two linear tracks was crucial for our current framework to calculate log odds for evaluating replay detection, we acknowledge that it limits the applicability of this framework. At the same time, the conclusions of the manuscript with regard to ripples, replay methods, and preplay should remain valid on a single track. A second track just provides a useful control for how place cells can realistically remap within another environment. However, with modification, it may be applied to a maze with different arms or subregions, although this is beyond the scope of our current study.

      Last of not least, we partly agree with the reviewer that the result can be dataset-specific such that the result may vary depending on animal’s behavioural state and experimental design. However, our results highlight the fact that there is a very wide distribution of both the track discriminability and the proportion of significant events detected across methods that are currently used in the field. And while we see several methods that appear comparable in their effectiveness in replay detection, there are also other methods that are deeply flawed (that have been previously been used in peer-reviewed publications) if the alpha level is not sufficiently strict. Regardless of the method used, most methods can be corrected with an appropriate alpha level (e.g. using all spikes for a rank order correlation). Therefore, while the exact result may be dataset-specific, we feel that this is most likely due to the number of cells and properties of the track more than the use of two tracks. Reporting of the empirically determined false-positive rate and use of alpha level with matching false-positive rate (such as 0.05) for detection does not require a second track, and the adoption of this approach by other labs would help to improve the interpretability and generalizability of their replay data.

      Reviewer #3 (Public Review):

      This study tackles a major problem with replay detection, which is that different methods can produce vastly different results. It provides compelling evidence that the source of this inconsistency is that biological data often violates assumptions of independent samples. This results in false positive rates that can vary greatly with the precise statistical assumptions of the chosen replay measure, the detection parameters, and the dataset itself. To address this issue, the authors propose to empirically estimate the false positive rate and control for it by adjusting the significance threshold. Remarkably, this reconciles the differences in replay detection methods, as the results of all the replay methods tested converge quite well (see Figure 6B). This suggests that by controlling for the false positive rate, one can get an accurate estimate of replay with any of the standard methods.

      When comparing different replay detection methods, the authors use a sequence-independent log-odds difference score as a validation tool and an indirect measure of replay quality. This takes advantage of the two-track design of the experimental data, and its use here relies on the assumption that a true replay event would be associated with good (discriminable) reactivation of the environment that is being replayed. The other way replay "quality" is estimated is by the number of replay events detected once the false positive rate is taken into account. In this scheme, "better" replay is in the top right corner of Figure 6B: many detected events associated with congruent reactivation.

      There are two possible ways the results from this study can be integrated into future replay research. The first, simpler, way is to take note of the empirically estimated false positive rates reported here and simply avoid the methods that result in high false positive rates (weighted correlation with a place bin shuffle or all-spike Spearman correlation with a spike-id shuffle). The second, perhaps more desirable, way is to integrate the practice of estimating the false positive rate when scoring replay and to take it into account. This is very powerful as it can be applied to any replay method with any choice of parameters and get an accurate estimate of replay.

      How does one estimate the false positive rate in their dataset? The authors propose to use a cell-ID shuffle, which preserves all the firing statistics of replay events (bursts of spikes by the same cell, multi-unit fluctuations, etc.) but randomly swaps the cells' place fields, and to repeat the replay detection on this surrogate randomized dataset. Of course, there is no perfect shuffle, and it is possible that a surrogate dataset based on this particular shuffle may result in one underestimating the true false positive rate if different cell types are present (e.g. place field statistics may differ between CA1 and CA3 cells, or deep vs. superficial CA1 cells, or place cells vs. non-place cells if inclusion criteria are not strict). Moreover, it is crucial that this validation shuffle be independent of any shuffling procedure used to determine replay itself (which may not always be the case, particularly for the pre-decoding place field circular shuffle used by some of the methods here) lest the true false-positive rate be underestimated. Once the false positive rate is estimated, there are different ways one may choose to control for it: adjusting the significance threshold as the current study proposes, or directly comparing the number of events detected in the original vs surrogate data. Either way, with these caveats in mind, controlling for the false positive rate to the best of our ability is a powerful approach that the field should integrate.

      Which replay detection method performed the best? If one does not control for varying false positive rates, there are two methods that resulted in strikingly high (>15%) false positive rates: these were weighted correlation with a place bin shuffle and Spearman correlation (using all spikes) with a spike-id shuffle. However, after controlling for the false positive rate (Figure 6B) all methods largely agree, including those with initially high false positive rates. There is no clear "winner" method, because there is a lot of overlap in the confidence intervals, and there also are some additional reasons for not overly interpreting small differences in the observed results between methods. The confidence intervals are likely to underestimate the true variance in the data because the resampling procedure does not involve hierarchical statistics and thus fails to account for statistical dependencies on the session and animal level. Moreover, it is possible that methods that involve shuffles similar to the cross-validation shuffle ("wcorr 2 shuffles", "wcorr 3 shuffles" both use a pre-decoding place field circular shuffle, which is very similar to the pre-decoding place field swap used in the cross-validation procedure to estimate the false positive rate) may underestimate the false positive rate and therefore inflate adjusted p-value and the proportion of significant events. We should therefore not interpret small differences in the measured values between methods, and the only clear winner and the best way to score replay is using any method after taking the empirically estimated false positive rate into account.

      The authors recommend excluding low-ripple power events in sleep, because no replay was observed in events with low (0-3 z-units) ripple power specifically in sleep, but that no ripple restriction is necessary for awake events. There are problems with this conclusion. First, ripple power is not the only way to detect sharp-wave ripples (the sharp wave is very informative in detecting awake events). Second, when talking about sequence quality in awake non-ripple data, it is imperative for one to exclude theta sequences. The authors' speed threshold of 5 cm/s is not sufficient to guarantee that no theta cycles contaminate the awake replay events. Third, a direct comparison of the results with and without exclusion is lacking (selecting for the lower ripple power events is not the same as not having a threshold), so it is unclear how crucial it is to exclude the minority of the sleep events outside of ripples. The decision of whether or not to select for ripples should depend on the particular study and experimental conditions that can affect this measure (electrode placement, brain state prevalence, noise levels, etc.).

      Finally, the authors address a controversial topic of de-novo preplay. With replay detection corrected for the false positive rate, none of the detection methods produce evidence of preplay sequences nor sequenceless reactivation in the tested dataset. This presents compelling evidence in favour of the view that the sequence of place fields formed on a novel track cannot be predicted by the sequential structure found in pre-task sleep.

      We would like to thank the reviewer for the positive and constructive feedback.

      We agree with the reviewer that the conclusion about the effect of ripple power is dataset-specific and is not intended to be a one-size-fit-all recommendation for wider application. But it does raise a concern that individual studies should address. The criteria used for selecting candidate events will impact the overall fraction of detected events, and makes the comparison between studies using different methods more difficult. We have updated the manuscript to emphasize this point.

      “These results emphasize that a ripple power threshold is not necessary for RUN replay events in our dataset but may still be beneficial, as long as it does not excessively eliminate too many good replay events with low ripple power. In other words, depending on the experimental design, it is possible that a stricter p-value with no ripple threshold can be used to detect more replay events than using a less strict p-value combined with a strict ripple power threshold. However, for POST replay events, a threshold at least in the range of a z-score of 3-5 is recommended based on our dataset, to reduce inclusion of false-positives within the pool of detected replay events.”

      “We make six key observations: 1) A ripple power threshold may be more important for replay events during POST compared to RUN. For our dataset, the POST replay events with ripple power below a z-score of 3-5 were indistinguishable from spurious events. While the exact ripple z-score threshold to implement may differ depending on the experimental condition (e.g. electrode placement, behavioural paradigm, noise level and etc) and experimental aim, our findings highlight the benefit of using ripple power threshold for detecting replay during POST. 2) ”

    1. Author Response

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      The authors present a number of deep learning models to analyse the dynamics of epithelia. In this way they want to overcome the time-consuming manual analysis of such data and also remove a potential operator bias. Specifically, they set up models for identifying cell division events and cell division orientation. They apply these tools to the epithelium of the developing Drosophila pupal wing. They confirm a linear decrease of the division density with time and identify a burst of cell division after healing of a wound that they had induced earlier. These division events happen a characteristic time after and a characteristic distance away from the wound. These characteristic quantities depend on the size of the wound.

      Strengths:

      The methods developed in this work achieve the goals set by the authors and are a very helpful addition to the toolbox of developmental biologists. They could potentially be used on various developing epithelia. The evidence for the impact of wounds on cell division is compelling.

      The methods presented in this work should prove to be very helpful for quantifying cell proliferation in epithelial tissues.

      We thank the reviewer for the positive comments!

      Reviewer #2 (Public Review):

      In this manuscript, the authors propose a computational method based on deep convolutional neural networks (CNNs) to automatically detect cell divisions in two-dimensional fluorescence microscopy timelapse images. Three deep learning models are proposed to detect the timing of division, predict the division axis, and enhance cell boundary images to segment cells before and after division. Using this computational pipeline, the authors analyze the dynamics of cell divisions in the epithelium of the Drosophila pupal wing and find that a wound first induces a reduction in the frequency of division followed by a synchronised burst of cell divisions about 100 minutes after its induction.

      Comments on revised version:

      Regarding the Reviewer's 1 comment on the architecture details, I have now understood that the precise architecture (number/type of layers, activation functions, pooling operations, skip connections, upsampling choice...) might have remained relatively hidden to the authors themselves, as the U-net is built automatically by the fast.ai library from a given classical choice of encoder architecture (ResNet34 and ResNet101 here) to generate the decoder part and skip connections.

      Regarding the Major point 1, I raised the question of the generalisation potential of the method. I do not think, for instance, that the optimal number of frames to use, nor the optimal choice of their time-shift with respect to the division time (t-n, t+m) (not systematically studied here) may be generic hyperparameters that can be directly transferred to another setting. This implies that the method proposed will necessarily require re-labeling, re-training and re-optimizing the hyperparameters which directly influence the network architecture for each new dataset imaged differently. This limits the generalisation of the method to other datasets, and this may be seen as in contrast to other tools developed in the field for other tasks such as cellpose for segmentation, which has proven a true potential for generalisation on various data modalities. I was hoping that the authors would try themselves testing the robustness of their method by re-imaging the same tissue with slightly different acquisition rate for instance, to give more weight to their work.

      We thank the referee for the comments. Regarding this particular biological system, due to photobleaching over long imaging periods (and the availability of imaging systems during the project), we would have difficulty imaging at much higher rates than the 2 minute time frame we currently use. These limitations are true for many such systems, and it is rarely possible to rapidly image for long periods of time in real experiments. Given this upper limit in framerate, we could, in principle, sample this data at a lower framerate, by removing time points of the videos but this typically leads to worse results. With some pilot data, we have tried to use fewer time intervals for our analysis but they always gave worse results. We found we need to feed the maximum amount of information available into the model to get the best results (i.e. the fastest frame rate possible, given the data available). Our goal is to teach the neural net to identify dynamic space-time localised events from time lapse videos, in which the duration of an event is a key parameter. Our division events take 10 minutes or less to complete therefore we used 5 timepoints in the videos for the deep learning model. If we considered another system with dynamic events which have a duration T when we would use T/t timepoints where t is the minimum time interval (for our data t=2min). For example if we could image every minute we would use 10 timepoints. As discussed below, we do envision other users with different imaging setups and requirements may need to retrain the model for their own data and to help with this, we have now provided more detailed instructions how to do this (see later).

      In this regard, and because the authors claimed to provide clear instructions on how to reuse their method or adapt it to a different context, I delved deeper into the code and, to my surprise, felt that we are far from the coding practice of what a well-documented and accessible tool should be.

      To start with, one has to be relatively accustomed with Napari to understand how the plugin must be installed, as the only thing given is a pip install command (that could be typed in any terminal without installing the plugin for Napari, but has to be typed inside the Napari terminal, which is mentioned nowhere). Surprisingly, the plugin was not uploaded on Napari hub, nor on PyPI by the authors, so it is not searchable/findable directly, one has to go to the Github repository and install it manually. In that regard, no description was provided in the copy-pasted templated files associated to the napari hub, so exporting it to the hub would actually leave it undocumented.

      We thank the referee for suggesting the example of (DeXtrusion, Villars et al. 2023). We have endeavoured to produce similarly-detailed documentation for our tools. We now have clear instructions for installation requiring only minimal coding knowledge, and we have provided a user manual for the napari plug-in. This includes information on each of the options for using the model and the outputs they will produce. The plugin has been tested by several colleagues using both Windows and Mac operating systems.

      Author response image 1.

      Regarding now the python notebooks, one can fairly say that the "clear instructions" that were supposed to enlighten the code are really minimal. Only one notebook "trainingUNetCellDivision10.ipynb" has actually some comments, the other have (almost) none nor title to help the unskilled programmer delving into the script to guess what it should do. I doubt that a biologist who does not have a strong computational background will manage adapting the method to its own dataset (which seems to me unavoidable for the reasons mentioned above).

      Within the README file, we have now included information on how to retrain the models with helpful links to deep learning tutorials (which, indeed, some of us have learnt from) for those new to deep learning. All Jupyter notebooks now include more comments explaining the models.

      Finally regarding the data, none is shared publicly along with this manuscript/code, such that if one doesn't have a similar type of dataset - that must be first annotated in a similar manner - one cannot even test the networks/plugin for its own information. A common and necessary practice in the field - and possibly a longer lasting contribution of this work - could have been to provide the complete and annotated dataset that was used to train and test the artificial neural network. The basic reason is that a more performant, or more generalisable deep-learning model may be developed very soon after this one and for its performance to be fairly compared, it requires to be compared on the same dataset. Benchmarking and comparison of methods performance is at the core of computer vision and deep-learning.

      We thank the referee for these comments. We have now uploaded all the data used to train the models and to test them, as well as all the data used in the analyses for the paper. This includes many videos that were not used for training but were analysed to generate the paper’s results. The link to these data sets is provided in our GitHub page (https://github.com/turleyjm/cell-division-dl- plugin/tree/main). In the folder for the data sets and in the GitHub repository, we have included the Jupyter notebooks used to train the models and these can be used for retraining. We have made our data publicly available at Zenodo dataset https://zenodo.org/records/10846684 (added to last paragraph of discussion). We have also included scripts that can be used to compare the model output with ground truth, including outputs highlighting false positives and false negatives. Together with these scripts, models can be compared and contrasted, both in general and in individual videos. Overall, we very much appreciate the reviewer’s advice, which has made the plugin much more user- friendly and, hopefully, easier for other groups to train their own models. Our contact details are provided, and we would be happy to advise any groups that would like to use our tools.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors present a number of deep-learning models to analyse the dynamics of epithelia. In this way, they want to overcome the time-consuming manual analysis of such data and also remove a potential operator bias. Specifically, they set up models for identifying cell division events and cell division orientation. They apply these tools to the epithelium of the developing Drosophila pupal wing. They confirm a linear decrease of the division density with time and identify a burst of cell division after the healing of a wound that they had induced earlier. These division events happen a characteristic time after and a characteristic distance away from the wound. These characteristic quantities depend on the size of the wound.

      Strength:

      The methods developed in this work achieve the goals set by the authors and are a very helpful addition to the toolbox of developmental biologists. They could potentially be used on various developing epithelia. The evidence for the impact of wounds on cell division is solid.

      Weakness:

      Some aspects of the deep-learning models remained unclear, and the authors might want to think about adding details. First of all, for readers not being familiar with deep-learning models, I would like to see more information about ResNet and U-Net, which are at the base of the new deep-learning models developed here. What is the structure of these networks?

      We agree with the Reviewer and have included additional information on page 8 of the manuscript, outlining some background information about the architecture of ResNet and U-Net models.

      How many parameters do you use?

      We apologise for this omission and have now included the number of parameters and layers in each model in the methods section on page 25.

      What is the difference between validating and testing the model? Do the corresponding data sets differ fundamentally?

      The difference between ‘validating’ and ‘testing’ the model is validating data is used during training to determine whether the model is overfitting. If the model is performing well on the training data but not on the validating data, this a key signal the model is overfitting and changes will need to be made to the network/training method to prevent this. The testing data is used after all the training has been completed and is used to test the performance of the model on fresh data it has not been trained on. We have removed refence to the validating data in the main text to make it simpler and add this explanation to the methods. There is no fundamental (or experimental) difference between each of the labelled data sets; rather, they are collected from different biological samples. We have now included this information in the Methods text on page 24.

      How did you assess the quality of the training data classification?

      These data were generated and hand-labelled by an expert with many years of experience in identifying cell divisions in imaging data, to give the ground truth for the deep learning model.

      Reviewer #1 (Recommendations For The Authors):

      You repeatedly use 'new', 'novel' as well as 'surprising' and 'unexpected'. The latter are rather subjective and it is not clear based on what prior knowledge you make these statements. Unless indicated otherwise, it is understood that the results and methods are new, so you can delete these terms.

      We have deleted these words, as suggested, for almost all cases.

      p.4 "as expected" add a reference or explain why it is expected.

      A reference has now been included in this section, as suggested.

      p.4 "cell divisions decrease linearly with time" Only later (p.10) it turns out that you think about the density of cell divisions.

      This has been changed to "cell division density decreases linearly with time".

      p.5 "imagine is largely in one plane" while below "we generated a 3D z-stack" and above "our in vivo 3D image data" (p.4). Although these statements are not strictly contradictory, I still find them confusing. Eventually, you analyse a 2D image, so I would suggest that you refer to your in vivo data as being 2D.

      We apologise for the confusion here; the imaging data was initially generated using 3D z-stacks but this 3D data is later converted to a 2D focused image, on which the deep learning analysis is performed. We are now more careful with the language in the text.

      p.7 "We have overcome (...) the standard U-Net model" This paragraph remains rather cryptic to me. Maybe you can explain in two sentences what a U-Net is or state its main characteristics. Is it important to state which class you have used at this point? Similarly, what is the exact role of the ResNet model? What are its characteristics?

      We have included more details on both the ResNet and U-Net models and how our model incorporates properties from them on Page 8.

      p.8 Table 1 Where do I find it? Similarly, I could not find Table 2.

      These were originally located in the supplemental information document, but have been moved to the main manuscript.

      p.9 "developing tissue in normal homeostatic conditions" Aren't homeostatic and developing contradictory? In one case you maintain a state, in the other, it changes.

      We agree with the Reviewer and have removed the word ‘homeostatic’.

      p.9 "Develop additional models" I think 'models' refers to deep learning models, not to physical models of epithelial tissue development. Maybe you can clarify this?

      Yes, this is correct; we have phrased this better in the text.

      p.12 "median error" median difference to the manually acquired data?

      Yes, and we have made this clearer in the text, too.

      p.12 "we expected to observe a bias of division orientation along this axis" Can you justify the expectation? Elongated cells are not necessarily aligned with the direction of a uniaxially applied stress.

      Although this is not always the case, we have now included additional references to previous work from other groups which demonstrated that wing epithelial cells do become elongated along the P/D axis in response to tension.

      p.14 "a rather random orientation" Please, quantify.

      The division orientations are quantified in Fig. 4F,G; we have now changed our description from ‘random’ to ‘unbiased’.

      p.17 "The theories that must be developed will be statistical mechanical (stochastic) in nature" I do not understand. Statistical mechanics refers to systems at thermodynamic equilibrium, stochastic to processes that depend on, well, stochastic input.

      We have clarified that we are referring to non-equilibrium statistical mechanics (the study of macroscopic systems far from equilibrium, a rich field of research with many open problems and applications in biology).

      Reviewer #2 (Public Review):

      In this manuscript, the authors propose a computational method based on deep convolutional neural networks (CNNs) to automatically detect cell divisions in two-dimensional fluorescence microscopy timelapse images. Three deep learning models are proposed to detect the timing of division, predict the division axis, and enhance cell boundary images to segment cells before and after division. Using this computational pipeline, the authors analyze the dynamics of cell divisions in the epithelium of the Drosophila pupal wing and find that a wound first induces a reduction in the frequency of division followed by a synchronised burst of cell divisions about 100 minutes after its induction.

      In general, novelty over previous work does not seem particularly important. From a methodological point of view, the models are based on generic architectures of convolutional neural networks, with minimal changes, and on ideas already explored in general. The authors seem to have missed much (most?) of the literature on the specific topic of detecting mitotic events in 2D timelapse images, which has been published in more specialized journals or Proceedings. (TPMAI, CCVPR etc., see references below). Even though the image modality or biological structure may be different (non-fluorescent images sometimes), I don't believe it makes a big difference. How the authors' approach compares to this previously published work is not discussed, which prevents me from objectively assessing the true contribution of this article from a methodological perspective.

      On the contrary, some competing works have proposed methods based on newer - and generally more efficient - architectures specifically designed to model temporal sequences (Phan 2018, Kitrungrotsakul 2019, 2021, Mao 2019, Shi 2020). These natural candidates (recurrent networks, long-short-term memory (LSTM) gated recurrent units (GRU), or even more recently transformers), coupled to CNNs are not even mentioned in the manuscript, although they have proved their generic superiority for inference tasks involving time series (Major point 2). Even though the original idea/trick of exploiting the different channels of RGB images to address the temporal aspect might seem smart in the first place - as it reduces the task of changing/testing a new architecture to a minimum - I guess that CNNs trained this way may not generalize very well to videos where the temporal resolution is changed slightly (Major point 1). This could be quite problematic as each new dataset acquired with a different temporal resolution or temperature may require manual relabeling and retraining of the network. In this perspective, recent alternatives (Phan 2018, Gilad 2019) have proposed unsupervised approaches, which could largely reduce the need for manual labeling of datasets.

      We thank the reviewer for their constructive comments. Our goal is to develop a cell detection method that has a very high accuracy, which is critical for practical and effective application to biological problems. The algorithms need to be robust enough to cope with the difficult experimental systems we are interested in studying, which involve densely packed epithelial cells within in vivo tissues that are continuously developing, as well as repairing. In response to the above comments of the reviewer, we apologise for not including these important papers from the division detection and deep learning literature, which are now discussed in the Introduction (on page 4).

      A key novelty of our approach is the use of multiple fluorescent channels to increase information for the model. As the referee points out, our method benefits from using and adapting existing highly effective architectures. Hence, we have been able to incorporate deeper models than some others have previously used. An additional novelty is using this same model architecture (retrained) to detect cell division orientation. For future practical use by us and other biologists, the models can easily be adapted and retrained to suit experimental conditions, including different multiple fluorescent channels or number of time points. Unsupervised approaches are very appealing due to the potential time saved compared to manual hand labelling of data. However, the accuracy of unsupervised models are currently much lower than that of supervised (as shown in Phan 2018) and most importantly well below the levels needed for practical use analysing inherently variable (and challenging) in vivo experimental data.

      Regarding the other convolutional neural networks described in the manuscript:

      (1) The one proposed to predict the orientation of mitosis performs a regression task, predicting a probability for the division angle. The architecture, which must be different from a simple Unet, is not detailed anywhere, so the way it was designed is difficult to assess. It is unclear if it also performs mitosis detection, or if it is instead used to infer orientation once the timing and location of the division have been inferred by the previous network.

      The neural network used for U-NetOrientation has the same architecture as U-NetCellDivision10 but has been retrained to complete a different task: finding division orientation. Our workflow is as follows: firstly, U-NetCellDivision10 is used to find cell divisions; secondly, U-NetOrientation is applied locally to determine the division orientation. These points have now been clarified in the main text on Page 14.

      (2) The one proposed to improve the quality of cell boundary images before segmentation is nothing new, it has now become a classic step in segmentation, see for example Wolny et al. eLife 2020.

      We have cited similar segmentation models in our paper and thank the referee for this additional one. We had made an improvement to the segmentation models, using GFP-tagged E-cadherin, a protein localised in a thin layer at the apical boundary of cells. So, while this is primarily a 2D segmentation problem, some additional information is available in the z-axis as the protein is visible in 2-3 separate z-slices. Hence, we supplied this 3-focal plane input to take advantage of the 3D nature of this signal. This approach has been made more explicit in the text (Pages 14, 15) and Figure (Fig. 2D).

      As a side note, I found it a bit frustrating to realise that all the analysis was done in 2D while the original images are 3D z-stacks, so a lot of the 3D information had to be compressed and has not been used. A novelty, in my opinion, could have resided in the generalisation to 3D of the deep-learning approaches previously proposed in that context, which are exclusively 2D, in particular, to predict the orientation of the division.

      Our experimental system is a relatively flat 2D tissue with the orientation of the cell divisions consistently in the xy-plane. Hence, a 2D analysis is most appropriate for this system. With the successful application of the 2D methods already achieving high accuracy, we envision that extension to 3D would only offer a slight increase in effectiveness as these measurements have little room for improvement. Therefore, we did not extend the method to 3D here. However, of course, this is the next natural step in our research as 3D models would be essential for studying 3D tissues; such 3D models will be computationally more expensive to analyse and more challenging to hand label.

      Concerning the biological application of the proposed methods, I found the results interesting, showing the potential of such a method to automatise mitosis quantification for a particular biological question of interest, here wound healing. However, the deep learning methods/applications that are put forward as the central point of the manuscript are not particularly original.

      We thank the referee for their constructive comments. Our aim was not only to show the accuracy of our models but also to show how they might be useful to biologists for automated analysis of large datasets, which is a—if not the—bottleneck for many imaging experiments. The ability to process large datasets will improve robustness of results, as well as allow additional hypotheses to be tested. Our study also demonstrated that these models can cope with real in vivo experiments where additional complications such as progressive development, tissue wounding and inflammation must be accounted for.

      Major point 1: generalisation potential of the proposed method.

      The neural network model proposed for mitosis detection relies on a 2D convolutional neural network (CNN), more specifically on the Unet architecture, which has become widespread for the analysis of biology and medical images. The strategy proposed here exploits the fact that the input of such an architecture is natively composed of several channels (originally 3 to handle the 3 RGB channels, which is actually a holdover from computer vision, since most medical/biological images are gray images with a single channel), to directly feed the network with 3 successive images of a timelapse at a time. This idea is, in itself, interesting because no modification of the original architecture had to be carried out. The latest 10-channel model (U-NetCellDivision10), which includes more channels for better performance, required minimal modification to the original U-Net architecture but also simultaneous imaging of cadherin in addition to histone markers, which may not be a generic solution.

      We believe we have provided a general approach for practical use by biologists that can be applied to a range of experimental data, whether that is based on varying numbers of fluorescent channels and/or timepoints. We envisioned that experimental biologists are likely to have several different parameters permissible for measurement based on their specific experimental conditions e.g., different fluorescently labelled proteins (e.g. tubulin) and/or time frames. To accommodate this, we have made it easy and clear in the code on GitHub how these changes can be made. While the model may need some alterations and retraining, the method itself is a generic solution as the same principles apply to very widely used fluorescent imaging techniques.

      Since CNN-based methods accept only fixed-size vectors (fixed image size and fixed channel number) as input (and output), the length or time resolution of the extracted sequences should not vary from one experience to another. As such, the method proposed here may lack generalization capabilities, as it would have to be retrained for each experiment with a slightly different temporal resolution. The paper should have compared results with slightly different temporal resolutions to assess its inference robustness toward fluctuations in division speed.

      If multiple temporal resolutions are required for a set of experiments, we envision that the model could be trained over a range of these different temporal resolutions. Of course, the temporal resolution, which requires the largest vector would be chosen as the model's fixed number of input channels. Given the depth of the models used and the potential to easily increase this by replacing resnet34 with resnet50 or resnet101 the model would likely be able to cope with this, although we have not specifically tested this. (page 27)

      Another approach (not discussed) consists in directly convolving several temporal frames using a 3D CNN (2D+time) instead of a 2D, in order to detect a temporal event. Such an idea shares some similarities with the proposed approach, although in this previous work (Ji et al. TPAMI 2012 and for split detection Nie et al. CCVPR 2016) convolution is performed spatio-temporally, which may present advantages. How does the authors' method compare to such an (also very simple) approach?

      We thank the Reviewer for this insightful comment. The text now discusses this (on Pages 8 and 17). Key differences between the models include our incorporation of multiple light channels and the use of much deeper models. We suggest that our method allows for an easy and natural extension to use deeper models for even more demanding tasks e.g. distinguishing between healthy and defective divisions. We also tested our method with ‘difficult conditions’ such as when a wound is present; despite the challenges imposed by the wound (including the discussed reduction in fluorescent intensities near the wound edge), we achieved higher accuracy compared to Nie et al. (accuracy of 78.5% compared to our F1 score of 0.964) using a low-density in vitro system.

      Major point 2: innovatory nature of the proposed method.

      The authors' idea of exploiting existing channels in the input vector to feed successive frames is interesting, but the natural choice in deep learning for manipulating time series is to use recurrent networks or their newer and more stable variants (LSTM, GRU, attention networks, or transformers). Several papers exploiting such approaches have been proposed for the mitotic division detection task, but they are not mentioned or discussed in this manuscript: Phan et al. 2018, Mao et al. 2019, Kitrungrotaskul et al. 2019, She et al 2020.

      An obvious advantage of an LSTM architecture combined with CNN is that it is able to address variable length inputs, therefore time sequences of different lengths, whereas a CNN alone can only be fed with an input of fixed size.

      LSTM architectures may produce similar accuracy to the models we employ in our study, however due to the high degree of accuracy we already achieve with our methods, it is hard to see how they would improve the understanding of the biology of wound healing that we have uncovered. Hence, they may provide an alternative way to achieve similar results from analyses of our data. It would also be interesting to see how LTSM architectures would cope with the noisy and difficult wounded data that we have analysed. We agree with the referee that these alternate models could allow an easier inclusion of difference temporal differences in division time (see discussion on Page 20). Nevertheless, we imagine that after selecting a sufficiently large input time/ fluorescent channel input, biologists could likely train our model to cope with a range of division lengths.

      Another advantage of some of these approaches is that they rely on unsupervised learning, which can avoid the tedious relabeling of data (Phan et al. 2018, Gilad et al. 2019).

      While these are very interesting ideas, we believe these unsupervised methods would struggle under the challenging conditions within ours and others experimental imaging data. The epithelial tissue examined in the present study possesses a particularly high density of cells with overlapping nuclei compared to the other experimental systems these unsupervised methods have been tested on. Another potential problem with these unsupervised methods is the difficulty in distinguishing dynamic debris and immune cells from mitotic cells. Once again despite our experimental data being more complex and difficult, our methods perform better than other methods designed for simpler systems as in Phan et al. 2018 and Gilad et al. 2019; for example, analysis performed on lower density in vitro and unwounded tissues gave best F1 scores for a single video was 0.768 and 0.829 for unsupervised and supervised respectively (Phan et al. 2018). We envision that having an F1 score above 0.9 (and preferably above 0.95), would be crucial for practical use by biologists, hence we believe supervision is currently still required. We expect that retraining our models for use in other experimental contexts will require smaller hand labelled datasets, as they will be able to take advantage of transfer learning (see discussion on Page 4).

      References :

      We have included these additional references in the revised version of our Manuscript.

      Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231. >6000 citations

      Nie, W. Z., Li, W. H., Liu, A. A., Hao, T., & Su, Y. T. (2016). 3D convolutional networks-based mitotic event detection in time-lapse phase contrast microscopy image sequences of stem cell populations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 55-62).

      Phan, H. T. H., Kumar, A., Feng, D., Fulham, M., & Kim, J. (2018). Unsupervised two-path neural network for cell event detection and classification using spatiotemporal patterns. IEEE Transactions on Medical Imaging, 38(6), 1477-1487.

      Gilad, T., Reyes, J., Chen, J. Y., Lahav, G., & Riklin Raviv, T. (2019). Fully unsupervised symmetry-based mitosis detection in time-lapse cell microscopy. Bioinformatics, 35(15), 2644-2653.

      Mao, Y., Han, L., & Yin, Z. (2019). Cell mitosis event analysis in phase contrast microscopy images using deep learning. Medical image analysis, 57, 32-43.

      Kitrungrotsakul, T., Han, X. H., Iwamoto, Y., Takemoto, S., Yokota, H., Ipponjima, S., ... & Chen, Y. W. (2019). A cascade of 2.5 D CNN and bidirectional CLSTM network for mitotic cell detection in 4D microscopy image. IEEE/ACM transactions on computational biology and bioinformatics, 18(2), 396-404.

      Shi, J., Xin, Y., Xu, B., Lu, M., & Cong, J. (2020, November). A Deep Framework for Cell Mitosis Detection in Microscopy Images. In 2020 16th International Conference on Computational Intelligence and Security (CIS) (pp. 100-103). IEEE.

      Wolny, A., Cerrone, L., Vijayan, A., Tofanelli, R., Barro, A. V., Louveaux, M., ... & Kreshuk, A. (2020). Accurate and versatile 3D segmentation of plant tissues at cellular resolution. Elife, 9, e57613.

    1. Author Response

      Reviewer #1 (Public Review):

      Castelán-Sánchez et al. analyzed SARS-CoV-2 genomes from Mexico collected between February 2020 and November 2021. This period spans three major spikes in daily COVID-19 cases in Mexico and the rise of three distinct variants of concern (VOCs; B.1.1.7, P.1., and B.1.617.2). The authors perform careful phylogenetic analyses of these three VOCs, as well as two other lineages that rose to substantial frequency in Mexico, focusing on identifying periods of cryptic transmission (before the lineage was first detected) and introductions to and from the neighboring United States. The figures are well presented and described, and the results add to our understanding of SARS-CoV-2 in Mexico. However, I have some concerns and questions about sampling that could affect the results and conclusions. The authors do not provide any details on the distribution of samples across the various Mexican States, making it hard to evaluate several key conclusions. Although this information is provided in Supplementary Data 2, it is not presented in a way that enables the reader to evaluate if lineages were truly predominant in certain regions of the country, or if these results are attributable purely to sampling bias. Specifically, each lineage is said to be dominant in a particular state or region, but it was not clear to me if sampling across states was even at all-time points. For example, the authors state that most B.1.1.7 genome sampling is from the state of Chihuahua, but it is not clear if this was due to more sequenced samples from that region during the time that B.1.1.7 was circulating, or if the effects of B.1.1.7 were truly differential across the country. The authors do mention sequencing biases several times, but need to be more specific about the nature of this bias and how it could affect their conclusions. It is surprising to see in this manuscript that the B.1.1.7 lineage did not rise above 25% prevalence in the data presented, despite its rapid rise in prevalence in many other parts of the world. This calls into question if the presented frequencies of each lineage are truly representative of what was circulating in Mexico at the time, especially since the coordinated sampling and surveillance program across Mexico did not start until May 2021.

      We thank the reviewer for the constructive comments. We recognize the need to better explain how the sequencing efforts in the country were set up and carried out, and this has now been clarified throughout the main text (L43-51, L95-105). A new figure comparing the overall cumulative proportion of genomes generated per state between 2020-2021 is now available as Supplementary Figure 1 c. The cumulative proportion of genomes sampled across states per lineage of interest, and corresponding to the period of circulation of the given lineage, were originally provided as maps in Figures 2-4. This has been further clarified in the Results section and in the corresponding figure legends. We also now provide additional maps representing the geographic distribution of the clades identified per lineage, integrating in the figures the information previously available in Supplementary Data 2, Supplementary Figures 4 and 5. As a note, for our analyses, we used the total cumulative genome data available from the country (and not only that generated by CoViGen-Mex, representing one third of the SARS-CoV-2 genomes from Mexico). This is expected to improve any sampling biases related to the scheme adopted by CoViGenMex, and is now clearly stated in the main text.

      However, we believe that there has been a misunderstanding related to the genome sampling scheme adopted by CoViGen-Mex, as ‘coordinated sampling and surveillance program across Mexico did not start until May 2021’. Although it is true that further improvements were implemented after this date (enabling genome sampling and sequencing to become more homogenous across the country), the overall virus genome sequencing in Mexico was already sufficient from February 2021. This is represented by the cumulative number of viral genomes sequenced throughout 2020-2021 (both by CoViGen-Mex and other contributing institutions) correlating to the number of cases officially reported in the country during this time (see Supplementary Figure 1 a). This has now been clarified in the Results section (L94-105). Therefore, we hold that “SARS-CoV-2 sequencing in Mexico has been sufficient to explore the spatial and temporal frequency of viral lineages across national territory, and now to further investigate the number of lineage-specific introduction events, and to characterize the extension and geographic distribution of associated transmission chains, as we present in this study” (L102-105). In this context, “a more homogenous sampling across the country is unlikely to impact our main findings, but could i) help pinpoint additional clades we are currently unable to detect, ii) provide further details on the geographic distribution of clades across other regions of the country, and iii) deliver a higher resolution for the viral spread reconstructions we present” (discussed in L466-470).

      For the B.1.1.7 lineage in Mexico, we have clarified the issue raised as follows: “during its circulation period, most B.1.1.7 genomes from Mexico were generated from the state of Chihuahua, with these representing the earliest B.1.1.7-assigned genomes from the country. However, our phylodynamic analysis revealed that only a small proportion of these grouped within a larger clade denoting an extended transmission chain (C2a), with the rest falling within minor clusters, or representing singleton events. Relative to other states, Chihuahua generated an overall lower proportion of viral genomes throughout 2020-2021. Thus, more viral genomes sequenced from a particular state does not necessarily translate into more well-supported clades denoting extended transmission chains, whilst the geographic distribution of clades is somewhat independent to the genome sampling across the country.” (L202-211). Again, these observations are supported by a sufficient overall genome sampling from Mexico.

      We would further like to make clear that “our results confirm that the B.1.1.7 lineage reached an overall lower sampling frequency of up to 25% (relative to other virus lineages circulating in the country), as was noted prior to this study (for example, see Zárate et al. 2022)” (L189-193). As similar observations were independently made for other Latin American countries such as Brazil, Chile, and Peru (some with better genome representation than others, like Brazil https://www.gisaid.org/), it is possible that “the overall epidemiological dynamics of the B.1.1.7 in Latin America may have substantially differed from what was observed in the USA and UK. Such differences could be partly explained by competition between cocirculating lineages, exemplified in Mexico by the regional co-circulation of B.1.1.7, P.1 and B.1.1.519. Nonetheless, the lack of a representative number of viral genomes for most of these countries prevents exploring such hypothesis at a larger scale, and further highlights the need to strengthen genomic epidemiology-based surveillance across the region” (now discussed in L372-379). We hope the reviewer considers that the issues raised have now been resolved.

      Reviewer #2 (Public Review):

      The authors use a series of subsampling methods based on phylogenetic placement and geographic setting, informed by human movement data to control for differences in sampling of SARS-CoV-2 genomes across countries. Of note, the authors show that 2 variants likely arose in Mexico and spread via multiple introductions globally, while other variant waves were driven by repeat introductions into Mexico from elsewhere. Finally, they use human mobility data to assess the impact of movement on transmission within Mexico. Overall, the study is well done and provides nice data on an under-studied country. The authors take a thoughtful approach to subsampling and provide a very thorough analysis. Because of the care given to subsampling and the great challenge that proper subsampling represents for the field of phylodynamics, the paper would benefit from a more thorough exploration of how their migration-informed subsampling procedure impacts their results. This would not only help strengthen the findings of the paper, but would likely provide a useful reference for others doing similar studies. Additionally, I would suggest the authors provide a bit more discussion of this subsampling approach and how it may be useful to others in the discussion section of the paper.

      We thank the reviewer for the constructive comments, and appreciate the recognition of our sub-sampling scheme as a valuable tool with potential application in other studies. We acknowledge the need for a ‘more thorough exploration and discussion of how a different migration-informed subsampling approach could impact our results’. To address this issue, “we further sought to validate our migration-informed genome subsampling scheme (applied to B.1.617.2+, representing the best sampled lineage in Mexico). For this, an independent dataset was built using a different migration sub-sampling approach, comprising all countries represented by B.1.617.2+ sequences deposited in GISAID (available up to November 30th 2021). In order to compare the number of introduction events, the new dataset was analysed independently under a time-scaled DTA (as described in Methods Section 4).” (L517-524). In the new dataset, <100 genome sequences from the USA were retained for further analysis (Supplementary Figure 2b), compared to approximately 2000 ‘USA’ genome sequences included in the original B.1.617.2+ alignment. Thus, we expected a lower number of inferred introduction events into Mexico, as an undersampling of viral genome sequences from the USA is likely to result in ‘Mexico’ clades not fully segregating (particularly impacting C5d).

      Our original results revealed a minimum number of 142 introduction events into Mexico (95% HPD interval = [125-148]), with 6 clades identified as denoting extended transmission chains. The DTA results derived from the new dataset (subsampling all countries) revealed a minimum number of 84 introduction events into Mexico (95% HPD interval = [81-87]), with again 6 major clades identified. Thus, a significantly lower number of introduction events into Mexico were inferred, as was expected. On the other hand, the number of clades identified were consistent between both datasets, supporting for the robustness of our phylogenetic methodological approach. However, in the new dataset, we observe that C5d displayed a reduced diversity (represented by the AY.113 and AY.100 genomes from Mexico, but excluded the B.1.617.2 genome sampled from the USA). This highlights the relevance of our genome sub-sampling using migration data as a proxy.

      In further agreement with these observations, publicly available data on global human mobility (https://migration-demography-tools.jrc.ec.europa.eu/data- hub/index.html?state=5d6005b30045242cabd750a2) shows that migration into Mexico is mostly represented by movements from the USA, followed by Indonesia, Guatemala, Belize and Colombia and Belize. However, the volume of movements from the USA into Mexico is much higher (up to 6 orders of magnitude above the volumes recorded into Mexico from any other country).

      Given time constraints related to performing additional analyses, we decided to exclude the subsampling scheme for ‘top ten countries’ suggested by the reviewer. However, we consider that the results derived from the comparison between the original and the new dataset (top-5 vs all countries) is sufficient to support for our migration-informed subsampling approach. A full description of the methodology and the result obtained, as well as a short discussion, is now available as Supplementary Text 2, and Supplementary Figure 2b and 2c. We hope the reviewer considers that the issues raised has been addressed.

    1. Author Response

      Reviewer #1 (Public Review):

      High resolution mechanistic studies would be instrumental in driving the development of Cas7-11 based biotechnology applications. This work is unfortunately overshadowed by a recent Cell publication (PMID: 35643083) describing the same Cas7-11 RNA-protein complex. However, given the tremendous interest in these systems, it is my opinion that this independent study will still be well cited, if presented well. The authors obviously have been trying to establish a unique angle for their story, by probing deeper into the mechanism of crRNA processing and target RNA cleavage. The study is carried out rigorously. The current version of the manuscript appears to have been rushed out. It would benefit from clarification and text polishing.

      We thank the reviewer for the positive and helpful comments that have made the manuscript more impactful.

      To summarize the revisions, we have resolved the metal-dependence issue, updated the maps in both main and supplementary figures that support the model, re-organized the labels for clarity, and added the comparison between our and Kato et al.’ structures.

      In addition, we describe a new result with an isolated C7L.1 fragment that retains the processing and crRNA binding activities.

      Reviewer #2 (Public Review):

      In this manuscript, Gowswami et al. solved a cryo-EM structure of Desulfonema ishimotonii Cas7-11 (DiCas7-11) bound to a guiding CRISPR RNA (crRNA) and target RNA. Cas7-11 is of interest due to its unusual architecture as a single polypeptide, in contrast to other type III CRISPR-Cas effectors that are composed of several different protein subunits. The authors have obtained a high-quality cryo-EM map at 2.82 angstrom resolution, allowing them to build a structural model for the protein, crRNA and target RNA. The authors used the structure to clearly identify a catalytic histidine residue in the Cas7-11 Cas7.1 domain that is important for crRNA processing activity. The authors also investigated the effects of metal ions and crRNAtarget base pairing on target RNA cleavage. Finally, the authors used their structure to guide engineering of a compact version of Cas7-11 in which an insertion domain that is disordered in the cryo-EM map was removed. This compact Cas7-11 appears to have comparable cleavage activity to the full-length protein.

      The cryo-EM map presented in this manuscript is generally of high quality and the manuscript is very well illustrated. However, some of the map interpretation requires clarification (outlined below). This structure will be valuable as there is significant interest in DiCas7-11 for biotechnology. Indeed, the authors have begun to engineer the protein based on observations from the structure. Although characterization of this engineered Cas7-11 is limited in this study and similar engineering was also performed in a recently published paper (PMID 35643083), this proof-of-principle experiment demonstrates the importance of having such structural information.

      The biochemistry experiments presented in the study identify an important residue for crRNA processing, and suggest that target RNA cleavage is not fully metal-ion dependent. Most of these conclusions are based on straightforward structure-function experiments. However, some results related to target RNA cleavage are difficult to interpret as presented. Overall, while the cryo-EM data presented in this work is of high quality, both the structural model and the biochemical results require further clarification as outlined below.

      We thank the reviewer for the positive and helpful comments that have made the manuscript more impactful.

      To summarize the revisions, we have resolved the metal-dependence issue, updated the maps in both main and supplementary figures that support the model, re-organized the labels for clarity, and added the comparison between our and Kato et al.’ structures.

      In addition, we describe a new result with an isolated C7L.1 fragment that retains the processing and crRNA binding activities.

      1. The DiCas7-11 structure bound to target RNA was also recently reported by Kato et al. (PMID 35643083). The authors have not cited this work or compared the two structures. While the structures are likely quite similar, it is notable that the structure reported in the current paper is for the wild-type protein and the sample was prepared under reactive conditions, resulting in a partially cleaved target. Kato et al. used a catalytically dead version of Cas7-11 in which the target RNA should remain fully intact. Are there differences in the Cas7-11 structure observed in the presence of a partially cleaved target RNA in comparison to the Kato et al. structure? Such a comparison is appropriate given the similarities between the two reports. A figure comparing the two structures could be included in the manuscript.

      We have added a paragraph on page 12 that describe the differences in preparation of the two complexes and their structures. We observed minor differences in the overall protein structure (r.m.s.d. 0.918 Å for 8114 atoms) but did observe quite different interactions between the protein and the first 5’-tag nucleotide (U(-15) vs. G(-15)) due to the different constructs in pre-crRNA, which suggests an importance of U(-15) in forming the processing-competent active site. We added Figure 2-figure supplementary 3 that illustrates the similarities and the differences.

      2.The cryo-EM density map is of high quality, but some of the structural model is not fully supported by the experimental data (e.g. protein loops from the alphafold model were not removed despite lack of cryo-EM density). Most importantly, there is little density for the target RNA beyond the site 1 cleavage site, suggesting that the RNA was cleaved and the product was released. However, this region of the RNA was included in the structural model. It is unclear what density this region of the target RNA model was based on. Further discussion of the interpretation of the partially cleaved target RNA is necessary. Were 3D classes observed in various states of RNA cleavage and with varied density for the product RNAs?

      We should have made it clear in the Method that multiple maps were used in building the structure but only submitted the post-processed map to reviewers. When using the Relion 4.0’s local resolution estimation-generated map, we observed sufficient density for some of the regions the reviewer is referring to. For instance, the site 1 cleavage density does support the model for the two nucleotides beyond site 1 cleavage site (see the revised Figure 1 & Figure 1- figure supplement 3).

      However, there are protein loops that remain lack of convincing density. These include 134141 and 1316-1329 that are now removed from the final coordinate.

      The “partially cleaved target RNA” phrase is a result of weak density for nucleotides downstream of site 1 (+2 and +3) but clear density flanking site 2. This feature indicates that cleavage likely had taken place at site 1 but not site 2 in most of the particles went into the reconstruction. To further clarify this phrase, we added “The PFS region plus the first base paired nucleotide (+1*) are not observed.” on page 4 and better indicate which nucleotides are or are not built in our model in Figure 1.

      1. The authors argue that site 1 cleavage of target RNA is independent of metal ions. This is a potentially interesting result, but it is difficult to determine whether it is supported by the evidence provided in the manuscript. The Methods section only describes a buffer containing 10 mM MgCl2, but does not describe conditions containing EDTA. How much EDTA was added and was MgCl2 omitted from these samples? In addition, it is unclear whether the site 1 product is visible in Figures 2d and 3d. To my eye, the products that are present in the EDTA conditions on these gels migrated slightly slower than the typical site 1 product. This may suggest an alternate cleavage site or chemistry (e.g. cyclic phosphate is maintained following cleavage). Further experimental details and potentially additional experiments are required to fully support the conclusion that site 1 cleavage may be metal independent.

      As we pointed out in response to Reviewer 1’s #8 comment, this conclusion may have been a result of using an older batch of DiCas7-11 that contains degraded fragments.

      As shown in the attached figure below, “batch Y” was an older prep from our in-house clone and “batch X” is a newer prep from the Addgene purchased clone (gel on right), and they consistently produce metal-independent (batch Y) or metal-dependent (batch X) cleavage (gel on left). It is possible that the degraded fragments in batch Y carry a metal-independent cleavage activity that is absent in the more pure batch X.

      We further performed mass spectrometry analysis of two of the degraded fragments from batch Y (indicated by arrows below) and discovered that these are indeed part of DiCas7-11. We, however, cannot rationalize, without more experimental evidence, why these fragments might have generated metal-independent cleavage at site 1. Therefore, we simply updated all our cleavage results from the new and cleaner prep (batch X) (For instance, Figure 3c). As a result, all references to “metal-independence” were removed.

      With regard to the nature of cleaved products, we found both sites could be inhibited by specific 2’-deoxy modifications, consistent with the previous observation that Type III systems generate a 2’, 3’-cyclic product in spite of the metal dependence (for instance, see Hale, C. R., Zhao, P., Olson, S., Duff, M. O., Graveley, B. R., Wells, L., ... & Terns, M. P. (2009). RNA-guided RNA cleavage by a CRISPR RNA-Cas protein complex. Cell, 139(5), 945-956.)

      We added this rationale based on the new results and believe that these characterizations are now thorough and conclusive

      1. The authors performed an experiment investigating the importance of crRNA-target base pairing on cleavage activity (Figure 3e). However, negative controls for the RNA targets in the absence of crRNA and Cas7-11 were not included in this experiment, making it impossible to determine which bands on the gel correspond to substrates and which correspond to products. This result is therefore not interpretable by the reader and does not support the conclusions drawn by the authors.

      Our original gel image (below) does contain these controls but we did not include them for the figure due to space considerations (we should have included it as a supplementary figure). We have now completely updated Figure 3e with much better quality and controls. Both the older and the updated experiments show the same results.

      Original gel for Figure 3e containing controls.

    1. Author Response

      Reviewer #1 (Public Review):

      The idea that because the hippocampal code generates responses that match the most needed variable for each task (time or distance) makes it a predictive code is not fully proved with the analyses provided in the manuscript. For example, in the elapsed time task, there are also place cells and in the fixed-distance travel there are also cells that encode other features. This, rather than a predictive code, can be a regular sample of the environment with an overrepresentation of the more salient variable that animals need to get in order to collect rewards.

      We concur with the Reviewer’s reservation. Claims about predictive coding were removed and the following possible account explanation for over-representation was suggested instead:

      "These results underscore the flexible coding capabilities of the hippocampus, which are shaped by over-representation of salient variables associated with reward conditions. " (page 1 line 23, page 4 line 27)

      In addition, the analysis provided in the manuscript are rather simple, and better controls could be provided. Improving the analytical quantification of the results is necessary to support the main claim.

      We improved the quantification, as suggested below by specific comments of the reviewer.

      What is the relationship of each type of cell with the speed of the animal?

      The cells were assigned to the different types according to their responses while running across all speeds. However, we checked how the speed of the animal affects the peak firing rate of the cells, for each type of cell. Results of this analysis are presented in Author response image 1. Bars represent maximum firing rate of all cells of a given type across runs with the specified speed range (𝒎𝒆𝒂𝒏 ± 𝑺𝑬𝑴).

      Author response image 1.

      We did not find a significant interaction effect of the speed and the cell-type over the max firing rate (2-way Anova p>0.98).

      What is the relationship with the n of trial that the animal has run (first 10 trials, last 10 trials..)?

      Some of the animals were subjected to only one type of session. Moreover, they were sometimes trained without recording. Therefore, to answer this question we restricted our analysis to recording sessions where the animal switched from fixed-time to fixed-distance or vice versa. We checked the 20 first runs vs. the last 20 runs (data from 10 runs is not powerful enough for analysis) in See the results in Author response table 1.

      Author response table 1.

      To assess the dynamics of the coding flexibility, we defined the Time-Distance index (TDI), quantifying the balance between the proportion of distance cells and of time cells at a given time. as (NDistanceCells/NTimeCells)/(NDistanceCells+NTimeCells). The is in the range of [0 ,1] if the majority of cells are classified as distance cells, and in the range of [-1, 0] if the majority of cells are classified as time cells. Chi-square testing for differences in proportions did not reveal significant differences (after correction for multiple comparisons).

      The shaded boxes in Author response table 1 indicate the sessions which followed a transition between session types

      What is the average firing rate of each neuron?

      This information was now added to the titles of the panels in Figure 2 and Figure 2-figure supplement 1.

      Is there any relationship between intrinsic firing rate and the type of coding that the cell develops in each task?

      In Author response image 2 is a comparison of the firing rates of the Time cells vs the Distance cells.

      The distributions are similar (p = 0.975 ,and p = 0.675 for peak firing rate and mean firing rate, respectively, Kolmogorov-Smirnov (KS) test).

      Author response image 2.

      This figure was added to the supplementary figures (figure 3 - figure supplement 3)

      What is the relation of the units of each type with LFP features (theta phase, ripple recruitment)?

      We had LFP recordings for 15 out of 18 sessions. A large proportion of the cells showed phase precession (see Author response table 2). An example is shown in Author response image 3. We could not find a significant relation between phase precession and the cell type or the trial type.

      The table on the left shows the total cells analyzed, and on the right we show the percentage of cells that had a significant linear fit of the theta phase within 80% of the field width, when analyzed per time (topright) or per distance (bottom-right). FDist/Ftime are Fixed-distance and fixed-time trials and Dist/Time are the cell type.

      We did not identify ripple events during treadmill runs.

      Author response table 2

      Author response image 3

      Reviewer #3 (Public Review):

      Weaknesses:

      The original study of Kraus et al. consisted of 3 rats for which all sessions, including both training and recording, were of one type. Another 3 rats had a hybrid mixture of distance and time sessions. This is mentioned very briefly in the main text.

      It would appear that the theory of reward might lead to different predictions that could be verified by comparing these animals session to session at a finer grain. For example, are there examples of cells switching or transforming their “predictive” representations when a large number of trials in on session type is followed by a large number of trials of the opposite type?

      For another example, the transition from training to recording could give similar opportunities. It seems at least possible that ignoring these issues could cause a loss of power.

      We could not compare a particular cell for switching between encodings since the different types of trial were performed on different days. As an alternative, we compared the populations of cells within the first 20 vs. last 20 trials in recording sessions where the animal switched from fixed-time to fixed-distance or vice versa (see table below). The “Time-Distance balance index” (TDI) is defined as (#DistanceCells#TimeCells)/(#DistanceCells+#TimeCells) and is ranges between 0 and 1 if the majority of cells are classified as distance cells while between -1 to 0 if the majority of cells are classified as time cells.

      In all three animals there seems to be a change between the first 20 runs and last 20 runs of the same session, following a switch between trial types. However, this change is significant and with the expected trend only in one of the animals (BK49, p=0.02, chi-square test).

      The grayed boxes in Author response table 1 indicate the sessions which followed a transition between session types

      Some circularities in the construction and interpretation of the time-cell and distance-cell classifiers are not clearly addressed. The classifiers currently appear to be fit to predict the type of session a cell’s response patterns are observed within. But it is tautological to use the session type to define the cell type. I sense this is ultimately reasonable because of how the classifier is built, but this concern is not addressed or explained.

      We regret that the term ‘classifiers’ was not sufficiently precise. We used this term to describe the metrics designed to express the relation between the firing-time and the velocity, in order to classify cells, rather than classifiers that are fit to predict the type of session. We believe this to be the source of the apparent circularity. To circumvent this confusion, we now replaced all places where the term “classifier” was mentioned, with the term “metric”

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, May et al use H2B overexpression driven by Keratin14 Cre-mediated excision of a loxPstop cassette to quantify bulk chromatin dynamics in the live epidermis. They observe heterogeneity of H2B distribution within the basal stem cell layer and a change in distribution when the stem cells delaminate into the suprabasal layers. They further show that these chromatin rearrangements precede cell fate commitment, as detected by adding another Cre-mediated transgene on top (tetO-Cre mediated Keratin10 reporter). Finally, they generate an MST stem-loop transgene for the keratin 10 transcript and observe transcriptional bursting.

      We would like to clarify for the reviewer that the H2B system used is a transgenic allele of histone-2B-GFP that is driven directly by the Keratin-14 promoter (Kanda et al., 1998; Tumbar et al., 2004). This system does not rely on any Cre-mediated excision of the LoxP-stop cassette, and these mice do not carry Cre alleles. We will touch on this point below when addressing the comment on Cre expression in cells and the raised question on whether it influences the quantifications of chromatin compaction.

      The manuscript uses elegant in vivo imaging approaches to describe a set of observations that are logically based on a panel of studies that have used genetic approaches to dissect the role of heterochromatin and histone/DNA modifications in epidermal state transitions. In addition, the MST stem-loop analysis is a nice technical advance, confirming transcriptional bursting as a general phenomenon of how transcription is regulated in cells (see work from Daniel Larsson, Jonathan Chubb, Arjun Raj, and others).

      We thank the reviewer for their recognition of our contribution to the transcription field. To deepen the connection between our data and previous characterizations of transcriptional dynamics in other systems, we have added new analyses of K10MS2 transcriptional bursting on a finer temporal scale (Fig 5G-K). We find pervasive “transcriptional bursting,” consistent with findings in vitro and in other model organisms, and a surprising variation of burst durations. We believe these additional analyses significantly strengthen our conclusions and the relevance of our study to the overall transcription field.

      The value of the study in my view is recapitulating these known phenomena in a live tissue setting with high-quality imaging and careful quantification. Overall, the analyses appear thorough, although the overall changes appear relatively minor, which is perhaps to be expected from imaging bulk H2B distribution as a proxy for chromatin states.

      There is one major technical concern that might impact the interpretation of the data. The authors combine Cre lines for their key conclusions (Krt10 reporter and SRF KO) and analyze single cells that thus express very high levels of Cre. Knowing that Cre will target non-loxP sites and is genotoxic, it is possible that the effect of chromatin is due to high levels of Cre expression in single cells rather than specific effects due to cell state transitions. I would encourage the authors to carefully quantify the dose-dependent effects of the Cre protein (independent of the LoxP sites) on chromatin organization. Along these lines, is the phenotype of the SRF KO similar in the presence of two Cre alleles versus just one?

      Thank you for these kind words. This is an important potential caveat to consider. We believe that Cre activity does not significantly affect the chromatin compaction profiles for several reasons. First, we interrogated Cre activity. The quantifications in Figure 1A-E and Figure 2B-C are from mice containing K14H2B-GFP allele alone and do not carry any Cre allele. When these data were compared to those from mice that had been treated with a high dose of tamoxifen to induce Cre-mediated recombination in the vast majority of cells, the chromatin compaction profiles were not significantly different (Supp Fig 3C). We have added this comparison to Supplemental Figure 3 and addressed this point in the text (page 9). To further determine whether Cremediated recombination affects our measurement of chromatin compaction, we also analyzed adjacent basal cells with and without Cre activity in the same animal. K14H2BGFP; K14CreER; tdTomato mice were induced with a low dose of tamoxifen such that roughly 65% of epidermal cells underwent Cre recombination as demonstrated by expression of the tdTomato fluorescent reporter (Gallini et al., 2022). They also received a punch biopsy performed on the unimaged ear. Three days post injury and six days after Cre induction, the chromatin compaction profiles of cells positive and negative for Cre-mediated recombination were also not significantly different (Rebuttal Figure 1). Together, these direct comparisons between cells exposed to Cre activity and cells not exposed to Cre activity indicate that Cre activity at levels comparable to those used in our experiments has no measurable effect on our measurements of chromatin compaction.

      Rebuttal Figure 1: Effect of Cre expression on chromatin compaction profiles

      The second issue is the conclusion of "chromatin spinning". Concluding that chromatin is spinning would in my view require that the authors demonstrate that the nuclear envelope is not moving or is moving less than the chromatin. To support this conclusion the authors should do double imaging for example with LINC complex proteins, an ER/outer nuclear membrane marker, or equivalent.

      This is an excellent point. While we expect that the entire nucleus is spinning based on observations others have made in in vitro fibroblasts systems, we describe our observation as “chromatin spinning” instead of “nuclear spinning” because the K14H2B-GFP allele only allows us to directly visualize chromatin itself (Kumar et al., 2014; Zhu et al., 2018).

      Unfortunately, LINC complex proteins and nuclear membrane proteins have not been fluorescently tagged in mice, which prevents us from visualizing their dynamics in vivo. To establish these new tools and perform experiments would take more than a year, making it therefore beyond the scope of this current paper. Additionally, their relatively uniform distribution across the nuclear membrane would not allow us to visualize potential spinning of these components. We have made efforts towards the reviewer’s question by asking whether other compartments within the cell also spin in delaminating cells. To do this, we leveraged a mouse line developed by Claudio Franco’s lab (Barbacena et al., 2019), which fluorescently labels both the chromatin (H2B-GFP) and the Golgi (GTS-mCherry). As expected, this model showed a perinuclear and polarized Golgi in skin fibroblasts (Rebuttal Figure 2). However, this tool is incompatible with our questions in epidermal cells for a few reasons. First, the system is toxic to epithelial cells in vivo, resulting in apoptosis, nuclear fragmentation, and binucleate cells. Second, the Golgi is not discretely polarized (or even perinuclear) in epithelial cells (Rebuttal Figure 2). As such, although we observe chromatin spinning in delaminating basal cells, we are uncertain as to whether the whole nucleus or any other cellular compartments are spinning in these cells.

      Rebuttal Figure 2: Interrogation of intracellular spinning

      Given the above reasoning and efforts, we have altered the text and specified that we only have the capacity to visualize chromatin through the H2B-GFP allele and that we hypothesize the entire nucleus is spinning (page 11).

      Reviewer #2 (Public Review):

      In this work entitled "Live imaging reveals chromatin compaction transitions and dynamic transcriptional bursting during stem cell differentiation in vivo" the authors use a combination of genetic and imaging tools to characterize dynamic changes in chromatin compaction of cells undergoing epidermal stem cell differentiation and to relate chromatin compaction to transcriptional regulation in vivo. They track this phenomenon by imaging the epithelium at the ear of live mice, thus in a physiological context. By following individual nuclei expressing H2B-GFP along time ranges of hours and up to 3 days, they develop a strategy to quantify the profile of chromatin compaction across different epidermal layers based on normalized intensity profiles of H2B-GFP. They observe that cells belonging to the basal stem cell layer display a considerable level of internuclear variability in chromatin compaction that is cell-cycle independent. Instead, intercellular variability in chromatin compaction appears more related to the differentiation status of the cells as it is stable in the hours range but dynamic in the days range. The authors show that differentiated nuclei in the spinous layer exhibit higher chromatin compaction. They also identified a subset of cells in the basal stem layer with an intermediate profile of chromatin compaction and with the dynamic expression of the early differentiation marker keratin 10. Lastly, they show that the expression of keratin-10 precedes the chromatin compaction establishing relevant temporal relationships in the process of epidermal differentiation.

      This work includes a number of challenging approaches and techniques since it is carried out in living mice. Also, it provides nice tools and methods to study chromatin structure in vivo during multiple days and within a differentiation physiological system. On the other hand, the results are descriptive and, in some respect, expected in line with previous observations.

      Thank you very much for this great summary, kind words, and the recommendations listed below. We will address each of them specifically. We have also deepened the analysis of transcriptional dynamics in ways that are more comparable with how other groups have studied transcription and included those results in Figure 5.

      References

      Kanda, T., Sullivan, K.F., and Wahl, G.M. (1998). Histone–GFP fusion protein enables sensitive analysis of chromosome dynamics in living mammalian cells. Current Biology 8, 377–385. 10.1016/S09609822(98)70156-3.

      Tumbar, T., Guasch, G., Greco, V., Blanpain, C., Lowry, W.E., Rendl, M., and Fuchs, E. (2004). Defining the epithelial stem cell niche in skin. Science 303, 359–363. 10.1126/science.1092436.

      Kumar, A., Maitra, A., Sumit, M., Ramaswamy, S., and Shivashankar, G.V. (2014). Actomyosin contractility rotates the cell nucleus. Sci Rep 4, 3781. 10.1038/srep03781.

      Zhu, R., Liu, C., and Gundersen, G.G. (2018). Nuclear positioning in migrating fibroblasts. Seminars in Cell & Developmental Biology 82, 41–50. 10.1016/j.semcdb.2017.11.006.

      Sara Gallini, Nur-Taz Rahman, Karl Annusver, David G. Gonzalez, Sangwon Yun, Catherine Matte-Martone, Tianchi Xin, Elizabeth Lathrop, Kathleen C. Suozzi, Maria Kasper, Valentina Greco . Injury suppresses Ras cell competitive advantage through enhanced wild-type cell proliferation.<br /> bioRxiv 2022.01.05.475078; doi: https://doi.org/10.1101/2022.01.05.475078

      Pedro Barbacena, Marie Ouarné, Jody J Haigh, Francisca F Vasconcelos, Anna Pezzarossa, Claudio A Franco. GNrep mouse: A reporter mouse for front-rear cell polarity. Genesis 2019 Jun. DOI: 10.1002/dvg.23299

      Cristiana M Pineda, Sangbum Park, Kailin R Mesa, Markus Wolfel, David G Gonzalez, Ann M Haberman, Panteleimon Rompolas, Valentina Greco. Intravital imaging of hair follicle regeneration in the mouse. Nature Protocols 2015 July. DOI: 10.1038/nprot.2015.070

    1. Author Response

      Reviewer #1 (Public Review):

      The work by Yijun Zhang and Zhimin He at al. analyzes the role of HDAC3 within DC subsets. Using an inducible ERT2-cre mouse model they observe the dependency of pDCs but not cDCs on HDAC3. The requirement of this histone modifier appears to be early during development around the CLP stage. Tamoxifen treated mice lack almost all pDCs besides lymphoid progenitors. Through bulk RNA seq experiment the authors identify multiple DC specific target gens within the remaining pDCs and further using Cut and Tag technology they validate some of the identified targets of HDAC3. Collectively the study is well executed and shows the requirement of HDAC3 on pDCs but not cDCs, in line with the recent findings of a lymphoid origin of pDC.

      1) While the authors provide extensive data on the requirement of HDAC3 within progenitors, the high expression of HDAC3 in mature pDCs may underly a functional requirement. Have you tested INF production in CD11c cre pDCs? Are there transcriptional differences between pDCs from HDAC CD11c cre and WT mice?

      We greatly appreciate the reviewer’s point. We have confirmed that Hdac3 can be efficiently deleted in pDCs of Hdac3fl/fl-CD11c Cre mice (Figure 5-figure supplement 1 in revised manuscript). Furthermore, in those Hdac3fl/fl-CD11c Cre mice, we have observed significantly decreased expression of key cytokines (Ifna, Ifnb, and Ifnl) by pDCs upon activation by CpG ODN (shown in Author response image 1). Therefore, HDAC3 is also required for proper pDC function. However, we have yet to conduct RNA-seq analysis comparing pDCs from HDAC CD11c cre and WT mice.

      Author response image 1.

      Cytokine expression in Hdac3 deficient pDCs upon activation

      2) A more detailed characterization of the progenitor compartment that is compromised following depletion would be important, as also suggested in the specific points.

      We thank the reviewer for this constructive suggestion. We have performed thorough analysis of the phenotype of hematopoietic stem cells and progenitor cells at various developmental stages in the bone marrow of Hdac3 deficient mice, based on the gating strategy from the recommended reference. Briefly, we analyzed the subpopulations of progenitors based on the description in the published report by "Pietras et al. 2015", namely MPP2, MPP3 and MPP4, using the same gating strategy for hematopoietic stem/progenitor cells. As shown in Author response image 2 and Author response image 3, we found that the number of LSK cells was increased in Hdac3 deficient mice, especially the subpopulations of MPP2 and MPP3, whereas no significant changes in MPP4. In contrast, the numbers of LT-HSC, ST-HSC and CLP were all dramatically decreased. This result has been optimized and added as Figure 3A in revised manuscript. The relevant description has been added and underlined in the revised manuscript Page 6 Line 164-168.

      Author response image 2.

      Gating strategy for hematopoietic stem/progenitor cells in bone marrow.

      Author response image 3.

      Hematopoietic stem/progenitor cells in Hdac3 deficient mice

      Reviewer #2 (Public Review):

      In this article Zhang et al. report that the Histone Deacetylase-3 (HDAC3) is highly expressed in mouse pDC and that pDC development is severely affected both in vivo and in vitro when using mice harbouring conditional deletion of HDAC3. However, pDC numbers are not affected in Hdac3fl/fl Itgax-Cre mice, indicating that HDCA3 is dispensable in CD11c+ late stages of pDC differentiation. Indeed, the authors provide wide experimental evidence for a role of HDAC3 in early precursors of pDC development, by combining adoptive transfer, gene expression profiling and in vitro differentiation experiments. Mechanistically, the authors have demonstrated that HDAC3 activity represses the expression of several transcription factors promoting cDC1 development, thus allowing the expression of genes involved in pDC development. In conclusion, these findings reveals HDAC3 as a key epigenetic regulator of the expression of the transcription factors required for pDC vs cDC1 developmental fate.

      These results are novel and very promising. However, supplementary information and eventual further investigations are required to improve the clarity and the robustness of this article.

      Major points

      1) The gating strategy adopted to identify pDC in the BM and in the spleen should be entirely described and shown, at least as a Supplementary Figure. For the BM the authors indicate in the M & M section that they negatively selected cells for CD8a and B220, but both markers are actually expressed by differentiated pDC. However, in the Figures 1 and 2 pDC has been shown to be gated on CD19- CD11b- CD11c+. What is the precise protocol followed for pDC gating in the different organs and experiments?

      We apologize for not clearly describing the protocols used in this study. Please see the detailed gating strategy for pDC in bone marrow, and for pDC and cDC in spleen (Figure 4 and Figure 5). These information are now added to Figure1−figure supplement 3, The relevant description has been underlined in Page 5 Line 113-116, in revised manuscript.

      We would like to clarify that in our study, we used two different panels of antibody cocktails, one for bone marrow Lin- cells, including mAbs to CD2/CD3/TER-119/Ly6G/B220/CD11b/CD8/CD19; the other for DC enrichment, including mAbs to CD3/CD90/TER-119/Ly6G/CD19. We included B220 in the Lineage cocktails to deplete B cells and pDCs, in order to enrich for the progenitor cells from bone marrow. However, when enriching for the pDC and cDC, B220 or CD8a were not included in the cocktail to avoid depletion of pDC and cDC1 subsets . For the flow cytometry analysis of pDCs, we gated pDCs as the CD19−CD11b−CD11c+B220+SiglecH+ population in both bone marrow and spleen. The relevant description has been underlined in the revised manuscript Page 16 Line 431-434.

      2) pDC identified in the BM as SiglecH+ B220+ can actually contain DC precursors, that can express these markers, too. This could explain why the impact of HDAC3 deletion appears stronger in the spleen than in the BM (Figures 1A and 2A). Along the same line, I think that it would important to show the phenotype of pDC in control vs HDAC3-deleted mice for the different pDC markers used (SiglecH, B220, Bst2) and I would suggest to include also Ly6D, taking also in account the results obtained in Figures 4 and 7. Finally, as HDCA3 deletion induces downregulation of CD8a in cDC1 and pDC express CD8a, it would important to analyse the expression of this marker on control vs HDAC3-deleted pDC.

      We agree with the reviewer’s points. In the revised manuscript, we incorporated major surface markers, including Siglec H, B220, Ly6D, and PDCA-1, all of which consistently demonstrated a substantial decrease in the pDC population in Hdac3 deficient mice. Moreover, we did notice that Ly6D+ pDCs showed higher degree of decrease in Hdac3 deficient mice. Additionally, percentage and number of both CD8+ pDC and CD8- pDC were decreased in Hdac3 deficient mice (Author response image 4). These results are shown in Figure1−figure supplement 4 of the revised manuscript. The relevant description has been added and underlined in the revised manuscript Page 5 Line 121-125.

      Author response image 4.

      Bone marrow pDCs in Hdac3 deficient mice revealed by multiple surface markers

      3) How do the authors explain that in the absence of HDAC3 cDC2 development increased in vivo in chimeric mice, but reduced in vitro (Figures 2B and 2E)?

      As shown in the response to the Minor point 5 of Reviewer#1. Briefly, we suggested that the variabilities maybe explained by the timing of anaysis after HDAC3 deletion. In Figure 2C, we analyzed cells from the recipients one week after the final tamoxifen treatment and observed no significant change in the percentage of cDC2 when further pooled all the experiment data. In Figure 2E, where tamoxifen was administered at Day 0 in Flt3L-mediated DC differentiation in vitro, the DC subsets generated were then analyzed at different time points. We observed no significant changes in cDCs and cDC2 at Day 5, but decreases in the percentage of cDC2 were observed at Day 7 and Day 9. This suggested that the cDC subsets at Day 5 might have originated from progenitors at a later stage, while those at Day 7 and Day 9 might originate form the earlier progenitors. Therefore, based on these in vitro and in vivo experiments, we believe that the variation in the cDC2 phenotype might be attributed to the progenitors at different stages that generated these cDCs.

      4) More generally, as reported also by authors (line 207), the reconstitution with HDAC3-deleted cells is poorly efficient. Although cDC seem not to be impacted, are other lymphoid or myeloid cells affected? This should be expected as HDAC3 regulates T and B development, as well as macrophage function. This should be important to know, although this does not call into question the results shown, as obtained in a competitive context.

      In this study, we found no significant influence on T cells, mature B cells or NK cells, but immature B cells were significantly decreased, in Hdac3-ERT2-Cre mice after tamoxifen treatment (Figure 6). However, in the bone marrow chimera experiments, the numbers of major lymphoid cells were decreased due to the impaired reconstitution capacity of Hdac3 deficient progenitors. Consistent with our finding, it has been reported that HDAC3 was required for T cell and B cell generation, in HDAC3-VavCre mice (Summers et al., 2013), and was necessary for T cell maturation (Hsu et al., 2015). Moreover, HDAC3 is also required for the expression of inflammatory genes in macrophages upon activation (Chen et al., 2012; Nguyen et al., 2020).

      5) What are the precise gating strategies used to identify the different hematopoietic precursors in the Figure 4 ? In particular, is there any lineage exclusion performed?

      We apologize for not describing the experimental procedures clearly. In this study we enriched the lineage negative (Lin−) cells from the bone marrow using a Lineage-depleting antibody cocktail including mAbs to CD2/CD3/TER-119/Ly6G/B220/CD11b/CD8/CD19. We also provide the gating strategy implemented for sorting LSK and CDP populations from the Lin− cells in the bone marrow (Author response image 5), shown in the Figure 3A and Figure4−figure supplement 1 of revised manuscript.

      Author response image 5.

      Gating strategy for LSK, CD115+ CDP and CD115− CDP in bone marrow

      6) Moreover, what is the SiglecH+ CD11c- population appearing in the spleen of mice reconstituted with HDAC3-deleted CDP, in Fig 4D?

      We also noticed the appearance of a SiglecH+CD11c− cell population in the spleen of recipient mice reconstituted with HDAC3-deficient CD115−CDPs, while the presence of this population was not as significant in the HDAC3-Ctrl group, as shown in Figure 4D. We speculate that this SiglecH+CD11c− cell population might represent some cells at a differentiation stage earlier than pre-DCs. Alternatively, the relatively increased percentage of this population derived from HDAC3-deficient CD115−CDP might be due to the substantially decreased total numbers of DCs. This could be clarified by further analysis using additional cell surface markers.

      7) Finally, in Fig 4H, how do the authors explain that Hdac3fl/fl express Il7r, while they are supposed to be sorted CD127- cells?

      This is indeed an interesting question. In this study, we confirmed that CD115−CDPs were isolated from the surface CD127− cell population for RNA-seq analysis, and the purity of the sorted cells were checked (Author response image 6), as shown in Figure4−figure supplement 1 in revised manuscript.

      The possible explanation for the expression of Il7r mRNA in some HDAC3fl/fl CD115−CDPs, as revealed in Figure 4H by RNA-seq analysis, could be due to a very low level of cell surface expression of CD127, these cells therefore could not be efficiently excluded by sorting for surface CD127- cells.

      Author response image 6.

      CD115−CDPs sorting from Hdac3-Ctrl and Hdac3-KO mice

      8) What is known about the expression of HDAC3 in the different hematopoietic precursors analysed in this study? This information is available only for a few of them in Supplementary Figure 1. If not yet studied, they should be addressed.

      We conducted additional analysis to address the expression of Hdac3 in various hematopoietic progenitor cells at different stages, based on the RNA-seq analyis. The data revealed a relatively consistent level of Hdac3 expression in progenitor populations, including HSC, MMP4, CLP, CDP and BM pDCs (Author response image 7). That suggests that HDAC3 may play an important role in the regulation of hematopoiesis at multiple stages. This information is now added in Figure1−figure supplement 1B of revised manuscript.

      Author response image 7.

      Hdac3 expression in hematopoietic progenitor cells

      9) It would be highly informative to extend CUT and Tag studies to Irf8 and Tcf4, if this is technically feasible.

      We totally agree with the reviewer. We have indeed attempted using CUT and Tag study to compare the binding sites of IRF8 and TCF4 in wild-type and Hdac3-deficient pDCs. However, it proved that this is technically unfeasible to get reliable results due to the limited number of cells we could obtain from the HDAC3 deficient mice. We are committed to explore alternative approaches or technologies in future studies to address this issue.

    1. Author response:

      Reviewer #1 (Public Review):

      How does the brain respond to the input of different complexity, and does this ability to respond change with age?

      The study by Lalwani et al. tried to address this question by pulling together a number of neuroscientific methodologies (fMRI, MRS, drug challenge, perceptual psychophysics). A major strength of the paper is that it is backed up by robust sample sizes and careful choices in data analysis, translating into a more rigorous understanding of the sensory input as well as the neural metric. The authors apply a novel analysis method developed in human resting-state MRI data on task-based data in the visual cortex, specifically investigating the variability of neural response to stimuli of different levels of visual complexity. A subset of participants took part in a placebo-controlled drug challenge and functional neuroimaging. This experiment showed that increases in GABA have differential effects on participants with different baseline levels of GABA in the visual cortex, possibly modulating the perceptual performance in those with lower baseline GABA. A caveat is that no single cohort has taken part in all study elements, ie visual discrimination with drug challenge and neuroimaging. Hence the causal relationship is limited to the neural variability measure and does not extend to visual performance. Nevertheless, the consistent use of visual stimuli across approaches permits an exceptionally high level of comparability across (computational, behavioural, and fMRI are drawing from the same set of images) modalities. The conclusions that can be made on such a coherent data set are strong.

      The community will benefit from the technical advances, esp. the calculation of BOLD variability, in the study when described appropriately, encouraging further linkage between complementary measures of brain activity, neurochemistry, and signal processing.

      Thank you for your review. We agree that a future study with a single cohort would be an excellent follow-up.

      Reviewer #2 (Public Review):

      Lalwani et al. measured BOLD variability during the viewing of houses and faces in groups of young and old healthy adults and measured ventrovisual cortex GABA+ at rest using MR spectroscopy. The influence of the GABA-A agonist lorazepam on BOLD variability during task performance was also assessed, and baseline GABA+ levels were considered as a mediating variable. The relationship of local GABA to changes in variability in BOLD signal, and how both properties change with age, are important and interesting questions. The authors feature the following results: 1) younger adults exhibit greater task-dependent changes in BOLD variability and higher resting visual cortical GABA+ content than older adults, 2) greater BOLD variability scales with GABA+ levels across the combined age groups, 3) administration of a GABA-A agonist increased condition differences in BOLD variability in individuals with lower baseline GABA+ levels but decreased condition differences in BOLD variability in individuals with higher baseline GABA+ levels, and 4) resting GABA+ levels correlated with a measure of visual sensory ability derived from a set of discrimination tasks that incorporated a variety of stimulus categories.

      Strengths of the study design include the pharmacological manipulation for gauging a possible causal relationship between GABA activity and task-related adjustments in BOLD variability. The consideration of baseline GABA+ levels for interpreting this relationship is particularly valuable. The assessment of feature-richness across multiple visual stimulus categories provided support for the use of a single visual sensory factor score to examine individual differences in behavioral performance relative to age, GABA, and BOLD measurements.

      Weaknesses of the study include the absence of an interpretation of the physiological mechanisms that contribute to variability in BOLD signal, particularly for the chosen contrast that compared viewing houses with viewing faces.

      Whether any of the observed effects can be explained by patterns in mean BOLD signal, independent of variability would be useful to know.

      One of the first pre-processing steps of computing SDBOLD involves subtracting the block-mean from the fMRI signal for each task-condition. Therefore, patterns observed in BOLD signal variability are not driven by the mean-BOLD differences. Moreover, as noted above, to further confirm this, we performed additional mean-BOLD based analysis (See Supplementary Materials Pg 3). Results suggest that ∆⃗ MEANBOLD is actually larger in older adults vs. younger adults (∆⃗ SDBOLD exhibited the opposite pattern), but more importantly ∆⃗ MEANBOLD is not correlated with GABA or with visual performance. This is also consistent with prior research (Garrett et.al. 2011, 2013, 2015, 2020) that found MEANBOLD to be relatively insensitive to behavioral performance.

      The positive correlation between resting GABA+ levels and the task-condition effect on BOLD variability reaches significance at the total group level, when the young and old groups are combined, but not separately within each group. This correlation may be explained by age-related differences since younger adults had higher values than older adults for both types of measurements. This is not to suggest that the relationship is not meaningful or interesting, but that it may be conceptualized differently than presented.

      Thank you for this important point. The relationship between GABA and ∆⃗ SDBOLD shown in Figure 3 is also significant within each age-group separately (Line 386-388). The model used both age-group and GABA as predictors of ∆⃗ SDBOLD and found that both had a significant effect, while the Age-group x GABA interaction was not significant. The effect of age on ∆⃗ SDBOLD therefore does not completely explain the observed relationship between GABA and ∆⃗ SDBOLD because this latter effect is significant in both age-groups individually and in the whole sample even when variance explained by age is accounted for. The revision clarifies this important point (Ln 488-492). Thanks for raising it.

      Two separate dosages of lorazepam were used across individuals, but the details of why and how this was done are not provided, and the possible effects of the dose are not considered.

      Good point. We utilized two dosages to maximize our chances of finding a dosage that had a robust effect. The specific dosage was randomly assigned across participants and the dosage did not differ across age-groups or baseline GABA levels. We also controlled for the drug-dosage when examining the role of drug-related shift in ∆⃗ SDBOLD. We have clarified these points in the revision and highlighted the analysis that found no effect of dosage on drug-related shift in ∆⃗ SDBOLD (Line 407-418).

      The observation of greater BOLD variability during the viewing of houses than faces may be specific to these two behavioral conditions, and lingering questions about whether these effects generalize to other types of visual stimuli, or other non-visual behaviors, in old and young adults, limit the generalizability of the immediate findings.

      We agree that examining the factors that influence BOLD variability is an important topic for future research. In particular, although it is increasingly well known that variability modulation itself can occur in a host of different tasks and research contexts across the lifespan (see Garrett et al., 2013 Waschke et al., 2021), to address the question of whether variability modulation occurs directly in response to stimulus complexity in general, it will be important for future work to examine a range of stimulus categories beyond faces and houses. Doing so is indeed an active area of research in Dr. Garrett’s group, where visual stimuli from many different categories are examined (e.g., for a recent approach, see Waschke et.al.,2023 (biorxiv)). Regardless, only face and house stimuli were available in the current dataset. We therefore exploited the finding that BOLD variability tends to be larger for house stimuli than for face stimuli (in line with the HMAX model output) to demonstrate that the degree to which a given individual modulates BOLD variability in response to stimulus category is related to their age, to GABA levels, and to behavioral performance.

      The observed age-related differences in patterns of BOLD activity and ventrovisual cortex GABA+ levels along with the investigation of GABA-agonist effects in the context of baseline GABA+ levels are particularly valuable to the field, and merit follow-up. Assessing background neurochemical levels is generally important for understanding individualized drug effects. Therefore, the data are particularly useful in the fields of aging, neuroimaging, and vision research.

      Thank you, we agree!

      Reviewer #3 (Public Review):

      The role of neural variability in various cognitive functions is one of the focal contentions in systems and computational neuroscience. In this study, the authors used a largescale cohort dataset to investigate the relationship between neural variability measured by fMRI and several factors, including stimulus complexity, GABA levels, aging, and visual performance. Such investigations are valuable because neural variability, as an important topic, is by far mostly studied within animal neurophysiology. There is little evidence in humans. Also, the conclusions are built on a large-scale cohort dataset that includes multi-model data. Such a dataset per se is a big advantage. Pharmacological manipulations and MRS acquisitions are rare in this line of research. Overall, I think this study is well-designed, and the manuscript reads well. I listed my comments below and hope my suggestions can further improve the paper.

      Strength:

      1). The study design is astonishingly rich. The authors used task-based fMRI, MRS technique, population contrast (aging vs. control), and psychophysical testing. I appreciate the motivation and efforts for collecting such a rich dataset.

      2) The MRS part is good. I am not an expert in MRS so cannot comment on MRS data acquisition and analyses. But I think linking neural variability to GABA in humans is in general a good idea. There has been a long interest in the cause of neural variability, and inhibition of local neural circuits has been hypothesized as one of the key factors. 3. The pharmacological manipulation is particularly interesting as it provides at least evidence for the causal effects of GABA and deltaSDBOLD. I think this is quite novel.

      Weakness:

      1) I am concerned about the definition of neural variability. In electrophysiological studies, neural variability can be defined as Poisson-like spike count variability. In the fMRI world, however, there is no consensus on what neural variability is. There are at least three definitions. One is the variability (e.g., std) of the voxel response time series as used here and in the resting fMRI world. The second is to regress out the stimulusevoked activation and only calculate the std of residuals (e.g., background variability). The third is to calculate variability of trial-by-trial variability of beta estimates of general linear modeling. It currently remains unclear the relations between these three types of variability with other factors. It also remains unclear the links between neuronal variability and voxel variability. I don't think the computational principles discovered in neuronal variability also apply to voxel responses. I hope the authors can acknowledge their differences and discuss their differences.

      These are very important points, thank you for raising them. Although we agree that the majority of the single cell electrophysiology world indeed seems to prefer Poisson-like spiking variability as an easy and tractable estimate, it is certainly not the only variability approach in that field (e.g., entropy; see our most recent work in humans where spiking entropy outperforms simple spike counts to predict memory performance; Waschke et al., 2023, bioRxiv). In LFP, EEG/MEG and fMRI, there is indeed no singular consensus on what variability “is”, and in our opinion, that is a good thing. We have reported at length in past work about entire families of measures of signal variability, from simple variance, to power, to entropy, and beyond (see Table 1 in Waschke et al, 2021, Neuron). In principle, these measures are quite complementary, obviating the need to establish any single-measure consensus per se. Rather than viewing the three measures of neural variability that the reviewer mentioned as competing definitions, we prefer to view them as different sources of variance. For example, from each of the three sources of variance the reviewer suggests, any number of variability measures could be computed.

      The current study focuses on using the standard deviation of concatenated blocked time series separately for face and house viewing conditions (this is the same estimation approach used in our very earliest studies on signal variability; Garrett et al., 2010, JNeurosci). In those early studies, and nearly every one thereafter (see Waschke et al., 2021, Neuron), there is no ostensible link between SDBOLD (as we normaly compute it) and average BOLD from either multivariate or GLM models; as such, we do not find any clear difference in SDBOLD results whether or not average “evoked” responses are removed or not in past work. This is perhaps also why removing ERPs from EEG time series rarely influences estimates of variability in our work (e.g., Kloosterman et al., 2020, eLife).

      The third definition the reviewer notes refers to variability of beta estimates over trials. Our most recent work has done exactly this (e.g., Skowron et al., 2023, bioRxiv), calculating the SD even over single time point-wise beta estimates so that we may better control the extraction of time points prior to variability estimation. Although direct comparisons have not yet been published by us, variability over single TR beta estimates and variability over the time series without beta estimation are very highly correlated in our work (in the .80 range; e.g., Kloosterman et al., in prep).

      Re: the reviewer’s point that “It also remains unclear the links between neuronal variability and voxel variability. I don’t think the computational principles discovered in neuronal variability also apply to voxel responses. I hope the authors can acknowledge their differences and discuss their differences.” If we understand correctly, the reviewer maybe asking about within-person links between single-cell neuronal variability (to allow Poisson-like spiking variability) and voxel variability in fMRI? No such study has been conducted to date to our knowledge (such data almost don’t exist). Or rather, perhaps the reviewer is noting a more general point regarding the “computational principles” of variability in these different domains? If that is true, then a few points are worth noting. First, there is absolutely no expectation of Poisson distributions in continuous brain imaging-based time series (LFP, E/MEG, fMRI). To our knowledge, such distributions (which have equivalent means and variances, allowing e.g., Fano factors to be estimated) are mathematically possible in spiking because of the binary nature of spikes; when mean rates rise, so too do variances given that activity pushes away from the floor (of no activity). In continuous time signals, there is no effective “zero”, so a mathematical floor does not exist outright. This is likely why means and variances are not well coupled in continuous time signals (see Garrett et al., 2013, NBR; Waschke et al., 2021, Neuron); anything can happen. Regardless, convergence is beginning to be revealed between the effects noted from spiking and continuous time estimates of variability. For example, we show that spiking variability can show a similar, behaviourally relevant coupling to the complexity of visual input (Waschke et al., 2023, bioRxiv) as seen in the current study and in past work (e.g., Garrett et al., 2020, NeuroImage). Whether such convergence reflects common computational principles of variability remains to be seen in future work, despite known associations between single cell recordings and BOLD overall (e.g., Logothetis and colleagues, 2001, 2002, 2004, 2008).

      Given the intricacies of these arguments, we don’t currently include this discussion in the revised text. However, we would be happy to include aspects of this content in the main paper if the reviewer sees fit.

      2) If I understand it correctly, the positive relationship between stimulus complexity and voxel variability has been found in the author's previous work. Thus, the claims in the abstract in lines 14-15, and section 1 in results are exaggerated. The results simply replicate the findings in the previous work. This should be clearly stated.

      Good point. Since this finding was a replication and an extension, we reported these results mostly in the supplementary materials. The stimulus set used for the current study is different than Garrett et.al. 2020 and therefore a replication is important. Moreover, we have extended these findings across young and older adults (previous work was based on older adults alone). We have modified the text to clarify what is a replication and what part are extension/novel about the current study now (Line 14, 345 and 467). Thanks for the suggestion.

      3) It is difficult for me to comprehend the U-shaped account of baseline GABA and shift in deltaSDBOLD. If deltaSDBOLD per se is good, as evidenced by the positive relationship between brainscore and visual sensitivity as shown in Fig. 5b and the discussion in lines 432-440, why the brain should decrease deltaSDBOLD ?? or did I miss something? I understand that "average is good, outliers are bad". But a more detailed theory is needed to account for such effects.

      When GABA levels are increased beyond optimal levels, neuronal firing rates are reduced, effectively dampening neural activity and limiting dynamic range; in the present study, this resulted in reduced ∆⃗ SDBOLD. Thus, the observed drug-related decrease in ∆⃗ SDBOLD was most present in participants with already high levels of GABA. We have now added an explanation for the expected inverted-U (Line 523-546). The following figure tries to explain this with a hypothetical curve diagram and how different parts of Fig 4 might be linked to different points in such a curve.

      Author response image 1.

      Line 523-546 – “We found in humans that the drug-related shift in ∆⃗ SDBOLD could be either positive or negative, while being negatively related to baseline GABA. Thus, boosting GABA activity with drug during visual processing in participants with lower baseline GABA levels and low levels of ∆⃗ SDBOLD resulted in an increase in ∆⃗ SDBOLD (i.e., a positive change in ∆⃗ SDBOLD on drug compared to off drug). However, in participants with higher baseline GABA levels and higher ∆⃗ SDBOLD, when GABA was increased presumably beyond optimal levels, participants experienced no-change or even a decrease in∆⃗ SDBOLD on drug. These findings thus provide the first evidence in humans for an inverted-U account of how GABA may link to variability modulation.

      Boosting low GABA levels in older adults helps increase ∆⃗ SDBOLD, but why does increasing GABA levels lead to reduced ∆⃗ SDBOLD in others? One explanation is that higher than optimal levels of inhibition in a neuronal system can lead to dampening of the entire network. The reduced neuronal firing decreases the number of states the network can visit and decreases the dynamic range of the network. Indeed, some anesthetics work by increasing GABA activity (for example propofol a general anesthetic modulates activity at GABAA receptors) and GABA is known for its sedative properties. Previous research showed that propofol leads to a steeper power spectral slope (a measure of the “construction” of signal variance) in monkey ECoG recordings (Gao et al., 2017). Networks function optimally only when dynamics are stabilized by sufficient inhibition. Thus, there is an inverted-U relationship between ∆⃗ SDBOLD and GABA that is similar to that observed with other neurotransmitters.”

      4) Related to the 3rd question, can you show the relationship between the shift of deltaSDBOLD (i.e., the delta of deltaSDBOLD) and visual performance?

      We did not have data on visual performance from the same participants that completed the drug-based part of the study (Subset1 vs 3; see Figure 1); therefore, we unfortunately cannot directly investigate the relationship between the drug-related shift of ∆⃗ SDBOLD and visual performance. We have now highlighted that this as a limitation of the current study (Line 589-592), where we state: One limitation of the current study is that participants who received the drug-manipulation did not complete the visual discrimination task, thus we could not directly assess how the drug-related change in ∆⃗ SDBOLD impacted visual performance.

      5) Are the dataset openly available?? I didn't find the data availability statement.

      An excel-sheet with all the processed data to reproduce figures and results has been included in source data submitted along with the manuscript along with a data dictionary key for various columns. The raw MRI, MRS and fMRI data used in the current manuscript was collected as a part of a larger (MIND) study and will eventually be made publicly available on completion of the study (around 2027). Before that time, the raw data can be obtained for research purposes upon reasonable request. Processing code will be made available on GitHub.

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript by Lujan and colleagues describes a series of cellular phenotypes associated with the depletion of TANGO2, a poorly characterized gene product but relevant to neurological and muscular disorders. The authors report that TANGO2 associates with membrane-bound organelles, mainly mitochondria, impacting in lipid metabolism and the accumulation of reactive-oxygen species. Based on these observations the authors speculate that TANGO2 function in Acyl-CoA metabolism.

      The observations are generally convincing and most of the conclusions appear logical. While the function of TANGO2 remains unclear, the finding that it interferes with lipid metabolism is novel and important. This observation was not developed to a great extent and based on the data presented, the link between TANGO2 and acyl-CoA, as proposed by the authors, appears rather speculative.

      We thank you for your advice and now include additional data that lends support to the role of TANGO2 in lipid metabolism. We have changed the title accordingly.

      1) The data with overexpressed TANGO2 looks convincing but I wonder if the authors analyzed the localization of endogenous TANGO2 by immunofluorescence using the antibody described in Figure S2. The idea that TANGO2 localizes to membrane contact sites between mitochondria and the ER and LDs would also be strengthened by experiments including multiple organelle markers.

      We agree that most of the data on TANGO2 localization are based on the overexpression of the protein. As suggested by the reviewer and despite the lack of commercial antibodies for immunofluorescence-based evaluation, see the following chart, we tested the commercial antibody described in Figure 2 on HepG2 and U2OS cells. Moreover, we used Förster resonance energy transfer (FRET) technology to analyze the proximity of TANGO2 and Tom20, a specific outer mitochondrial membrane protein. In addition, we visualized cells expressing tagged TANGO2 and tagged VAP-B, an integral ER protein in the mitochondria-associated membranes (doi:10.1093/hmg/ddr559) or tagged TANGO2 and tagged GPAT4-Hairpin, an integral LD protein (doi:10.1016/j.devcel.2013.01.013). These data strengthen our proposal and are presented in the revised manuscript.

      As suggested by the reviewer, we have also visualized two additional cell lines (HepG2 and U2OS) with the anti-TANGO2( from Novus Biologicals) that have been used for western blot (see chart above). As shown in the following figure, the commercial antibody shows a lot of staining in addition to mitochondria, especially in U2OS cells, where it also appears to label the nucleus.

      2) The changes in LD size in TANGO2-depleted cells are very interesting and consistent with the role of TANGO2 in lipid metabolism. From the lipidomics analysis, it seems that the relative levels of the main neutral lipids in TANGO2-depleted cells remain unaltered (TAG) or even decrease (CE). Therefore, it would be interesting to explore further the increase in LD size for example analyze/display the absolute levels of neutral lipids in the various conditions.

      We agree with the reviewer and now present the absolute levels of lipids of interest in the various conditions of the lipidomics analyses (Figure S 3).

      3) Most of the lipidomics changes in TANGO2-depleted cells are observed in lipid species present in very low amounts while the relative abundance of major phospholipids (PC, PE PI) remains mostly unchanged. It would be good to also display the absolute levels of the various lipids analyzed. This is an important point to clarify as it would be unlikely that these major phospholipids are unaffected by an overall defect in Acyl-CoA metabolism, as proposed by the authors.

      As stated above, we have now included the absolute levels of lipids of interest in the various conditions of the lipidomics analyses (Figure S 3).

    1. Author Response

      Reviewer #2 (Public Review):

      I have only one concern with the study. I am not fully convinced that the disruption of behavioral updating is specifically due to NA signaling within OFC. In the first two studies, they observed non-specific anatomical effect likely due to the ablation of fibers of passage through OFC. The DREADD experiment is claimed to allay this concern. However, the DCZ was injected systemically. This means that any collaterals of LC NA neurons outside OFC will also be suppressed. While the lack of effect with the mPFC projection is interesting, this does not preclude an effect mediated in other target regions. Overall, I believe that none of the experiments truly demonstrate a specific effect of NA in OFC. A few experimental options that could be considered are injection of DCZ directly in OFC, optogenetic inhibition of fibers in OFC, or pharmacological disruption of NA signaling in OFC.

      The other options are to measure the effect of the toxin ablations from experiments 1 and 2 not just in mPFC but in other regions. If the non-specific effect is truly only in mPFC outside of OFC, that would lead to more confidence that mPFC projection is the only other viable pathway mediating the effect.

      As requested, we have quantified the effect of toxin ablations in neighbouring cortical regions known to be involved in the goal directed behavior, namely the insular cortex (IC, e.g., Balleine & Dickinson, 2000; Parkes & Balleine, 2013) the medial orbitofrontal cortex (MO, e.g., Bradfield et al., 2015; Gourley et al., 2016) and secondary motor cortex (M2, Gremel et al., 2016). Briefly, we found that injection of the saporin toxin in the VO and LO (Experiment 1) led to a significant decrease in NA fiber density in all examined regions. Injection of 6-OHDA also produced significant loss of NA fibres in MO and M2 but not insular cortex. These results are presented in Suppl. Figures 1 and 3 (pages 28 and 30) and the statistics are reported in the main text (page 6 and page 11)

      We have also added the following to our discussion on the reason for the off-target depletions that we observed and acknowledged the potential role of collateral LC neurons:

      Page 21, line starting 374: “The use of the saporin toxin led to a dramatic decrease of NA fiber density in all analysed cortical areas (Suppl Fig 1). This may be due to diffusion of the toxin from the injection site, the existence of collateral LC neurons and/or fibers passing through the ventral portion of the OFC but targeting other cortical areas (Cerpa et al 2019). However, injection of 6OHDA led to much less offsite NA depletion suggesting that a large part of the previous observation is toxin-specific. Indeed, no significant loss of NA fibers was visible in the insular cortex, which has been previously implicated in goal-directed behaviour (Balleine & Dickinson, 2000; Parkes et al., 2013; 2015; 2017). We did nevertheless observe an offsite depletion in more proximal prefrontal areas (prelimbic and medial orbitofrontal cortices) albeit a more modest depletion that what was observed using the saporin toxin. Several studies have described the projection pattern of LC cells. These studies, using various techniques, indicate that LC cells mainly target a single region, and that only a small proportion of LC neurons collateralize to minor targets (Plummer et al., 2020, Kebschull et al 2016, Uematsu et al 2017, Chandler et al 2014). Therefore, even if the OFC noradrenergic innervation is presumably specific (Chandler et al 2013), we cannot rule out a possible collateralization of some neurons toward neighbouring prefrontal areas (PL and MO). We have previously discussed that the posterior ventral portion of the OFC is an entry point for LC fibers en passant, which ultimately target other prefrontal areas (Cerpa et al 2019).

      To achieve a greater anatomical selectivity we used a CAV-2 vector carrying the noradrenergic promoter PRS to target either the LC:A32 or the LC:OFC pathways (Hayat et al., 2020; Hirschberg et al., 2017). It has been shown that the CAV-2 vector can infect axons-of-passage, however the vector does not spread more than 200 µm from the injection site (Schwarz et al 2015). Therefore, when targeting the OFC we injected anteriorly to the level where the highest density of fibers of passage is expected (Cerpa et al 2019) in order to minimize infection of such fibers and restrict inhibition to our pathway of interest.

      Overall, the current behavioural results are in line with our previous work showing that the ability to associate new outcomes to previously acquired actions is impaired following chemogenetic inhibition of the VO and LO (Parkes et al., 2018) or disconnection of the VO and LO from the submedius thalamic nucleus (Fresno et al 2019). These results point to a necessary role of the ventral and lateral parts of the OFC and its noradrenergic innervation for updating A-O associations. However, it is worth mentioning that different subregions of the OFC, both along the medio-lateral and antero-posterior axes of OFC, display clear functional heterogeneities (Dalton et al 2016, Izquierdo 2017, Panayi & Killcross, 2018, Bradfield et al 2018, Barreiros et al 2021). Therefore, while we have previously focused on the anatomical heterogeneity of the noradrenergic innervation in these prefrontal subregions (Cerpa et al 2019), a thorough characterization of its functional role in each of these subregions still needs to be addressed.”

      One last concern is that the lack of the effect due to disruption of the mPFC projection is not guaranteed to not be from experimental issues. If the authors have some evidence that the mPFC projection disruption produced some other behavioral effect, that would make the lack of effect in this case more convincing.

      Unfortunately, we do not provide evidence in the current paper that disrupting the LC:mPFC (now termed LC:A32 in the current study, based on the recommendation of reviewer 1) projection produces some other behavioural effect. However, in an on-going series of experiments, using the same tools as the current study, we found that inhibiting the LC:A32, but not LC:OFC, pathway impairs Pavlovian contingency degradation as shown in the figure below. We therefore believe that the failure of LC:mPFC pathway inhibition to effect outcome identity reversal in the present study is not due to experimental issues. Please note that in the figure below mPFC is referred to as area 32 (A32), as requested by reviewer 1.

      Figure 1. A) Experimental timeline for the Pavlovian contingency degradation procedure. Prior to behavioural training, rats were injected with CAV2-PRS-hM4D-mCherry into either the vlOFC or area 32 (A32). Number of food port entries during the non-degraded CS and degraded CS for rats injected with vehicle and rats injected with DCZ during degradation training (B, D) and the test in extinction (C, E). Inhibition of the LC:vlOFC had no effect on Pavlovian contingency degradation, whereas inhibition of LC:A32 during degradation training rendered rats insensitive to the change in the causal relationship between the CS and the US.

      Reviewer #3 (Public Review):

      I would be curious about the authors' thoughts regarding the recent Duan ... Robbins Neuron paper (https://pubmed.ncbi.nlm.nih.gov/34171290/), in which marmosets displayed paradoxical responses to VLO inactivation and stimulation in contingency degradation tasks. Are there ways to reconcile these reports?

      We previously argued that the updating processes underlying changes in causal contingency versus outcome identity may be supported by different prefrontal regions (Cerpa et al., 2021, Behav Neurosci). Unfortunately, the tasks used in the current study do not allow us to test if our rats are sensitive to changes in the action-outcome contingency. In fact, the effect of inactivation (or overactivation) of the ventral and lateral regions of OFC on an instrumental contingency degradation task similar to that used in Duan et al (2022) has not yet been examined in rats.

      Indeed, while it is stated in Duan et al (2022) that rats with lesions of lateral OFC are insensitive to contingency degradation, none of the citations provided support this conclusion (Balleine & Dickinson, 1998; Corbit & Balleine, 2003; Ostlund and Balleine, 2007; Yin et al., 2005). Balleine and Dickinson (1998) assessed the effect of prelimbic and insular cortex lesions (insular anteroposterior coordinate +1.2), with only the former affecting instrumental contingency degradation. Ostlund and Balleine (2007) assessed the effect of orbitofrontal lesions on Pavlovian contingency degradation (degradation of the S-O contingency) not instrumental contingency degradation. Finally, Corbit and Balleine (2003) and Yin et al (2005) assessed the effect of prelimbic and dorsomedial striatum lesions, respectively. Nevertheless, there are some reports on the effect of chemogenetic inhibition of VO/LO on degradation in a nose-poke response task but the results are conflicting (e.g., Whyte et al., 2019; Zimmerman et al., 2017; 2018). It would be very interesting to study the impact of both inactivation and overactivation of VO and LO in rats to compare with the results found in marmosets, using comparable tasks.

      We have added the following to our discussion, which cites Duan et al (2022) and the need to better understand the role of VO and LO in contingency degradation.

      Page 24, line starting 450: “However, it is not yet clear if the NA-OFC system is also involved in detecting the causal relationship between an action and its outcome (see Cerpa et al., 2021 for a discussion). Some have reported impaired adaptation to contingency changes following inhibition of VO and LO or BDNF-knockdown in these regions (Whyte et al., 2019; Zimmerman et al., 2017), while another study shows that inhibition of VO/LO leaves sensitivity to degradation intact, at least during an initial test (Zimmerman et al., 2018). Interestingly, a recent paper in marmosets demonstrates that inactivation of anterior OFC (area 11) improves instrumental contingency degradation, whereas overactivation impairs degradation (Duan et al., 2022). The potential role of the rodent ventral and lateral regions of OFC, and the NA innervation of OFC, in adapting to degradation of instrumental contingencies requires further investigation.”

    1. Author Response:

      Reviewer #1 (Public Review):

      There is growing precedent for the utility of GWAS-type analyses in elucidating otherwise cryptic genotypic associations with specific Mtb phenotypes, most commonly drug resistance. This study represents the latest instalment of this type of approach, utilizing a large set of WGS data from clinical Mtb isolates and refining the search for DR-associated alleles by restricting the set to those predicted (or known) to be phenotypically DR. This revealed a number of potential candidate mutations, including some in nucleotide excision repair (uvrA, uvrB), in base excision repair (mutY), and homologous recombination (recF). In validating these leads functional assays, the authors present evidence supporting the impact of the identified mutations on antibiotic susceptibility in vitro and in macrophage and animal infection models. These results extend the number of candidate mutations associated with Mtb drug resistance, however the following must be considered:

      (i) The GWAS analysis is the basis of this study, yet the description of the approach used and presentation of results obtained is occasionally obscure; for example, the authors report the use of known drug resistance phenotypes (where available) or inferences of drug-resistance from genotypic data to enhance the potential to identify other mutations that might be implicated in enabling the DR mutations, yet their list of known DR mutations seem to be predominantly rare or unusual mutations, not those commonly associated with clinical DR-TB. In addition, the distribution of the identified resistance-associated mutations across the different lineages need to be explained more clearly.

      In the revised manuscript, we have performed the phylogenetic analysis of the strains used. A phylogenetic tree was generated using Mycobacterium canetti as an outgroup (Figure 1b). The phylogeny analysis suggests the clustering of the strains in lineage 1, 2, 3, and 4. Lineages 2,3 and 4 are clustering together, and lineage 1 is monophyletic, as reported previously. The genome sequence data of 2773 clinical strains were downloaded from NCBI. These strains were also part of the GWAS analysis performed by Coll et al (https://pubmed.ncbi.nlm.nih.gov/29358649/) and Manson et al. (https://pubmed.ncbi.nlm.nih.gov/28092681/). The phenotype of the strains used for the association analysis was reported in the previous studies. We have not performed other predictions. The supplementary table provides the lineage origin of each strain used in the study (Supplementary File 1 & 2). The distributions of resistance-associated mutations in different strains is shown (Figure 2-figure supplement 6a-h). As suggested, we have performed an analysis wherein we looked for the direct target mutations that harbor mutations in the DNA repair genes (Figure 2-figure supplement 6i-k).

      We identified mostly the rare mutations due to the following reasons;

      1. We looked for the mutations that were present only in the multidrug resistant strains as compared to the susceptible strains for association mapping. This strategy exclusively gave most variants associated with multidrug resistant phenotype.

      2. We have used Mixed Linear Model (MLM) for association analysis. MLM removes all the population-specific SNPs based on PCA and kinship corrections. The false discovery rate (FDR) adjusted p-values in the GAPIT software are stringent as it corrects the effects of each marker based on the population structure (Q) as well as kinship (K) values. Therefore the probability of identifying the false-positive SNP is very low. We combined it with the Bonferroni corrections to identify markers associated with the drug resistant phenotype.

      (ii) By combining target gene deletions with different complementation alleles, the authors provide compelling microbiological evidence supporting the inferred role of the mutY and uvrB mutations in enhanced survival under antibiotic treatment. The experimental work, however, is limited to assessments of competitive survival in various models, with/without antibiotic selection, or to mutant frequency analyses; there is no direct evidence provided in support of the proposed mechanism.

      To ascertain if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, is indeed due to the acquisition of mutations in the direct target of antibiotics, we performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro), was mixed in equal proportions before library preparation. Only those SNPs present in >20% of reads were retained for the analysis. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotics, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY:mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY:mutY (Figure 8-Figure Supplement 2). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

      (iii) The low drug concentrations used (especially of rifampicin against M. smegmatis) suggest the identified mutations confer low-level resistance to multiple antimycobacterial agents - in turn implying tolerance rather than resistance. If correct, it would be interesting to know how broadly tolerant strains containing these mutations are; that is, whether susceptibility is decreased to a broad range of antibiotics with different mechanisms of action (including both cidal and static agents), and whether the extent of the decrease be determined quantitatively (for example, as change in MIC value).

      To evaluate the effect of different drugs on the survival of RvDmutY or RvDmutY::mutYR262Q, we performed killing kinetics in the presence and absence of isoniazid, rifampicin, ciprofloxacin, and ethambutol (Figure 4a). In the absence of antibiotics, the growth kinetics of Rv, RvDmutY, RvDmutY:mutY, and RvDmutY::mutY-R262Q were similar (Figure 4b). In the presence of isoniazid, ~2 log-fold decreases in bacterial survival was observed on day 3 in Rv and RvDmutY:mutY; however, in RvDmutY and RvDmutY::mutY-R262Q, the difference was limited to ~1.5 log-fold (Figure 4c). A similar trend was apparent on days 6 and 9, suggesting a ~5-fold increase in the survival of RvDmutY and RvDmutY::mutY-R262Q compared with Rv and RvDmutY:mutY (Figure 4c). Interestingly, in the presence of ethambutol, we did not observe any significant difference (Figure 4d). In the presence of rifampicin and ciprofloxacin, we observed a ~10-fold increase in the survival of RvDmutY and RvDmutY::mutY-R262Q compared with Rv and RvDmutY:mutY (Figure 4e-f). Thus results suggest that the absence of mutY or the presence of mutY variant aids in subverting the antibiotic stress.

      Reviewer #2 (Public Review):

      This interesting manuscript uses a collection of whole genome sequences of TB isolates to associate specific sequence polymorphisms with MDR/XDR strains, and having found certain mutations in DNA repair pathways, does a detailed analysis of several mutations. The evaluation of the MutY polymorphism reveals it is loss of function and TB strains carrying this mutation have a higher mutation frequency and enhanced survival in serial passage in macrophages. The strengths of the manuscript are the leveraging of a large sequence dataset to derive interesting candidate mutations in DNA repair pathway and the demonstration that at least one of these mutations has a detectable effect on mutagenicity and pathogenesis. The weaknesses of the manuscript are a lack of experimental exploration of the mechanism by which loss of a DNA repair pathway would enhance survival in vivo. The model presented is that these phenotypes are due to hypermutagenicity and thereby evolution of enhanced pathogenesis, but this is not actually directly tested or investigated. There are also some technical concerns for some of the experimental data which can be strengthened.

      This paper presents the following data:

      • Analyzed whole-genome sequences 2773 clinical strains: 160 000 SNPs identified
      • 1815 drug-susceptible/422 MDR/XDR strains: 188 mutations correlated with Drug resistance.
      • Novel mutations associated with the drug resistance have been found in base excision repair (BER), nucleotide excision repair (NER), and homologous recombination (HR) pathway genes (mutY, uvrA, uvrB, and recF).
      • Specific mutations mutY-R262Q and uvrB-A524V were studied.
      • mutY-R262Q and uvrB-A524V mutations behave as loss of function alleles in vivo, as measured by non-complementation of the increased mutation frequency measured by resistance to Rif and INH.
      • The mutY deletion and the mutY-R262Q mutation increase Mtb survival over WT in macrophages when Mtb has not been submitted to previous rounds of macrophage infection.
      • This advantage is exacerbated in presence of antibiotic (Rif and Cipro but not INH).
      • The MutY deletion and the MutY-R262Q mutation result in an enhanced survival of Mtb during guinea pig infection.

      Major issues:

      The finding that mutations in MutY confers an advantage during macrophage infection is convincing based on the macrophage experiments, but it is premature to conclude that the mechanism of this effect is due to hypermutagenesis and selection of fitter bacterial clones. It is described in E. coli (Foti et al., 2012) and recently in mycobacteria (Dupuy et al., 2020) that the MutY/MutM excision pathways can increase the lethality of antibiotic treatment because of double-strand breaks caused by Adenine/oxoG excisions. The higher survival of the mutY mutant during antibiotic treatment could more be due to lower Adenine/oxoG excision in the mutant rather than acquisition of advantageous mutations, or some other mechanism. The same hypothesis cannot be excluded for the Guinea pig experiments (no antibiotics, but oxidative stress mediated by host defenses could also increase oxoG) and should at least be discussed. Experiments that would support the idea that the in vivo advantage is due to hypermutagenesis would be whole genome sequencing of the output vs input populations to directly document increased mutagenesis. Similarly, is the ΔmutY survival advantage after rounds of macrophage infections dependent on macrophage environment? What happens if the ΔmutY strain is cultivated in vitro in 7H9 (same number of generations) before infecting macrophages?

      We thank the reviewer for the insightful comments. To ascertain if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, is indeed due to the acquisition of mutations in the direct target of antibiotics, we performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro) was mixed in equal proportion prior to library preparation. For the analysis, only those SNPs that were present in >20% of reads were retained. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of the Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotic, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY:mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c-e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY:mutY (Figure 8-figure supplement 2). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

      • It would be useful to present more data about the strain relatedness and genome characteristics of the DNA repair mutant strains in the GWAS. For example, the model would suggest that strains carrying DNA repair mutations should have higher SNP load than control strains. Additionally, it would be helpful to know whether the identified DNA repair pathway mutations are from epidemiologically linked strains in the collection to deduce whether these events are arising repeatedly or are a founder effect of a single mutant since for each mutation, the number of strains is small.

      We analyzed the genome of the clinical strains that possess DNA repair gene mutations to determine the additional polymorphisms. The number of SNPs in the strains harboring DNA repair mutation and the drug susceptible strains appears to be similar. The marginal difference, if any were not statistically significant.

      We agree with the reviewer that these strains might be epidemiologically linked. In the present study, all the strains harboring mutation in mutY belong to lineage 4. We observed that all the mutY mutationcontaining strains were either MDR or pre-XDR compared with drug susceptible strains of the same clade.

      • Some of the mutation frequency, survival and competition data could be strengthened by more experimental replicates. Data Lines 370-372 (mutation frequency), lines 387-388 (Survival of strains ex vivo), line 394 (competition experiment) : "Two biologically independent experiments were performed. Each experiment was performed in technical triplicates. Data represent one of the two biological experiments." Two biological replicates is insufficient for the phenotypes presented and all replicates should be included in the analysis. In addition, the definition of "technical triplicates" should be given, does this mean the same culture sampled in triplicate?

      We thank the reviewer for the comment. We performed at least two independent experiments with biological triplicates (not technical triplicates). We apologize for writing this incorrectly. We have reported data from one independent experiment consisting of at least biological triplicates. For mutation rate analysis, we have performed experiment using six independent colonies. These points are mentioned in the methods and legends of the revised manuscript.

      • MutY phenotypes. One caveat to the conclusion that the MutY R262Q mutant is nonfunctional is the lack of examination of the expression of the complementing protein. I would be informative to comment on the location of this mutation in relation to the known structures of MutY proteins. Similarly, for the UvrB polymorphism, this null strain has a clear UV sensitivity phenotype in the literature, so a fuller interrogation for UV killing would be informative re: the A524V mutation.

      We have now included the western blot data on both complementation strains (Figure 3-figure supplement 1). We agree with the reviewer that the uvrB null mutant may have UV sensitivity phenotype, but we have not performed the experiment in the present study.

      Reviewer #3 (Public Review):

      STRENGTHS

      • This ambitious study is broad in scope, beginning with a bacterial GWAS study and extending all the way to in vivo guinea pig infection models.

      • Numerous reports have attempted to identify Mtb strains with elevated mutation rates, and the results are conflicting. The present study sets out to thoroughly evaluate one such mutation that may produce a mutator phenotype, mutY-Arg262Gln.

      WEAKNESSES

      • While the authors follow-up experiments with the mutY-Arg262Gln allele are all consistent with the conclusion that this mutation elevates the mutation rate in Mtb and thus could promote the evolution of drug resistance, further work is needed to unambiguously demonstrate this link.

      • The authors highlight five mutations in genes associated with DNA replication and or repair from their GWAS analysis:

      o dnaA-Arg233Gln: as the authors note in the Discussion, Hicks et al. associate SNPs in dnaA with low-level isoniazid resistance, as a result of lowered katG expression. Since this is unrelated to their focus on DNA repair genes whose mutation could elevate mutation rates, I would consider removing this allele from the Table.

      As suggested, we have removed the dnaA from Table 3.

      o mutY-Arg262Gln: querying publicly available whole genome sequences of clinical Mtb isolates, this SNP appears to be restricted to lineage 4.3 (L4.3). All of these L4.3 strains appear to be drug-resistant. How many times did the mutY-Arg262Gln mutation evolve in the authors dataset? If there is evidence of homoplastic evolution, this would strengthen their case. If not, it doesn't mean the authors findings are incorrect, but does elevate that risk that this mutation could be a passenger (i.e. not driver) mutation. To address this, the authors could attempt to date when the mutY-Arg262Gln arose. If it was before the evolution of drug-resistance conferring alleles in these L4.3 strains, that is consistent with (but not proof of) a driver mutation. If mutY-Arg262Gln arose after, this is much more consistent with a passenger mutation.

      As pointed out by the reviewer, the mutY-Arg262Gln mutation is restricted to lineage 4. We have checked the mutY gene sequence from the strains harboring mutY Arg262Gln mutation and sensitive strains of the same clade. We identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation that could be used for performing molecular clock analysis. To ascertain whether it is a passenger or a driver mutation, we have performed multiple experiments that suggest that identified mutation aids in the acquisition of drug resistance.

      o uvrB-Ala524Val: curiously we don't see this SNP in our dataset of publicly available whole genome sequences of clinical Mtb isolates (~45,000 genomes).

      We have rechecked this SNP in our dataset. This SNP was present in 87 drug-resistant strains that belong to lineage 2.

      o uvrA-Gln135Lys: this SNP also appears to be restricted to lineage 4.3. Same question as for mutY-Arg262Gln.

      As pointed out by the reviewer, uvrA-Gln135lys mutation is restricted to lineage 4. We identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation that can be used for performing molecular clock analysis

      o recF-Gly269Gly: this is a very common mutation, is it unique to lineage 2.2.1? Same question as for mutY-Arg262Gln.

      RecF-Gly269Gly mutation was present in the lineage 2 strains. Here also, we identified only the reported mutation in the drug-resistant strains, and there was no synonymous mutation could be used for performing molecular clock analysis.

      • The CRYPTIC consortium recently published a number of preprints on biorxiv detailing very large GWAS studies in Mtb. Did any of these reports also associate drug resistance with mutY? If yes, this should be stated. If not, the potential reasons for this discrepancy should be discussed.

      We have checked the recently published CRYPTIC consortium article (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721#sec012) for mutY-Arg262Gln. We did not find the mutY-Arg262Gln mutation in their analysis; this is due to the different strains used in the study. However, we identified recF Gly269Gly mutation in their datase

      • Based on the authors follow-up studies in vivo, MutY-Arg262Gln is presumed to be a loss-of-function allele. If the authors could convincingly demonstrate this biochemically with recombinant proteins, this would significantly strengthen their case.

      Experiments performed in Msm and Mtb mutant strains suggest that MutY variant is a loss-of-function allele. We have not performed in vitro assays to confirm the same.

      • If the authors are correct and mutY-Arg262Gln strains have elevated mutation rates, presumably there would be evidence of this in the clinical strain sequencing data. Do mutY-Arg262Gln containing strains have elevated C→G or C→A mutations in their genomes? Presumably such strains would also have a higher number of SNPs than closely related strains WT for mutY- is this the case?

      We analyzed the genome of the clinical strains that possess DNA repair gene mutations to determine the additional polymorphisms. The number of SNPs in the strains harboring DNA repair mutation and the drug susceptible strains appears to be higher. We have also looked for the CàT and CàG mutations in the same strains. CàT mutations are higher in the strains harboring mutY variant compared with the susceptible strains (Figure 2-figure supplement 6 l). However, we could not perform statistical analysis as the number of strains that harbor mutY variant is limited to 8. Thus data suggest that empirically the strains harboring mutY variant show higher SNPs elsewhere and CàT mutations. We are not stating these conclusions strongly in the manuscript as the data is not statistically significant

      • While more work, mutation rates as measured by Luria-Delbruck fluctuation analysis are more accurate than mutation frequencies. I would recommend repeating key experiments by Luria-Delbruck fluctuation analysis. It is also important to report both drug-resistant colony counts and total CFU in these sorts of experiments. Given the clumpy nature of mycobacteria, mutation rates can appear to be artificially elevated due to low total CFU and not an increase in the number of drug-resistant colonies.

      As suggested, we determined the mutation rate in the presence of isoniazid, rifampicin, and ciprofloxacin (Figure 3g-j). The fold increase in the mutation rate relative to Rv for RvDmutY, RvDmutY:mutY, and RvDmutY::mutY-R262Q was 2.90, 0.76, and 3.0 in the presence of isoniazid and 5.62, 1.13, and 5.10 or 9.14, 1.57, and 8.71 in the presence of rifampicin and ciprofloxacin respectively (Figure 3).

      • Figure 4 would appear to measuring drug tolerance not resistance? Are the elevated CFU in the presence of drugs in the mutY-Arg262Gln strain due to an increase in the number of drug resistant strains or drug sensitive strains? This could be assessed by quantifying resulting CFU in the presence or absence the indicated drugs.

      To ascertain better survival is due to the acquisition of mutations in the direct target of antibiotics or drug tolerance. We performed WGS of the strain from the ex vivo evolution experiment (Figure 5). Genomic DNA extracted from ten independent colonies (grown in vitro) was mixed in equal proportion prior to library preparation. Only those SNPs present in >20% of reads were retained for the analysis. Analysis of Rv sequences grown in vitro suggested that the laboratory strain has accumulated 100 SNPs compared with the reference strain. The sequence of the Rv laboratory strain was used as the reference strain for the subsequent analysis. WGS data for RvDmutY, RvDmutY::mutY, and RvDmutY::mutY-R262Q strains grown in vitro did not show the presence of a mutation in the antibiotic target genes. In a similar vein, ten independent colonies, each from the 7H11-OADC plates, after the final round of ex vivo selection in the presence or absence of antibiotics, were selected for WGS. Data indicated that in the absence of antibiotics, no direct target mutations were identified in the ex vivo passaged strains (Figure 6a & e). In the presence of isoniazid, we found mutations in the katG (Ser315Thr or Ser315Ileu) in the Rv, RvDmutY but not in RvDmutY::mutY and RvDmutY::mutY-R262Q (Figure 6b & e). These findings are in congruence with the ex vivo evolution CFU analysis, wherein we did not observe a significant increase in the survival of RvDmutY and RvDmutY::mutY-R262Q in the presence of isoniazid (Figure 5). In the presence of ciprofloxacin and rifampicin, direct target mutations were identified in the gyrA and rpoB (Figure 6c-e). Asp94Glu/Asp94Gly mutations were identified in gyrA, and, His445Tyr/Ser450Leu mutations were identified in rpoB of RvDmutY and RvDmutY::mutY-R262Q, respectively. No direct target mutations were identified in the Rv and RvDmutY::mutY, suggesting that the perturbed DNA repair aids in acquiring the drug resistance-conferring mutations in Mtb (Figure 6c-e & Supplementary File 8).

      To determine if the better survival of the RvDmutY, or RvDmutY::mutY-R262Q, in the guinea pig infection experiment (Figure 8) is due to the accumulation of mutations in the host, we performed WGS of the strain isolated from guinea pig lungs. Analysis revealed specific genes such as cobQ1, smc, espI, and valS were mutated only in RvDmutY and RvDmutY::mutYR262Q but not in Rv and RvDmutY::mutY. Besides, tcrA and gatA were mutated only in RvDmutY, whereas rv0746 were mutated exclusively in the RvDmutY::mutY (Figure 2-figure supplement 6). However, we did not observe any direct target mutations; this may be because guinea pigs were not subjected to antibiotic treatment. Data suggests that the continued longterm selection pressure is necessary for bacilli to acquire mutations.

    1. Author Response

      Reviewer #3: (Public Review):

      In this ms Li et al. examine the molecular interaction of Rabphilin 3A with the SNARE complex protein SNAP25 and its potential impact in SNARE complex assembly and dense core vesicle fusion.

      Overall the literature of rabphilin as a major rab3/27effector on synaptic function has been quite enigmatic. After its cloning and initial biochemical analysis, rather little new has been found about rabphilin, in particular since loss of function analysis has shown rather little synaptic phenotypes (Schluter 1999, Deak 2006), arguing against that rabphilin plays a crucial role in synaptic function.

      While the interaction of rabphilin to SNAP25 via its bottom part of the C2 domain has been already described biochemically and structurally in the Deak et al. 2006, and others, the authors make significant efforts to further map the interactions between SNAP25 and rabphilin and indeed identified additional binding motifs in the first 10 amino acids of SNAP25 that appear critical for the rabphilin interaction.

      Using KD-rescue experiments for SNAP25, in TIRF based imaging analysis of labeled dense core vesicles showed that the N-terminus of SN25 is absolutely essential for SV membrane proximity and release. Similar, somewhat weaker phenotypes were observed when binding deficient rabphilin mutants were overexpressed in PC12 cells coexpressing WT rabphilin. The loss of function phenotypes in the SN25 and rabphilin interaction mutants made the authors to claim that rabphilin-SN25 interactions are critical for docking and exocytosis. The role of these interaction sites were subsequently tested in SNARE assembly assays, which were largely supportive of rabphilin accelerating SNARE assembly in a SN25 -terminal dependent way.

      Regarding the impact of this work, the transition of synaptic vesicles to form fusion competent trans-SNARE complex is very critical in our understanding of regulated vesicle exocytosis, and the authors put forward an attractive model forward in which rabphilin aids in catalyzing the SNARE complex assembly by controlling SNAP25 a-helicalicity of the SNARE motif. This would provide here a similar regulatory mechanism as put forward for the other two SNARE proteins via their interactions with Munc18 and intersection, respectively.

      We thank the reviewer #3 for the summary of the paper and for the praise of our work. The point-to-point replies are as follow:

      While discovery of the novel interaction site of rabphilin with the N-Terminus of SNAP25 is interesting, I have issues with the functional experiments. The key reliance of the paper is whether it provides convincing data on the functional role of the interactions, given the history of loss of function phenotypes for Rabphilin. First, the authors use PC12 cells and dense core vesicle docking and fusion assays. Primary neurons, where rabphilin function has been tested before, has unfortunately not been utilized, reducing the impact of docking and fusion phenotype.

      We have discussed these questions as mentioned in our response to Essential Revisions 3 and added this corresponding passage to the Discussion section (pp.18-19, lines 407-427).

      In particular the loss of function phenotype in figure 3 of the n-terminally deleted SNAP25 in docking and fusion is profound, and at a similar level than the complete loss of the SNARE protein itself. This is of concern as this is in stark contrast to the phenotype of rabphilin loss in mammalian neurons where the phenotype of SNAP25 loss is very severe while rabphilin loss has almost no effect on secretion. This would argue that the N-terminal of SNAPP25 has other critical functions besides interacting with rabphilin. In addition, it could argue that the n-Terminal SNAP25 deletion mutant may be made in the cell (as indicated from the western blot) but may not be properly trafficked to the site of release

      To test whether the N-peptide deletion mutant of SN25 can properly target to the plasma membrane, we overexpressed the SN25 FL or SN25 (11–206) with C-terminal EGFP-tag in PC12 cells and monitored the localization of SN25 FL-EGFP and SN25 (11–206)-EGFP near the plasma membrane by TIRF microscopy. We observed that the average fluorescence intensity of SN25 (11–206)-EGFP showed no significant difference with SN25 FL-EGFP as below, suggesting that the N-peptide deletion mutant may not influence the trafficking of SN25 to plasma membrane.

      (A) TIRF imaging assay to monitor the localization of SN25-EGFP near the plasma membrane. Overexpression of SN25 FL-EGFP (left) and SN25 (11–206)-EGFP (right) using pEGFP-N3 vector in PC12 cells. Scale bars, 10 μm. (B) Quantification of the average fluorescence intensity of SN25-EGFP near the plasma membrane in (A). Data are presented as mean ± SEM (n ≥ 10 cells in each). Statistical significance and P values were determined by Student’s t-test. ns, not significant.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors present a PyTorch-based simulator for prosthetic vision. The model takes in the anatomical location of a visual cortical prostheses as well as a series of electrical stimuli to be applied to each electrode, and outputs the resulting phosphenes. To demonstrate the usefulness of the simulator, the paper reproduces psychometric curves from the literature and uses the simulator in the loop to learn optimized stimuli.

      One of the major strengths of the paper is its modeling work - the authors make good use of existing knowledge about retinotopic maps and psychometric curves that describe phosphene appearance in response to single-electrode stimulation. Using PyTorch as a backbone is another strength, as it allows for GPU integration and seamless integration with common deep learning models. This work is likely to be impactful for the field of sight restoration.

      1) However, one of the major weaknesses of the paper is its model validation - while some results seem to be presented for data the model was fit on (as opposed to held-out test data), other results lack quantitative metrics and a comparison to a baseline ("null hypothesis") model. On the one hand, it appears that the data presented in Figs. 3-5 was used to fit some of the open parameters of the model, as mentioned in Subsection G of the Methods. Hence it is misleading to present these as model "predictions", which are typically presented for held-out test data to demonstrate a model's ability to generalize. Instead, this is more of a descriptive model than a predictive one, and its ability to generalize to new patients remains yet to be demonstrated.

      We agree that the original presentation of the model fits might give rise to unwanted confusion. In the revision, we have adapted the fit of the thresholding mechanism to include a 3-fold cross validation, where part of the data was excluded during the fitting, and used as test sets to calculate the model’s performance. The results of the cross- validation are now presented in panel D of Figure 3. The fitting of the brightness and temporal dynamics parameters using cross-validation was not feasible due to the limited amount of quantitative data describing temporal dynamics and phosphene size and brightness for intracortical electrodes. To avoid confusion, we have adapted the corresponding text and figure captions to specify that we are using a fit as description of the data.

      We note that the goal of the simulator is not to provide a single set of parameters that describes precise phosphene perception for all patients but that it could also be used to capture variability among patients. Indeed, the model can be tailored to new patients based on a small data set. Figure 3-figure supplement 1 exemplifies how our simulator can be tailored to several data sets collected from patients with surface electrodes. Future clinical experiments might be used to verify how well the simulator can be tailored to the data of other patients.

      Specifically, we have made the following changes to the manuscript:

      • Caption Figure 2: the fitted peak brightness levels reproduced by our model

      • Caption Figure 3: The model's probability of phosphene perception is visualized as a function of charge per phase

      • Caption Figure 3: Predicted probabilities in panel (d) are the results of a 3-fold cross- validation on held-out test data.

      • Line 250: we included biologically inspired methods to model the perceptual effects of different stimulation parameters

      • Line 271: Each frame, the simulator maps electrical stimulation parameters (stimulation current, pulse width and frequency) to an estimated phosphene perception

      • Lines 335-336: such that 95% of the Gaussian falls within the fitted phosphene size.

      • Line 469-470: Figure 4 displays the simulator's fit on the temporal dynamics found in a previous published study by Schmidt et al. (1996).

      • Lines 922-925: Notably, the trade-off between model complexity and accurate psychophysical fits or predictions is a recurrent theme in the validation of the components implemented in our simulator.

      2) On the other hand, the results presented in Fig. 8 as part of the end-to-end learning process are not accompanied by any sorts of quantitative metrics or comparison to a baseline model.

      We now realize that the presentation of the end-to-end results might have given the impression that we present novel image processing strategies. However, the development of a novel image processing strategy is outside the scope of the study. Instead, The study aims to provide an improved simulation which can be used for more realistic assessment of different stimulation protocols. The simulator needs to fit experimental data, and it should run fast (so it can be used in behavioral experiments). Importantly, as demonstrated in our end-to-end experiments, the model can be used in differentiable programming pipelines (so it can be used in computational optimization experiments), which is a valuable contribution in itself because it lends itself to many machine learning approaches which can improve the realism of the simulation.

      We have rephrased our study aims in the discussion to improve clarity.

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments

      • Lines 810-814: Computational optimization approaches can also aid in the development of safe stimulation protocols, because they allow a faster exploration of the large parameter space and enable task-driven optimization of image processing strategies (Granley et al., 2022; Fauvel et al., 2022; White et al., 2019; Küçükoglü et al. 2022; de Ruyter van Steveninck et al., 2022; Ghaffari et al., 2021).

      • Lines 814-819: Ultimately, the development of task-relevant scene-processing algorithms will likely benefit both from computational optimization experiments as well as exploratory SPV studies with human observers. With the presented simulator we aim to contribute a flexible toolkit for such experiments.

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end user (Fauvel et al., 2022; Beyeler & Sanchez-Garcia, 2022).

      3) The results seem to assume that all phosphenes are small Gaussian blobs, and that these phosphenes combine linearly when multiple electrodes are stimulated. Both assumptions are frequently challenged by the field. For all these reasons, it is challenging to assess the potential and practical utility of this approach as well as get a sense of its limitations.

      The reviewer raises a valid point and a similar point was raised by a different reviewer (our response is duplicated). As pointed out in the discussion, many aspects about multi- electrode phosphene perception are still unclear. On the one hand, the literature is in agreement that there is some degree of predictability: some papers explicitly state that phosphenes produced by multiple patterns are generally additive (Dobelle & Mladejovsky, 1974), that the locations are predictable (Bosking et al., 2018) and that multi-electrode stimulation can be used to generate complex, interpretable patterns of phosphenes (Chen et al., 2020, Fernandez et al., 2021). On the other hand, however, in some cases, the stimulation of multiple electrodes is reported to lead to brighter phosphenes (Fernandez et al., 2021), fused or displaced phosphenes (Schmidt et al., 1996, Bak et al., 1990) or unpredicted phosphene patterns (Fernández et al., 2021). It is likely that the probability of these interference patterns decreases when the distance between the stimulated electrodes increases. An empirical finding is that the critical distance for intracortical stimulation is approximately 1 mm (Ghose & Maunsell, 2012).

      We note that our simulator is not restricted to the simulation of linearly combined Gaussian blobs. Some irregularities, such as elongated phosphene shapes were already supported in the previous version of our software. Furthermore, we added a supplementary figure that displays a possible approach to simulate some of the more complex electrode interactions that are reported in the literature, with only minor adaptations to the code. Our study thereby aims to present a flexible simulation toolkit that can be adapted to the needs of the user.

      Adjustments:

      • Added Figure 1-figure supplement 3 on irregular phosphene percepts.

      • Lines 957-970: Furthermore, in contrast to the assumptions of our model, interactions between simultaneous stimulation of multiple electrodes can have an effect on the phosphene size and sometimes lead to unexpected percepts (Fernandez et al., 2021, Dobelle & Mladejovsky 1974, Bak et al., 1990). Although our software supports basic exploratory experimentation of non-linear interactions (see Figure 1-figure supplement 3), by default, our simulator assumes independence between electrodes. Multi- phosphene percepts are modeled using linear summation of the independent percepts. These assumptions seem to hold for intracortical electrodes separated by more than 1 mm (Ghose & Maunsell, 2012), but may underestimate the complexities observed when electrodes are nearer. Further clinical and theoretical modeling work could help to improve our understanding of these non-linear dynamics.

      4) Another weakness of the paper is the term "biologically plausible", which appears throughout the manuscript but is not clearly defined. In its current form, it is not clear what makes this simulator "biologically plausible" - it certainly contains a retinotopic map and is fit on psychophysical data, but it does not seem to contain any other "biological" detail.

      We thank the reviewer for the remark. We improved our description of what makes the simulator “biologically plausible” in the introduction (line 78): ‘‘Biological plausibility, in our work's context, points to the simulation's ability to capture essential biological features of the visual system in a manner consistent with empirical findings: our simulator integrates quantitative findings and models from the literature on cortical stimulation in V1 [...]”. In addition, we mention in the discussion (lines 611 - 621): “The aim of this study is to present a biologically plausible phosphene simulator, which takes realistic ranges of stimulation parameters, and generates a phenomenologically accurate representation of phosphene vision using differentiable functions. In order to achieve this, we have modeled and incorporated an extensive body of work regarding the psychophysics of phosphene perception. From the results presented in section H, we observe that our simulator is able to produce phosphene percepts that match the descriptions of phosphene vision that were gathered in basic and clinical visual neuroprosthetics studies over the past decades.”

      5) In fact, for the most part the paper seems to ignore the fact that implanting a prosthesis in one cerebral hemisphere will produce phosphenes that are restricted to one half of the visual field. Yet Figures 6 and 8 present phosphenes that seemingly appear in both hemifields. I do not find this very "biologically plausible".

      We agree with the reviewer that contemporary experiments with implantable electrodes usually test electrodes in a single hemisphere. However, future clinically useful approaches should use bilaterally implanted electrode arrays. Our simulator can either present phosphene locations in either one or both hemifields.

      We have made the following textual changes:

      • Fig. 1 caption: Example renderings after initializing the simulator with four 10 × 10 electrode arrays (indicated with roman numerals) placed in the right hemisphere (electrode spacing: 4 mm, in correspondence with the commonly used 'Utah array' (Maynard et al., 1997)).

      • Line 518-525: The simulator is initialized with 1000 possible phosphenes in both hemifields, covering a field of view of 16 degrees of visual angle. Note that the simulated electrode density and placement differs from current prototype implants and the simulation can be considered to be an ambitious scenario from a surgical point of view, given the folding of the visual cortex and the part of the retinotopic map in V1 that is buried in the calcarine sulcus. Line 546-547: with the same phosphene coverage as the previously described experiment

      Reviewer #2 (Public Review):

      Van der Grinten and De Ruyter van Steveninck et al. present a design for simulating cortical- visual-prosthesis phosphenes that emphasizes features important for optimizing the use of such prostheses. The characteristics of simulated individual phosphenes were shown to agree well with data published from the use of cortical visual prostheses in humans. By ensuring that functions used to generate the simulations were differentiable, the authors permitted and demonstrated integration of the simulations into deep-learning algorithms. In concept, such algorithms could thereby identify parameters for translating images or videos into stimulation sequences that would be most effective for artificial vision. There are, however, limitations to the simulation that will limit its applicability to current prostheses.

      The verification of how phosphenes are simulated for individual electrodes is very compelling. Visual-prosthesis simulations often do ignore the physiologic foundation underlying the generation of phosphenes. The authors' simulation takes into account how stimulation parameters contribute to phosphene appearance and show how that relationship can fit data from actual implanted volunteers. This provides an excellent foundation for determining optimal stimulation parameters with reasonable confidence in how parameter selections will affect individual-electrode phosphenes.

      We thank the reviewer for these supportive comments.

      Issues with the applicability and reliability of the simulation are detailed below:

      1) The utility of this simulation design, as described, unfortunately breaks down beyond the scope of individual electrodes. To model the simultaneous activation of multiple electrodes, the authors' design linearly adds individual-electrode phosphenes together. This produces relatively clean collections of dots that one could think of as pixels in a crude digital display. Modeling phosphenes in such a way assumes that each electrode and the network it activates operate independently of other electrodes and their neuronal targets. Unfortunately, as the authors acknowledge and as noted in the studies they used to fit and verify individual-electrode phosphene characteristics, simultaneous stimulation of multiple electrodes often obscures features of individual-electrode phosphenes and can produce unexpected phosphene patterns. This simulation does not reflect these nonlinearities in how electrode activations combine. Nonlinearities in electrode combinations can be as subtle the phosphenes becoming brighter while still remaining distinct, or as problematic as generating only a single small phosphene that is indistinguishable from the activation of a subset of the electrodes activated, or that of a single electrode.

      If a visual prosthesis happens to generate some phosphenes that can be elicited independently, a simulator of this type could perhaps be used by processing stimulation from independent groups of electrodes and adding their phosphenes together in the visual field.

      The reviewer raises a valid point and a similar point was raised by a different reviewer (our response is duplicated). As pointed out in the discussion, many aspects about multi- electrode phosphene perception are still unclear. On the one hand, the literature is in agreement that there is some degree of predictability: some papers explicitly state that phosphenes produced by multiple patterns are generally additive (Dobelle & Mladejovsky, 1974), that the locations are predictable (Bosking et al., 2018) and that multi-electrode stimulation can be used to generate complex, interpretable patterns of phosphenes (Chen et al., 2020, Fernandez et al., 2021). On the other hand, however, in some cases, the stimulation of multiple electrodes is reported to lead to brighter phosphenes (Fernandez et al., 2021), fused or displaced phosphenes (Schmidt et al., 1996, Bak et al., 1990) or unpredicted phosphene patterns (Fernández et al., 2021). It is likely that the probability of these interference patterns decreases when the distance between the stimulated electrodes increases. An empirical finding is that the critical distance for intracortical stimulation is approximately 1 mm (Ghose & Maunsell, 2012).

      We note that our simulator is not restricted to the simulation of linearly combined Gaussian blobs. Some irregularities, such as elongated phosphene shapes were already supported in the previous version of our software. Furthermore, we added a supplementary figure that displays a possible approach to simulate some of the more complex electrode interactions that are reported in the literature, with only minor adaptations to the code. Our study thereby aims to present a flexible simulation toolkit that can be adapted to the needs of the user.

      Adjustments:

      • Lines 957-970: Furthermore, in contrast to the assumptions of our model, interactions between simultaneous stimulation of multiple electrodes can have an effect on the phosphene size and sometimes lead to unexpected percepts (Fernandez et al., 2021, Dobelle & Mladejovsky 1974, Bak et al., 1990). Although our software supports basic exploratory experimentation of non-linear interactions (see Figure 1-figure supplement 3), by default, our simulator assumes independence between electrodes. Multi- phosphene percepts are modeled using linear summation of the independent percepts. These assumptions seem to hold for intracortical electrodes separated by more than 1 mm (Ghose & Maunsell, 2012), but may underestimate the complexities observed when electrodes are nearer. Further clinical and theoretical modeling work could help to improve our understanding of these non-linear dynamics.

      • Added Figure 1-figure supplement 3 on irregular phosphene percepts.

      2) Verification of how the simulation renders individual phosphenes based on stimulation parameters is an important step in confirming agreement between the simulation and the function of implanted devices. That verification was well demonstrated. The end use a visual-prosthesis simulation, however, would likely not be optimizing just the appearance of phosphenes, but predicting and optimizing functional performance in visual tasks. Investigating whether this simulator can suggest visual-task performance, either with sighted volunteers or a decoder model, that is similar to published task performance from visual-prosthesis implantees would be a necessary step for true validation.

      We agree with the reviewer that it will be vital to investigate the utility of the simulator in tasks. However, the literature on the performance of users of a cortical prosthesis in visually-guided tasks is scarce, making it difficult to compare task performance between simulated versus real prosthetic vision.

      Secondly, the main objective of the current study is to propose a simulator that emulates the sensory / perceptual experience, i.e. the low-level perceptual correspondence. Once more behavioral data from prosthetic users become available, studies can use the simulator to make these comparisons.

      Regarding the comparison to simulated prosthetic vision in sighted volunteers, there are some fundamental limitations. For instance, sighted subjects are exposed for a shorter duration to the (simulated) artificial percept and lack the experience and training that prosthesis users get. Furthermore, sighted subjects may be unfamiliar with compensation strategies that blind individuals have developed. It will therefore be important to conduct clinical experiments.

      To convey more clearly that our experiments are performed to verify the practical usability in future behavioral experiments, we have incorporated the following textual adjustments:

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments.

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end (Fauvel et al., 2022; Beyeler & Sanchez- Garcia, 2022).

      3) A feature of this simulation is being able to convert stimulation of V1 to phosphenes in the visual field. If used, this feature would likely only be able to simulate a subset of phosphenes generated by a prosthesis. Much of V1 is buried within the calcarine sulcus, and electrode placement within the calcarine sulcus is not currently feasible. As a result, stimulation of visual cortex typically involves combinations of the limited portions of V1 that lie outside the sulcus and higher visual areas, such as V2.

      We agree that some areas (most notably the calcarine sulcus) are difficult to access in a surgical implantation procedure. A realistic simulation of state-of-the-art cortical stimulation should only partially cover the visual field with phosphenes. However, it may be predicted that some of these challenges will be addressed by new technologies. We chose to make the simulator as generally applicable as possible and users of the simulator can decide which phosphene locations are simulated. To demonstrate that our simulator can be flexibly initialized to simulate specific implantation locations using third- party software, we have now added a supplementary figure (Figure 1-figure supplement 1) that displays a demonstration of an electrode grid placement on a 3D brain model, generating the phosphene locations from receptive field maps. However, the simulator is general and can also be used to guide future strategies that aim to e.g. cover the entire field with electrodes, compare performance between upper and lower hemifields etc.

      Reviewer #3 (Public Review):

      The authors are presenting a new simulation for artificial vision that incorporates many recent advances in our understanding of the neural response to electrical stimulation, specifically within the field of visual prosthetics. The authors succeed in integrating multiple results from other researchers on aspects of V1 response to electrical stimulation to create a system that more accurately models V1 activation in a visual prosthesis than other simulators. The authors then attempt to demonstrate the value of such a system by adding a decoding stage and using machine-learning techniques to optimize the system to various configurations.

      1) While there is merit to being able to apply various constraints (such as maximum current levels) and have the system attempt to find a solution that maximizes recoverable information, the interpretability of such encodings to a hypothetical recipient of such a system is not addressed. The authors demonstrate that they are able to recapitulate various standard encodings through this automated mechanism, but the advantages to using it as opposed to mechanisms that directly detect and encode, e.g., edges, are insufficiently justified.

      We thank the reviewer for this constructive remark. Our simulator is designed for more realistic assessment of different stimulation protocols in behavioral experiments or in computational optimization experiments. The presented end-to-end experiments are a demonstration of the practical usability of our simulator in computational experiments, building on a previously existing line of research. In fact, our simulator is compatible with any arbitrary encoding strategy.

      As our paper is focused on the development of a novel tool for this existing line of research, we do not aim to make claims about the functional quality of end-to-end encoders compared to alternative encoding methods (such as edge detection). That said, we agree with the reviewer that it is useful to discuss the benefits of end-to-end optimization compared to e.g. edge detection will be useful.

      We have incorporated several textual changes to give a more nuanced overview and to acknowledge that many benefits remain to be tested. Furthermore, we have restated our study aims more clearly in the discussion to clarify the distinction between the goals of the current paper and the various encoding strategies that remain to be tested.

      • Lines 275-279: In the sections below, we discuss the different components of the simulator model, followed by a description of some showcase experiments that assess the ability to fit recent clinical data and the practical usability of our simulator in simulation experiments

      • Lines 810-814: Computational optimization approaches can also aid in the development of safe stimulation protocols, because they allow a faster exploration of the large parameter space and enable task-driven optimization of image processing strategies (Granley et al., 2022; Fauvel et al., 2022; White et al., 2019; Küçükoglü et al. 2022; de Ruyter van Steveninck, Güçlü et al., 2022; Ghaffari et al., 2021).

      • Lines 842-853: Eventually, the functional quality of the artificial vision will not only depend on the correspondence between the visual environment and the phosphene encoding, but also on the implant recipient's ability to extract that information into a usable percept. The functional quality of end-to-end generated phosphene encodings in daily life tasks will need to be evaluated in future experiments. Regardless of the implementation, it will always be important to include human observers (both sighted experimental subjects and actual prosthetic implant users in the optimization cycle to ensure subjective interpretability for the end user (Fauvel et al., 2022; Beyeler & Sanchez-Garcia, 2022).

      2) The authors make a few mistakes in their interpretation of biological mechanisms, and the introduction lacks appropriate depth of review of existing literature, giving the reader the mistaken impression that this is simulator is the only attempt ever made at biologically plausible simulation, rather than merely the most recent refinement that builds on decades of work across the field.

      We thank the reviewer for this insight. We have improved the coverage of the previous literature to give credit where credit is due, and to address the long history of simulated phosphene vision.

      Textual changes:

      • Lines 64-70: Although the aforementioned SPV literature has provided us with major fundamental insights, the perceptual realism of electrically generated phosphenes and some aspects of the biological plausibility of the simulations can be further improved and by integrating existing knowledge of phosphene vision and its underlying physiology.

      • Lines 164-190: The aforementioned studies used varying degrees of simplification of phosphene vision in their simulations. For instance, many included equally-sized phosphenes that were uniformly distributed over the visual field (informally referred to as the ‘scoreboard model’). Furthermore, most studies assumed either full control over phosphene brightness or used binary levels of brightness (e.g. 'on' / 'off'), but did not provide a description of the associated electrical stimulation parameters. Several studies have explicitly made steps towards more realistic phosphene simulations, by taking into account cortical magnification or using visuotopic maps (Fehervari et al., 2010;, Li et al., 2013; Srivastava et al., 2009; Paraskevoudi et al., 2021), simulating noise and electrode dropout (Dagnelie et al., 2007), or using varying levels of brightness (Vergnieux et al., 2017; Sanchez-Garcia et al., 2022; Parikh et al., 2013). However, no phosphene simulations have modeled temporal dynamics or provided a description of the parameters used for electrical stimulation. Some recent studies developed descriptive models of the phosphene size or brightness as a function of the stimulation parameters (Winawer et al., 2016; Bosking et al., 2017). Another very recent study has developed a deep-learning based model for predicting a realistic phosphene percept for single stimulating electrodes (Granley et al., 2022). These studies have made important contributions to improve our understanding of the effects of different stimulation parameters. The present work builds on these previous insights to provide a full simulation model that can be used for the functional evaluation of cortical visual prosthetic systems.

      • Lines 137-140: Due to the cortical magnification (the foveal information is represented by a relatively large surface area in the visual cortex as a result of variation of retinal RF size) the size of the phosphene increases with its eccentricity (Winawer & Parvizi, 2016, Bosking et al., 2017).

      • Lines 883-893: Even after loss of vision, the brain integrates eye movements for the localization of visual stimuli (Reuschel et al., 2012), and in cortical prostheses the position of the artificially induced percept will shift along with eye movements (Brindley & Lewin, 1968, Schmidt et al., 1996). Therefore, in prostheses with a head-mounted camera, misalignment between the camera orientation and the pupillary axes can induce localization problems (Caspi et al., 2018; Paraskevoudi & Pezaris, 2019; Sabbah et al., 2014; Schmidt et al., 1996). Previous SPV studies have demonstrated that eye-tracking can be implemented to simulate the gaze-coupled perception of phosphenes (Cha et al., 1992; Sommerhalder et al., 2004; Dagnelie et al., 2006; McIntosh et al., 2013, Paraskevoudi & Pezaris, 2021; Rassia & Pezaris 2018, Titchener et al., 2018, Srivastava et al., 2009)

      3) The authors have importantly not included gaze position compensation which adds more complexity than the authors suggest it would, and also means the simulator lacks a basic, fundamental feature that strongly limits its utility.

      We agree with the reviewer that the inclusion of gaze position to simulate gaze-centered phosphene locations is an important requirement for a realistic simulation. We have made several textual adjustments to section M1 to improve the clarity of the explanation and we have added several references to address the simulation literature that took eye movements into account.

      In addition, we included a link to some demonstration videos in which we illustrate that the simulator can be used for gaze-centered phosphene simulation. The simulation models the phosphene locations based on the gaze direction, and updates the input with changes in the gaze direction. The stimulation pattern is chosen to encode the visual environment at the location where the gaze is directed. Gaze contingent processing has been implemented in prior simulation studies (for instance: Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018) and even in the clinical setting with users of the Argus II implant (Caspi et al., 2018). From a modeling perspective, it is relatively straightforward to simulate gaze-centered phosphene locations and gaze contingent image processing (our code will be made publicly available). At the same time, however, seen from a clinical and hardware engineering perspective, the implementation of eye-tracking in a prosthetic system for blind individuals might come with additional complexities. This is now acknowledged explicitly in the manuscript.

      Textual adjustment:

      Lines 883-910: Even after loss of vision, the brain integrates eye movements for the localization of visual stimuli (Reuschel et al., 2012), and in cortical prostheses the position of the artificially induced percept will shift along with eye movements (Brindley & Lewin, 1968, Schmidt et al., 1996). Therefore, in prostheses with a head-mounted camera, misalignment between the camera orientation and the pupillary axes can induce localization problems (Caspi et al., 2018; Paraskevoudi & Pezaris, 2019; Sabbah et al., 2014; Schmidt et al., 1996). Previous SPV studies have demonstrated that eye-tracking can be implemented to simulate the gaze-coupled perception of phosphenes (Cha et al., 1992; Sommerhalder et al., 2004; Dagnelie et al., 2006, McIntosh et al., 2013; Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018; Srivastava et al., 2009). Note that some of the cited studies implemented a simulation condition where not only the simulated phosphene locations, but also the stimulation protocol depended on the gaze direction. More specifically, instead of representing the head-centered camera input, the stimulation pattern was chosen to encode the external environment at the location where the gaze was directed. While further research is required, there is some preliminary evidence that such a gaze-contingent image processing can improve the functional and subjective quality of prosthetic vision (Caspi et al., 2018; Paraskevoudi et al., 2021; Rassia et al., 2018; Titchener et al., 2018). Some example videos of gaze-contingent simulated prosthetic vision can be retrieved from our repository (https://github.com/neuralcodinglab/dynaphos/blob/main/examples/). Note that an eye-tracker will be required to produce gaze-contingent image processing in visual prostheses and there might be unforeseen complexities in the clinical implementation thereof. The study of oculomotor behavior in blind individuals (with or without a visual prosthesis) is still an ongoing line of research (Caspi et al.,2018; Kwon et al., 2013; Sabbah et al., 2014; Hafed et al., 2016).

      4) Finally, the computational capacity required to run the described system is substantial and is not one that would plausibly be used as part of an actual device, suggesting that there may be difficulties with converting results from this simulator to an implantable system.

      The software runs in real time with affordable, consumer-grade hardware. In Author response image 1 we present the results of performance testing with a 2016 model MSI GeForce GTX 1080 (priced around €600).

      Author response image 1.

      Note that the GPU is used only for the computation and rendering of the phosphene representations from given electrode stimulation patterns, which will never be part of any prosthetic device. The choice of encoder to generate the stimulation patterns will determine the required processing capacity that needs to be included in the prosthetic system, which is unrelated to the simulator’s requirements.

      The following addition was made to the text:

      • Lines 488-492: Notably, even on a consumer-grade GPU (e.g. a 2016 model GeForce GTX 1080) the simulator still reaches real-time processing speeds (>100 fps) for simulations with 1000 phosphenes at 256x256 resolution.

      5) With all of that said, the results do represent an advance, and one that could have wider impact if the authors were to reduce the computational requirements, and add gaze correction.

      We appreciate the kind compliment from the reviewer and sincerely hope that our revised manuscript meets their expectations. Their feedback has been critical to reshape and improve this work.

    1. Author Response

      Reviewer #3 (Public Review):

      In this manuscript, the authors studied the erythropoiesis and hematopoietic stem/progenitor cell (HSPC) phenotypes in a ribosome gene Rps12 mutant mouse model. They found that RpS12 is required for both steady and stress hematopoiesis. Mechanistically, RpS12+/- HSCs/MPPs exhibited increased cycling, loss of quiescence, protein translation rate, and apoptosis rates, which may be attributed to ERK and Akt/mTOR hyperactivation. Overall, this is a new mouse model that sheds light into our understanding of Rps gene function in murine hematopoiesis. The phenotypic and functional analysis of the mice are largely properly controlled, robust, and analyzed.

      A major weakness of this work is its descriptive nature, without a clear mechanism that explains the phenotypes observed in RpS12+/- mice. It is possible that the counterintuitive activation of ERK/mTOR pathway and increased protein synthesis rate is a compensatory negative feedback. Direct mechanism of Rps12 loss could be studied by ths acute loss of Rps12, which is doable using their floxed mice. At the minimum, this can be done in mammalian hematopoietic cell lines.

      We thank the reviewer for pointing this out. We have addressed this question by developing a new inducible conditional knockout Rps12 mouse model (see response below to major point 1).

      Below are some specific concerns need to be addressed.

      1) Line 226. The authors conclude that "Together, these results suggest that RpS12 plays an essential role in HSC function, including self-renewal and differentiation." The reviewer has three concerns regarding this conclusion and corresponding Figure3. 1) The data shows that RpS12+/- mice have decreased number of both total BM cells and multiple subpopulations of HSPCs. The frequency of HSPC subpopulations should also be shown to clarify if the decreased HSPC numbers arises from decreased total BM cellularity or proportionally decrease in frequency. 2) This figure characterizes phenotypic HSPC in BM by flow and lineage cells in PB by CBC. HSC function and differentiation are not really examined in this figure, except for the colony assay in Figure 3K. BMT data in Figure4 is actually for HSC function and differentiation. So the conclusion here should be rephrased. 3) Since all LT-, ST-HSCs, as well as all MPPs are decreased in number, how can the authors conclude that Rps12 is important for HSC differentiation? No experiments presented here were specifically designed to address HSC differentiation.

      We thank the reviewer for this excellent point. We think that the main defect is in HSC and progenitor maintenance, rather than in HSC differentiation. This is consistent with the decrease in multiple HSC and progenitor populations, as observed both by calculating absolute numbers and by frequency of the parent population (see new Supplementary Figures S2C-S2C). We have removed any references to altered differentiation from the text.

      We added data on the population frequency in the Supplementary Figure 2. And in the corresponding text. See lines 221-235.

      2) Figure 3A and 5E. The flow cytometry gating of HSC/MPP is not well performed or presented, especially HSC plot. Populations are not well separated by phenotypic markers. This concerns the validity of the quantification data.

      We chose a better representative HSC plot and included it in the Figure 3A

      3) It is very difficult to read bone marrow cytospin images in Fig 6F without annotation of cell types shown in the figure. It appears that WT and +/- looked remarkably different in terms of cell size and cell types. This mouse may have other profound phenotypes that need detailed examination, such as lineage cells in the BM and spleen, and colony assays for different types of progenitors, etc.

      The purpose of the bone marrow cytospin images in Figure 6F was to show the high number of apoptotic cells in the bone marrow of Rps12 KO/+ mice compared with controls. The differences in apoptosis in the LSK and myeloid progenitor populations are quantified in the flow cytometry data shown in Figure 6G-H. A detailed quantitative analysis of different bone marrow cell populations and their relative frequencies is also shown in Figures 2 and 3. In Rps12 KO/+ bone marrow, we observed a significant decrease in multiple stem cell and progenitor populations.

      4) For all the intracellular phospho-flow shown in Fig7, both a negative control of a fluorescent 2nd antibody only and a positive stimulus should be included. It is very concerning that no significant changes of pAKT and pERK signaling (MFI) after SCF stimulation from the histogram in WT LSKs. There are no distinct peaks that indicate non-phospho-proteins and phosphoproteins. This casts doubt on the validity of results. It is possible though that Rsp12+/- have very high basal level of activation of pAKT/mTOR and pERK pathway. This again may point to a negative feedback mechanism of Rps12 haploinsufficiency.

      It is true that we did not observe an increase in pAKT, p4EBP1, or pERK in control cells in every case. This is often an issue with these specific phospho-flow cytometry antibodies, as they are not very sensitive, and the response to SCF is very time-dependent. We did observe an increase in pS6 with SCF in both LSK cells and progenitors (Figure 7B, E). However, the main point of this experiment was to assess the basal level of signaling in Rps12 KO/+ vs control cells. We did not observe hypersensitivity of RpS12 cells to SCF, but we did observe significant increases in pAKT, pS6, p4EBP1, and pERK in Rsp12 KO/+ LSK cells.

      To address the concern about the validity of staining, please see the requested flow histograms for unstained vs individual Phospho-antibodies (Ab): p4EBP1, pERK, pS6 and pAKT (Figure R1 for reviewers) below. Additionally, since staining with the surface antibodies potentially can change the peak, we are including additional an control of the cell surface antibodies vs full sample with surface antibodies and Phospho-Ab: p4EBP1, pERK, pS6 and pAKT. We can include this figure in the Supplementary Data if requested.

      5) The authors performed in vitro OP-Puro assay to assess the global protein translation in different HSPC subpopulations. 1) Can the authors provide more information about the incubation media, any cytokine or serum included? The incubation media with supplements may boost the overall translation status, although cells from WT and RpS12+/- are cultured side by side. Based on this, in vivo OP-Puro assay should be performed in both genotypes. 2) Polysome profiling assay should be performed in primary HSPCs, or at least in hematopoietic cell lines. It is plausible that RpS12 haploinsufficiency may affect the content of translational polysome fractions.

      We are including these details in the methods section: for in vitro OP-Puro assay (lines 555565) cells were resuspended in DMEM (Corning 10-013-CV) media supplemented with 50 µM β-mercaptoethanol (Sigma) and 20 µM OPP (Thermo Scientific C10456). Cells were incubated for 45 minutes at 37°C and then washed with Ca2+ and Mg2+ free PBS. No additional cytokines were added.

      We did not perform polysome profiles. Polysome profiling of mutant stem and progenitor cells would be very challenging, as their numbers are much reduced. We now deem this of reduced interest, given the conclusion of the revised manuscript that RpS12 haploinsufficiency reduces overall translation. Also, because in RpS12-floxed/+;SCL-CRE-ERT mouse model with acute deletion of RpS12 we observed the expected decrease in translation in HSCs using the same ex vivo OPP protocol, we did not follow up with in vivo OPP treatment,

    1. Author Response

      Reviewer #1 (Public Review):

      The authors have used many cleverly chosen mouse models (periodontitis models; various models that lead to an on-switch of genes) and methods (immune localizations of high quality; single cell RNA sequencing) for the quest of elucidating a role for telocytes. They describe that more telocytes are present around teeth in mice that had periodontitis. These cells proliferated, and they expressed a pattern of genes that allowed macrophages to differentiate into a different direction. In particular, they showed that telocytes in periodontitis express HGF, a molecule that steers macrophage differentiation towards a less inflammatory cell type, paving the way for recovery. As a weakness, one could state that an attempt to extrapolate to human cells is missing.

      In the Discussion, we have a sentence that states further investigation in human periodontitis is required (see page 20, paragraph 416).

      Reviewer #3 (Public Review):

      Zhao and Sharpe identified telocytes in the periodontium. To address their contribution to periodontal diseases, they conducted scRNA-seq analysis and lineage tracing in mice. They demonstrated that telocytes are activated in periodontitis. The activated telocytes send HGF signals to surrounding macrophages, converting M2 to M1/M2 hybrid status. The study implies that targeting telocytes and HGF signal for the potential treatment of periodontitis.

      The significance of the study could be improved by authors testing if targeting telocytes or HGF signals could ameliorate periodontitis in the mouse model. The current form of the manuscript lacks the data that demonstrate the actual contribution of telocytes in the homeostasis of periodontium or progression of periodontitis.

      Major comments:

      1) I see the genetic validation of the role of telocytes or HGF signals are crucial to assure the significance of this manuscript. I recommend either of two experiments. a. testing the role of HGF signals by deleting the Hgf gene in telocytes. Using Wnt11-Cre; Hgf f/f mice, the authors could address the role of HGF signals in periodontitis. CX3CR1-Cre; cMet f/f mice will delete HGF signals in monocyte-derived macrophages. This will be another verification, but not sure if the PDL macrophages are derived from yolk sac or monocytes. b. measuring the contribution of telocytes in the homeostasis or disease progression. The mouse model could be challenging though, the system if achieved will be very informative. The authors could first check the expression of telocyte enriched genes, such as Lgr5 or Foxl1 reported previously in other tissue telocytes. Delete those genes under the Wnt1-Cre driver and check if telocyte lineage is removed. The system would be very useful for next-level study. DTA model could be an alternative, but Wnt1-Cre is vastly expressed in neural crest lineage.

      These are good suggestions but unfortunately not feasible as we do not have all the mouse lines (e.g., Hgf f/f mice). Lgr5 and Foxl1 are used in intestine but is not suitable for PDL tissue. CD34;DTA show CD34+ cells, however, we encountered challenges associated with induced genetic heterogeneity when using this model, preventing us from making concrete conclusions from the experiments using the CD34;DTA model. Lgf5/Foxl1 are either not expressed or overlap with CD34 in and therefore do not seem suitable for us to pursue.

      2) This paper points out that the M1/M2 hybrid state of macrophages appears upon periodontitis. The authors could further characterize the hybrid macrophages by the expression of more markers, production of cytokines, and morphology. Need to clarify if this means some macrophages are in M1 state and others are in M2 state, or one macrophage possesses both M1 and M2 phenotype. Please conduct either FACS or immunofluorescence to demonstrate if one macrophage expresses both markers. Please introduce more information about the M1/M2 hybrid state of macrophage based on other present literature.

      Unlike our single cell sequencing data, we were unsuccessful in determining if one macrophage possesses both M1 and M2 phenotype by immunolabelling.

      3) In the introduction part, the author lists several markers that can be used for telocyte identification, such as CD34+CD31-, CD34+c-Kit+, CD34+Vim+, CD34+PDGFRα+. Could authors explain why they chose CD34 CD31, but not other markers?

      As shown in the cluster images below, the other markers do not overlap very well with CD34 cells or in the case of Vim, expressed more ubiquitously. We generated a new supplementary figure (Supp Fig2) and explained this in the text (page 12, lines 235-238).

      4) In figure 5g, I don't think the yellow color cell shows the reduction trend in the Tivantinib treatment group compared with a control group. Please validate the observation by gene expression analysis, WB, etc. In addition, please show c-Met+ cells level in the Tivantinib treatment group and control group.

      New Supp Fig4 is included to show Met expression in homeostasis and periodontitis.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Li et al characterize sex differences in the impact of macrophage RELMa in protection against diet-induced obesity [DIO]. This is a key area of interest as obesity studies in mice have generally focused exclusively on male animals, as they tend to gain more weight, faster than female mice. The authors use a combination of flow cytometry, adoptive transfer, and single-cell transcriptomics to characterize the mechanism of action for female-specific DIO protection. They identify a potential role for eosinophils in mediating female DIO protection downstream of RELMa production by macrophage. They also use the transcriptomic characterization of the stromal vascular fraction of the adipose tissue to evaluate molecular and cellular drivers of this sex-specific DIO protection.

      Although the authors provide solid evidence for many claims in the manuscript, there is generally not enough information about the studies' methods (especially on the computational/data analysis aspects) for a careful evaluation of the result's robustness at this stage.

      We have significantly expanded the methodology, especially of the scRNAseq, and deposited the script and raw data in public repositories. We also validated our methods and can confirm that the analysis presented is robust. This resubmission contains new Fig 7 and new supplementary material with this methodology and validation.

      Reviewer #2 (Public Review):

      In the study by Li et al., the authors hypothesize that RELMa, a macrophage-derived protein, plays a sex-dimorphic role as a protective factor in obesity in females vs males. The authors perform largely in vivo studies utilizing male and female WT and RELMa KO mice on a high-fat diet and perform an in-depth analysis of immune cell composition, gene expression, and single-cell RNA Sequencing. The authors find that WT females are protected from obesity and inflammation vs males, and this protection is lost in female RELMa KO mice. Further analysis by the authors including flow cytometry of the visceral fat SVF in female WT mice showed reduced macrophage infiltration, higher levels of eosinophils, and Th2 cytokine expression compared to WT male mice and female KO mice. The authors show that protection from obesity and inflammation in female RELMa KO mice can be rescued with an injection of eosinophils and recombinant RELMa. Lastly, the authors use single-cell RNA-Sequencing to further analyze SVF cells in WT and KO male and female mice on a high-fat diet.

      Overall, we find that the study represents an important finding in the immunometabolism field showing that RELMa is a key myeloid-derived factor that helps influence the macrophage-eosinophil function in female mice and protects from diet-induced obesity and inflammation in a sexually dimorphic manner. Overall, the study provides strong and convincing data supporting the authors' hypothesis and conclusion.

      We thank the reviewer for their positive review of our manuscript and their helpful feedback which we address below.

      Reviewer #3 (Public Review):

      Li, Ruggiero-Ruff et al. examine the role of RELMα, an anti-inflammatory macrophage signature gene, in mediating sex differences in high-fat diet (HFD)-induced obesity in young mice. Specifically, the authors hypothesize that RELMα protects females against HFD-induced obesity. Comparisons between RELMα-knockout (KO) and wildtype (WT) mice of both sexes revealed sex- and RELMα-specific differences in weight gain, immune cell populations, and inflammatory signaling in response to HFD. RELMα-deficiency in females led to increased weight gain, expansion of pro-inflammatory macrophage populations, and eosinophil loss in response to HFD. Female RELMα-deficiency could be rescued by RELMα treatment or eosinophil transfer. Single-cell RNA-sequencing (scRNA-seq) of adipose stromal vascular fraction (SVF) revealed sex- and RELMα-dependent differences under HFD conditions and identified potential "pro-obesity" and "anti-obesity" genes in a cell-type-specific manner. Using trajectory analysis, the authors suggest dysregulation of macrophage-to-monocyte transition in RELMα-deficient mice.

      The conclusions of this paper are mostly well supported by the data, but some aspects of the statistical and single-cell analyses will need to be corrected, clarified, and extended to enhance the report.

      We thank Dr. Ocanas for their positive comments and for the helpful feedback to improve our study. We have addressed all the comments and significantly revised the manuscript.

      Strengths:

      The authors use several orthogonal approaches (i.e., flow cytometry, immunohistochemistry, scRNA-Seq) and models to support their hypotheses.

      The authors demonstrate that phenotypes observed in HFD-fed females with RELMα-deficiency (i.e., weight gain, loss of eosinophils, a gain of M1 macrophages) can be rescued by RELMα treatment or eosinophil transfer.

      The authors recognized the complexity of macrophage activation that is beyond the 'M1/M2' paradigm and informed readers in the introduction as to why this paradigm was used in this study. During the scRNA-seq analyses, the authors further sub-cluster macrophages to include more granularity.

      Weaknesses:

      1) There are several instances in the text where the authors claim that there is a significant difference between the two groups, but the statistics for these comparisons are not shown in the figure.

      Because we are dealing with three variables: genotype, diet and sex, and many differences, we thought it too complicated to add all the significant differences on the graph, but sometimes just mentioned these in the text with a p value, or didn’t mention at all if the difference was obvious, or not meaningful (for example, we weren’t interested in comparing a WT male on a Ctr diet with a RELMalpha KO female on a HFD for the purpose of our hypothesis). We have now ensured clarity in the text and in the figures, and addressed the specific point-by-point comments from the reviewer. We have also now carefully re-evaluated the text to ensure that any significant differences we discuss are shown in the figure.

      2) It is unfortunate that eosinophils could not be identified in the single-cell analysis since this population of cells was shown to be important in rescuing the RELMα-deficiency in HFD-fed females. The authors should note in the discussion how future scRNA-Seq experiments could overcome this limitation (i.e., enriching immune cells prior to scRNA-Seq).

      We were indeed disappointed that we were not able to obtain eosinophil single cell seq, but realize that this is a reported issue in the field. We have expanded our discussion of this and cited a paper that performs eosinophil single cell sequencing (published at the time our manuscript was being submitted): ““At the same time as our ongoing analysis, the first publication of eosinophil single cell RNA-seq was published, using a flow cytometry based approach rather than 10x, including RNAse inhibitor in the sorting buffer, and performing prior eosinophil enrichment (PMID: 36509106). Based on guidance from 10x, we employed targeted approaches to identify eosinophil clusters according to eosinophil markers (e.g. Siglecf, Prg2, Ccr3, Il5r), and relaxed the scRNA-Seq cutoff analysis to include more cells and intronic content, but still could not find eosinophils. We conclude that eosinophils may be absent due to the enzyme digestion required for SVF isolation and processing for single cell sequencing, which could lead to specific eosinophil population loss due to low RNA content, RNases or cell viability issues. Future experiments would be needed to optimize eosinophil single cell sequencing, based on the recent publication of eosinophil single cell sequencing.”

      3a) There are several issues with the scRNA-Seq analysis and interpretation. More details on the steps taken in the single-cell analyses should be included in the methods section.

      We agree with the reviewer that more details on steps taken in the single cell data processing and bioinformatics needs to be included in the methods section. We included more information and separated sections within the data processing section in the Materials and Methods on the methodology used for these approaches, as well as provided a code for our data processing in a public Github repository: https://github.com/rrugg002/Sexual-dimorphism-in-obesity-is-governed-by-RELM-regulation-of-adipose-macrophages-and-eosinophils.

      b) With regards to the 'pseudobulk' analyses presented in Figs. 5-6, several of the differentially expressed genes identified in Fig. 6 are hemoglobin genes (i.e., Hba, Hbb genes). It is not uncommon to filter these genes out of single-cell analysis since their presence usually indicates red blood cell (RBC) contamination (PMID: 31942070, PMID: 35672358). We would recommend assessing RBC contamination as well as removing Fig. 6 from the manuscript and focusing on cell-type-specific analyses. Re-analysis will likely have an impact on the overall conclusions of the study.

      Prior to our first submission, we consulted with 10x support scientists and the UCR bioinformatics core director to ensure that our analysis included the appropriate filtering. We have now added details in the Methods. The PMIDs provided above are from studies that looked at hippocampus development (where they didn’t perfuse so there may be blood contamination) or whole blood (where there would be significant red blood cell contamination). In contrast, we perfused our mice and treated the single cell suspension with RBC lysis buffer, as detailed in Methods. Also, we have now extended our scSeq analysis to compare hemoglobin RNA to red blood cell specific markers including Gypa/CD235a. While hemoglobin is distributed throughout the myeloid population in the female KO mice, Gypa/CD235a, which would suggest RBC contamination is not expressed at all (see new Fig 7B). Additionally, we provide hemoglobin protein ELISA and IF staining to support our finding that macrophages from KO mice express hemoglobin protein. Last, two publications support hemoglobin expression by nonerythroid sources, including macrophages (PMID: 10359765; PMID: 25431740). While we are confident based on above that our data is not due to RBC contamination, we cannot exclude the fact that, although unlikely, macrophages may be phagocytosing RBC and preserving specifically hemoglobin RNA and protein. Nonetheless, we discuss this possibility in the text. In conclusion, based on the justification above and the new data, we are confident that our findings and overall conclusions are robust.

      To assess for potential RBC contamination, in addition to Gypa, we additionally looked at top genes expressed by murine erythrocytes (PMID: 24637361). Please see below feature plots, showing little to no expression, and a very different distribution than the hemoglobin genes (see new Fig 7a):

      Also, we had a small cluster of potential RBCs (only 75 cells) that we filtered out of downstream DEG analysis, which revealed the same data as in the first submission.

      4) Within the text, there are several instances where the authors claim that a pathway is upregulated based on their Gene Ontology (GO) over-representation analysis (ORA). To come to this conclusion, the authors identify genes that are upregulated in one condition and then perform GO-ORA on these genes. However, the authors do not consider negative regulators, whose upregulation would actually decrease the pathway. Authors should either replace their GO-ORA analysis with one that considers the magnitude and direction of differentially expressed genes and provides an activation z-score (i.e., Ingenuity Pathway Analysis) or replace instances of 'upregulated' or 'downregulated' pathways with 'over-represented' pathways.

      Unfortunately, we did not have access to IPA for this project, therefore we have changed our analysis to over and under-represented pathways as suggested.

      5) For Fig.7A, a representative tSNE plot for each group (WT Female, KO Female, WT Male, KO Male) should be shown to ensure there is proper integration of the clusters across groups. There are some instances where the scRNA-Seq data do not appear to be integrated properly (i.e., Supplemental Figure 2C). The authors should explore integration techniques (i.e., Seurat; PMID: 29608179) to correct for potential batch effects within the analysis.

      We thank the reviewer for the suggestion of proper integration of the clusters across groups. We performed integration using the Cell Ranger aggregation (aggr) pipeline (see updated materials and methods section). In addition, many technical controls were performed to prevent batch effects between our samples. For sequencing, we used the 10x genomics library sequencing depth and run parameters for both gene expression and multiplexing libraries. For all 3’ gene expression library sequencing, we sequenced at a depth of 20,000 read pairs per cell and for all cell multiplexing library sequencing we sequenced at a depth of 5,000 read pairs per cell. All libraries were paired-end dual indexed libraries and were pooled on one flow cell lane using a 4:1 ratio (3’ Gene expression: Multiplexing ratio) in the Novaseq, as recommended by 10x Genomics, in order to maintain nucleotide diversity and prevent batch effects during the sequencing process. When performing integration/aggregation of all sample gene expression libraries using the Cell Ranger aggregation (aggr) pipeline, we performed sequencing depth normalization between all samples. Cell Ranger does this by equalizing the average read depth per cell between groups before merging all sample libraries and counts together. This is a default setting in the Cell Ranger aggr pipeline, and this approach avoids artifacts that may be introduced due to differences in sequencing depth. Thus, we are confident that changes we observed in gene expression and cell type populations are due to biological differences and not technical variability. Below we have provided a tSNE plot showing clustering of all 12 samples after we performed integration:

      We updated old Fig.7 (now Fig. 6) and included a representative tSNE plot for each group. We also updated the tSNE plot for Figure 5-figure supplement 2C (previously S2C) showing overall clustering amongst all groups. The largest population differences occurred in the fibroblast population and these population differences were largely due to sex differences. Because we are confident that integration was performed appropriately and that batch effects were controlled for, we believe these sex differences are a biological effect.

      6) LncRNA Gm47283 is identified as a gene that is differentially expressed by genotype in HFD females (Fig. 7G); however, according to Ensembl this gene is encoded on the Y-chromosome (https://uswest.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000096768;r=Y:90796007-90827734). The authors should use the RELMα genotype and sex chromosomally-encoded genes to confirm that their multiplexing was appropriate.

      We agree with the reviewer that it is crucial to confirm that multiplexing and all subsequent analyses are performed correctly. Comparison between males and females contains internal controls that increase confidence, such as Xist gene that is expressed only in females, and Ddx3y that is located on the Y chromosome. LncRNA, Gm47283 is located in the syntenic region of Y chromosome and is also present in females, annotated as Gm21887 located in the syntenic region of the X chromosome. It also has 100% alignment with Gm55594 on X chromosome. Additionally, it is also referred to erythroid differentiation regulator 1 (Erd1), x or y depending on the chromosome, although NCBI database specifies partial assembly and incomplete annotation. Therefore, this explains why we see expression of this gene in females. We have discussed this in the text. We revised the text to refer to this LncRNA as Gm47283/Gm21887 to prevent further confusion. The RELMalpha genotype (absence in the KO) was also confirmed. Last, the PC analysis (see Fig 5) supports clustering by group.

      7) For Fig. 8, samples should be co-clustered and integrated across groups before performing trajectory analysis to allow for direct comparisons between groups.

      We appreciate the valuable feedback and suggestions, which have been helpful in clarifying the trajectory analysis, which we have done as follows:

      Regarding the co-clustering and integration of our samples across groups, here is the explanation of our trajectory analysis approach. We have co-clustered all of our samples using the align_cds function from the Monocle3 package. We have included the code for Figure 8 in our Github repository at https://github.com/rrugg002/Sexual-dimorphism-in-obesity-is-governed-by-RELM-regulation-of-adipose-macrophages-and-eosinophils/blob/main/Figure8.R. Specifically, lines 138, 166, 196 and 225 of the code indicate that the align_cds function was used to cluster our samples by "Sample.ID".

      The align_cds function in Monocle3 can be used to co-cluster all samples in a single-cell RNA-seq experiment by aligning coding sequences (CDS) across different cell types or conditions. The align_cds function takes a set of reference CDS sequences and single-cell RNA-seq reads and identifies the CDS sequences within each read, allowing the identification of differentially expressed genes across different cell types or conditions based on the aligned CDS sequences. More details about align_cds can be found here https://rdrr.io/github/cole-trapnell-lab/monocle3/man/align_cds.html .

      We hope that this additional information alleviates the reviewer’s concerns.

      8) Since the experiments presented in this report were from young mice using a single diet intervention, the authors should comment on how age and other obesogenic diets may impact the results found here. Also, the authors should expand their discussion as to what upstream regulators (i.e., hormones or genetics) may be driving the sex differences in RELMα expression in response to HFD.

      We thank the reviewer for the suggestion. We included several sentences to address this comment. However, since reviewers commented that some of the text needs to be trimmed down, extensive discussion regarding reasons for sex differences, which are numerous, are outside the scope of this manuscript. For example, sex differences can arise from all or any of these:

      1. Sex steroid hormones (estrogen and testosterone) are an obvious possibility for sex differences and this discussion has been included below and in the text.

      2. Sex differences we observe may stem from variety of other factors, besides ovarian estrogen; including extraovarian estrogen, primarily estrogen produced in adipose tissues (32119876).

      3. Sex differences exist in fat deposition, which may or may not be estrogen dependent (25578600, 21834845).

      4. Sex difference were determined in metabolic rate and oxidative phosphorylation, which may also be independent of estrogen (28650095, and reviewed in 26339468).

      5. Sex differences exist in the immune system, some of which are estrogen independent, but dependent on sex chromosomes (32193609).

      6. Sex differences particularly in myeloid lineage, which may also be estrogen independent (25869128).

      7. Sex differences were determined in adipokine levels, including leptin and adiponectin, which influence immune cells in adipose tissues (33268480).

      The role of estrogen is not clear either, and thus extensive discussion is not possible. Numerous studies demonstrated that estrogen is protective from inflammation, thus it is possible that estrogen drives some of the sex differences observed herein. However, several studies determined that estrogen can be pro-inflammatory (20554954, 15879140, 18523261). Previous publications by us (30254630, 33268480) and others (25869128) demonstrated intrinsic sex differences in immune system, that are maybe dependent on sex chromosome complement and/or Xist expression (34103397, 30671059).

      Studies are more consistent that estrogen is protective from weight gain: postmenopausal women with diminished estrogen, and ovariectomized animal models gain weight. The effects of ovariectomy on weight gain and its additive effects with high fat diet were reported in Rhesus monkeys (for example PMID: 2663699; and PMID: 16421340); and in rodents (PMID: 7349433).

      The reviewer is correct that the effects of aging or estrogen on RELMa levels would be of significant interest, and could be a future direction of our studies. Aging-mediated increase in inflammation (including of adipose tissue, recently reviewed in 36875140), that may be dependent on estrogen, can exacerbate obesity-mediated inflammation. We have added this discussion.

      For these reasons we limited our discussion regarding possible differences and stated this in the discussion: “Several studies demonstrated the protective role of estrogen in obesity-mediated inflammation and in weight gain, as discussed above. Whether estrogen protection occurs via estrogen regulation of RELMa levels is a focus of our future studies. Alternatively, intrinsic sex differences in immune system have been demonstrated as well (30254630, 33268480, 25869128) that are dependent on sex chromosome complement and/or Xist expression (34103397, 30671059), and RELMa may be regulated by these as well. Additionally, ageing-mediated increase in inflammation (including of adipose tissue, recently reviewed in 36875140), may also occur via changes in RELMa levels. Our studies used young but developmentally mature mice (4-6 weeks old when placed on diet, 18 weeks old at sacrifice), and future work on aged mice would be needed to investigate aging-mediated inflammation. Furthermore, there are sex differences in fat deposition, metabolic rates and oxidative phosphorylation (reviewed in 26339468), and adipokine expression (Coss) that regulate cytokine and chemokines levels, and therefore may regulate levels of RELMa as well. These possibilities will be addressed in future studies.”

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Goering et al. investigate subcellular RNA localization across different cell types focusing on epithelial cells (mouse C2bbe1 and human HCA-7 enterocyte monolayers, canine MDCK epithelial cells) as well as neuronal cultures (mouse CAD cells). They use their recently established Halo-seq method to investigate transcriptome-wide RNA localization biases in C2bbe1 enterocyte monolayers and find that 5'TOP-motif containing mRNAs, which encode ribosomal proteins (RPs), are enriched on the basal side of these cells. These results are supported by smFISH against endogenous RP-encoding mRNAs (RPL7 and RPS28) as well as Firefly luciferase reporter transcripts with and without mutated 5'TOP sequences. Furthermore, they find that 5'TOP-motifs are not only driving localization to the basal side of epithelial cells but also to neuronal processes. To investigate the molecular mechanism behind the observed RNA localization biases, they reduce expression of several Larp proteins and find that RNA localization is consistently Larp1-dependent. Additionally, the localization depends on the placement of the TOP sequence in the 5'UTR and not the 3'UTR. To confirm that similar RNA localization biases can be conserved across cell types for other classes of transcripts, they perform similar experiments with a GA-rich element containing Net1 3'UTR transcript, which has previously been shown to exhibit a strong localization bias in several cell types. In order to determine if motor proteins contribute to these RNA distributions, they use motor protein inhibitors to confirm that the localization of individual members of both classes of transcripts, 5'TOP and GA-rich, is kinesin-dependent and that RNA localization to specific subcellular regions is likely to coincide with RNA localization to microtubule plus ends that concentrate in the basal side of epithelial cells as well as in neuronal processes.

      In summary, Goering et al. present an interesting study that contributes to our understanding of RNA localization. While RNA localization has predominantly been studied in a single cell type or experimental system, this work looks for commonalities to explain general principles. I believe that this is an important advance, but there are several points that should be addressed.

      Comments:

      1) The Mili lab has previously characterized the localization of ribosomal proteins and NET1 to protrusions (Wang et al, 2017, Moissoglu et al 2019, Crisafis et al., 2020) and the role of kinesins in this localization (Pichon et al, 2021). These papers should be cited and their work discussed. I do not believe this reduces the novelty of this study and supports the generality of the RNA localization patterns to additional cellular locations in other cell types.

      This was an unintentional oversight on our part, and we apologize. We have added citations for the mentioned publications and discussed our work in the context of theirs.

      2) The 5'TOP motif begins with an invariant C nucleotide and mutation of this first nucleotide next to the cap has been shown to reduce translation regulation during mTOR inhibition (Avni et al, 1994 and Biberman et al 1997) and also Lapr1 binding (Lahr et al, 2017). Consequently, it is not clear to me if RPS28 initiates transcription with an A as indicated in Figure 3B. There also seems to be some differences in published CAGE datasets, but this point needs to be clarified. Additionally, it is not clear to me how the 5'TOP Firefly luciferase reporters were generated and if the transcription start site and exact 5'-ends of these constructs were determined. This is again essential to determine if it is a pyrimidine sequence in the 5'UTR that is important for localization or the 5'TOP motif and if Larp1 is directly regulating the localization by binding to the 5'TOP motif or if the effect they observe is indirect (e.g. is Larp1 also basally localized?). It should also be noted that Larp1 has been suggested to bind pyrimidine-rich sequences in the 5'UTR that are not next to the cap, but the details of this interaction are less clear (Al-Ashtal et al, 2021)

      We did not fully appreciate the subtleties related to TOP motif location when we submitted this manuscript, so we thank the reviewer for pointing them out.

      We also analyzed public CAGE datasets (Andersson et al, 2014 Nat Comm) and found that the start sites for both RPL7 and RPS28 were quite variable within a window of several nucleotides (as is the case for the vast majority of genes), suggesting that a substantial fraction of both do not begin with pyrimidines (Reviewer Figure 1). Yet, by smFISH, endogenous RPL7 and RPS28 are clearly basally/neurite localized (see new figure 3C).

      Reviewer Figure 1. Analysis of transcription start sites for RPL7 (A) and RPS28 (B) using CAGE data (Andersson et al, 2014 Nat Comm). Both genes show a window of transcription start sites upstream of current gene models (blue bars at bottom).

      A more detailed analysis of our PRRE-containing reporter transcripts led us to find that in these reporters, the pyrimidine-rich element was approximately 90 nucleotides into the body of the 5’ UTR. Yet these reporters are also basally/neurite localized. The organization of the PRRE-containing reporters is now more clearly shown in an updated figure 3D.

      From these results, it would seem that the pyrimidine-rich element need not be next to the 5’ cap in order to regulate RNA localization. To generalize this result, we first used previously identified 5’ UTR pyrimidine-rich elements that had been found to regulate translation in an mTOR-dependent manner (Hsieh et al 2012). We found that, as a class, RNAs containing these motifs were similarly basally/neurite localized as RP mRNAs. These results are presented in figures 3A and 3I.

      We then asked if the position of the pyrimidine-rich element within the 5’ UTR of these RNAs was related to their localization. We found no relationship between element position and transcript localization as elements within the bodies of 5’ UTRs were seemingly just as able to promote basal/neurite localization as elements immediately next to the 5’ cap. These results are presented in figures 3B and 3J.

      To further confirm that pyrimidine-rich elements need not be immediately next to the 5’ cap, we redesigned our RPL7-derived reporter transcripts such that the pyrimidine-rich motif was immediately adjacent to the 5’ cap. This was possible because the reporter uses a CMV promoter that reliably starts transcription at a known nucleotide. We then compared the localization of this reporter (called “RPL7 True TOP”) to our previous reporter in which the pyrimidine-rich element was ~90 nt into the 5’ UTR (called “RPL7 PRRE”) (Reviewer Figure 2). As with the PRRE reporter, the True TOP reporter drove RNA localization in both epithelial and neuronal cells while purine-containing mutant versions of the True TOP reporter did not (Reviewer Figure 2A-D). In the epithelial cells, the True TOP was modestly but significantly better at driving basal RNA localization than the PRRE (Reviewer Figure 2E) while in neuronal cells the True TOPs were modestly but insignificantly better. Again, this suggests that pyrimidine-rich motifs need not be immediately cap-adjacent in order to regulate RNA localization.

      Reviewer Figure 2. Experimental confirmation that pyrimidine-rich motif location within 5’ UTRs is not critical for RNA localization. (A) RPL7 True TOP smFISH in epithelial cells. (B) RPL7 True TOP smFISH in neuronal cells. (C) Quantification of epithelial cell smFISH in A. (D) Quantification of neuronal cell smFISH in D. (E) Comparison of the location in epithelial cells of endogenous RPL7 transcripts, RPL7 PRRE reporter transcripts, and PRL7 True TOP reporter transcripts. (F) Comparison of the neurite-enrichment of RPL7 PRRE reporters and RPL7 True TOP reporters. In C-F, the number of cells included in each analysis is shown.

      In response to the point about whether the localization results are direct effects of LARP1, we did not assay the binding of LARP1 to our PRRE-containing reporters, so we cannot say for sure. However, given that PRRE-dependent localization required LARP1 and there is much evidence about LARP1 binding pyrimidine-rich elements (including those that are not cap-proximal as the reviewer notes), we believe this to be the most likely explanation.

      It should also be noted here that while pyrimidine-rich motif position within the 5’ UTR may not matter, its location within the transcript does. PRREs located within 3’ UTRs were unable to direct RNA localization (Figure 5).

      3) In figure 1A, they indicate that mRNA stability can contribute to RNA localization, but this point is never discussed. This may be important to their work since Larp1 has also been found to impact mRNA half-lives (Aoki et al, 2013 and Mattijssen et al 2020, Al-Ashtal et al 2021). Is it possible the effect they see when Larp1 is depleted comes from decreased stability?

      We found that PRRE-containing reporter transcripts were generally less abundant than their mutant counterparts in C2bbe1, HCA7, and MDCK cells (figure 3 – figure supplements 5, 6, and 8) although the effect was not consistent in mouse neuronal cells (figure 3 – figure supplement 13).

      However, we don’t think it is likely that the changes in localization are due to stability changes. This abundance effect did not seem to be LARP1-dependent as both PRRE-containing and PRRE-mutant reporters were generally more expressed in LARP1-rescue epithelial cells than in LARP1 KO cells (figure 4 – figure supplement 9).

      It should be noted here that we are not ever actually measuring transcript stability but rather steady state abundances. It cannot therefore be ruled out that LARP1 is regulating the stability of our PRRE reporters. Given, though, that their localization was dependent on kinesin activity (figures 7F, 7G), we believe the most likely explanation for the localization effects is active transport.

      4) Also Moor et al, 2017 saw that feeding cycles changed the localization of 5'TOP mRNAs. Similarly, does mTOR inhibition or activation or simply active translation alter the localization patterns they observe? Further evidence for dynamic regulation of RNA localization would strengthen this paper

      We are very interested in this and have begun exploring it. We have data suggesting that PRREs also mediate the feeding cycle-dependent relocalization of RP mRNAs. As the reviewer says, we think this leads to a very attractive model involving mTOR, and we are currently working to test this model. However, we don’t have the room to include those results in this manuscript and would instead prefer to include them in a later manuscript that focuses on nutrient-induced dynamic relocalization.

      5) For smFISH quantification, is every mRNA treated as an independent measurement so that the statistics are calculated on hundreds of mRNAs? Large sample sizes can give significant p-values but have very small differences as observe for Firefly vs. OSBPL3 localization. Since determining the biological interpretation of effect size is not always clear, I would suggest plotting RNA position per cell or only treat biological replicates as independent measurements to determine statistical significance. This should also be done for other smFISH comparisons

      This is a good suggestion, and we agree that using individual puncta as independent observations will artificially inflate the statistical power in the experiment. To remedy this in the epithelial cell images, we first reanalyzed the smFISH images using each of the following as a unique observation: the mean location of all smFISH puncta in one cell, the mean location of all puncta in a field of view, and the mean location of all puncta in one coverslip. With each metric, the results we observed were very similar (Reviewer Figure 3) while the statistical power of course decreased. We therefore chose to go with the reviewer-suggested metric of mean transcript position per cell.

      Reviewer Figure 3. C2bbe1 monolayer smFISH spot position analysis. RNA localization across the apicobasal axis is measured by smFISH spot position in the Z axis. This can be plotted for each spot, where thousands of spots over-power the statistics. Spot position can be averaged per cell as outlined manually within the FISH-quant software. This reduces sample size and allows for more accurate statistical analysis. When spot position is averaged per field of view, sample size further decreases, statistics are less powered but the localization trends are still robust. Finally, we can average spot position per coverslip, which represents biological replicates. We lose almost all statistical power as sample size is limited to 3 coverslips. Despite this, the localization trends are still recognizable.

      When we use this metric, all results remain the same with the exception of the smFISH validation of endogenous OSBPL3 localization. That result loses its statistical significance and has now been omitted from the manuscript. All epithelial smFISH panels have been updated to use this new metric, and the number of cells associated with each observation is indicated for each sample.

      For the neuronal images, these were already quantified at the per-cell level as we compare soma and neurite transcript counts from the same cell. In lieu of more imaging of these samples, we chose to perform subcellular fractionation into soma and neurite samples followed by RT-qPCR as an orthogonal technique (figure 3K, figure 3 supplement 14). This technique profiles the population average of approximately 3 million cells.

      6) F: How was the segmentation of soma vs. neurites performed? It would be good to have a larger image as a supplemental figure so that it is clear the proximal or distal neurites segments are being compared

      All neurite vs. soma segmentations were done manually. An example of this segmentation is included as Reviewer Figure 4. This means that often only proximal neurites segments are included in the analysis as it is often difficult to find an entire soma and an entire neurite in one field of view. However, in our experience, inclusion of more distal neurite segments would likely only strengthen the smFISH results as we often observe many molecules of localized transcripts in the distal tips of these neurites.

      Reviewer Figure 4. Manual segmentation of differentiated CAD soma and neurite in FISH-quant software. Neurites that do not overlap adjacent neurites are selected for imaging. Often neurites extend beyond the field of view, limiting this assay to RNA localization in proximal neurites.

      Also, it should be noted that the neuronal smFISH results are now supplemented by experiments involving subcellular fractionation and RT-qPCR (figure 3 supplement 14). These subcellular fractionation experiments collect the whole neurite, both the proximal and distal portions.

      Text has been added to the methods under the header “smFISH computational analysis” to clarify how the segmentation was done.

    1. Author Response

      Reviewer #3 (Public Review):

      Main results:

      1) TCR convergence is different from publicity: The authors look at CDR3 sequence features of convergent TCRs in the large Emerson CMV cohort. Amino usage does not perfectly correlate with codon degeneracy, for example, arginine (which has 6 codons) is less common in convergent TCRs, whereas leucine and serine are elevated. It's argued that there's more to convergence than just recombination biases, which makes sense. (I wonder if the trends for charged amino acids could be explained by the enrichment of convergent TCRs in CD8 T cells, which tend to have more acidic CDR3 loops). There's also a claim that the overlap between convergent and public TCRs is lower in tumors with a high mutational burden (TMB), but this part is sketchy: the definition of public TCRs is murky and hard to interpret, and the correlation between TMB and convergence-publicity overlap is modest (two cohorts with low TMB have higher overlap, and the other three have lower, but there is no association over those three, if anything the trend is in the other direction). It's also not clear why the overlap between COVID19 cohort convergent TCRs and public TCRs defined by the pre-2019 Emerson cohort should be high. A confounder here is the potential association between convergence and clonal expansion since expanded clonotypes can spawn apparently convergent TCRs due to sequencing errors. The paper "TCR Convergence in Individuals Treated With Immune Checkpoint Inhibition for Cancer" (Ref#5 here) gives evidence that sequencing errors may be inflating convergence in this specific dataset.

      We really appreciate the reviewer’s feedback. We respond to each of the reviewer’s points below:

      (1) Amino acid preference of convergent TCRs might be caused by CD8+ T cell enrichment. To test this hypothesis, we performed the same analysis using only CD8+ T cells (using the Cader 2019 lymphoma cohort). The results are shown below. We do not observe significant changes after excluding CD4+ T cells, indicating that this enrichment might be caused by factors other than CD4/CD8 differences.

      (2) Definition of public TCRs. We have changed the definition of public TCRs. Instead of mixing the Emerson cohort into each group and using the mixed cohort to define the public TCRs, we just used the 666 samples of the Emerson cohort to define the same set of public TCRs and applied them to each cohort. Both the dataset and the approach used in this manuscript is consistent with a previous study on the same topic (Madi et al., 2014, elife).

      (3) Convergence-publicity overlap: We agree with the reviewer that some high TMB tumors did not show further decrease of convergence-publicity overlap. One potential explanation is that the correlation between the two is not linear. By adding additional cohorts in this revision (healthy and recovered COVID-19 patients), we confirmed the previously observed overall trend between TMB and the overlap, which supported our conclusions (see figure below). On the other hand, we believe that the high overlap of convergent TCRs among healthy cohorts might result from exposure to common antigens. In the cancer patients, while still exposed, private antigens derived from tumor cells are expected to compete for resources, thus reducing the proportion of these public TCRs in the blood repertoire. The above discussion has been added to the revised manuscript:

      “Healthy individuals are expected to be exposed to common pathogens, which might induce public T cell responses. On the other hand, cancer patients have more neoantigens due to the accumulative mutation, which drives their antigen-specific T cells to recognize these 'private' antigens. This reduces the proportion of public TCRs in antigen-specific TCRs. Furthermore, a higher tumor mutation burden (TMB) would indicate a higher abundance of neoantigens, resulting in a lower ratio of public TCRs.”

      2) Convergent TCRs are more likely to be antigen-specific: This is nicely shown on two datasets: the large dextramer dataset from 10x genomics, and the COVID19 datasets from Adaptive biotech. But given previous work on TCR convergence, for example, the Pogorelyy ALICE paper, and many others, this is also not super-surprising.

      We thank the reviewer for bringing up this related work. In the Pogorelyy ALICE paper, the authors defined TCR neighbors based on one nucleotide difference of a given CDR3, which included both synonymous and non-synonymous changes. In other words, ALICE combines both convergence and mismatched (with hamming distance 1) sequences as neighbors. Although highly relevant, our approach is different by focusing only on the convergence, as mismatch has been extensively investigated by previous studies. We have now added this paper as Ref 27, and discussed the difference between ALICE and our method in the revised manuscript.

      3) Convergent T cells exhibit a CD8+ cytotoxic gene signature: This is based on a nice analysis of mouse and human single-cell datasets. One striking finding is that convergent TCRs are WAY more common in CD8+ T cells than in CD4+ T cells. It would be interesting to know how much of this could be explained by greater clonal expansion of CD8+ T cells, together with sequencing errors. A subtle point here is that some of the P values are probably inflated by the presence of expanded clonotypes: a group of cells belonging to the same expanded clonotype will tend to have similar gene expression (and therefore similar cluster membership), and will necessarily all be either convergent or not convergent collectively since they share the same TCR. So it's probably not quite right to treat them as independent for the purposes of assessing associations between gene expression clusters and convergence (or any other TCR-defined feature). You can see evidence for clonal expansion in Figure 3C, where TRAV genes are among the most enriched, suggesting that Cluster 04 may contain expanded clones.

      (1) We agree with the reviewer that a possible explanation of the CD8/CD4 difference is the larger cell expansion of CD8+ T cells. We tested this hypothesis by counting the number of T cell clones instead of cell number to remove the effect that would have been caused by CD8 T cell expansion. We first investigated the bulk TCR repertoire sequencing samples as Figure 3 - figure supplement 2C-2D (see figure below). We observed higher convergence levels for the CD8+ T cell clones compared to CD4+ T cells. The additional description of this topic was added at the last paragraph of the result section of “Convergent T cells exhibit a CD8+ cytotoxic gene signature” as follows:

      “The results may be explained by larger cell expansions of CD8+ T cells than CD4+ T cells. Therefore, we calculated the number of convergent clones within CD8+ T cells and CD4+ T cells from the above datasets to exclude the effects of cell expansion. As a result, in the scRNA-seq mouse data, while only 1.54% of the CD4+ clones were convergent, 3.76% of the CD8+ clones showed convergence. Likewise, 0.17% of convergent CD4+ T cell clones and 1.03% of convergent CD8+ T cell clones were found in human scRNA-seq data. In the bulk TCR-seq lymphoma data, similar results were also observed, where the gap between the convergent levels of CD4+ and CD8+ T cells narrowed but remained significant (Figure 3—figure supplement 2C-2D). In conclusion, these results suggest that CD8+ T cells show higher levels of convergence than CD4+ T cells, which substantiated our hypothesis that convergent T cells are more likely antigen-experienced. This observation has been tested using multiple datasets with diverse sequencing platforms and sequencing depth to minimize the impact of batch or other technical artifacts.”

      (2) We next investigated the effect of cell expansion in the single cell analysis. We agree with the reviewer that some highly-expanded convergent clones could inflate the p-value. Therefore, we revised the calculation of TCR convergence by using the T cell clone instead of individual cells. We observed that the clusters of interest mentioned in the paper (for both mouse and human data) remain at the top convergent level among all clusters (see table below), with p values estimated using Binomial exact test. These results supported our hypothesis that TCR convergence is enriched for T cell clusters that are more likely antigen-experienced.

      4) TCR convergence is associated with the clinical outcome of ICB treatment: The associations for the first analysis are described as significant in the text, and they are, but just barely (0.045 and 0.047, but you have to check the figure to see that).

      As suggested by the reviewer, we have added the p-value to the test so that it is easier to see. In this revision, we adopted another definition of convergent level, changing from the ratio of convergent TCR to the actual number of convergent T cell clones within each sample. The p-values were more significant using this new indicator (0.02 and 0.00038). To avoid the effect of other variables that might be correlative with convergent levels, especially the sequencing depth, the multivariate Cox model was used for both datasets tested in the paper, correcting for TCR clonality, TCR diversity and sequencing depth (and different treatment methods for melanomas data). As a result, convergence remains significantly prognostic after adjusting for the additional variables.

      5) Introduction/Discussion: Overall, the authors could do a better job citing previous work on convergence, for example, papers from Venturi on convergent recombination and the work from Mora and Walczak (ALICE, another recombination modeling). They also present the use of convergence as an ICB biomarker as a novel finding, but Ref 5 introduces this concept and validates it in another cohort. Ref 5 also has a careful analysis of the link between sequencing errors and convergence, which could have been more carefully considered here.

      We thank the reviewer for this excellent suggestion. We have added the citation of Venturi on convergent recombination as Ref 43 and we cited it at the last paragraph of the result selection:

      “Convergent recombination was claimed to be the mechanistic basis for public TCR response in many previous studies(Quigley et al., 2010; Venturi et al., 2006).”

      We also included work from Mora and Walczak in the fourth paragraph of the introduction and the third paragraph of the discussion as Ref 27 to introduce this TCR similarity-based clustering method as well as its application in predicting ICB response:

      “This idea has led several TCR similarity-based clustering algorithms, such as ALICE (Pogorelyy et al., 2019), TCRdist (Dash et al., 2017), GLIPH2 (Huang et al., 2020), iSMART (Zhang et al., 2020), and GIANA (Zhang et al., 2021), to be developed for studying antigen-driven T cell expansion during viral infection or tumorigenesis.”

      “In addition, the potential prognostic value of TCR convergence and TCR similarity-based clustering was testified in other studies(Looney et al., 2019; Pogorelyy et al., 2019).”

      Ref 5 was recited while discussing the effect of sequencing error on TCR convergence in the fourth paragraph of discussion:

      “Improper handling of sequencing errors may result in the overestimation of TCR convergence (Looney et al., 2019).”

    1. Author Response

      Reviewer #1 (Public Review):

      Kazrin appears to be implicated in many diverse cellular functions, and accordingly, localizes to many subcellular sites. Exactly what it does is unclear. The authors perform a fairly detailed analysis of Kazrin in-cell function, and find that it is important for the perinuclear localization of TfN, and that it binds to members of the AP-1 complex (e.g., gamma-adaptin). The authors note that the C-terminus of Kazrin (which is predicted to be intrinsically disordered) forms punctate structures in the cytoplasm that colocalize with components of the endosomal machinery. Finally, the authors employ co-immunoprecipitation assays to show that both N and C-termini of Kazrin interacts with dynactin, and the dynein light-intermediate chain.

      Much of the data presented in the manuscript are of fairly high quality and describe a potentially novel function for Kazrin C. However, I had a few issues with some of the language used throughout, the manner of data presentation, and some of their interpretations. Most notably, I think in its current form, the manuscript does not strongly support the authors' main conclusion: that Kazrin is a dynein-dynactin adaptor, as stated in their title. Without more direct support for this function, the authors need to soften their language. Specific points are listed below.

      Major comments:

      1) I agree with the authors that the data provided in the manuscript suggest that Kazrin may indeed be an endosomal adaptor for dynein-dynactin. However, without more direct evidence to support this notion, the authors need to soften their language stating as much. For example, the title as stated would need to be changed, as would much of the language in the first paragraph of the discussion. Alternatively, the manuscript could be significantly strengthened if the authors performed a more direct assay to test this idea. For example, the authors could use methods employed previously (e.g., McKenney et al., Science 2014) to this end. In brief, the authors can simply use their recombinant Kazrin C (with a GFP) to pull out dynein-dynactin from cell extracts and perform single molecule assays as previously described.

      While this is certainly an excellent suggestion, the in vitro dynein/dynactin motility assays are really not straight forward experiments for laboratories that do not use them as a routine protocol. That is why we asked Dr. Thomas Surrey (Centre for Genomic Regulation, Barcelona), an expert in the biochemistry and biophysics of microtubule dynamics, to help us with this kind of analysis. In their setting, TIRF microscopy is used to follow EGFPdynein/dynactin motility along microtubules immobilized on cover slides (Jha et al., 2017). As shown in figure R1, more binding of EGFP-dynein to the microtubules is observed when purified kazrin is added to the assay (from 20 to 400 nM), but there is no increase in the number or processivity of the EGFP-dynein motility events. These results are hard to interpret at this point. Kazrin might still be an activating adaptor but a component is missing in the assay (i. e. an activating posttranslational modification or a particular subunit of the dynein or dynactin complexes), or it could increase the processivity of dyneindynactin in complex with another bona fide activating adaptor, as it has been demonstrated for LIS1 (Baumbach et al., 2017; Gutierrez et al., 2017). Alternatively, kazrin could transport dynactin and/or dynein to the microtubule plus ends in a kinesin 1-dependent manner, in order to load the peripheral endosomes with the minus end directed motor (Yamada et al., 2008).

      Figure R1. Kazrin C purified from E. coli increases binding of dynein to microtubules but does not increase the number or processivity of EGFP-dynein motility events. A. TIRF (Total Internal Reflexion Fluorescence) micrographs of microtubule-coated cover slides incubated in the presence of 10 nM EGFP-dynein and 20 nM dynactin in the presence or absence of 20 nM kazrin C, expressed and purified from E. coli. B. Kymographs of TIRF movies of microtubule-coated cover slides incubated in the presence of purified 10 nM EGFP-dynein, 20 nM dynactin and either 400 nM of the activating adaptor BICD2 (1:2:40 ratio) (left panel) or kazrin C (right panel). Red squares indicate processive dynein motility events induced by BICD2”.

      Investigating the molecular activity of kazrin on the dynein/dynactin motility is a whole project in itself that we feel it is out of the scope of the present manuscript. Therefore, as suggested by the BRE, we have chosen to soften the conclusions and classify kazrin as a putative “candidate” dynein/dynactin adaptor based on its interactome, domain organization and subcellular localization, as well as on the defects installed in vivo on the endosome motility upon its depletion. We also discuss other possibilities as those outlined above.

      2) I'm not sure I agree with the use of the term 'condensates' used throughout the manuscript to describe the cytoplasmic Kazrin foci. 'Condensates' is a very specific term that is used to describe membraneless organelles. Given the presumed association of Kazrin with membrane-bound compartments, I think it's more reasonable to assume these foci are quite distinct from condensates.

      We actually used condensates to avoid implying that the kazrin IDR generates membraneless compartments or induces liquid-liquid-phase separation, which is certainly not a conclusion from the manuscript. However, since all reviewers agreed that the word was misleading, we have substituted the term condensates for foci throughout the manuscript.

      3) The authors note the localization of Tfn as perinuclear. Although I agree the localization pattern in the kazKO cells is indeed distinct, it does not appear perinuclear to me. It might be useful to stain for a centrosomal marker (such as pericentrin, used in Figure 5B) to assess Tfn/EEA1 with respect to MT minus ends.

      We have now changed the term perinuclear, which implies that endosomes surround the nucleus, by the term juxtanuclear, which more accurately define what we wanted to indicate (close to). We thank the reviewer for pointing out this lack of accuracy. We also more clearly describe in the text that in fibroblast, the Golgi apparatus and the Recycling Endosomes (REs) gather around the pericentriolar region ((Granger et al., 2014) and reference therein), which is usually close to the nucleus ((Tang and Marshall, 2012) and references therein). Nevertheless, as suggested by the reviewer, we have included pictures of the TxR-Tfn and EEA1-labelled endosomes accumulating around pericentrin in wild type mouse embryonic fibroblast (MEF) (Figure 1–supplement figure 3) to illustrate these points.

      4) "Treatment with the microtubule depolymerizing drug nocodazole disrupted the perinuclear localization of GFP-kazrin C, as well as the concomitant perinuclear accumulation of EE (Fig. 5C & D), indicating that EEs and GFP-kazrin C localization at the pericentrosomal region required minus end-directed microtubule-dependent transport, mostly affected by the dynactin/dynein complex (Flores-Rodriguez et al., 2011)."

      • I don't agree that the nocodazole experiment indicates that minus end-directed motility is required for this perinuclear localization. In the absence of other experiments, it simply indicates that microtubules are required. It might, however, "suggest" the involvement of dynein. The same is true for the subsequent sentence ("Our observations indicated that kazrin C can be transported in and out of the pericentriolar region along microtubule tracks...").

      We agree with the reviewer. To reinforce the point that GFP-kazrin C localization and the pericentriolar accumularion of EEA1 rely on dynein-dependent transport, we have now added an experiment in figure 5E and F, where we use ciliobrevin to inhibit dynein in cells expressing GFP-kazrin C. In the treated cells, we see that the GFP-kazrin C staining in the pericentrin foci is lost and that EEs have a more dispersed distribution, similar to kazKO MEF. We have also completed and rearranged the in vivo fluorescence microscopy data to more clearly show that small GFP-kazrin C foci can be observed moving towards the cell centre (Figure 5-S1 and movies 6 and 7). Taken all this data together, I think we can now suggest that kazrin might travel into the pericentriolar region, possibly along microtubules and powered by dynein.

      5) Although I see a few examples of directed motion of Tfn foci in the supplemental movies, it would be more useful to see the kymographs used for quantitation (and noted by the authors on line 272). Also related to this analysis, by "centripetal trajectories", I assume the authors are referring to those moving in a retrograde manner. If so, it would be more consistent with common vernacular (and thus more clear to readers) to use 'retrograde' transport.

      We have now included some more examples of the time projections used in the analysis in figure 6-S1 and 2, where we have coloured in blue the fairly straight, longer trajectories, as opposed to the more confined movements that appeared as round dots in the time projections (coloured in red). We have also added more videos illustrating the differences observed in cells expressing endogenous or GFP-kazrin C versus kazKO cells or kazKO cells expressing GFP or GFP-kazrin C-Nt. Movies 8 and 11 show the endosome motility in representative WT and kazKO cells (movie 8) and kazKO cells expressing GFP, GFPkazrin C or GFP-kazrin C Nt (movie 11). Movies 9 and 10 show endosome motility in four magnified fields of different WT and kazKO cells, where longer and faster motility events can be observed when endogenous kazrin is expressed. Movies 12 to 14 show endosome motility in four magnified fields of different kazKO cells expressing, GFP-kazrin C (movie 12), GFP (movie 13) and GFP-kazrin C-Nt (movie 14). Longer and faster movements can be observed in the different insets of movie 12, as compared with movies 13 and 14. Finally, as suggested by the reviewer, we have re-worded centripetal movement to retrograde movement throughout the manuscript.

      6) The error bars on most of the plots appear to be extremely small, especially in light of the accompanying data used for quantitation. The authors state that they used SEM instead of SD, but their reasoning is not stated. All the former does is lead to an artificial reduction in the real deviation (by dividing SD by the square root of whatever they define as 'n', which isn't clear to me) of the data which I find to be misleading and very nonrepresentative of biological data. For example, the error bars for cell migration speed in Figure 2B suggest that the speeds for WT cells ranged from ~1.7-1.9 µm/sec, which I'm assuming is largely underrepresenting the range of values. Although I'm not a statistician, as someone that studies biochemical and biological processes, I strongly urge the authors to use plots and error bars that more accurately describe the data to your readers (e.g., scatter plots with standard deviation are the most transparent way to display data).

      We have now changed all plots to scattered plots with standard deviations, as suggested.

    1. Author Response

      Reviewer #2 (Public Review):

      Wang et al. elegantly exploit single-cell RNA-seq datasets to question the putative involvement of lncRNAs in human germ cell development. In the first part of the study, the authors use computational approaches to identify and characterize, from existing data, lncRNAs expressed in the germline. Of note, the scRNA-seq data used were generated from polyA+ RNAs, and thus non-polyadenylated lncRNAs could not be retrieved. Most of the lncRNAs identified in the germ cells and in the somatic cells of the gonads were previously unannotated. While this increases the catalog of lncRNA genes in the human genome, further characterization is needed to determine which fraction of these newly identified lncRNAs represent bona fide transcripts or transcriptional noise.

      Differential expression analysis between developmental stages, sexes, or cell types led to several observations: (i) whatever the stage of development, the number of expressed lncRNAs is higher in fetal germ cells compared to gonadal somatic cells; (ii) there is a continuous increase in the number of expressed lncRNA during the development of the germline; of note, a similar, although the more subtle trend is observed for protein-coding genes; (iii) the developmental stage at which there is the highest number of lncRNA expressed differs between male and female germ cells. While convincing, the significance of these observations is difficult to assess. However, the authors remain prudent with their conclusion and are not over-interpreting their findings.

      We appreciate Reviewer #2 precise summary of our analysis and highlighting the significances of these datasets for other researchers and future studies.

      Interestingly, integrating lncRNA expression to classify cell types led to the identification of a novel population of cells in the female germline that had not been revealed by protein-coding gene only-based classification. The biological relevance of this population, which cluster with mitotic populations, remains to be demonstrated. Finally, by examining lncRNA biotype, the authors could demonstrate an enrichment, in the germ cells, of the antisense head-to-head organization (in relation to the nearby protein-coding gene) compared to other biotypes. Whether this is different from the general distribution of lncRNA should be discussed.

      We analyzed the lncRNAs in NONCODEv5 database (human genome), and the result showed that XH type occupied 21.73% of the intragenic lncRNA-mRNA pairs in NONCODEv5 database (human genome), which is lower than 26.58% in fGC and 26.23% in mGC (Response Figure 1).

      Response Figure 1. Genomic distribution and biotypes of the lncRNAs in NONCODEv5 database and lncRNAs expressed in human gonad.

      In the second part of the manuscript, Wang et al focus on one pair of divergent lncRNA-protein coding genes (LNC1845-LHX8). To document the choice of this particular pair, it would be informative to have its correlation score indicated in Figure 3C. he existence of this transcript was validated using female fetal ovaries, and its function was addressed in late primordial germ cells like cells (PGCLC) derived from human embryonic stem cells (hESCs). The authors have used an admirable set of orthogonal approaches that led them to conclude as to a role for LNC1845 in regulating in cis the nearby gene LHX8. They further went on to identify the underlying mechanisms, which involve modification of the chromatin landscape through direct interaction of LNC1845 with a histone modifier. Among the different strategies used (KO, stop transcription, overexpression), the shRNA-mediated knock-down is the only one to specifically address the function of the transcript itself, as opposed to the active transcription. The result of this experiment led the authors to conclude that the LNC1845 RNA is functional, a conclusion that is reinforced by the demonstration of physical interaction between the LNC1845 RNA and WDR5, a component of MLL methyltransferase complexes. The result of the KD experiment is however puzzling as RNAi has been shown not to be the method of choice for targeting nuclear lncRNAs (Lennox et al. NAR 2016).

      We thank the Reviewer #2’s suggestion to add the correlation score of LNC1845-LHX8 pair and the Pearson Correlation of this pair is 0.3268. We have added the number to Figure 4C because which the expression correlation of LNC1845 and LHX8 was first mentioned. We have compared many other similar studies, shRNA knockdown has been widely used to target nuclear lncRNAs (Guttman et al. Nature 2011; Luo et al. Cell Stem Cell 2016; Subhash et al. Nucleic Acids Res. 2018; Li et al. Genome Res 2021), and the knockdown efficiency seemed to be feasible and acceptable to be used. The knockdown results are consistent with the deletion mutation and stop transcription approaches, all three showed that LNC1845 transcriptional expression is required for proper LHX8 expression in late PGCLCs.

      Overall, the functional investigation is convincing and strengthened by the inclusion of multiple clones for each approach, and by the convergence in the outcome of each individual approach. The depth of characterization is also remarkable. The analyses of the mechanisms at stake are somehow less solid, as there is less evidence demonstrating the involvement of the LNC1845 RNA and its interaction with WDR5.

      We have added more experimental evidence to strengthen the model especially the interaction of LNC1845 and WDR5. Apart from the RIP-qPCR results of WDR5 demonstrating the enrichment of LNC1845 by WDR5 pulldown (Figure S8D), we performed chromatin isolation by RNA purification (ChIRP) assay using antisense oligos along the entire LNC1845 transcript sequence. ChIRP results confirmed that WDR5 protein were enriched when anti-LNC1845 oligo probes were used to isolate the complex but not the controls without the probes or without overexpression of LNC1845 transcript (Response Figure 2). Taken together, the findings of both approaches support the model that LNC1845 directly interacts with WDR5 to modulate the H3K4me3 modification for LHX8 transcriptional activation. (Related to supplementary figure 8D and 8E.)

      Response Figure 2. LNC1845 binding for WDR5 was verified by CHIRP-western blot.

      Altogether, this study provides a convincing demonstration of the role of a lncRNA on the regulation of a nearby gene in the context of the germline. However, to have a better understanding of the functionality of lncRNA genes in general, it would be interesting to know whether other pairs of lncRNA-PC genes have been functionally investigated in this context, where no function for the lncRNA gene could be demonstrated. Negative results are highly informative and if so, these could be included in the manuscript.

      We appreciate Reviewer #2 suggestion to add other lncRNA-PC gene pairs results. In fact, we have analyzed and presented the results of another 2 pairs in figure 7D. LncRNAs LNC3346 and LNC15266 were also transcriptionally regulated by FOXP3, and they may regulate their neighbor genes TMCO1 and MPP5, as figure 7D showed. Our analysis showed that other lncRNA-PC gene pairs may also have the similar transcriptional regulation as LNC1845-LHX8 during germ cell development.

    1. Author Response

      Reviewer #2 (Public Review):

      Charme is a long non-coding RNA reported by the authors in their previous studies. Their previous work, mainly using skeletal muscles as a model, showed the functional relevance of Charme, and presented data demonstrating its nuclear role, primarily via modulating the sub-nuclear localization of Matrin 3 (MATR3). Their data from skeletal muscles suggested that loss of the intronic region of Charme affects the local 3D genome organization, affecting MATR3 occupancy and this gene expression. Loss of Charme in vivo leads to cardiac defects. In this manuscript, they characterize the cardiac developmental defects and present molecular data supporting how the loss of Charme affects the cardiac transcriptome repertoire. Specifically, by performing whole transcriptome analysis in E12.5 hearts, they identify gene expression changes affected in developing hearts due to loss of Charme. Based on their previous study in skeletal muscles, they assume that Charme regulates cardiac gene expression primarily via MATR3 also in developing cardiomyocytes. They provide CLIP-seq data for MATR3 (transcriptome-wide foot printing of MATR3) in wild-type E15.5 hearts and connect the binding of MATR3 to gene expression changes observed in Charme knockout hearts. I credit the authors for providing CLIP seq data from in vivo embryonic samples, which is technically demanding.

      Major strengths:

      Although, as previously indicated by the authors in Charme knockout mice, the major strength is the effect of Charme on cardiac development. While the phenotype might be subtle, the functional data indicate that the role of Charme is essential for cardiac development and function. The combinatorial analysis of MATR3 CLIP-seq and transcriptional changes in the absence of Charme suggests a role of Charme that could be dependent on MATR3.

      We thank this reviewer for appreciating our methodological efforts and the importance of the MATR3 CLIP-seq data from in vivo embryonic samples.

      Weakness:

      (i) Nuclear lncRNAs often affect local gene expression by influencing the local chromatin.

      Charme locus is in close proximity to MYBPC2, which is essential for cardiac function, sarcomerogenesis, and sarcomere maintenance. It is important to rule out that the cardiac-specific developmental defects due to Charme loss are not due to (a) the influence of Charme on MYBPC2 or, of that matter, other neighboring genes, (b) local chromatin changes or enhancer-promoter contacts of MYBPC2 and other immediate neighbors (both aspects in the developmental time window when Charme expression is prominent in the heart, ideally from E11 to E15.5)

      Although the cis-activity represents a mechanism-of-action for several lncRNAs, our previous work does not reveal this kind of activity for pCharme. To add stronger evidence, we have now analysed the expression of pCharme neighbouring genes in cardiac muscle. Genes were selected by narrowing the analysis not only on the genes in “linear” proximity but also on eventual chromatin contacts, which may underlie possible candidates for in cis regulation. To this purpose, we made use of the analyses that in the meantime were in progress (to answer point iv) on available Hi-C datasets (Rosa- Garrido et al. 2017). Starting from a 1 Mb region around Charme locus, we found that most of the interactions with Charme occur in a region spanning from 240 kb upstream and 115 kb downstream of Charme for a total of 370 Kb (Rev#2_Capture Fig. 1A). This region includes 39 genes, 9 of them expressed in the neonatal heart but none showing significant deregulation (see Table S2). To note, this genomic region also included the MYBPC2 locus, for which we did not find a decreased expression in the heart from our RNA-seq data (Revised Figure 2-figure supplement 1C and Table S2). This trend was confirmed through RT-qPCR analyses of several genes from E15.5 extracts, which revealed no significant difference in their abundance upon Charme ablation (Rev#2_Capture fig. 1B).

      Fig. 1. A) Contact map depicting Hi-C data of left ventricular mice heart retrived from GEO accession ID GSM2544836. Data related to 1 Mb region around Charme locus were visualized using Juicebox Web App (https://aidenlab.org/juicebox/). B) RT-qPCR quantification of Charme and its neighbouring genes in CharmeWT vs CharmeKO E15.5.5 hearts. Data were normalized to GAPDH mRNA and represent means ± SEM of WT and KO (n=3) pools. Data information: p < 0.05; p < 0.01, **p < 0.001 unpaired Student’s t test.

      For a better understanding, we also checked possible “local” Charme activities in skeletal muscle cells, from previous datasets (Ballarino et al., 2018). We found that in murine C2C12 cells treated with two different gapmers against Charme, three of its neighbouring genes were expressed (Josd2, Emc10 and Pold1), but none showed significant alterations in their expression levels in response to Charme knock-down (Rev#2_Capture Fig. 2).

      Taken together, these results would exclude the possibility of Charme in cis activity as responsible for the phenotype.

      Fig. 2: Average expression from RNA-seq (FPKM) quantification of Charme neighbouring genes in C2C12 differentiated myotubes treated with Gap-scr vs Gap-Charme. Values for Gap-Charme represent the average values of gene expression after treatment with two different gapmers (GAP-2 and GAP-2/3).

      (ii) The authors provide data indicating cardiac developmental defects in Charme knockouts. Detailed developmental phenotyping is missing, which is necessary to pinpoint the exact developmental milestones affected by Charme. This is critical when reporting the cell type/ organ-specific developmental function of a newly identified regulator.

      We did our best to answer this concern.

      Let us first emphasise that, since their generation, we have never observed any particular tissue alteration, morphological or physiological, when dissecting the CharmeKO animals other than the muscular ones. The high specificity of pCharme expression, as also shown here by ISH (Figure 1C-D, Figure 1-figure supplement 1A-B, Figure 3A), together with the minimal alteration applied to the locus for CRISPR-Cas-mediated KO (PolyA insertion), strongly excludes the presence of an alteration in other tissues and their involvement in the development of the phenotype.

      Nevertheless, we now add more developmental details to the cardiac phenotype (see also Essential revision point 2).

      1- First of all, gene expression analyses performed at 12.5E, 15.5E, 18.5E and neonatal (PN2) stages allowed us to identify, at the molecular level, the developmental time point when CharmeKO effects on the cardiac muscle can be found. Our new results clearly indicate that the pCharme-mediated regulation of morphogenic and cardiac differentiation genes is detectable from E15.5 fetal stage onward (Rev#2_Capture Fig. 3/Revised Figure 2E). Together with the analysis of pCharme targets and coherently with the altered cardiac maturation and performance, this evidence is also supported by the analysis of the myosins Myh6/Myh7 ratio, which diminution in CharmeKO hearts starts from E15.5 up to 69% of control levels at PN stages (Revised Figure 2F).

      2- Hematoxylin-eosin staining of dorso-ventral cryosections from CharmeWT and CharmeKO hearts confirmed the fetal malformation at the E15.5 stage (Revised Figure 2G). Moreover, the hypotrabeculation phenotype of CharmeKO hearts, which was initially examined by immunofluorescence, now finds confirmation by the analysis of key trabecular markers (Irx3 and Sema3a), which expression significantly decreases upon pCharme ablation (Rev#1_Capture Fig. 3B/Revised Figure 2-figure supplement 1G).

      3- Finally, the gene expression analysis on Ki-67, Birc5 and Ccna2 (Revised Figure 2-figure supplement 1E) definitively rules out the influence of pCharme ablation on cell-cycle genes and cardiomyocytes proliferation, thus allowing a more careful interpretation of the embryonic phenotype. Note that, coherently with the lncRNA implication at later stages of development, the expression of important cardiac regulators, such as Gata4, Nkx2-5 and Tbx5, is not altered by its ablation at any of the tested time points (Rev#2_Capture Fig.3), while pCharme absence mainly affects genes which are expressed downstream of these factors.

      These new results have been included in the revised version of the manuscript and better discussed.

      Fig. 3: RT-qPCR quantification Gata4, Nkx2-5 and Tbx5 in CharmeWT and CharmeKO cardiac extract at E12.5, E15.5 and E18.5 days of embryonal development. Data were normalized to GAPDH mRNA and represent means ± SEM of WT and KO (n=3) pools.

      (iii) Along the same line, at the molecular level, the authors provide evidence indicating a change in the expression of genes involved in cardiogenesis and cardiac function. Based on changes in mRNA levels of the genes affected due to loss of Charme and based on immunofluorescence analysis of a handful of markers, they propose a role of Charme in cell cycle and maturation. Such claims could be toned down or warrant detailed experimental validation.

      See above, response to Reviewer #2 (Public Review) weakness (ii).

      (iv) Authors extrapolate the mechanistic finding in skeletal muscle they reported for Charme to the developing heart. While the data support this hypothesis, it falls short in extending the mechanistic understanding of Charme beyond the papers previously published by the authors. CLIP-seq data is a step in the right direction. MATR3 is a relatively abundant RBP, binding transcriptome-wide, mainly in the intronic region, based on currently available CLIP-seq data, as well as shown by the authors' own CLIP seq in cardiomyocytes. It is also shown to regulate pre-mRNA splicing/ alternative splicing along with PTB (PMID: 25599992) and 3D genome organization (PMID: 34716321). In addition, the authors propose a MATR3 depending molecular function for Charme primarily dependent on the intronic region of Charme and due to the binding of MATR3. Answering the following question would enable a better mechanistic understanding of how Charme controls cardiac development.

      (i) what are the proximal genomic regions in the 3D space to Charme locus in embryonic cardiomyocytes? Authors can re-analysis published Hi-C data sets from embryonic cardiomyocytes or perform a 4-C experiment using Charme locus for this purpose.

      See above, response to Reviewer #2 (Public Review) weakness (i).

      (ii) does the loss of Charme affect the splicing landscape of MATR3 bound pre-mRNAs in E12.5 ventricles in general and those arising from the NCTC region specifically?

      This is an intriguing issue, as also highlighted by new evidence showing that the reactivation of fetal-specific RNA-binding proteins, including MATR3, in the injured heart drives transcriptome-wide switches through the regulation of early steps of RNA transcription and processing (D'Antonio et al., 2022).

      Using the rMATS software on our neonatal RNA-Seq datasets we then investigated the effect of pCharme depletion on splicing, with a focus on NCTC. As shown in the Rev#2_Capture Fig.4A, all classical splicing alterations were investigated, such as exon-skipping, alternative 5’ splice site, alternative 3’ splice site, mutually excluded exons and intron retention. Intriguingly, we did observe a slight alteration in the splicing patterns, in particular considering exon skipping events (62% corresponding to 381 genes). Among them, the majority corresponded to exon exclusion events (237 events = 209 genes) while a smaller fraction to exon inclusion (144 events = 133 genes). Moreover, by intersecting these genes with the MATR3-bound RNAs we found a slightly significant enrichment (p=0,038) for exon inclusion (Rev#2_Capture Fig.4B).

      Regarding the NCTC locus, we demonstrate that in hearts pCharme acts through different target genes. Indeed, none of the NCTC-arising transcripts are bound by MATR3 (see Table S4) or substrate for alternative splicing regulation.

      While these results are very interesting for deepening the investigation of pCharme/MATR3 interplay, their biological significance needs to be further investigated through one-by-one analysis of specific transcripts. As a prosecution of the project, Nanopore sequencing of these samples on a MinION platform is currently undergoing in the lab to obtain a better characterization of alternative splicing events in response to the lncRNA ablation during development.

      Fig. 4: A) Left and middle panel: Pie Chart depicting the proportion of significantly altered (FDR < 0.05) splicing events detected by rMATS comparing neonatal CharmeWT and CharmeKO RNA-seq samples. All classical splicing alterations were investigated, such as exon-skipping, alternative 3’ splice site (A3SS), intron retention, alternative 5’ splice site (A5SS) and mutually excluded exons (MXE). Right panel. Volcano plot depicting significant exon skipping events in CharmeKO (FDR < 0.05, PSI<0 for excluded and included exons, FDR >= 0.05 for invariant exons). X-axis represent exon-inclusion ratio or Percentage Spliced In (PSI) while y-axis represent –log10 of p-value. B) Pie charts representing the fraction of transcripts with at least one significant excluded (left panel), invariant (middle panel) and included (right panel) exons that are bound by MATR3. P-values of MATR3 targets enrichment for each comparison is depicted below. Statistical significance was assessed with Fisher exact test.

      (iii) MATR3 binds DNA, as also shown by authors in previous studies. Is the MATR3 genomic binding altered by Charme loss in cardiomyocytes globally, as well as on the loci differentially expressed in Charme knockout heart? Overlapping MATR3 genomic binding changes and transcriptome binding changes to differentially expressed genes in the absence of Charme would better clarify the MATR3-centric mechanisms proposed here. Further connecting that to 3D genome changes due to Charme loss could provide needed clarity to the mechanistic model proposed here.

      Previous experience from our (Desideri et al., 2020) and other labs (Zeitz et al 2009 J Cell Biochem), indicate that Chromatin IP is not the most suitable approach for identifying MATR3 specific targets because of the broad distribution of MATR3 over the genome. Given the number of animals that would need to be sacrificed, we moved further to strengthen our MATR3 CLIP evidence by adding the i) CharmeKO MATR3 CLIP-seq control and the ii) combinatorial analysis of MATR3 CLIP-seq with the RNA-seq data.

      We have better explained the reasoning within the text, which now reads “The known ability of MATR3 to interact with both DNA and RNA and the high retention of pCharme on the chromatin may predict the presence of chromatin and/or specific transcripts within these MATR3-enriched condensates. In skeletal muscle cells, we have previously observed on a genome-wide scale, a global reduction of MATR3 chromatin binding in the absence of pCharme (Desideri et al., 2020). Nevertheless, the broad distribution of the protein over the genome made the identification of specific targets through MATR3-ChIP challenging.” (lines 274-279).

      Indeed, we found that MATR3 binding was significantly decreased on numerous peaks (434/626), while its increase was observed on a smaller fraction of regions (192/626) (Revised Figure 5C). As a control, we performed MATR3 motif enrichment analysis on the differentially bound regions revealing its proximity to the peak summit (+/- 50 nt) (Revised Figure 5-figure supplement 1D) close to the strongest enrichment of MATR3, further confirming a direct and highly specific binding of the protein to these sites. To better characterise the relationship between MATR3 and pCharme, we then intersected the newly identified regions with the MATR3-bound transcripts whose expression was altered by Charme depletion. While gain peaks were equally distributed across DEGs, loss peaks were significantly enriched in a subset of pCharme down-regulated DEGs (Revised Figure 5D), suggesting a crosstalk between the lncRNA and the protein in regulating the expression of this specific group of genes. Interestingly, these RNAs mainly distribute across the same GO categories as pCharme downregulated DEGs and include genes, such as Cacna1c, Notch3, Myo18B and Rbm20 involved in embryo development and validated as pCharme/Matr3 targets in primary cardiac cells (Revised Figure 5D, lower panel and 5E)

    1. Author Response

      Reviewer 2 (Public Review):

      1) The authors developed a novel C.elegans model for studying extracellular amyloid beta aggregation and is therefore likely to be taken up broadly by the field. However, the new model should be fully characterized. Throughout the manuscript, the only method to detect amyloid deposition was the GFP fluorescence intensity and morphology, while direct characterization of amyloid aggregates is lacking.

      We thank the reviewer for the feedback and the foresight that this model might be taken up by the field. To strengthen our model, as the reviewer had suggested, we confirmed that the GFP fluorescence is indeed amyloid aggregations. Please, see point 3 above and the new Supporting Figure 1.1.

      2) A targeted RNA interference (RNAi) screen was used to identify the key regulators of Aβ aggregation and clearance, which is one of the strengths of the study. There should be evidence that RNAi works to knockdown the specific genes. Similarly, there should be evidence indicating that ADM-2 is indeed expressed in the overexpression experiments.

      We aimed to verify our main hits (cri-2 and adm-2) with a mutation in these genes, as RNAi can have off-target effects. The adm-2(ok3178) allele is a 989 bp deletion leading to a splice/acceptor change leading to a probably truncated and out-of-frame protein.

      Author response image 1.

      The cri-2(gk314) allele is a 1213 bp deletion covering the whole cri-2 locus, suggesting to be a null allele.

      Author response image 2.

      For the overexpression, there is no ADM-2 antibody available. We tried to generate an ADM-2 antibody, unfortunately unsuccessfully. Thus, we can only, based on the induction and higher red fluorescence of ADM-2::mScarlet (Supporting Figure 6.1.) infer the ADM-2 overexpression.

      3) It remains unknown whether ADM-2 directly degrades Aβ or facilitates the clearance of Aβ by remoulding the ECM. The effect of ADM-2 on ECM remodeing should be examined.

      We addressed this in point 1 above and also in our discussion section.

    1. Author Response

      Reviewer #2 (Public Review):

      The time-dependency of the model simulations was not analyzed, and the nature of the observed biphasic time-dependent APAP response remains elusive. It would be interesting to see how the model can explain the time course of the APAP stimulation experiment.

      The alternative model at its current state can only describe steady state conditions. However, we understand that the reviewer is interested in the dynamic behavior of the model. However, our approach provides a proof of principle that the alternative model can phenomenologically explain the changes of YAP localization as a response to APAP treatment. The question of how to model Hippo pathway in a time-dependent manner as a response to APAP treatment is very challenging and would require further investigations and, most notably, further development of the PDE simulation algorithms and the SME software. Hence, a technical update of the software algorithms would be required, which cannot be in the scope of this manuscript.

      Nevertheless, we decided to share our first and preliminary analyses on dynamic processes caused by APAP with the reviewer. For this, we simulated the steady state model in an arbitrary manner, where APAP initiates (early time-point) and slows down (late time-points) YAP phosphorylation in the nucleus (see Figure below).

      The simulated alternative model shows that increased YAP phosphorylation about 50% leads to the cytoplasmic localization of YAP (Rebuttal Figure R5A/B). However, this shuttling is not detectable in our protein fractionation and live-cell imaging experiments (see also Rebuttal Figure R7C/D). At late time points, decreasing YAP phosphorylation (about 60%) led to a clear nuclear enrichment and dephosphorylation of YAP was observed in our experiments. Thus, our mathematical model nicely describes cellular events of Hippo pathway dynamics observed at later stages after APAP treatment (nuclear enrichment). However, early events cannot be completely explained (suggested nuclear YAP exclusion is not detectable).

      We suggest two explanations for this observation. First, other molecular mechanisms (not yet identified and therefore not part of the model topology) oppose the exclusion YAP enrichment that is expected at early time points. Second, detection methods used in this study (Western Blotting and life cell imaging) cannot capture minimal changes and cellular heterogeneity in the chosen experimental setup. We clarify this aspect/limitation of our study in the discussion chapter of the manuscript. Page 12, lines 436-440

      Time-dependency of YAP (orange) localization based on the simulated APAP treatment. (A): Simulated control (ctrl) and APAP treatment for 2 and 48h. The treatment was simulated by changing the phosphorylation coefficient of YAP in the nucleus. (B): Simulated pYAP/YAP ratio during control and APAP treatment for 2 and 48 hours at the steady state of the model. (C): Simulated NCR of the total YAP during control and APAP treatment for 2 and 48 hours at the steady state.

    1. Author Response

      Reviewer #1 (Public Review):

      Because of the importance of brain and cognitive traits in human evolution, brain morphology and neural phenotypes have been the subject of considerable attention. However, work on the molecular basis of brain evolution has tended to focus on only a handful of species (i.e., human, chimp, rhesus macaque, mouse), whereas work that adopts a phylogenetic comparative approach (e.g., to identify the ecological correlates of brain evolution) has not been concerned with molecular mechanism. In this study, Kliesmete, Wange, and colleagues attempt to bridge this gap by studying protein and cis-regulatory element evolution for the gene TRNP1, across up to 45 mammals. They provide evidence that TRNP1 protein evolution rates and its ability to drive neural stem cell proliferation are correlated with brain size and/or cortical folding in mammals, and that activity of one TRNP1 cis-regulatory element may also predict cortical folding.

      There is a lot to like about this manuscript. Its broad evolutionary scope represents an important advance over the narrower comparisons that dominate the literature on the genetics of primate brain evolution. The integration of molecular evolution with experimental tests for function is also a strength. For example, showing that TRNP1 from five different mammals drives differences in neural stem cell proliferation, which in turn correlate with brain size and cortical folding, is a very nice result. At the same time, the paper is a good reminder of the difficulty of conclusively linking macroevolutionary patterns of trait evolution to molecular function. While TRNP1 is a moderate outlier in the correlation between rate of protein evolution and brain morphology compared to 125 other genes, this result is likely sensitive to how the comparison set is chosen; additionally, it's not clear that a correlation with evolutionary rate is what should be expected. Further, while the authors show that changes in TRNP1 sequence have functional consequences, they cannot show that these changes are directly responsible for size or folding differences, or that positive selection on TRNP1 is because of selection on brain morphology (high bars to clear). Nevertheless, their findings contribute strong evidence that TRNP1 is an interesting candidate gene for studying brain evolution. They also provide a model for how functional follow-up can enrich sequence-based comparative analysis.

      We thank the reviewer for the positive assessment. With respect to our set of control genes and the interpretation of the correlation between the evolution of the TRNP1 protein sequence and the evolution of brain size and gyrification, we would like to mention the following: we do think that the set is small, but we took all similarly sized genes with one coding exon that we could find in all 30 species. Furthermore, the control genes are well comparable to TRNP1 with respect to alignment quality and average omega (Figure 1-figure supplement 3). Hence, we think that the selection procedure and the actual omega distribution make them a valid, unbiased set to which TRNP1’s co-evolution with brain phenotypes can be compared to. Moreover, we want to point out that by using Coevol, we correlate evolutionary rates, that is the rate of protein evolution of TRNP1 as measured with omega and the rate of brain size evolution that is modeled in Coevol as a Brownian motion process. We think that this was unclear in the previous version of our manuscript, and appreciate that the reviewer saw some merit in our analyses in spite of it.

      Finding conclusive evidence to link molecular evolution to concrete phenotypes is indeed difficult and necessarily inferential. This said, we still believe that correlating rates of evolution of phenotype and sequence across a phylogeny is one of the most convincing pieces of evidence available.

      Reviewer #2 (Public Review):

      In this paper, Kliesmete et al. analyze the protein and regulatory evolution of TRNP1, linking it to the evolution of brain size in mammals. We feel that this is very interesting and the conclusions are generally supported, with one concern.

      The comparison of dN/dS (omega) values to 125 control proteins is helpful, but an important factor was not controlled. The fraction of a protein in an intrinsically disordered region (IDR) is potentially even more important in affecting dN/dS than the protein length or number of exons. We suggest comparing dN/dS of TRNP1 to another control set, preferably at least ~500 proteins, which have similar % IDR.

      Thank you for this interesting suggestion. As mentioned in the public response to Reviewer #1, we are sorry that we did not explain the rationale of the approach very well in the previous version of the manuscript. As also argued above, we think that our control proteins are an unbiased set as they have a comparable alignment quality and an average omega (dN/dS) similar to TRNP1 (Figure 1-figure supplement 3). While IDR domains tend to have a higher omega than their respective non-IDR counterparts, we do not think that the IDR content should be more relevant than omega itself as we do not interpret this estimate on its own, but its covariance with the rate of phenotypic change. Indeed, the proteins of our control set that have a higher IDR content (D2P2, Oates et al. 2013) do not show stronger evidence to be coevolving with the brain phenotypes (IDR content vs. absolute brain size-omega partial correlation: Kendall's tau = 0.048, p-value = 0.45; IDR content vs. absolute GI-omega partial correlation: Kendall’s tau = -0.025, p-value = 0.68; 88 proteins (71%) contain >0% IDRs; 8 proteins contain >62% (TRNP1 content) IDRs.

      Reviewer #3 (Public Review):

      In this work, Z. Kliesmete, L. Wange and colleagues investigate TRNP1 as a gene of potential interest for the evolution of the mammalian cortex. Previous evidence suggests that TRNP1 is involved in self-renewal, proliferation and expansion in cortical cells in mouse and ferret, making this gene a good candidate for evolutionary investigation. The authors designed an experimental scheme to test two non-exclusive hypotheses: first, that evolution of the TRNP1 protein is involved in the apparition of larger and more convoluted brains; and second, that regulation of the TRNP1 gene also plays a role in this process alongside protein evolution.

      The authors report that the rate of TRNP1 protein evolution is strongly correlated to brain size and gyrification, with species with larger and more convoluted brains having more divergent sequences at this gene locus. The correlation with body mass was not as strong, suggesting a functional link between TRNP1 and brain evolution. The authors directly tested the effects of sequence changes by transfecting the TRNP1 sequences from 5 different species in mouse neural stem cells and quantifying cell proliferation. They show that both human and dolphin sequences induce higher proliferation, consistent with larger brain sizes and gyrifications in these two species. Then, the authors identified six potential cis-regulatory elements around the TRNP1 gene that are active in human fetal brain, and that may be involved in its regulation. To investigate whether sequence evolution at these sites results in changes in TRNP1 expression, the authors performed a massively parallel reporter assay using sequences from 75 mammals at these six loci. The authors report that one of the cis-regulatory elements drives reporter expression levels that are somewhat correlated to gyrification in catarrhine monkeys. Consistent with the activity of this cis-regulatory sequence in the fetal brain, the authors report that this element contains binding sites for TFs active in brain development, and contains stronger binding sites for CTCF in catarrhine monkeys than in other species. However, the specificity or functional relevance of this signal is unclear.

      Altogether, this is an interesting study that combines evolutionary analysis and molecular validation in cell cultures using a variety of well-designed assays. The main conclusions - that TRNP1 is likely involved in brain evolution in mammals - are mostly well supported, although the involvement of gene regulation in this process remains inconclusive.

      Strengths:

      • The authors have done a good deal of resequencing and data polishing to ensure that they obtained high-quality sequences for the TRNP1 gene in each species, which enabled a higher confidence investigation of this locus.

      • The statistical design is generally well done and appears robust.

      • The combination of evolutionary analysis and in vivo validation in neural precursor cells is interesting and powerful, and goes beyond the majority of studies in the field. I also appreciated that the authors investigated both protein and regulatory evolution at this locus in significant detail, including performing a MPRA assay across species, which is an interesting strategy in this context.

      Weaknesses:

      • The authors report that TRNP1 evolves under positive selection, however this seems to be the case for many of the control proteins as well, which suggests that the signal is non-specific and possibly due to misspecifications in the model.

      • The evidence for a higher regulatory activity of the intronic cis-regulatory element highlighted by the authors is fairly weak: correlation across species is only 0.07, consistent with the rapid evolution of enhancers in mammals, and the correlation in catarrhine monkeys is seems driven by a couple of outlier datapoints across the 10 species. It is unclear whether false discovery rates were controlled for in this analysis.

      • The analysis of the regulatory content in this putative enhancer provides some tangential evidence but no reliable conclusions regarding the involvement of regulatory changes at this locus in brain evolution.

      We thank the reviewer for the detailed comments. Indeed, TRNP1 overall has a rather average omega value across the tree and hence also the proportion of sites under selection is not hugely increased compared to the control proteins. This is good because we want to have comparable power to detect a correlation between the rate of protein evolution (omega) and the rate of brain size or GI evolution for TRNP1 and the control proteins. Indeed, what makes TRNP1 special is the rather strong correlation between the rate of brain size change and omega, which was only stronger in 4% of our control proteins. Hence, we do not agree with the weakness of model misspecification for TRNP1 protein evolution.

      We agree that the correlation of the activity induced by the intronic cis regulatory element (CRE) with gyrification is weak, but we dispute that the correlation is due to outliers (see residual plot below) or violations of model assumptions (see new permutation analysis in the Results section). There are many reasons why we would expect such a correlation not to be weak, including that a MPRA takes the CRE out of its natural genomic context. Our conclusions do not solely rest on those statistics, but also on independent corroborating evidence: Reilly et al (2015) found a difference in the activity of the TRNP1 intron between human and macaque samples during brain development. Furthermore, we used their and other public data to show that the intron CRE is indeed active in humans and bound by CTCF (new Figure 4 - figure supplement 2).

      We believe that the combined evidence suggests a likely role for the intron CRE for the co-evolution of TRNP1 with gyrification.

    1. Author Response

      Reviewer #1 (Public Review):

      Trudel and colleagues aimed to uncover the neural mechanisms of estimating the reliability of the information from social agents and non-social objects. By combining functional MRI with a behavioural experiment and computational modelling, they demonstrated that learning from social sources is more accurate and robust compared with that from non-social sources. Furthermore, dmPFC and pTPJ were found to track the estimated reliability of the social agents (as opposed to the non-social objects). The strength of this study is to devise a task consisting of the two experimental conditions that were matched in their statistical properties and only differed in their framing (social vs. non-social). The novel experimental task allows researchers to directly compare the learning from social and non-social sources, which is a prominent contribution of the present study to social decision neuroscience.

      Thank you so much for your positive feedback about our work. We are delighted that you found that our manuscript provided a prominent contribution to social decision neuroscience. We really appreciate your time to review our work and your valuable comments that have significantly helped us to improve our manuscript further.

      One of the major weaknesses is the lack of a clear description about the conceptual novelty. Learning about the reliability/expertise of social and non-social agents has been of considerable concern in social neuroscience (e.g., Boorman et al., Neuron 2013; and Wittmann et al., Neuron 2016). The authors could do a better job in clarifying the novelty of the study beyond the previous literature.

      We understand the reviewer’s comment and have made changes to the manuscript that, first, highlight more strongly the novelty of the current study. Crucially, second, we have also supplemented the data analyses with a new model-based analysis of the differences in behaviour in the social and non-social conditions which we hope makes clearer, at a theoretical level, why participants behave differently in the two conditions.

      There has long been interest in investigating whether ‘social’ cognitive processes are special or unique compared to ‘non-social’ cognitive processes and, if they are, what makes them so. Differences between conditions could arise during the input stage (e.g. the type of visual input that is processed by social and non-social system), at the algorithm stage (e.g. the type of computational principles that underpin social versus non-social processes) or, even if identical algorithms are used, social and non-social processes might depend on distinct anatomical brain areas or neurons within brain areas. Here, we conducted multiple analyses (in figures 2, 3, and 4 in the revised manuscript and in Figure 2 – figure supplement 1, Figure 3 – figure supplement 1, Figure 4 – figure supplement 3, Figure 4 – figure supplement 4) that not only demonstrated basic similarities in mechanism generalised across social and non-social contexts, but also demonstrated important quantitative differences that were linked to activity in specific brain regions associated with the social condition. The additional analyses (Figure 4 – figure supplement 3, Figure 4 – figure supplement 4) show that differences are not simply a consequence of differences in the visual stimuli that are inputs to the two systems1, nor does the type of algorithm differ between conditions. Instead, our results suggest that the precise manner in which an algorithm is implemented differs when learning about social or non-social information and that this is linked to differences in neuroanatomical substrates.

      The previous studies mentioned by the reviewer are, indeed, relevant ones and were, of course, part of the inspiration for the current study. However, there are crucial differences between them and the current study. In the case of the previous studies by Wittmann, the aim was a very different one: to understand how one’s own beliefs, for example about one’s performance, and beliefs about others, for example about their performance levels, are combined. Here, however, instead we were interested in the similarities and differences between social and non-social learning. It is true that the question resembles the one addressed by Boorman and colleagues in 2013 who looked at how people learned about the advice offered by people or computer algorithms but the difference in the framing of that study perhaps contributed to authors’ finding of little difference in learning. By contrast, in the present study we found evidence that people were predisposed to perceive stability in social performance and to be uncertain about non-social performance. By accumulating evidence across multiple analyses, we show that there are quantitative differences in how we learn about social versus non-social information, and that these differences can be linked to the way in which learning algorithms are implemented neurally. We therefore contend that our findings extend our previous understanding of how, in relation to other learning processes, ‘social’ learning has both shared and special features.

      We would like to emphasize the way in which we have extended several of the analyses throughout the revision. The theoretical Bayesian framework has made it possible to simulate key differences in behaviour between the social and non-social conditions. We explain in our point-by-point reply below how we have integrated a substantial number of new analyses. We have also more carefully related our findings to previous studies in the Introduction and Discussion.

      Introduction, page 4:

      [...] Therefore, by comparing information sampling from social versus non-social sources, we address a long-standing question in cognitive neuroscience, the degree to which any neural process is specialized for, or particularly linked to, social as opposed to non-social cognition 2–9. Given their similarities, it is expected that both types of learning will depend on common neural mechanisms. However, given the importance and ubiquity of social learning, it may also be that the neural mechanisms that support learning from social advice are at least partially specialized and distinct from those concerned with learning that is guided by nonsocial sources. However, it is less clear on which level information is processed differently when it has a social or non-social origin. It has recently been argued that differences between social and non-social learning can be investigated on different levels of Marr’s information processing theory: differences could emerge at an input level (in terms of the stimuli that might drive social and non-social learning), at an algorithmic level or at a neural implementation level 7. It might be that, at the algorithmic level, associative learning mechanisms are similar across social and non-social learning 1. Other theories have argued that differences might emerge because goal-directed actions are attributed to social agents which allows for very different inferences to be made about hidden traits or beliefs 10. Such inferences might fundamentally alter learning about social agents compared to non-social cues.

      Discussion, page 15:

      […] One potential explanation for the assumption of stable performance for social but not non-social predictors might be that participants attribute intentions and motivations to social agents. Even if the social and non-social evidence are the same, the belief that a social actor might have a goal may affect the inferences made from the same piece of information 10. Social advisors first learnt about the target’s distribution and accordingly gave advice on where to find the target. If the social agents are credited with goal-directed behaviour then it might be assumed that the goals remain relatively constant; this might lead participants to assume stability in the performances of social advisors. However, such goal-directed intentions might not be attributed to non-social cues, thereby making judgments inherently more uncertain and changeable across time. Such an account, focussing on differences in attribution in social settings aligns with a recent suggestion that any attempt to identify similarities or differences between social and non-social processes can occur at any one of a number of the levels in Marr’s information theory 7. Here we found that the same algorithm was able to explain social and non-social learning (a qualitatively similar computational model could explain both). However, the extent to which the algorithm was recruited when learning about social compared to non-social information differed. We observed a greater impact of uncertainty on judgments about social compared to non-social information. We have shown evidence for a degree of specialization when assessing social advisors as opposed to non-social cues. At the neural level we focused on two brain areas, dmPFC and pTPJ, that have not only been shown to carry signals associated with belief inferences about others but, in addition, recent combined fMRI-TMS studies have demonstrated the causal importance of these activity patterns for the inference process […]

      Another weakness is the lack of justifications of the behavioural data analyses. It is difficult for me to understand why 'performance matching' is suitable for an index of learning accuracy. I understand the optimal participant would adjust the interval size with respect to the estimated reliability of the advisor (i.e., angular error); however, I am wondering if the optimal strategy for participants is to exactly match the interval size with the angular error. Furthermore, the definitions of 'confidence adjustment across trials' and 'learning index' look arbitrary.

      First, having read the reviewer’s comments, we realise that our choice of the term ‘performance matching’ may not have been ideal as it indeed might not be the case that the participant intended to directly match their interval sizes with their estimates of advisor/predictor error. Like the reviewer, our assumption is simply that the interval sizes should change as the estimated reliability of the advisor changes and, therefore, that the intervals that the participants set should provide information about the estimates that they hold and the manner in which they evolve. On re-reading the manuscript we realised that we had not used the term ‘performance matching’ consistently or in many places in the manuscript. In the revised manuscript we have simply removed it altogether and referred to the participants’ ‘interval setting’.

      Most of the initial analyses in Figure 2a-c aim to better understand the raw behaviour before applying any computational model to the data. We were interested in how participants make confidence judgments (decision-making per se), but also how they adapt their decisions with additional information (changes or learning in decision making). In the revised manuscript we have made clear that these are used as simple behavioural measures and that they will be complemented later by more analyses derived from more formal computational models.

      In what we now refer to as the ‘interval setting’ analysis (Figure 2a), we tested whether participants select their interval settings differently in the social compared to non-social condition. We observe that participants set their intervals closer to the true angular error of the advisor/predictor in the social compared to the non-social condition. This observation could arise in two ways. First, it could be due to quantitative differences in learning despite general, qualitative similarity: mechanisms are similar but participants differ quantitatively in the way that they learn about non-social information and social information. Second, it could, however, reflect fundamentally different strategies. We tested basic performance differences by comparing the mean reward between conditions. There was no difference in reward between conditions (mean reward: paired t-test social vs. non-social, t(23)= 0.8, p=0.4, 95% CI= [-0.007 0.016]), suggesting that interval setting differences might not simply reflect better or worse performance in social or non-social contexts but instead might reflect quantitative differences in the processes guiding interval setting in the two cases.

      In the next set of analyses, in which we compared raw data, applied a computational model, and provided a theoretical account for the differences between conditions, we suggest that there are simple quantitative differences in how information is processed in social and nonsocial conditions but that these have the important impact of making long-term representations – representations built up over a longer series of trials – more important in the social condition. This, in turn, has implications for the neural activity patterns associated with social and non-social learning. We, therefore, agree with the reviewer, that one manner of interval setting is indeed not more optimal than another. However, the differences that do exist in behaviour are important because they reveal something about the social and non-social learning and its neural substrates. We have adjusted the wording and interpretation in the revised manuscript.

      Next, we analysed interval setting with two additional, related analyses: interval setting adjustment across trials and derivation of a learning index. We tested the degree to which participants adjusted their interval setting across trials and according to the prediction error (learning index, Figure f); the latter analysis is very similar to a trial-wise learning rate calculated in previous studies11. In contrast to many other studies, the intervals set by participants provide information about the estimates that they hold in a simple and direct way and enable calculation of a trial-wise learning index; therefore, we decided to call it ‘learning index’ instead of ‘learning rate’ as it is not estimated via a model applied to the data, but instead directly calculated from the data. Arguably the directness of the approach, and its lack of dependence on a specific computational model, is a strength of the analysis.

      Subsequently in the manuscript, a new analysis (illustrated in new Figure 3) employs Bayesian models that can simulate the differences in the social and non-social conditions and demonstrate that a number of behavioural observations can arise simply as a result of differences in noise in each trial-wise Bayesian update (Figure 3 and specifically 3d; Figure 3 – figure supplement 1b-c). In summary, the descriptive analyses in Figure 2a-c aid an intuitive understanding of the differences in behaviour in the social and non-social conditions. We have then repeated these analyses with Bayesian models incorporating different noise levels and showed that in such a way, the differences in behaviour between social and non-social conditions can be mimicked (please see next section and manuscript for details).

      We adjusted the wording in a number of sections in the revised manuscript such as in the legend of Figure 2 (figures and legend), Figure 4 (figures and legend).

      Main text, page 5:

      The confidence interval could be changed continuously to make it wider or narrower, by pressing buttons repeatedly (one button press resulted in a change of one step in the confidence interval). In this way participants provided what we refer to as an ’interval setting’.

      We also adjusted the following section in Main text, page 6:

      Confidence in the performance of social and non-social advisors

      We compared trial-by-trial interval setting in relation to the social and non-social advisors/predictors. When setting the interval, the participant’s aim was to minimize it while ensuring it still encompassed the final target position; points were won when it encompassed the target position but were greater when it was narrower. A given participant’s interval setting should, therefore, change in proportion to the participant’s expectations about the predictor’s angular error and their uncertainty about those expectations. Even though, on average, social and non-social sources did not differ in the precision with which they predicted the target (Figure 2 – figure supplement 1), participants gave interval settings that differed in their relationships to the true performances of the social advisors compared to the non-social predictors. The interval setting was closer to the angular error in the social compared to the non-social sessions (Figure 2a, paired t-test: social vs. non-social, t(23)= -2.57, p= 0.017, 95% confidence interval (CI)= [-0.36 -0.4]). Differences in interval setting might be due to generally lower performance in the nonsocial compared to social condition, or potentially due to fundamentally different learning processes utilised in either condition. We compared the mean reward amounts obtained by participants in the social and non-social conditions to determine whether there were overall performance differences. There was, however, no difference in the reward received by participants in the two conditions (mean reward: paired t-test social vs. non-social, t(23)= 0.8, p=0.4, 95% CI= [-0.007 0.016]), suggesting that interval setting differences might not simply reflect better or worse performance

      Discussion, page 14:

      Here, participants did not match their confidence to the likely accuracy of their own performance, but instead to the performance of another social or non-social advisor. Participants used different strategies when setting intervals to express their confidence in the performances of social advisors as opposed to non-social advisors. A possible explanation might be that participants have a better insight into the abilities of social cues – typically other agents – than non-social cues – typically inanimate objects.

      As the authors assumed simple Bayesian learning for the estimation of reliability in this study, the degree/speed of the learning should be examined with reference to the distance between the posterior and prior belief in the optimal Bayesian inference.

      We thank the reviewer for this suggestion. We agree with the reviewer that further analyses that aim to disentangle the underlying mechanisms that might differ between both social and non-social conditions might provide additional theoretical contributions. We show additional model simulations and analyses that aim to disentangle the differences in more detail. These new results allowed clearer interpretations to be made.

      In the current study, we showed that judgments made about non-social predictors were changed more strongly as a function of the subjective uncertainty: participants set a larger interval, indicating lower confidence, when they were more uncertain about the non-social cue’s accuracy to predict the target. In response to the reviewer’s comments, the new analyses were aimed at understanding under which conditions such a negative uncertainty effect might emerge.

      Prior expectations of performance First, we compared whether participants had different prior expectations in the social condition compared to the non-social condition. One way to compare prior expectations is by comparing the first interval set for each advisor/predictor. This is a direct readout of the initial prior expectation with which participants approach our two conditions. In such a way, we test whether the prior beliefs before observing any social or non-social information differ between conditions. Even though this does not test the impact of prior expectations on subsequent belief updates, it does test whether participants have generally different expectations about the performance of social advisors or non-social predictors. There was no difference in this measure between social or non-social cues (Figure below; paired t-test social vs. non-social, t(23)= 0.01, p=0.98, 95% CI= [-0.067 0.68]).

      Figure. Confidence interval for the first encounter of each predictor in social and non-social conditions. There was no initial bias in predicting the performance of social or non-social predictors.

      Learning across time We have now seen that participants do not have an initial bias when predicting performances in social or non-social conditions. This suggests that differences between conditions might emerge across time when encountering predictors multiple times. We tested whether inherent differences in how beliefs are updated according to new observations might result in different impacts of uncertainty on interval setting between social and non-social conditions. More specifically, we tested whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. This approach was inspired by the reviewer’s comments about potential differences in the speed of learning as well as the reduction of uncertainty with increasing predictor encounters. Similar ideas were tested in previous studies, when comparing the learning rate (i.e. the speed of learning) in environments of different volatilities 12,13. In these studies, a smaller learning rate was prevalent in stable environments during which reward rates change slower over time, while higher learning rates often reflect learning in volatile environments so that recent observations have a stronger impact on behaviour. Even though most studies derived these learning rates with reinforcement learning models, similar ideas can be translated into a Bayesian model. For example, an established way of changing the speed of learning in a Bayesian model is to introduce noise during the update process14. This noise is equivalent to adding in some of the initial prior distribution and this will make the Bayesian updates more flexible to adapt to changing environments. It will widen the belief distribution and thereby make it more uncertain. Recent information has more weight on the belief update within a Bayesian model when beliefs are uncertain. This increases the speed of learning. In other words, a wide distribution (after adding noise) allows for quick integration of new information. On the contrary, a narrow distribution does not integrate new observations as strongly and instead relies more heavily on previous information; this corresponds to a small learning rate. So, we would expect a steep decline of uncertainty to be related to a smaller learning index while a slower decline of uncertainty is related to a larger learning index. We hypothesized that participants reduce their uncertainty quicker when observing social information, thereby anchoring more strongly on previous beliefs instead of integrating new observations flexibly. Vice versa, we hypothesized a less steep decline of uncertainty when observing non-social information, indicating that new information can be flexibly integrated during the belief update (new Figure 3a).

      We modified the original Bayesian model (Figure 2d, Figure 2 – figure supplement 2) by adding a uniform distribution (equivalent to our prior distribution) to each belief update – we refer to this as noise addition to the Bayesian model14,21 . We varied the amount of noise between δ = [0,1], while δ= 0 equals the original Bayesian model and δ= 1 represents a very noisy Bayesian model. The uniform distribution was selected to match the first prior belief before any observation was made (equation 2). This δ range resulted in a continuous increase of subjective uncertainty around the belief about the angular error (Figure 3b-c). The modified posterior distribution denoted as 𝑝′(σ x) was derived at each trial as follows:

      We applied each noisy Bayesian model to participants’ choices within the social and nonsocial condition.

      The addition of a uniform distribution changed two key features of the belief distribution: first, the width of the distribution remains larger with additional observations, thereby making it possible to integrate new observations more flexibly. To show this more clearly, we extracted the model-derived uncertainty estimate across multiple encounters of the same predictor for the original model and the fully noisy Bayesian model (Figure 3 – figure supplement 1). The model-derived ‘uncertainty estimate’ of a noisy Bayesian model decays more slowly compared to the ‘uncertainty estimate’ of the original Bayesian model (upper panel). Second, the model-derived ‘accuracy estimate’ reflects more recent observations in a noisy Bayesian model compared to the ‘accuracy estimate’ derived from the original Bayesian model, which integrates past observations more strongly (lower panel). Hence, as mentioned beforehand, a rapid decay of uncertainty implies a small learning index; or in other words, stronger integration of past compared to recent observations.

      In the following analyses, we tested whether an increasingly noisy Bayesian model mimics behaviour that is observed in the non-social compared to social condition. For example, we tested whether an increasingly noisy Bayesian model also exhibits a strongly negative ‘predictor uncertainty’ effect on interval setting (Figure 2e). In such a way, we can test whether differences in noise in the updating process of a Bayesian model might reproduce important qualitative differences in learning-related behaviour seen in the social and nonsocial conditions.

      We used these modified Bayesian models to simulate trial-wise interval setting for each participant according to the observations they made when selecting a particular advisor or non-social cue. We simulated interval setting at each trial and examined whether an increase in noise produced model behaviours that resembled participant behaviour patterns observed in the non-social condition as opposed to social condition. At each trial, we used the accuracy estimate (Methods, equation 6) – which represents a subjective belief about a single angular error -- to derive an interval setting for the selected predictor. To do so, we first derived the point-estimate of the belief distribution at each trial (Methods, equation 6) and multiplied it with the size of one interval step on the circle. The step size was derived by dividing the circle size by the maximum number of possible steps. Here is an example of transforming an accuracy estimate into an interval: let’s assume the belief about the angular error at the current trial is 50 (Methods, equation 6). Now, we are trying to transform this number into an interval for the current predictor on a given trial. To obtain the size of one interval step, the circle size (360 degrees) is divided by the maximum number of interval steps (40 steps; note, 20 steps on each side), which results in nine degrees that represents the size of one interval step. Next, the accuracy estimate in radians (0,87) is multiplied by the step size in radians (0,1571) resulting in an interval of 0,137 radians or 7,85 degrees. The final interval size would be 7,85.

      Simulating Bayesian choices in that way, we repeated the behavioural analyses (Figure 2b,e,f) to test whether intervals derived from more noisy Bayesian models mimic intervals set by participants in the non-social condition: greater changes in interval setting across trials (Figure 3 – figure supplement 1b), a negative ‘predictor uncertainty' effect on interval setting (Figure 3 – figure supplement 1c), and a higher learning index (Figure 3d).

      First, we repeated the most crucial analysis -- the linear regression analysis (Figure 2e) and hypothesized that intervals that were simulated from noisy Bayesian models would also show a greater negative ‘predictor uncertainty’ effect on interval setting. This was indeed the case: irrespective of social or non-social conditions, the addition of noise (increased weighting of the uniform distribution in each belief update) led to an increasingly negative ‘predictor uncertainty’ effect on confidence judgment (new Figure 3d). In Figure 3d, we show the regression weights (y-axis) for the ‘predictor uncertainty’ on confidence judgment with increasing noise (x-axis). This result is highly consistent with the idea that that in the non-social condition the manner in which task estimates are updated is more uncertain and more noisy. By contrast, social estimates appear relatively more stable, also according to this new Bayesian simulation analysis.

      This new finding extends the results and suggests a formal computational account of the behavioural differences between social and non-social conditions. Increasing the noise of the belief update mimics behaviour that is observed in the non-social condition: an increasingly negative effect of ‘predictor uncertainty’ on confidence judgment. Noteworthily, there was no difference in the impact that the noise had in the social and non-social conditions. This was expected because the Bayesian simulations are blind to the framing of the conditions. However, it means that the observed effects do not depend on the precise sequence of choices that participants made in these conditions. It therefore suggests that an increase in the Bayesian noise leads to an increasingly negative impact of ‘predictor uncertainty’ on confidence judgments irrespective of the condition. Hence, we can conclude that different degrees of uncertainty within the belief update is a reasonable explanation that can underlie the differences observed between social and non-social conditions.

      Next, we used these simulated confidence intervals and repeated the descriptive behavioural analyses to test whether interval settings that were derived from more noisy Bayesian models mimic behavioural patterns observed in non-social compared to social conditions. For example, more noise in the belief update should lead to more flexible integration of new information and hence should potentially lead to a greater change of confidence judgments across predictor encounters (Figure 2b). Further, a greater reliance on recent information should lead to prediction errors more strongly in the next confidence judgment; hence, it should result in a higher learning index in the non-social condition that we hypothesize to be perceived as more uncertain (Figure 2f). We used the simulated confidence interval from Bayesian models on a continuum of noise integration (i.e. different weighting of the uniform distribution into the belief update) and derived again both absolute confidence change and learning indices (Figure 3 – figure supplement 1b-c).

      ‘Absolute confidence change’ and ‘learning index’ increase with increasing noise weight, thereby mimicking the difference between social and non-social conditions. Further, these analyses demonstrate the tight relationship between descriptive analyses and model-based analyses. They show that a noise in the Bayesian updating process is a conceptual explanation that can account for both the differences in learning and the difference in uncertainty processing that exist between social and non-social conditions. The key insight conveyed by the Bayesian simulations is that a wider, more uncertain belief distribution changes more quickly. Correspondingly, in the non-social condition, participants express more uncertainty in their confidence estimate when they set the interval, and they also change their beliefs more quickly as expressed in a higher learning index. Therefore, noisy Bayesian updating can account for key differences between social and non-social condition.

      We thank the reviewer for making this point, as we believe that these additional analyses allow theoretical inferences to be made in a more direct manner; we think that it has significantly contributed towards a deeper understanding of the mechanisms involved in the social and non-social conditions. Further, it provides a novel account of how we make judgments when being presented with social and non-social information.

      We made substantial changes to the main text, figures and supplementary material to include these changes:

      Main text, page 10-11 new section:

      The impact of noise in belief updating in social and non-social conditions

      So far, we have shown that, in comparison to non-social predictors, participants changed their interval settings about social advisors less drastically across time, relied on observations made further in the past, and were less impacted by their subjective uncertainty when they did so (Figure 2). Using Bayesian simulation analyses, we investigated whether a common mechanism might underlie these behavioural differences. We tested whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. Similar ideas were tested in previous studies, when comparing the learning rate (i.e. the speed of learning) in environments of different volatilities12,13. We tested these ideas using established ways of changing the speed of learning during Bayesian updates14,21. We hypothesized that participants reduce their uncertainty quicker when observing social information. Vice versa, we hypothesized a less steep decline of uncertainty when observing non-social information, indicating that new information can be flexibly integrated during the belief update (Figure 5a).

      We manipulated the amount of uncertainty in the Bayesian model by adding a uniform distribution to each belief update (Figure 3b-c) (equation 10,11). Consequently, the distribution’s width increases and is more strongly impacted by recent observations (see example in Figure 3 – figure supplement 1). We used these modified Bayesian models to simulate trial-wise interval setting for each participant according to the observations they made by selecting a particular advisor in the social condition or other predictor in the nonsocial condition. We simulated confidence intervals at each trial. We then used these to examine whether an increase in noise led to simulation behaviour that resembled behavioural patterns observed in non-social conditions that were different to behavioural patterns observed in the social condition.

      First, we repeated the linear regression analysis and hypothesized that interval settings that were simulated from noisy Bayesian models would also show a greater negative ‘predictor uncertainty’ effect on interval setting resembling the effect we had observed in the nonsocial condition (Figure 2e). This was indeed the case when using the noisy Bayesian model: irrespective of social or non-social condition, the addition of noise (increasing weight of the uniform distribution to each belief update) led to an increasingly negative ‘predictor uncertainty’ effect on confidence judgment (new Figure 3d). The absence of difference between the social and non-social conditions in the simulations, suggests that an increase in the Bayesian noise is sufficient to induce a negative impact of ‘predictor uncertainty’ on interval setting. Hence, we can conclude that different degrees of noise in the updating process are sufficient to cause differences observed between social and non-social conditions. Next, we used these simulated interval settings and repeated the descriptive behavioural analyses (Figure 2b,f). An increase in noise led to greater changes of confidence across time and a higher learning index (Figure 3 – figure supplement 1b-c). In summary, the Bayesian simulations offer a conceptual explanation that can account for both the differences in learning and the difference in uncertainty processing that exist between social and non-social conditions. The key insight conveyed by the Bayesian simulations is that a wider, more uncertain belief distribution changes more quickly. Correspondingly, in the non-social condition, participants express more uncertainty in their confidence estimate when they set the interval, and they also change their beliefs more quickly. Therefore, noisy Bayesian updating can account for key differences between social and non-social condition.

      Methods, page 23 new section:

      Extension of Bayesian model with varying amounts of noise

      We modified the original Bayesian model (Figure 2d, Figure 2 – figure supplement 2) to test whether the integration of new evidence differed between social and non-social conditions; for example, recent observations might be weighted more strongly for non-social cues while past observations might be weighted more strongly for social cues. [...] To obtain the size of one interval step, the circle size (360 degrees) is divided by the maximum number of interval steps (40 steps; note, 20 steps on each side), which results in nine degrees that represents the size of one interval step. Next, the accuracy estimate in radians (0,87) is multiplied by the step size in radians (0,1571) resulting in an interval of 0,137 radians or 7,85 degrees. The final interval size would be 7,85.

      We repeated behavioural analyses (Figure 2b,e,f) to test whether confidence intervals derived from more noisy Bayesian models mimic behavioural patterns observed in the nonsocial condition: greater changes of confidence across trials (Figure 3 – figure supplement 1b), a greater negative ‘predictor uncertainty' on confidence judgment (Figure 3 – figure supplement 1c) and a greater learning index (Figure 3d).

      Discussion, page 14: […] It may be because we make just such assumptions that past observations are used to predict performance levels that people are likely to exhibit next 15,16. An alternative explanation might be that participants experience a steeper decline of subjective uncertainty in their beliefs about the accuracy of social advice, resulting in a narrower prior distribution, during the next encounter with the same advisor. We used a series of simulations to investigate how uncertainty about beliefs changed from trial to trial and showed that belief updates about non-social cues were consistent with a noisier update process that diminished the impact of experiences over the longer term. From a Bayesian perspective, greater certainty about the value of advice means that contradictory evidence will need to be stronger to alter one’s beliefs. In the absence of such evidence, a Bayesian agent is more likely to repeat previous judgments. Just as in a confirmation bias 17, such a perspective suggests that once we are more certain about others’ features, for example, their character traits, we are less likely to change our opinions about them.

      Reviewer #2 (Public Review):

      Humans learn about the world both directly, by interacting with it, and indirectly, by gathering information from others. There has been a longstanding debate about the extent to which social learning relies on specialized mechanisms that are distinct from those that support learning through direct interaction with the environment. In this work, the authors approach this question using an elegant within-subjects design that enables direct comparisons between how participants use information from social and non-social sources. Although the information presented in both conditions had the same underlying structure, participants tracked the performance of the social cue more accurately and changed their estimates less as a function of prediction error. Further, univariate activity in two regions-dmPFC and pTPJ-tracked participants' confidence judgments more closely in the social than in the non-social condition, and multivariate patterns of activation in these regions contained information about the identity of the social cues.

      Overall, the experimental approach and model used in this paper are very promising. However, after reading the paper, I found myself wanting additional insight into what these condition differences mean, and how to place this work in the context of prior literature on this debate. In addition, some additional analyses would be useful to support the key claims of the paper.

      We thank the reviewer for their very supportive comments. We have addressed their points below and have highlighted changes in our manuscript that we made in response to the reviewer’s comments.

      (1) The framing should be reworked to place this work in the context of prior computational work on social learning. Some potentially relevant examples:

      • Shafto, Goodman & Frank (2012) provide a computational account of the domainspecific inductive biases that support social learning. In brief, what makes social learning special is that we have an intuitive theory of how other people's unobservable mental states lead to their observable actions, and we use this intuitive theory to actively interpret social information. (There is also a wealth of behavioral evidence in children to support this account; for a review, see Gweon, 2021).

      • Heyes (2012) provides a leaner account, arguing that social and non-social learning are supported by a common associative learning mechanism, and what distinguishes social from non-social learning is the input mechanism. Social learning becomes distinctively "social" to the extent that organisms are biased or attuned to social information.

      I highlight these papers because they go a step beyond asking whether there is any difference between mechanisms that support social and nonsocial learning-they also provide concrete proposals about what that difference might be, and what might be shared. I would like to see this work move in a similar direction.

      References<br /> (In the interest of transparency: I am not an author on these papers.)

      Gweon, H. (2021). Inferential social learning: how humans learn from others and help others learn. PsyArXiv. https://doi.org/10.31234/osf.io/8n34t

      Heyes, C. (2012). What's social about social learning?. Journal of Comparative Psychology, 126(2), 193.

      Shafto, P., Goodman, N. D., & Frank, M. C. (2012). Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 7(4), 341-351.

      Thank you for this suggestion to expand our framing. We have now made substantial changes to the Discussion and Introduction to include additional background literature, the relevant references suggested by the reviewer, addressing the differences between social and non-social learning. We further related our findings to other discussions in the literature that argue that differences between social and non-social learning might occur at the level of algorithms (the computations involved in social and non-social learning) and/or implementation (the neural mechanisms). Here, we describe behaviour with the same algorithm (Bayesian model), but the weighing of uncertainty on decision-making differs between social and non-social contexts. This might be explained by similar ideas put forward by Shafto and colleagues (2012), who suggest that differences between social and non-social learning might be due to the attribution of goal-directed intention to social agents, but not non-social cues. Such an attribution might lead participants to assume that advisor performances will be relatively stable under the assumption that they should have relatively stable goal-directed intentions. We also show differences at the implementational level in social and non-social learning in TPJ and dmPFC.

      Below we list the changes we have made to the Introduction and Discussion. Further, we would also like to emphasize the substantial extension of the Bayesian modelling which we think clarifies the theoretical framework used to explain the mechanisms involved in social and non-social learning (see our answer to the next comments below).

      Introduction, page 4:

      [...]<br /> Therefore, by comparing information sampling from social versus non-social sources, we address a long-standing question in cognitive neuroscience, the degree to which any neural process is specialized for, or particularly linked to, social as opposed to non-social cognition 2–9. Given their similarities, it is expected that both types of learning will depend on common neural mechanisms. However, given the importance and ubiquity of social learning, it may also be that the neural mechanisms that support learning from social advice are at least partially specialized and distinct from those concerned with learning that is guided by nonsocial sources.

      However, it is less clear on which level information is processed differently when it has a social or non-social origin. It has recently been argued that differences between social and non-social learning can be investigated on different levels of Marr’s information processing theory: differences could emerge at an input level (in terms of the stimuli that might drive social and non-social learning), at an algorithmic level or at a neural implementation level 7. It might be that, at the algorithmic level, associative learning mechanisms are similar across social and non-social learning 1. Other theories have argued that differences might emerge because goal-directed actions are attributed to social agents which allows for very different inferences to be made about hidden traits or beliefs 10. Such inferences might fundamentally alter learning about social agents compared to non-social cues.

      Discussion, page 15:

      […] One potential explanation for the assumption of stable performance for social but not non-social predictors might be that participants attribute intentions and motivations to social agents. Even if the social and non-social evidence are the same, the belief that a social actor might have a goal may affect the inferences made from the same piece of information 10. Social advisors first learnt about the target’s distribution and accordingly gave advice on where to find the target. If the social agents are credited with goal-directed behaviour then it might be assumed that the goals remain relatively constant; this might lead participants to assume stability in the performances of social advisors. However, such goal-directed intentions might not be attributed to non-social cues, thereby making judgments inherently more uncertain and changeable across time. Such an account, focussing on differences in attribution in social settings aligns with a recent suggestion that any attempt to identify similarities or differences between social and non-social processes can occur at any one of a number of the levels in Marr’s information theory 7. Here we found that the same algorithm was able to explain social and non-social learning (a qualitatively similar computational model could explain both). However, the extent to which the algorithm was recruited when learning about social compared to non-social information differed. We observed a greater impact of uncertainty on judgments about social compared to non-social information. We have shown evidence for a degree of specialization when assessing social advisors as opposed to non-social cues. At the neural level we focused on two brain areas, dmPFC and pTPJ, that have not only been shown to carry signals associated with belief inferences about others but, in addition, recent combined fMRI-TMS studies have demonstrated the causal importance of these activity patterns for the inference process […]

      (2) The results imply that dmPFC and pTPJ differentiate between learning from social and non-social sources. However, more work needs to be done to rule out simpler, deflationary accounts. In particular, the condition differences observed in dmPFC and pTPJ might reflect low-level differences between the two conditions. For example, the social task could simply have been more engaging to participants, or the social predictors may have been more visually distinct from one another than the fruits.

      We understand the reviewer’s concern regarding low-level distinctions between the social and non-social condition that could confound for the differences in neural activation that are observed between conditions in areas pTPJ and dmPFC. From the reviewer’s comments, we understand that there might be two potential confounders: first, low-level differences such that stimuli within one condition might be more distinct to each other compared to the relative distinctiveness between stimuli within the other condition. Therefore, simply the greater visual distinctiveness of stimuli in one condition than another might lead to learning differences between conditions. Second, stimuli in one condition might be more engaging and potentially lead to attentional differences between conditions. We used a combination of univariate analyses and multivariate analyses to address both concerns.

      Analysis 1: Univariate analysis to inspect potential unaccounted variance between social and non-social condition

      First, we used the existing univariate analysis (exploratory MRI whole-brain analysis, see Methods) to test for neural activation that covaried with attentional differences – or any other unaccounted neural difference -- between conditions. If there were neural differences between conditions that we are currently not accounting for with the parametric regressors that are included in the fMRI-GLM, then these differences should be captured in the constant of the GLM model. For example, if there are attentional differences between conditions, then we could expect to see neural differences between conditions in areas such as inferior parietal lobe (or other related areas that are commonly engaged during attentional processes).

      Importantly, inspection of the constant of the GLM model should capture any unaccounted differences, whether they are due to attention or alternative processes that might differ between conditions. When inspecting cluster-corrected differences in the constant of the fMRI-GLM model during the setting of the confidence judgment, there were no clustersignificant activation that was different between social and non-social conditions (Figure 4 – figure supplement 4a; results were familywise-error cluster-corrected at p<0.05 using a cluster-defining threshold of z>2.3). For transparency, we show the sub-threshold activation map across the whole brain (z > 2) for the ‘constant’ contrasted between social and nonsocial condition (i.e. constant, contrast: social – non-social).

      For transparency we additionally used an ROI-approach to test differences in activation patterns that correlated with the constant during the confidence phase – this means, we used the same ROI-approach as we did in the paper to avoid any biased test selection. We compared activation patterns between social and non-social conditions in the same ROI as used before; dmPFC (MNI-coordinate [x/y/z: 2,44,36] 16), bilateral pTPJ (70% probability anatomical mask; for reference see manuscript, page 23) and additionally compared activation patterns between conditions in bilateral IPLD (50% probability anatomical mask, 20). We did not find significantly different activation patterns between social and non-social conditions in any of these areas: dmPFC (confidence constant; paired t-test social vs nonsocial: t(23) = 0.06, p=0.96, [-36.7, 38.75]), bilateral TPJ (confidence constant; paired t-test social vs non-social: t(23) = -0.06, p=0.95, [-31, 29]), bilateral IPLD (confidence constant; paired t-test social vs non-social: t(23) = -0.58, p=0.57, [-30.3 17.1]).

      There were no meaningful activation patterns that differed between conditions in either areas commonly linked to attention (eg IPL) or in brain areas that were the focus of the study (dmPFC and pTPJ). Activation in dmPFC and pTPJ covaried with parametric effects such as the confidence that was set at the current and previous trial, and did not correlate with low-level differences such as attention. Hence, these results suggest that activation between conditions was captured better by parametric regressors such as the trial-wise interval setting, i.e. confidence, and are unlikely to be confounded by low-level processes that can be captured with univariate neural analyses.

      Analysis 2: RSA to test visual distinctiveness between social and non-social conditions

      We addressed the reviewer’s other comment further directly by testing whether potential differences between conditions might arise due to a varying degree of visual distinctiveness in one stimulus set compared to the other stimulus set. We used RSA analysis to inspect potential differences in early visual processes that should be impacted by greater stimulus similarity within one condition. In other words, we tested whether the visual distinctiveness of one stimuli set was different to the visual distinctiveness of the other stimuli set. We used RSA analysis to compare the Exemplar Discriminability Index (EDI) between conditions in early visual areas. We compared the dissimilarity of neural activation related to the presentation of an identical stimulus across trials (diagonal in RSA matrix) with the dissimilarity in neural activation between different stimuli across trials (off-diagonal in RSA matrix). If stimuli within one stimulus set are very similar, then the difference between the diagonal and off-diagonal should be very small and less likely to be significant (i.e. similar diagonal and off-diagonal values). In contrast, if stimuli within one set are very distinct from each other, then the difference between the diagonal and off-diagonal should be large and likely to result in a significant EDI (i.e. different diagonal and off-diagonal values) (see Figure 4g for schematic illustration). Hence, if there is a difference in the visual distinctiveness between social and non-social conditions, then this difference should result in different EDI values for both conditions – hence, visual distinctiveness between the stimuli set can be tested by comparing the EDI values between conditions within the early visual processing. We used a Harvard-cortical ROI mask based on bilateral V1. Negative EDI values indicate that the same exemplars are represented more similarly in the neural V1 pattern than different exemplars. This analysis showed that there was no significant difference in EDI between conditions (Figure 4 – figure supplement 4b; EDI paired sample t-test: t(23) = -0.16, p=0.87, 95% CI [-6.7 5.7]).

      We have further replicated results in V1 with a whole-brain searchlight analysis, averaging across both social and non-social conditions.

      In summary, by using a combination of univariate and multivariate analyses, we could test whether neural activation might be different when participants were presented with a facial or fruit stimuli and whether these differences might confound observed learning differences between conditions. We did not find meaningful neural differences that were not accounted for with the regressors included in the GLM. Further, we did not find differences in the visual distinctiveness between the stimuli sets. Hence, these control analyses suggest that differences between social and non-social conditions might not arise because of differences in low-level processes but are instead more likely to develop when learning about social or non-social information.

      Moreover, we also examined behaviourally whether participants differed in the way they approached social and non-social condition. We tested whether there were initial biases prior to learning, i.e. before actually receiving information from either social or non-social information sources. Therefore, we tested whether participants have different prior expecations about the performance of social compared to non-social predictors. We compared the confidence judgments at the first trial of each predictor. We found that participants set confidence intervals very similarly in social and non-social conditions (Figure below). Hence, it did not seem to be the case that differences between conditions arose due to low level differences in stimulus sets or prior differences in expectations about performances of social compared to non-social predictors. However, we can show that differences between conditions are apparent when updating one’s belief about social advisors or non-social cues and as a consequence, in the way that confidence judgments are set across time.

      Figure. Confidence interval for the first encounter of each predictor in social and non-social conditions. There was no initial bias in predicting the performance of social or non-social predictors.

      Main text page 13:

      [… ]<br /> Additional control analyses show that neural differences between social and non-social conditions were not due to the visually different set of stimuli used in the experiment but instead represent fundamental differences in processing social compared to non-social information (Figure 4 – figure supplement 4). These results are shown in ROI-based RSA analysis and in whole-brain searchlight analysis. In summary, in conjunction, the univariate and multivariate analyses demonstrate that dmPFC and pTPJ represent beliefs about social advisors that develop over a longer timescale and encode the identities of the social advisors.

      References

      1. Heyes, C. (2012). What’s social about social learning? Journal of Comparative Psychology 126, 193–202. 10.1037/a0025180.
      2. Chang, S.W.C., and Dal Monte, O. (2018). Shining Light on Social Learning Circuits. Trends in Cognitive Sciences 22, 673–675. 10.1016/j.tics.2018.05.002.
      3. Diaconescu, A.O., Mathys, C., Weber, L.A.E., Kasper, L., Mauer, J., and Stephan, K.E. (2017). Hierarchical prediction errors in midbrain and septum during social learning. Soc Cogn Affect Neurosci 12, 618–634. 10.1093/scan/nsw171.
      4. Frith, C., and Frith, U. (2010). Learning from Others: Introduction to the Special Review Series on Social Neuroscience. Neuron 65, 739–743. 10.1016/j.neuron.2010.03.015.
      5. Frith, C.D., and Frith, U. (2012). Mechanisms of Social Cognition. Annu. Rev. Psychol. 63, 287–313. 10.1146/annurev-psych-120710-100449.
      6. Grabenhorst, F., and Schultz, W. (2021). Functions of primate amygdala neurons in economic decisions and social decision simulation. Behavioural Brain Research 409, 113318. 10.1016/j.bbr.2021.113318.
      7. Lockwood, P.L., Apps, M.A.J., and Chang, S.W.C. (2020). Is There a ‘Social’ Brain? Implementations and Algorithms. Trends in Cognitive Sciences, S1364661320301686. 10.1016/j.tics.2020.06.011.
      8. Soutschek, A., Ruff, C.C., Strombach, T., Kalenscher, T., and Tobler, P.N. (2016). Brain stimulation reveals crucial role of overcoming self-centeredness in self-control. Sci. Adv. 2, e1600992. 10.1126/sciadv.1600992.
      9. Wittmann, M.K., Lockwood, P.L., and Rushworth, M.F.S. (2018). Neural Mechanisms of Social Cognition in Primates. Annu. Rev. Neurosci. 41, 99–118. 10.1146/annurev-neuro080317-061450.
      10. Shafto, P., Goodman, N.D., and Frank, M.C. (2012). Learning From Others: The Consequences of Psychological Reasoning for Human Learning. Perspect Psychol Sci 7, 341– 351. 10.1177/1745691612448481.
      11. McGuire, J.T., Nassar, M.R., Gold, J.I., and Kable, J.W. (2014). Functionally Dissociable Influences on Learning Rate in a Dynamic Environment. Neuron 84, 870–881. 10.1016/j.neuron.2014.10.013.
      12. Behrens, T.E.J., Woolrich, M.W., Walton, M.E., and Rushworth, M.F.S. (2007). Learning the value of information in an uncertain world. Nature Neuroscience 10, 1214– 1221. 10.1038/nn1954.
      13. Meder, D., Kolling, N., Verhagen, L., Wittmann, M.K., Scholl, J., Madsen, K.H., Hulme, O.J., Behrens, T.E.J., and Rushworth, M.F.S. (2017). Simultaneous representation of a spectrum of dynamically changing value estimates during decision making. Nat Commun 8, 1942. 10.1038/s41467-017-02169-w.
      14. Allenmark, F., Müller, H.J., and Shi, Z. (2018). Inter-trial effects in visual pop-out search: Factorial comparison of Bayesian updating models. PLoS Comput Biol 14, e1006328. 10.1371/journal.pcbi.1006328.
      15. Wittmann, M., Trudel, N., Trier, H.A., Klein-Flügge, M., Sel, A., Verhagen, L., and Rushworth, M.F.S. (2021). Causal manipulation of self-other mergence in the dorsomedial prefrontal cortex. Neuron.
      16. Wittmann, M.K., Kolling, N., Faber, N.S., Scholl, J., Nelissen, N., and Rushworth, M.F.S. (2016). Self-Other Mergence in the Frontal Cortex during Cooperation and Competition. Neuron 91, 482–493. 10.1016/j.neuron.2016.06.022.
      17. Kappes, A., Harvey, A.H., Lohrenz, T., Montague, P.R., and Sharot, T. (2020). Confirmation bias in the utilization of others’ opinion strength. Nat Neurosci 23, 130–137. 10.1038/s41593-019-0549-2.
      18. Trudel, N., Scholl, J., Klein-Flügge, M.C., Fouragnan, E., Tankelevitch, L., Wittmann, M.K., and Rushworth, M.F.S. (2021). Polarity of uncertainty representation during exploration and exploitation in ventromedial prefrontal cortex. Nat Hum Behav. 10.1038/s41562-020-0929-3.
      19. Yu, Z., Guindani, M., Grieco, S.F., Chen, L., Holmes, T.C., and Xu, X. (2022). Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron 110, 21–35. 10.1016/j.neuron.2021.10.030.
      20. Mars, R.B., Jbabdi, S., Sallet, J., O’Reilly, J.X., Croxson, P.L., Olivier, E., Noonan, M.P., Bergmann, C., Mitchell, A.S., Baxter, M.G., et al. (2011). Diffusion-Weighted Imaging Tractography-Based Parcellation of the Human Parietal Cortex and Comparison with Human and Macaque Resting-State Functional Connectivity. Journal of Neuroscience 31, 4087– 4100. 10.1523/JNEUROSCI.5102-10.2011.
      21. Yu, A.J., and Cohen, J.D. Sequential effects: Superstition or rational behavior? 8.
      22. Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., and Kriegeskorte, N. (2014). A Toolbox for Representational Similarity Analysis. PLoS Comput Biol 10, e1003553. 10.1371/journal.pcbi.1003553.
      23. Lockwood, P.L., Wittmann, M.K., Nili, H., Matsumoto-Ryan, M., Abdurahman, A., Cutler, J., Husain, M., and Apps, M.A.J. (2022). Distinct neural representations for prosocial and self-benefiting effort. Current Biology 32, 4172-4185.e7. 10.1016/j.cub.2022.08.010.
    1. Author Response

      Reviewer #2 (Public Review):

      1) Although the images and videos were of great quality, the results derived from them provided little new knowledge and few conceptual insights into male reproductive tract biology and basically confirmed what has been published using traditional methods. For example, the high intensity of the vascular network in the initial segment was previously reported by Abe in 1984 and Suzuki in 1982; the pattern of the major lymphatic vessel and drainage was beautifully depicted by Perez-Clavier, 1982.

      We thank the reviewer for his/her appreciative comments regarding the quality of the images/videos we provide in this study. We do not fully agree with his/her assessment of the lack of novelty. Our work confirms earlier reports that are now dated (1980s), which in itself is worth mentioning for the interested community, especially when the confirmation uses the most advanced technologies available today. We have never said that nothing was done in the past, and we have acknowledged all past contributors (including those mentioned by the reviewer) by pointing out the limitations of the technical tools that were available at the time. In addition, our current work provides a more comprehensive and global view by extending our approach to the entire mouse epididymis, whereas previous work was much more limited.

      2) The authors were very cautious when interpreting the results of marker immunostaining however these markers were not specific for a definite cell type. For example, as the authors stated, VEGFR3 marks both lymphatic vessels and fenestrated blood vessels. how could the authors claim the VEGFR3+ network was lymphatic? The authors claimed that they used three markers for the lymphatic vessel. But staining results of the networks were very different. How could the author make conclusions about the network of lymphatic vessels in the epididymis?

      We broadly agree with the reviewer and have made it clear that one cannot be 100% sure that all the VEGFR3+ structures we present are lymphatic. However, in total, we used 4 documented lymphatic markers (not 3 as mentioned by the reviewer) which are (VEGFR3, LYVE1, PROX1 and PDPN). Three of them give very similar profiles, while only PDPN shows some differences. We are currently studying in more detail the expression of PDPN in the mouse epididymis because we speculate that this marker may target a population of pluripotent cells in this tissue. Therefore, with the 3 similar profiles and with the subtraction of PVLAP+ structures, we are pretty confident that what we show corresponds to the different lymphatic structures.

      3) To understand the vascular network development in the epididymis, would the authors please look at the fetal stage when the vascular network is established in the first place? Wolffian duct tissues are much smaller and thinner and would be amenable for 3D imaging probably even without clearing.

      We generally agree with the reviewer that this could be an interesting addition. However, it represents a significant amount of additional work. Organ clearing will certainly be required because it is unlikely that Wolffian duct will be sufficiently transparent to allow lightsheet microscopy. In the literature, the study of Wolffian duct relies primarily on whole mounts, inclusions, and cryosections. Besides the fact that this represents a lot of extra work, we are not totally convinced that this would be of much use. A key reason is that the epididymis is an organ that differentiates completely after birth (Robaire and Hinton, 2015). It is reported that differentiation of mouse caput segment 1 occurs around 19DPN (Xu et al., 2016) and is intimately related to the development of the vasculature (Lebarr et al., 1986). Regarding the lymphatic network, Swingen et al, (2012) reports that lymphangiogenesis in the mouse testis and epididymis is initiated late in gestation after 15DPC. Videos showing the external lymphatic vessels of the testis and epididymis at 17.5DPC can be seen at https://doi.org/10.1371/journal.pone.0052620.s002. The authors indicate that lymphangiogenesis occurs via sprouting from the adjacent mesonephros. We hypothesize that the more internal lymphatics evolve between birth and 10DPN, which corresponds to the time when we observed LEPC Lyve1pos cells.

      4) Immunofluorescence staining of VEGF factors was not convincing. As a secreted factor, VEGF will be secreted out of the cells, would it be detected more in the interstitium? I am always skeptical about the results of immunostaining secreted growth factors. Would it be possible to perform in situ or RNAscope to confirm the spatial expression pattern of VEGFs?

      Well, active VEGF factors result from alternative mRNA splicing events and posttranslational proteolytic cleavage. Therefore, in our opinion, the study of VEGF mRNA by in situ hybridization or RNAscope analysis will not be very informative about the actual presence of active forms of VEGF in the epididymis. If necessary, we can provide as supplementary material immunohistochemistry data showing the presence of VEFG-A in the epididymal principal cells. Our major objective with these data was to show that VEGF factors and their respective receptors were present in the epididymis. Nevertheless, in an attempt to convince the reviewer, we provide as accompanying data to this rebuttal letter new sets of figures (Figures VEGF-A-response editor & VEGFC /VEGF-D-response editor) that we believe can improve the perception of our data. If the editorial office feels it is necessary, these figures could be added to the supplementary figure set (as Figure 6figure supplement 1 and Figure 6-figure supplement 2). For VEGF-A the data exists already in the literature as we have indicated (Korpelainen, 1998). In fine, our goal was not to show which cell types of the epididymis epithelium produce VEGFs but rather than VEGF factors and their receptors where there in order to support angiogenesis or lymphangiogenic activity in the tissue. In addition, we hypothesize that because septa have been reported to constitute barriers between segments restricting passive diffusion of molecules (Turner et al., 2003; Stammler et al., 2015), the VEGF factors are expected to be produced locally.

      Figure VEGF-A - response editor : Immunofluorescence of the angiogenic ligand VEGF-A in the epididymis. Figure 6 shows that this ligand is mainly found in the caput and more precisely in S1.It is very strongly expressed in the peritubular microvascularization of the SI which expresses the VEGFR3:YFP transgene whereas it is less expressed by intertubular blood vessels (asterisk). This seems to indicate that it is the peritubular vessels that are in the majority responsible for the angiogenic activity measured in our study. Furthermore, it is expressed by the epithelium as secretory vesicles (IS, and S3 and enlargement) which is in agreement with in situ hybridization work performed by Korpelainene E.I et al J.Cell.biol 1998). The enlargement shown in S3_Z shows the sagital plane of the tubule where one can distinguish VEGFR:YFP positive cells that strongly express are also VEGF-A positive indicating that the same cells of the epithelium express both the receptor and the ligand. Here the transgene is detected directly without the use of an anti-GFP which allows to enhance the signal.

      Figure VEGF-C / VEGF-D - response editor : Immunofluorescence of VEGF-C and VEGF-D lymphangiogenic ligands in the epididymis. This figure shows that these ligands are mainly found in the interstitial tissue throughout the organ with a higher proportion in the caudal part. This expression may be largely driven by fibroblasts, which are widely represented in the interstitium, or by endothelial cells, since these two ligands are expressed by these cell types. However, as shown in the figures and in the enlargement of panel A, VEGF-C is also produced by epithelial cells within what may appear as secretory vesicles. In contrast, for VEGF-D, we observe only few weakly positive epithelial cells (panel B). These ligands are also detected in the lumen of epididymal tubules (visible for VEGF-C Panel A S2). This presence may be explained by lumicrine transfer from the testis, in addition to secretion from epithelial cells. Here the transgene is detected directly without the use of an anti-GFP which allows to enhance the signal.

      5) The study is descriptive and does not provide functional and mechanistic insights. Maybe, the combination of 3D imaging with lineage tracing of endothelium cells or ligation study (removal/ligation of the certain vessel) would help better understand how the vascular network is established and their functional significance.

      The technical approaches suggested by the reviewer could certainly improve our understanding of the rather complex epididymal vascular network. Taken together, they represent the body of a comprehensive follow-up study that is worth undertaking.

      6) Immune response is among many physiological processes in which vascular networks play significant roles. Discussion would be needed in other physiological processes, such as tissue metabolism and stem/progenitor cell niche microenvironment.

      We agree with the reviewer that the mammalian vasculature is involved in other physiological processes beyond immune/inflammatory responses. We have deliberately chosen to focus our discussion on the inflammatory and immune context of the epididymis, as we believe this is the most relevant aspect. It is also in full agreement with the research that our team has been conducting for 15 years to try to understand the complex orchestration of tolerance versus immune surveillance in this territory. This is a finely tuned process that, if properly understood, can help to understand and appropriately treat clinical situations of infertility and/or urological problems. As our discussion section is already quite long, we feel that it was not justified to extend it further on other aspects. However, in response to the reviewer's suggestion, we now mention at the end of the first paragraph of the discussion that the epididymal vascular network is likely to serve different processes in this tissue (page 9, lines 299 to 303).

      7) How could the author determine the Cd-A labeled vessel in Fig 1 was an artery, not a vein? This leads to another critical question. Would it be possible to stain with artery and vein markers to help illustrate the blood flow directions of the vessel?

      The reviewer is right on the fact that we arbitrarily called the Cd-A vessel in Figure 1 an artery. Cd-A is not an acronym we use anymore. What we have done is to use the acronym SEA (superior epididymal artery) to indicate what we firmly believe to be an artery, as also suggested by previous literature (e.g., Suzuki, 1982; Abe et al, 1982) in which this same structure has been consistently referred to as an artery. For other blood vessels, we now have used the acronym "Cd-BV" because we do not know whether we are dealing with a vein or an artery as rightfully pointed out by the reviewer. This is clearly stated in the legend of Figure 1.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors ask an interesting question as to whether working memory contains more than one conjunctive representation of multiple task features required for a future response with one of these representations being more likely to become relevant at the time of the response. With RSA the authors use a multivariate approach that seems to become the standard in modern EEG research.

      We appreciate the reviewer’s helpful comments on the manuscript and their encouraging comments regarding its potential impact.

      I have three major concerns that are currently limiting the meaningfulness of the manuscript: For one, the paradigm uses stimuli with properties that could potentially influence involuntary attention and interfere in a Stroop-like manner with the required responses (i.e., 2 out of 3 cues involve the terms "horizontal" or "vertical" while the stimuli contain horizontal and vertical bars). It is not clear to me whether these potential interactions might bring about what is identified as conjunctive representations or whether they cause these representations to be quite weak.

      We agree it is important to rule out any effects of involuntary attention that might have been elicited by our stimulus choices. To address the Reviewer’s concern, we conducted control analyses to test if there was any influence of Stroop-like interference on our measures of behavior or the conjunctive representation. To summarize these analyses (detailed in our responses below and in the supplemental materials), we found no evidence of the effect of compatibility on behavior or on the decoding of conjunctions during either the maintenance or test periods. Furthermore, we found that the decoding of the bar orientation was at chance level during the interval when we observe evidence of the conjunctive representations. Thus, we conclude that the compatibility of the stimuli and the rule did not contribute to the decoding of conjunctive representations or to behavior.

      Second, the relatively weak conjunctive representations are making it difficult to interpret null effects such as the absence of certain correlations.

      The reviewer is correct that we cannot draw strong conclusions from null findings. We have revised the main text accordingly. In certain cases, we have also included additional analyses. These revisions are described in detail in response the reviewer’s comments below.

      Third, if the conjunctive representations truly are reflections of working memory activity, then it would help to include a control condition where memory load is reduced so as to demonstrate that representational strength varies as a function of load. Depending on whether these concerns or some of them can be addressed or ruled out this manuscript has the potential of becoming influential in the field.

      This is a clever suggestion for further experimentation. We agree that observing the adverse effect of memory load is one of the robust ways to assess the contributions of working memory system for future studies. However, given that decoding is noisy during the maintenance period (particularly for the low-priority conjunctive representation) even with a relatively low set-size, we expect that in order to further manipulate load, we would need to alter the research design substantially. Thus, as the main goal of the current study is to study prioritization and post-encoding selection of action-related information, we focused on the minimum set-size required for this question (i.e., load 2). However, we now note this load manipulation as a direction for future research in the discussion (pg. 18).

      Reviewer #2 (Public Review):

      Kikumoto and colleagues investigate the way visual-motor representations are stored in working memory and selected for action based on a retro-cue. They make use of a combination of decoding and RSA to assess at which stages of processing sensory, motor, and conjunctive information (consisting of sensory and motor representations linked via an S- R mapping) are represented in working memory and how these mental representations are related to behavioral performance.

      Strengths

      This is an elaborate and carefully designed experiment. The authors are able to shed further light on the type of mental representations in working memory that serve as the basis for the selection of relevant information in support of goal- directed actions. This is highly relevant for a better understanding of the role of selective attention and prospective motor representations in working memory. The methods used could provide a good basis for further research in this regard.

      We appreciate these helpful comments and the Reviewer’s positive comments on the impact of the work.

      Weaknesses

      There are important points requiring further clarification, especially regarding the statistical approach and interpretation of results.

      • Why is there a conjunction RSA model vector (b4) required, when all information for a response can be achieved by combining the individual stimulus, response, and rule vectors? In Figure 3 it becomes obvious that the conjunction RSA scores do not simply reflect the overlap of the other three vectors. I think it would help the interpretation of results to clearly state why this is not the case.

      Thank you for the suggestion, we’ve now added the theoretical background that motivates us to include the RSA model of conjunctive representation (pg. 4 and 5). In particular, several theories of cognitive control have proposed that over the course of action planning, the system assembles an event (task) file which binds all task features at all levels – including the rule (i.e., context), stimulus, and response – into an integrated, conjunctive representation that is essential for an action to be executed (Hommel 2019; Frings et al. 2020). Similarly, neural evidence of non-human primates suggests that cognitive tasks that require context-dependency (e.g., flexible remapping of inputs to different outputs based on the context) recruit nonlinear conjunctive representations (Rigotti et al. 2013; Parthasarathy et al. 2019; Bernardi et al. 2020; Panichello and Buschman, 2021). Supporting these views, we previously observed that conjunctive representations emerge in the human brain during action selection, which uniquely explained behavior such as the costs in transition of actions (Kikumoto & Mayr, 2020; see also Rangel & Hazeltine & Wessel, 2022) or the successful cancelation of actions (Kikumoto & Mayr, 2022). In the current study, by using the same set of RSA models, we attempted to extend the role of conjunctive representations for planning and prioritization of future actions. As in the previous studies (and as noted by the reviewer), the conjunction model makes a unique prediction of the similarity (or dissimilarity) pattern of the decoder outputs: a specific instance of action that is distinct from others actions. This contrasts to other RSA models of low-level features that predict similar patterns of activities for instances that share the same feature (e.g., S-R mappings 1 to 4 share the diagonal rule context). Here, we generally replicate the previous studies showing the unique trajectories of conjunctive representations (Figure 3) and their unique contribution on behavior (Figure 5).

      • One of the key findings of this study is the reliable representation of the conjunction information during the preparation phase while there is no comparable effect evident for response representations. This might suggest that two potentially independent conjunctive representations can be activated in working memory and thereby function as the basis for later response selection during the test phase. However, the assumption of the independence of the high and low priority conjunction representations relies only on the observation that there was no statistically reliable correlation between the high and low priority conjunctions in the preparation and test phases. This assumption is not valid because non-significant correlations do not allow any conclusion about the independence of the two processes. A comparable problem appeared regarding the non-significant difference between high and low-priority representations. These results show that it was not possible to prove a difference between these representations prior to the test phase based on the current approach, but they do not unequivocally "suggest that neither action plan was selectively prioritized".

      We appreciate this important point. We have taken care in the revision to state that we find evidence of an interference effect for the high-priority action and do not find evidence for such an effect from the low-priority action. Thus, we do not intend to conclude that no such effect could exist. Further, although it is not our intention to draw a strong conclusion from the null effect (i.e., no correlations), we performed an exploratory analysis where we tested the correlation in trials where we observed strong evidence of both conjunctions. Specifically, we binned trials into half within each time point and individual subject and performed the multi-level model analysis using trials where both high and low priority conjunctions were above their medians. Thus, we selected trials in such a way that they are independent of the effect we are testing. The figure below shows the coefficient of associated with low-priority conjunction predicting high-priority conjunction (uncorrected). Even when we focus on trials where both conjunctions are detected (i.e., a high signal-to-noise ratio), we observed no tradeoff. Again, we cannot draw strong conclusions based on the null result of this exploratory analysis. Yet, we can rule out some causes of no correlation between high and low priority conjunctions such as the poor signal-to-noise ratio of the low priority conjunctions. We have further clarified this point in the result (pg. 14).

      Fig. 1. Trial-to-trial variability between high and low priority conjunctions, using above median trials. The coefficients of the multilevel regression model predicting the variability in trial-to-trial highpriority conjunction by low-priority conjunction.

      • The experimental design used does not allow for a clear statement about whether pure motor representations in working memory only emerge with the definition of the response to be executed (test phase). It is not evident from Figure 3 that the increase in the RSA scores strictly follows the onset of the Go stimulus. It is also conceivable that the emergence of a pure motor representation requires a longer processing time. This could only be investigated through temporally varying preparation phases.

      We agree with the reviewer. Although we detected no evidence of response representations of both high and low priority action plans during the preparation phase, t(1,23) = -.514, beta = .002, 95% CI [-.010 .006] for high priority; t(1,23) = -1.57, beta = -.008, 95% CI [-.017 .002] for low priority, this may be limited by the relatively short duration of the delay period (750 ms) in this study. However, in our previous studies using a similar paradigm without a delay period (Kikumoto & Mayr, 2020; Kikumoto & Mayr, 2022), response representations were detected less than 300ms after the response was specified, which corresponds to the onset of delay period in this study. Further, participants in the current study were encouraged to prepare responses as early as possible, using adaptive response deadlines and performance-based incentives. Thus, we know of no reason why responses would take longer to prepare in the present study. But we agree that we can’t rule this out. We have added the caveat noted above, as well as this additional context in the discussion (pg. 16-17).

      • Inconsistency of statistical approaches: In the methods section, the authors state that they used a cluster-forming threshold and a cluster-significance threshold of p < 0.05. In the results section (Figure 4) a cluster p-value of 0.01 is introduced. Although this concerns different analyses, varying threshold values appear as if they were chosen in favor of significant results. The authors should either proceed consistently here or give very good reasons for varying thresholds.

      We thank the reviewer for noting this oversight. All reported significant clusters with cluster P-value were identified using a cluster-forming threshold, p < .05. We fixed the description accordingly.

      • Interpretation of results: The significant time window for the high vs. low priority by test-type interaction appeared quite late for the conjunction representation. First, it does not seem reasonable that such an effect appears in a time window overlapping with the motor responses. But more importantly, why should it appear after the respective interaction for the response representation? When keeping in mind that these results are based on a combination of time-frequency analysis, decoding, and RSA (quite many processing steps), I find it hard to really see a consistent pattern in these results that allows for a conclusion about how higher-level conjunctive and motor representations are selected in working memory.

      Thank you for raising this important point. First, we fixed reported methodological inconsistencies such as the cluster P-value and cluster-forming threshold). Further, we fully agree that the difference in the time course for the response and conjunctive representations in the low priority, tested condition is unexpected and would complicate the perspective that the conjunctive representation contributes to efficient response selection. However, additional analysis indicates that this apparent pattern in the stimulus locked result is misleading and there is a more parsimonious explanation. First, we wish to caution that the data are relatively noisy and likely are influenced by different frequency bands for different features. Thus, fine-grained temporal differences should be interpreted with caution in the absence of positive statistical evidence of an interaction over time. Indeed, though Figure 4 in the original submission shows a quantitative difference in timing of the interaction effect (priority by test type) across conjunctive representation and response representation, the direct test of this four way interaction [priority x test type x representation type (conjunction vs. response), x time interval (1500 ms to 1850 ms vs. 1850 to 2100 ms)] is not significant, t(1,23) = 1.65, beta = .058, 95% CI [-.012 .015]). The same analysis using response-aligned data is also not significant, t(1,23) = -1.24, beta = -.046, 95% CI [-.128 .028]). These observations were not dependent on the choice of time interval, as other time intervals were also not significant. Therefore, we do not have strong evidence that this is a true timing difference between these conditions and believe this is likely driven by noise.

      Further, we believe the apparent late emergence of difference in two conjunctions when the low priority action is tested is more likely due to a slow decline in the strength of the untested high priority conjunction rather than a late emergence of the low priority conjunction. This pattern is clearer when the traces are aligned to the response. The tested low priority conjunction emerges early and is sustained when it is the tested action and declines when it is untested (-226 ms to 86 ms relative to the response onset, cluster-forming threshold, p < .05). These changes eventually resulted in a significant difference in strength between the tested versus untested low priority conjunctions just prior to the commission of the response (Figure 4 - figure supplement 1, the panel on right column of the middle row, the black bars at the top of panel). Importantly, the high priority conjunction also remains active in its untested condition and declines later than the untested low priority conjunction does. Indeed, the untested high priority conjunction does not decline significantly relative to trials when it is tested until after the response is emitted (Figure 4 - figure supplement 1, the panel on right column of the middle row, the red bars at the top of panel). This results in a late emerging interaction effect of the priority and test type, but this is not due to a late emerging low priority conjunctive representation.

      In summary, we do not have statistical evidence of a time by effect interaction that allows us to draw strong inferences about timing. Nonetheless, even the patterns we observe are inconsistent with a late emerging low priority conjunctive representation. And if anything, they support a late decline in the untested high priority conjunctive representation. This pattern of the result of the high priority conjunction being sustained until late, even when it is untested, is also notable in light of our observation that the strength of the high priority conjunctive representation interferes behavior when the low priority item is tested, but not vice versa. We now address this point about the timing directly in the results (pg. 15-16) and the discussion (pg. 21), and we include the response locked results in the main text along with the stimulus locked result including exploratory analyses reported here.

      Reviewer #3 (Public Review):

      This study aims to address the important question of whether working memory can hold multiple conjunctive task representations. The authors combined a retro-cue working memory paradigm with their previous task design that cleverly constructed multiple conjunctive tasks with the same set of stimuli, rules, and responses. They used advanced EEG analytical skills to provide the temporal dynamics of concurrent working memory representation of multiple task representations and task features (e.g., stimulus and responses) and how their representation strength changes as a function of priority and task relevance. The results generally support the authors' conclusion that multiple task representations can be simultaneously manipulated in working memory.

      We appreciate these helpful comments, and were pleased that the reviewer shares our view that these results may be broadly impactful.

    1. Author Response

      Reviewer #2 (Public Review):

      Reviewer #2 was critical of every aspect of our manuscript and we were disappointed that they failed to appreciate the significance of our findings. However, we have responded to each point as described below:

      1) The experiment displayed in Figure 5 is deeply flawed for multiple reasons and should be removed from the manuscript entirely. A Michaelis-Menton plot compares the initial rate of a reaction versus substrate concentration. Instead, the authors plotted the fraction of SsrB that is phosphorylated after 10 minutes at various substrate concentrations. Such a plot must reach saturation because the enzyme is limiting, whereas it is not always possible to achieve saturation in a genuine Michaelis-Menton plot. Because no reaction rates were measured, it is not possible to derive kcat values from the data.

      Mea culpa. We now plot our phosphorylation data and describe the mid-point as a k0.5 and have removed Fig. 1g. When we directly compare the H12 mutant to wt at neutral pH, its phosphorylation level is less compared to the wt (see new Fig. 4a). The wt phosphorylation is reduced at acid pH, (Fig 4b), but with His12Q, there was no difference in phosphorylation between neutral and acid pH (Fig 4c). It is important to include this data, because in RcsB, a close homolog of SsrB, an H12A mutant was not phosphorylated by acetyl phosphate and it was incapable of binding to DNA, unlike what we show here with SsrB.

      (i) Increasing the concentration of the phosphoramidite substrate increased ionic strength. Response regulator active sites contain many charged moieties and autophosphorylation of at least one response regulator (CheY) is inhibited by increasing ionic strength (PMID 10471801).

      The reviewer raises some interesting points and they are based on CheY phosphorylation by small molecules. We have a long history of studying OmpR and SsrB as well as other RRs and we know that they can all behave very differently from “canonical signaling”. We examined the effect of ionic strength on SsrB phosphorylation and it was relatively insensitive to changes in ionic strength (our original buffer was 267-430 mOsm and in each case, we have 90% phosphorylation). However, we repeated all of the phosphorylation experiments and kept ionic strength constant. These data are now presented in the revised manuscript.

      (ii) Autophosphorylation with phosphoramidite is pH dependent because the nitrogen on the donor must be protonated to form a good leaving group (PMID 9398221). The pKa of phosphoramidite is ~8. Therefore, the fraction of phosphoramidite that is reactive (i.e., protonated) will be very different at pH 6.1 and 7.4.

      We are aware of those findings, but we are comparing the H12 mutant with the wt protein in each case. There is no reason to believe that the presence of the mutant should alter the phosphoramidate substrate, so we are comparing how the wt phosphorylation compares with the mutant (Fig 4b, c).

      (iii) Response regulator autophosphorylation absolutely depends on the presence of a divalent metal ion (usually Mg2+) in the active site (PMID 2201404). There is no guarantee that the 20 mM Mg2+ included in the reaction is sufficient to saturate SsrB. Furthermore, as the authors themselves note, the amino acid at SsrB position 12 is likely to affect the affinity of Mg2+ binding. Therefore, the fraction of SsrB that is reactive (i.e. has Mg2+ bound) may differ between wildtype and the H12Q mutant, and/or between wildtype at different pHs (because the protonation state of His12 changes).

      This is exactly the point that we are making. And why we varied the magnesium concentration (increasing to 50-100 mM). There was a slight increase in phosphorylation at 50 mM MgCl2 compared to 20 mM, and only a slight increase between 50 and 100 mM at pH 6.1. The revised phosphorylation experiments all contain 100 mM MgCl2.

      2) The data in Figures 1abcd and 3de are clearly sigmoidal rather than hyperbolic, indicating cooperativity. However, there are insufficient data points between the upper and lower bounds to accurately calculate the Hill coefficient or KD values. This limitation of the data means that comparisons of apparent Hill coefficient or KD values under different conditions cannot be the basis of credible conclusions.

      We respectfully disagree. In every curve that we provide, there is at least one data point in the transition between low and high binding. With the mutant H12Q, we did manage to get two data points in the transition and the KD was the same as the wildtype (Fig. 2). We provide an analysis of the binding curve which nicely demonstrates the range of KD values based on the lowest and highest error in the point (132-168 nM) and it doesn’t significantly change the value (this is now shown in Fig.1– figure supplement 1). The very high affinity we observed at pH 6.1 (KD ~5 nM) makes the range of possibilities between 4-8 nM (i.e. still VERY high affinity). These range in affinities at neutral and acid pH are very reminiscent of affinities we measured for OmpR and OmpR~P at the porin promoters, suggesting that acid pH puts SsrB in an activated state even in the absence of phosphorylation. A similar argument holds for the Hill coefficient (see Figure).

      3) There are hundreds of receiver domain structures in PDB. There is some variation, but to a first approximation receiver domain structures, all exhibit an (alpha/beta)5 fold. The structure of SsrB predicted by i-TASSER breaks the standard beta-2 strand into two parts, which throws off the numbering for subsequent beta strands. Given the highly conserved receiver domain fold, I am skeptical that the predicted i-TASSER structure is correct or adds any value to the manuscript. If the authors wish to retain the structure of the manuscript, then they should point out the unusual feature and the consequence of strand numbering.

      We now include a new model based on the RcsB/DNA crystal structure that eliminates this problem (see new Fig.2– figure supplement 2). We have replaced this model with an Alphafold prediction that was energy minimized to align with the RcsB dimer crystal structure (Fig.5– figure supplement 2). This model retains the original (beta/alpha)5 fold, so the classical numbering is retained.

      4) The detailed predictions of active site structure in Supplementary Figure 5 are not physiologically relevant because Mg2+ was not included in the simulation. The presence of a divalent cation binding to Asp10 and Asp11 is likely to substantially alter interactions between Asp 10, Asp11, His12, and Lys109.

      See response to 1iii, above and new Fig.5– figure supplement 2. Author response image 1 is a zoomed-in snapshot of supplementary Figure 8c that has been modelled using the RcsB dimer bound to BeF3 and Mg2+(6ZIX). Both the i-TASSER and Alphafold model receiver domains align well with this structure, and the polar contacts and pi-cation interactions made by His12 are maintained.

      Author response image 1.

      5) The authors present an AlphaFold model of an SsrB dimer, and note that His12 is at the dimer interface. However, the authors also believe that a higher-order oligomer of SsrB binds to DNA in a pH-dependent manner. Do the authors have any suggestions or informed speculation about how His12 might affect higher-order oligomerization than dimerization?

      As mentioned to point 3, above, we now include a new model of an SsrB dimer bound to DNA based on our NMR structure of the CTD and the RcsB/DNA structure. In the RcsB paper, they also have evidence for a higher-order oligomer in the crystal structure of unphosphorylated (and BeF3-) RcsB, which showed an asymmetric unit containing 6 molecules of RcsB, which form 3 dimers arranged in a hexameric structure that resembles a cylinder. This configuration involves a crossed conformation with the REC of one molecule interacting with the DBD of another and interestingly, His12 is interacting with the DBD of another molecule. We modelled an SsrB oligomer structure using the RcsB hexamer as a template and have included it as a new figure (see Fig.5– figure supplement 3) and in the revised discussion (lines 432-448).

    1. Author Response

      Reviewer #1 (Public Review):

      1) One nagging concern is that the category structure in the CNN reflects the category structure baked into color space. Several groups (e.g. Regier, Zaslavsky, et al) have argued that color category structure emerges and evolves from the structure of the color space itself. Other groups have argued that the color category structure recovered with, say, the Munsell space may partially be attributed to variation in saturation across the space (Witzel). How can one show that these properties of the space are not the root cause of the structure recovered by the CNN, independent of the role of the CNN in object recognition?

      We agree that there is overlap with the previous studies on color structure. In our revision, we show that color categories are directly linked to the CNN being trained on the objectrecognition task and not the CNN per se. We repeated our analysis on a scene-trained network (using the same input set) and find that here the color representation in the final layer deviates considerably from the one created for object classification. Given the input set is the same, it strongly suggests that any reflection of the structure of the input space is to the benefit of recognizing objects (see the bottom of “Border Invariance” section; Page 7). Furthermore, the new experiments with random hue shifts to the input images show that in this case stable borders do not arise, as might be expected if the border invariance was a consequence of the chosen color space only.

      A crucial distinction to previous results is also, is that in our analysis, by replacing the final layer, specifically, we look at the representation that the network has built to perform the object classification task on. As such the current finding goes beyond the notion that the color category structure is already reflected in the color space.

      2) In Figure 1, it could be useful to illustrate the central observation by showing a single example, as in Figure 1 B, C, where the trained color is not in the center of the color category. In other words, if the category structure is immune to the training set, then it should be possible to set up a very unlikely set of training stimuli (ones that are as far away from the center of the color category while still being categorized most of the time as the color category). This is related to what is in E, but is distinctive for two reasons: first, it is a post hoc test of the hypothesis recovered in the data-driven way by E; and second, it would provide an illustration of the key observation, that the category boundaries do not correspond to the median distance between training colors. Figure 5 begins to show something of this sort of a test, but it is bound up with the other control related to shape.

      We have now added a post-hoc test where we shift the training bands from likely to unlikely positions using the original paradigm: Retraining output layers whilst shifting training bands from the left to the right category-edge (in 9 steps) we can see the invariance to the category bounds specifically (see Supp. Inf.: Figure S11). The most extreme cases (top and bottom row) have the training bands right at the edge of the border, which are the interesting cases the reviewer refers to. We also added 7 steps in between to show how the borders shift with the bands.

      Similarly, if the claim is that there are six (or seven?) color categories, regardless of the number of colors used to train the data, it would be helpful to show the result of one iteration of the training that uses say 4 colors for training and another iteration of the training that uses say 9 colors for training.

      We have now included the figure presented in 1E, but for all the color iterations used (see SI: Figure S10. We are also happy to include a single iteration, but believe this gives the most complete view for what the reviewer is asking.

      The text asserts that Figure 2 reflects training on a range of color categories (from 4 to 9) but doesn’t break them out. This is an issue because the average across these iterations could simply be heavily biased by training on one specific number of categories (e.g. the number used in Figure 1). These considerations also prompt the query: how did you pick 4 and 9 as the limits for the tests? Why not 2 and 20? (the largest range of basic color categories that could plausibly be recovered in the set of all languages)?

      The number of output nodes was inspired by the number of basic color categories that English speakers observe in the hue spectrum (in which a number of the basic categories are not represented). We understand that this is not a strong reason, however, unfortunately the lack of studies on color categories in CNNs forced us to approach this in an explorative manner. We have adapted the text to better reflect this shortcoming (Bottom page 4). Naturally if the data would have indicated that these numbers weren’t a good fit, we would have adapted the range. (if there were more categories, we would have expected more noise and we would have increased the number of training bands to test this). As indicated above, we have now also included the classification plots for all the different counts, so the reader can review this as well (SI: Section 9).

      3) Regarding the transition points in Figure 2A, indicated by red dots: how strong (transition count) and reliable (consistent across iterations) are these points? The one between red and orange seems especially willfully placed.

      To answer the question on the consistency we have now included a repetition of the ResNet18, with the ResNet34, ResNet50 and ResNet101 in the SI (section 1). We have also introduced a novel section presenting the result of alternate CNNs to the SI (section S8). Despite small idiosyncrasies the general pattern of results recurs.

      Concerning the red-orange border, it was not willfully placed, but we very much understand that in isolation it looks like it could simply be the result of noise. Nevertheless, the recurrence of this border in several analyses made us confident that it does reflect a meaningful invariance. Notably:

      • We find a more robust peak between red and orange in the luminance control (SI section 3).

      • The evolutionary algorithm with 7 borders also places a border in this position.

      • We find the peak recurs in the Resnet-18 replication as well as several of the deeper ResNets and several of the other CNNs (SI section 1)

      • We also find that the peak is present throughout the different layers of the ResNet-18.

      4) Figure 2E and Figure 5B are useful tests of the extent to which the categorical structure recovered by the CNNs shifts with the colors used to train the classifier, and it certainly looks like there is some invariance in category boundaries with respect to the specific colors uses to train the classifier, an important and interesting result. But these analyses do not actually address the claim implied by the analyses: that the performance of the CNN matches human performance. The color categories recovered with the CNN are not perfectly invariant, as the authors point out. The analyses presented in the paper (e.g. Figure 2E) tests whether there is as much shift in the boundaries as there is stasis, but that’s not quite the test if the goal is to link the categorical behavior of the CNN with human behavior. To evaluate the results, it would be helpful to know what would be expected based on human performance.

      We understand the lack of human data was a considerable shortcoming of the previous version of the manuscript. We have now collected human data in a match-to-sample task modeled on our CNN experiment. As with the CNN we find that the degree of border invariance does fluctuate considerably. While categorical borders are not exact matches, we do broadly find the same category prototypes and also see that categories in the red-to-yellow range are quite narrow in both humans and CNNs. Please, see the new “Human Psychophysics” (page 8) addition in the manuscript for more details.

      5) The paper takes up a test of color categorization invariant to luminance. There are arguments in the literature that hue and luminance cannot be decoupled-that luminance is essential to how color is encoded and to color categorization. Some discussion of this might help the reader who has followed this literature.

      We have added some discussion of the interaction between luminance and color categories (e.g., Lindsay & Brown, 2009) at the bottom of page 6/ top of page 7. The current analysis mainly aimed at excluding that the borders are solely based on luminance.

      Related, the argument that “neighboring colors in HSV will be neighboring colors in the RGB space” is not persuasive. Surely this is true of any color space?

      We removed the argument about “neighboring colors”. Our procedure requires the use of a hue spectrum that wraps around the color space while including many of the highly saturated colors that are typical prototypes for human color categories. We have elected to use the hue spectrum from the HSV color space at full saturation and brightness, which is represented by the edges of the RGB color cube. As this is the space in which our network was trained, it does not introduce any deformations into the color space. Other potential choices of color space either include strong non-linear transformations that stretch and compress certain parts of the RGB cube, or exclude a large portion of the RGB gamut (yellow in particular).

      We have adapted the text to better reflect our reasoning (page 6, top of paragraph 2).

      6) The paper would benefit from an analysis and discussion of the images used to originally train the CNN. Presumably, there are a large number of images that depict manmade artificially coloured objects. To what extent do the present results reflect statistical patterns in the way the images were created, and/or the colors of the things depicted? How do results on color categorization that derive from images (e.g. trained with neural networks, as in Rosenthal et al and presently) differ (or not) from results that derive from natural scenes (as in Yendrikhovskij?).

      We initially hoped we could perhaps analyze differences between colors in objects and background like in Rosenthal, unfortunately in ImageNet we did not find clear differences between pixels in the bounding boxes of objects provided with ImageNet and pixels outside these boxes (most likely because the rectangular bounding boxes still contain many background pixels). However, if we look at the results from the K-means analysis presented in Figure S6 (Suppl. Inf.) of the supplemental materials and the color categorization throughout the layers in the objecttrained network (end of the first experiment on page 7) as well as the color categorization in humans (Human Psychophysics starting on page 8), we see very similar border positions arise.

      7) It could be quite instructive to analyze what's going on in the errors in the output of the classifiers, as e.g. in Figure 1E. There are some interesting effects at the crossover points, where the two green categories seem to split and swap, the cyan band (hue % 20) emerges between orange and green, and the pink/purple boundary seems to have a large number of green/blue results. What is happening here?

      One issue with training the network on the color task, is that we can never fully guarantee that the network is using color to resolve the task and we suspected that in some cases the network may rely on other factors as well, such as luminance. When we look at the same type of plots for the luminance-controlled task (see below left) presented in the supplemental materials we do not see these transgressions. Also, when we look at versions of the original training, but using more bands, luminance will be less reliable and we also don’t see these transgressions (see right plot below).

      8) The second experiment using an evolutionary algorithm to test the location of the color boundaries is potentially valuable, but it is weakened because it pre-determines the number of categories. It would be more powerful if the experiment could recover both the number and location of the categories based on the "categorization principle" (colors within a category are harder to tell apart than colors across a color category boundary). This should be possible by a sensible sampling of the parameter space, even in a very large parameter space.

      The main point of the genetic algorithm was to see whether the border locations would be corroborated by an algorithm using the principle of categorical perception. Unfortunately, an exact approach to determining the number of borders is difficult, because some border invariances are clearly stronger than others. Running the algorithm with the number of borders as a free parameter just leads to a minimal number of borders, as 100% correct is always obtained when there is only one category left. In general, as the network can simply combine categories into a class at no cost (actually, having less borders will reduce noise) it is to be expected that less classes will lead to better performance. As such, in estimating what the optimal category count would be, we would need to introduce some subjective trade-off between accuracy and class count.

      9) Finally, the paper sets itself up as taking "a different approach by evaluating whether color categorization could be a side effect of learning object recognition", as distinct from the approach of studying "communicative concepts". But these approaches are intimately related. The central observation in Gibson et al. is not the discovery of warm-vscool categories (these as the most basic color categories have been known for centuries), but rather the relationship of these categories to the color statistics of objects-those parts of the scene that we care about enough to label. This idea, that color categories reflect the uses to which we put our color-vision system, is extended in Rosenthal et al., where the structure of color space itself is understood in terms of categorizing objects versus backgrounds (u') and the most basic object categorization distinction, animate versus inanimate (v'). The introduction argues, rightly in our view, that "A link between color categories and objects would be able to bridge the discrepancy between models that rely on communicative concepts to incorporate the varying usefulness of color, on the one hand, and the experimental findings laid out in this paragraph on the other". This is precisely the link forged by the observation that the warmcool category distinction in color naming correlates with object-color statistics (Gibson, 2017; see also Rosenthal et al., 2018). The argument in Gibson and Rosenthal is that color categorization structure emerges because of the color statistics of the world, specifically the color statistics of the parts of the world that we label as objects, which is the same approach adopted by the present work. The use of CNNs is a clever and powerful test of the success of this approach.

      We are sorry we did not properly highlight the enormous importance of these two earlier papers in our previous version of the manuscript. We have now elaborated our description of Gibson’s work to better reflect the important relation between the usefulness of colors and color categories (Page 2, middle and Page 19 par. above methods). We think our work nicely extends the earlier work by showing that their approach works even at a more general level with more color categories,

    1. Author Response

      Reviewer #1 (Public Review):

      Using health insurance claims data (from 8M subjects), a retrospective propensity score matched cohort study was performed (450K in both groups) to quantify associations between bisphosphonate (BP) use and COVID- 19 related outcomes (COVID-19 diagnosis, testing and COVID-19 hospitalization. The observation periods were 1-1-2019 till 2-29-2020 for BP use and from 3-1-2020 and 6-30-2020 for the COVID endpoints. In primary and sensitivity analyses BP use was consistently associated with lower odds for COVID-19, testing and COVID-19 hospitalization.

      The major strength of this study is the size of the study population, allowing a propensity-based matched- cohort study with 450K in both groups, with a sizeable number of COVID-19 related endpoints. Health insurance claims data were used with the intrinsic risk of some misclassification for exposure. In addition there probably is misclassification of endpoints as testing for COVID-19 was limited during the study period. Furthermore, the retrospective nature of the study includes the risk of residual confounding, which has been addressed - to some extent - by sensitivity analyses.

      In all analyses there is a consistent finding that BP exposure is associated with reduced odds for COVID-19 related outcomes. The effect size is large, with high precision.

      The authors extensively discuss the (many) potential limitations inherent to the study design and conclude that these findings warrant confirmation, preferably in intervention studies. If confirmed BP use could be a powerful adjunct in the prevention of infection and hospitalization due to COVID-19.

      We thank the reviewer for this overall very positive feedback. We appreciate the reviewer's comments regarding the potential risks associated with misclassification of exposure and other potential limitations, which we have sought to address in a number of sensitivity analyses and are also addressing in the discussion of our paper. In addition, as noted by the reviewer, the observed effect size of BP use on COVID-19 related outcomes is large, with high precision, which we feel is a strong argument to explore this class of drugs in further prospective studies.

      Reviewer #2 (Public Review):

      The authors performed a retrospective cohort study using claims data to assess the causal relationship between bisphosphonate (BP) use and COVID-19 outcomes. They used propensity score matching to adjust for measured confounders. This is an interesting study and the authors performed several sensitivity analyses to assess the robustness of their findings. The authors are properly cautious in the interpretation of their results and justly call for randomized controlled trials to confirm a causal relationship. However, there are some methodological limitations that are not properly addressed yet.

      Strengths of the paper include:

      (A) Availability of a large dataset.

      (B) Using propensity score matching to adjust for confounding.

      (C) Sensitivity analyses to challenge key assumptions (although not all of them add value in my opinion, see specific comments)

      (D) Cautious interpretation of results, the authors are aware of the limitations of the study design.

      Limitation of the paper are:

      (A) This is an observational study using register data. Therefore, the study is prone to residual confounding and information bias. The authors are well aware of that.

      (B) The authors adjusted for Carlson comorbidity index whereas they had individual comorbidity data available and a dataset large enough to adjust for each comorbidity separately.

      (C) The primary analysis violates the positivity assumption (a substantial part of the population had no indication for bisphosphonates; see specific comments). I feel that one of the sensitivity analyses 1 or 2 would be more suited for a primary analysis.

      (D) Some of the other sensitivity analyses have underlying assumptions that are not discussed and do not necessarily hold (see specific comments).

      In its current form the limitations hinder a good interpretation of the results and, therefore, in my opinion do not support the conclusion of the paper.

      The finding of a substantial risk reduction of (severe) COVID-19 in bisphosphonate users compared to non- users in this observational study may be of interest to other researchers considering to set up randomized controlled trials for evaluation of repurpose drugs for prevention of (severe) COVID-19.

      We thank the reviewer for the insightful comments and questions related to our manuscript. Our response to the concerns regarding limitations of our study is as follows:

      (A) We agree that there is likely residual confounding and information bias due to use of US health insurance claims datasets which do not include information on certain potentially relevant variables. Nonetheless, given the large effect size and precision of our analysis, we feel that our findings support our main conclusion that additional prospective trials appear warranted to further explore whether BPs might confer a meaure of protection against severe respiratory infections, including COVID-19. We have added a sentence on the second page of our Discussion (line 859-860) to emphasize this point: "Specifically, there is the potential that key patient characteristics impacting outcomes could not be derived from claims data."

      (B) The progression of this study mirrors the real-world performance of the analysis where we initially used the CCI in matching to control for comorbidity burden on a broader scale. This was our a priori approach. After observing large effect sizes, we performed more stringent matching for sensitivity analyses 1 and 2. Irrespective of the matching strategy chosen, effect sizes remained similar for all outcome parameters. Therefore, we elected to include both the primary analysis and the sensitivity analyses with more stringent matching in order to more transparently show what was done in entirety during our analyses, as we feel it displays all of the efforts taken to identify sources of unmeasured confounding which could have impacted our results.

      (C) We agree that the positivity assumption is a key factor to consider when building comparable treatment cohorts. We also agree that it is the important to separately perform the analysis for either all patients with an indication for use of BPs and for other anti-osteoporosis medications, as we have done in our analysis of the Osteo-Dx-Rx cohort and Bone-Rx cohort, respectively. However, we did not have sufficient data, a priori, to determine whether BP users would be more similar in their risk of COVID-19 outcomes to non- users or to other users of anti-resorptive medications. In addition, we believe that this specific limitation does not negate our findings in the primary analysis for the following reasons: (1) ‘Type of Outcome’: the outcomes in this study are related to infectious disease and are not direct clinical outcomes of any known treatment benefits of BPs. The clinical benefits being assessed - impact of BP use on COVID-19-related outcomes - were essentially unknown at the time of the study data; this fact mitigates the impact of any violation of the positivity assumption; and (2) ‘Clinical Population’: after propensity score matching, both the BP user and the BP non-user group in the primary analysis mainly consisted of older females (90.1% female, 97.2% age>50), which is the main population with clinical indications for BP use. According to NCHS Data Brief No. 93 (April 2012) released by the CDC, ~75% and 95% of US women between 60-69 and 70-79 suffer from either low bone mass or osteoporosis, respectively, and essentially all women (and 70% of men) above age 80 suffer from these conditions, which often go undiagnosed (https://www.cdc.gov/nchs/data/databriefs/db93.pdf). Women aged 60 and older make up ~75% of our study population (Table 1). Although bone density measurements are not available for non- BP users in the matched primary cohort, there is a high probability that the incidence of osteoporosis and/or low bone mass in these patients was similar to the national average. This justifies the assumption that BP therapy was indicated for most non-BP users in the matched primary cohort. Arguably, for these patients the positivity assumption was not violated.

      (D) We will discuss in detail below the specific issues raised by the reviewer regarding our sensitivity analyses. In general we acknowledge that individual analytical and/or matching approaches may each have their own limitations, but the analyses performed herein were done to test in a systematic fashion the different critical threats to the validity of our initial results in the primary cohort analysis, which were based on a priori-defined methods and yielded a large and robust effect size. Thus, the individual sensitivity analyses should be considered in the greater context of the entire project.

      Specific comments (in order of manuscript):

      Methods:

      Line 158: it is unclear how the authors dealt with patients who died during the follow-up period. The wording suggests they were excluded which would be inappropriate.

      When this study was executed, we were unable to link the patient-level US insurance claims data with patient-level mortality data due to HIPAA concerns. Therefore, line 158 (now 177) defines continuous insurance coverage during the observation period as a verifiable eligibility criterion we used for patient inclusion. It was necessary to disqualify individuals who discontinued insurance coverage for a variety of reasons, e.g. due to loss or change of coverage, relocation etc., but our approach also eliminated patients who died. Appendix 3 (line 2449ff) describes methods we employed post hoc to assess how censoring due to death could have impacted our analyses. We discuss our conclusions from this post hoc analysis in the main text (lines 1053-1058) as follows: "An additional limitation is potential censoring of patients who died during the observation period, resulting in truncated insurance eligibility and exclusion based on the continuous insurance eligibility requirement. However, modelling the impact of censoring by using death rates observed in BP users and non-users in the first six months of 2020 and attributing all deaths as COVID-19-related did not significantly alter the decreased odds of COVID-19 diagnosis in BP users (see Appendix 3)."

      Why did the authors use CCI for propensity matching rather than the individual comorbid conditions? I presume using separate variables will improve the comparability of the cohorts. The authors discuss imbalances in comorbidities as a limitation but should rather have avoided this.

      CCI was the a priori approach defined at the study outset and was chosen due to the widespread use and understanding of this score. The general CCI score was originally planned for matching in order to have the largest possible study population since we did not know how many patients would meet all criteria as well as have an event of interest. After realizing we had adequate sample size to power matching using stricter criteria, we proceeded to perform subsequent sensitivity analyses on more stringently matched cohorts (sensitivity analysis 2).

      Line 301-10: it seems unnecesary to me to adjust for the given covariates while these were already used for propensity score matching (except comorbidities, but see previous comment). The manuscript doesn't give a rationale why did the authors choose for this 'double correction'.

      The following language was added to the methods section (lines 325-327): “Demographic characteristics used in the matching procedure were also included in the final outcome regressions to control for the impact of those characteristics on outcomes modelled.”

      The following language was added to the Discussion section regarding the potential limitations of our srudy (lines 1078-1085): “Another limitation in the current study is related to a potential ‘double correction’ of patient characteristics that were included in both the propensity score matching procedure as well as the outcome regression modelling, which could lead to overfitting of the regression models and an overestimation of the measured treatment effect. Covariates were included in the regression models since these characteristics could have differential impacts on the outcomes themselves, and our results show that the adjusted ORs were in fact larger (showing a decreased effect size) when compared to the unadjusted ORs, which show the difference in effect sizes of the matched populations alone.”

      In causal research a very important assumption is the 'positivity assumption', which means that none of the individuals has a probability of zero or one to be exposed. Including everyone would therefore not be appropriate. My suggestion is to include either all patients with an indication (based on diagnosis) or all that use an anti-osteoporosis (AOP) drug (or one as the primary and the other as the sensitivity analysis) instead of using these cohorts as sensitivity analyses. The choice should in my opinion be based on two aspects: whether it is likely that other AOP drugs have an effect on the COVID-19 outcomes and whether BP users are deemed to be more similar (in their risk of COVID-19 outcomes) to non-users or to other AOP drug users. Or alternatively, the authors might have discussed the positivity assumption and argue why this is not applicable to their primary analysis.

      The following text has been added to the Discussion section addressing potential limitations of our study (lines 987-1009): " Another potential limitation of this study relates to the positivity assumption, which when building comparable treatment cohorts is violated when the comparator population does not have an indication for the exposure being modelled 56. This limitation is present in the primary cohort comparisons between BP users and BP non-users, as well as in the sensitivity analyses involving other preventive medications. This limitation, however, is mitigated by the fact that the outcomes in this study are related to infectious disease and are not direct clinical outcomes of known treatment benefits of BPs. The fact that the clinical benefits being assessed – the impact of BPs on COVID-related outcomes – was essentially unknown clinically at the time of the study data minimizes the impact of violation of the positivity assumption. Furthermore, our sensitivity analyses involving the “Bone-Rx” and “Osteo-Dx- Rx” cohorts did not suffer this potential violation, and the results from those analyses support those from the primary analysis cohort comparisons. Moreover, we note that the propensity score matched BP users and BP non-users in the primary analysis cohort mainly consisted of older females. According to the CDC, ~75% and 95% of US women between 60-69 and 70-79 suffer from either low bone mass or osteoporosis, respectively (https://www.cdc.gov/nchs/data/databriefs/db93.pdf). Essentially all women (and 70% of men) above age 80 suffer from these conditions, which often go undiagnosed. Women aged 60 and older represent ~75% of our study population (Table 1). Although bone density measurements are not available for non-BP users in the matched primary cohort, there is a high probability that the incidence of osteoporosis and/or low bone mass in these patients was similar to the national average.Thus, BP therapy would have been indicated for most non-BP users in the matched primary cohort, and arguably, for these patients the positivity assumption was not violated."

      Sensitivity Analysis 3: Association of BP-use with Exploratory Negative Control Outcomes: what is the implicit assumption in this analysis? I think the assumption here is that any residual confounding would be of the same magnitude for these outcomes. But that depends on the strength of the association between the confounder and the outcome which needs not be the same. Here, risk avoiding behavior (social distancing) is the most obvious unmeasured confounder, which may not have a strong effect on other health outcomes. Also it is unclear to me why acute cholecystitis and acute pancreatitis-related inpatient/emergency-room were selected as negative controls. Do the authors have convincing evidence that BPs have no effect on these outcomes? Yet, if the authors believe that this is indeed a valid approach to measure residual confounding, I think the authors might have taken a step further and present ORs for BP → COVID-19 outcomes that are corrected for the unmeasured confounding. (e.g. if OR BP → COVID-19 is ~ 0.2 and OR BP → acute cholecystitis is ~ 0.5, then 'corrected' OR of BP → COVID-19 would be ~ 0.4.

      We appreciate the reviewer’s thoughtful comments regarding the differential strength of the association between unmeasured confounders and outcome. We had initially selected acute cholecystitis and pancreatitis-related inpatient and emergency room visits as negative controls because we deemed them to be emergent clinical scenarios that should not be impacted by risk avoiding behavior. However, upon further search, we identified several publications that suggest a potential impact of osteoporosis and/or BPs on gallbladder diseases (DOIhttps://doi.org/10.1186/s12876-014-0192-z; http://dx.doi.org/10.1136/annrheumdis-2017-eular.3900), thus calling the validity our strategy into question. We therefore agree that the designation of negative control outcomes is problematic and adds relatively little to the overall story. Therefore, we have removed these analyses from the revised manuscript.

      Sensitivity Analysis 4: Association of BP-use with Exploratory Positive Control Outcomes: this doesn't help me be convinced of the lack of bias. If previous researchers suffered from residual confounding, the same type of mechanisms apply here. (It might still be valuable to replicate the previous findings, but not as a sensitivity analysis of the current study).

      We agree that the same residual confounding in previous research papers could be present in our study. Nonetheless, it was important to assess whether our analysis would be potentially subject to additional (or different) confounding due to the nature of insurance claims data as compared to the previous electronic record-based studies. Therefore, it was relevant to see if previous findings of an association between BP use and upper respiratory infections are observable in our cohort.

      The second goal of sensitivity analysis #4 (now #3) was to see whether associations could be found on different sets of respiratory infection-based conditions, both during the time of the pandemic/study period as well as during the pre-pandemic time, i.e. before medical care in the US was significantly impacted by the pandemic. In light of these considerations, we feel that sensitivity analysis 4 adds value by showing consistency in our core findings.

      Sensitivity Analysis 5: Association of Other Preventive Drugs with COVID-19-Related Outcomes: Same here as for sensitivity analysis 3: the assumption that the association of unmeasured confounders with other drugs is equally strong as for BPs. Authors should explicitly state the assumptions of the sensitivity analyses and argue why they are reasonable.

      The following sentence was added to the Discussion section (lines 1019-1020): “ "These analyses were based on the assumption that the association of unmeasured confounders with other drugs is comparable in magnitude and quality as for BPs."

      Results: The data are clearly presented. The C-statistic / ROC-AUC of the propensity model is missing.

      Unfortunately, a significant amount of time has passed since execution of our original analysis of the Komodo dataset by our co-authors at Cerner Enviza. To date, our ability to perform follow-up studies with the Komodo dataset (which is exclusively housed on Komodo's secure servers) has become limited because business arrangements between these companies have been terminated, and the pertinent statistical software is no longer active. This issue prevents us from attaining the original C-statistic and ROC-AUC information, however, we were able to extract the actual; propensity scores themselves for the base cohort matching (BP-users versus non-users). The table below illustrates that the distribution of propensity scores for the base cohort match ranged from <0.01 to a max of 0.49, with 81.4% of patients having a propensity score of 10-49%, and 52.9% of patients having a propensity score of 20-49%. This distribution is unlikely to reflect patients who had a propensity score of either all 0 or all 1.

      Discussion:

      When discussing other studies the authors reduce these results to 'did' or 'did not find an association'. Although commonly practiced, it doesn't justify the statistical uncertainty of both positive and negative findings. Instead I encourage the authors to include effect estimates and confidence intervals. This is particularly relevant for studies that are inconclusive (i.e. lower bound of confidence interval not excluding a clinically relevant reduction while upper bound not excluding a NULL-effect).

      We appreciate the reviewer’s suggestion and have added this information on p.21/22 in the Discussion.

      Line 1145 "These retrospective findings strongly suggest that BPs should be considered for prophylactic and/or therapeutic use in individuals at risk of SARS-CoV-2 infection." I agree for prophylactic use but do not see how the study results suggest anything for therapeutic use.

      We have removed “and/or therapeutic use” from this sentence (line 1088-1090).

      The authors should discuss the acceptability of using BPs as preventive treatment (long-term use in persons without osteoporosis or other indication for BPs). This is not my expertise but I reckon there will be little experience with long-term inhibiting osteoblasts in people with healthy bones. The authors should also discuss what prospective study design would be suitable and what sample size would be needed to demonstrate a reasonable reduction. (Say 50% accounting for some residual confounding being present in the current study.)

      Although BPs are also used in pediatric populations and in patients without osteoporosis (for example, patients with malignancy), we do recognize the lack of long-term safety data in use of BPs as preventative treatments. We tried to partially address this concern in our sub-stratified analysis of COVID-19 related outcomes and time of exposure to BP. Reassuringly, we observed that patients newly prescribed alendronic acid in February 2020 also had decreased odds of COVID-19 related outcomes (Figure 3B), suggesting that the duration of BP treatment may not need to be long-term. This was further discussed in the last paragraph of our Discussion where we state that " BP use at the time of infection may not be necessary for protection against COVID-19. Rather, our results suggest that prophylactic BP therapy may be sufficient to achieve a potentially rapid and sustained immune modulation resulting in profound mitigation of the incidence and/or severity of infections by SARS- CoV-2."

      We agree that a future prospective study on the effect of BPs on COVID-19 related outcomes will require careful consideration of the study design, sample size, statistical power etc. However, we feel that a detailed discussion of these considerations is beyond the scope of the present study.

      The authors should discuss the fact that confounders were based on registry data which is prone to misclassification. This can result in residual confounding.

      Some potential sources of misclassification have been discussed on line 932-948. In addition, the following language was added (line 970-985): "Additionally, limitations may be present due to misclassification bias of study outcomes due to the specific procedure/diagnostic codes used as well as the potential for residual confounding occurring for patient characteristics related to study outcomes that are unable to be operationalized in claims data, which would impact all cohort comparisons. For SARS- CoV-2 testing, procedure codes were limited to those testing for active infection, and therefore observations could be missed if they were captured via antibody testing (CPT 86318, 86328). These codes were excluded a priori due to the focus on the symptomatic COVID-19 population. Furthermore, for the COVID-19 diagnosis and hospitalization outcomes, all events were identified using the ICD-10 code for lab-confirmed COVID-19 (U07.1), and therefore events with an associated diagnosis code for suspected COVID-19 (U07.2) were not included. This was done to have a more stringent algorithm when identifying COVID-19-related events, and any impact of events identified using U07.2 is considered minimal, as previous studies of the early COVID-19 outbreak have found that U07.1 alone has a positive predictive value of 94%55, and for this study U07.1 captured 99.2%, 99.0%, and 97.5% of all COVID-19 patient-diagnoses for the primary, “Bone-Rx”, and “Osteo-Dx-Rx” cohorts, respectively."

    1. Author Response:

      Evaluation Summary:

      The study provides evidence that specific transcriptional responses may underpin the observation that metabolic rates often scale inversely with body mass. The conclusions are supported by direct measurement of metabolic fluxes in mouse and rat livers, although generalizations to other settings remain to be rigorously tested. The study has broad implications for researching and studying animal metabolism and physiology.

      We thank the reviewers and editors for this summary. We are pleased that they agree that the conclusions “are supported by direct measurements of metabolic fluxes in mouse and rat livers,” and that “the study has broad implications for researching and studying animal metabolism and physiology. While we fully agree that “generalizations to other settings remain to be rigorously tested,” we have now added a comment comparing our measured liver fluxes in rodents to those recently measured in people:

      “While we did not have the capacity to measure liver fluxes in larger mammals in the current study, endogenous glucose production, VPC, and VCS previously measured using PINTA were 50-60% lower in overnight fasted humans than in rats (Petersen et al., 2019), assuming a liver size of 1,500 g in humans.”

      Reviewer #1 (Public Review):

      It is well established that the energy expenditure and metabolic rate of metazoan organisms scale inversely to body mass, based on the measurement of oxygen consumption and caloric intake. However, the underlying regulatory mechanisms for this observation are poorly defined. To investigate whether metabolic scaling is associated with reduced levels of transcription of metabolic genes in larger animals, the authors reviewed existing transcriptional datasets from liver tissues of five animals (mice, rats, monkeys, humans and cattle) with a 30,000-fold range in average adult body weights. They identified a number of metabolic genes in different pathways of central carbon metabolism whose expression inversely scaled with body size, a majority of which required oxygen, NAD/H or ATP/ADP. Metabolic flux studies on intact liver sections, as well as in live animals also revealed decreased liver metabolic fluxes in rats compared to mice. Interestingly, these differences were not observed in primary hepatocyte cultures, indicating that metabolic scaling is primarily regulated by cell-extrinsic factors and tissue context. These are interesting findings and highlight the importance of measuring metabolic processes in vivo. The measurement of cellular metabolic fluxes in different contexts (cultured, ex vivo tissue sections and live animals) is a major strength of this study. The lack of direct evidence that enzyme levels correlate with mRNA, and the absence of both transcriptional and enzyme activity measurements in cultured cells are potential weaknesses.

      We are delighted, and thank Reviewer #1 for stating that “These are interesting findings and highlight the importance of measuring metabolic processes in vivo” and that “The measurement of cellular metabolic fluxes in different contexts (cultured, ex vivo tissue sections and live animals) is a major strength of this study.” In addition, we sincerely thank the reviewer for raising important weaknesses related to the importance of proteomics, transcriptional and enzyme activity measurements in cultured cells, and are pleased to have had the opportunity to add data to address each of these points.

      Reviewer #2 (Public Review):

      Akingbesote et al. aim to determine the molecular basis of metabolic scaling - the phenomenon that metabolic rates scale inversely with (0.75) body mass. More specifically, they test the hypothesis that expression of genes involved in the regulation of oxygen consumption and substrate metabolism as well as respective fluxes provide a molecular basis for metabolic scaling across five species: mice, rats, monkeys, humans, and cattle. To this end, Akingbesote et al. use publicly available transcriptomics data and identify genes that show decreasing (normalized) expression with increasing mass of organisms. This descriptive analysis is followed by discussing a few relevant examples and (KEGG) pathway enrichment analysis. The authors then used their published PINTA approach with data from their experiments with mice and rats to provide estimates of selected cytosolic and mitochondrial fluxes in vitro, ex vivo, and in vivo; these estimates are then employed in determining if metabolic fluxes scale. The conclusion drawn from these analyses is that estimates of selected fluxes do not differ in vitro between plated hepatocytes of mice and rats, but that differences can be detected using metabolic flux analysis in vivo. As a result, in vivo flux profiling is more relevant to assessing metabolic scaling.

      The conclusions are only in part supported by the data and clarifications are needed both with respect to the analysis of transcriptomics data as well as flux estimates:

      1. In looking for scaling in gene expression, the authors rely on the assumption that mRNA expression correlates well with protein abundance (citing Schwanhäusser et al., 2011); however, transcripts explain about 40% of variance in protein abundance (this observation holds across multiple species). Hence, the identified patterns based on the transcript data may have little implications for protein abundance or flux.

      We agree that, despite the data in the cited publication, gene expression should not be assumed to directly correlate with protein expression, and the two certainly cannot be assumed – without data to equate to metabolic flux. We have removed the citation, and replaced it with proteomics data. Half of the genes available in the proteomics analysis which were found to correlate negatively with body size in our liver transcriptomics analysis also correlated negatively with body size at the level of liver protein expression:

      Author Response Figure 1

      Additionally, we analyzed available proteomics assessment of left ventricular expression of the three proteins observed to correlate negatively with body mass in the liver proteomics analysis. One of the three genes observed to correlate negatively with body mass in the proteomics analysis of liver, GLUL, was also shown to correlate negatively with body mass when its expression was assessed in the heart:

      Author Response Figure 2

      However, as discussed in our response to the editor’s point 1, we are limited by the available data, and fully acknowledge that without the capacity to statistically compare groups, we cannot make conclusive statements regarding the proteomics data.

      Additionally, we have substantially softened the description of the implications of the transcriptomics data in the Abstract, Introduction, and Discussion, including: - Editing “Together, these data reveal that metabolic scaling extends beyond oxygen consumption to numerous other metabolic pathways, and is likely regulated at the level of gene and protein expression, enzyme activity, and substrate supply” to add the parameters in red. - Removing “Considering that mRNA expression correlates well with protein expression under basal conditions, especially for metabolic genes (Schwanhäusser et al., 2011), we used mRNA expression as a proxy for the relative abundance of metabolic enzymes.” - Added “Further analysis of liver proteomics revealed that approximately half of the genes in liver that scaled at the transcriptional level also scaled at the level of protein expression,” now linking gene expression to protein expression to metabolic flux. - Editing “Numerous metabolic genes…followed the pattern of metabolic scaling, and informed our isotope tracer based in vitro and in vivo metabolic flux studies” to “Numerous metabolic genes…followed the pattern of metabolic scaling. Further analysis of liver proteomics revealed that approximately half of the genes in liver that scaled at the transcriptional level also scaled at the level of protein expression. To determine if gene and protein expression would correlate with scaling at the level of metabolic flux, we performed a comprehensive assessment of liver metabolism in vivo and in vitro using modified Positional Isotopomer NMR Tracer Analysis (PINTA)…” - Edited “Taken together, this study demonstrates systems regulation of metabolic scaling: gene expression in livers showed that scaling occurs to regulate oxygen consumption and substrate supply, isotope-based tracer studies in mice and rats demonstrated the mechanistic function of these enzymes in vivo which was only apparent in the living organism rather than plated cells” to “Taken together, this study demonstrates systems regulation of the ordering of metabolic fluxes according to body size, and provides unique insight into the regulation of metabolic flux across species.” - Removed “Interestingly, the scaling of GPT and ADIPOR1 further suggest that there is dependence on extra-hepatic organs in the scaling of in vivo gluconeogenesis and fatty acid oxidation: that is, skeletal muscle supply of alanine for the liver mediated glucose-alanine cycle and adipose tissue-derived adiponectin signaling. These findings also suggests that the scaling of mitochondrial mass (Porter and Brand, 1995) or mitochondrial proton leak (Porter and Brand, 1993) cannot fully explain metabolic scaling.” - Added “However, it should be noted that metabolic scaling cannot fully be explained at the transcriptional level, because many rate-limiting enzymes in the metabolic processes measured in vivo did not scale at the transcriptional level, and only approximately half of genes that scaled at the level of mRNA scaled at the level of protein. Thus, it is likely that both transcriptional and other mechanisms – such as enzyme activity – are responsible for variations in metabolic flux per unit mass, inversely proportionally to body size. Additionally, the currently available data do not allow us to assess whether expression of certain isoforms of key metabolic enzymes scale differentially across species.”

      1. While the procedure used to identify transcripts whose expression scale is clearly described, focusing the enrichment on KEGG pathways can only identify metabolic genes that scale. It would be informative and instructive to investigate if and to what extent genes involved in non-metabolic processes, that affect metabolic rates, also scale.

      We acknowledge that focusing the enrichment on KEGG pathways does enrich for the identification of metabolic processes that scale. However, we would respectfully submit that because this manuscript focuses on metabolic scaling, this seems to be the most appropriate setting in which to conduct the analysis. New data added in this revision demonstrate that three metabolic enzymes that scaled in the transcriptomics analysis also scale relative to β-actin, further suggesting that the inverse correlation of gene expression with body weight is primarily confined to metabolic processes:

      Author Response Figure 3

      In addition, we measured the expression of two structural proteins (collagenase 3 [Mmp3] and Larp6) outside of metabolic pathways, relative to β-actin (Actb), and found that neither was differentially expressed relative to actin in mice versus rats:

      Author Response Figure 4

      We recognize that these data may be confounded by the fact that Actb expression could potentially be different in mice versus rats; however, the fact that metabolic genes scale relative to β-actin (Actb) expression shows that it is unlikely that global mRNA scaling is unlikely to be the sole cause of the metabolic scaling phenotype.

      1. The result on flux ratios and absolute fluxes, based on the equations in Table S1, rely on certain assumptions (e.g. metabolic and isotopic steady state, among the others listed in PINTA); the current presentation does not ensure that all assumptions of PINTA are met in the present setting, so the estimates may be biased, leading to alternative explanations for the observed differences in vivo or the lack thereof in vitro.

      However, we fully agree with the reviewer that it is critical to ensure that key assumptions are met when presenting tracer data, and thank them for raising this important point. Thus, we have now added data demonstrating that plasma m+1, m+2, and m+7 glucose are in steady state at 100 min of the 120 min in vivo tracer infusion:

      Author Response Figure 5

      Additionally, we now show that blood glucose and plasma lactate concentrations have reached steady state as well:

      Author Response Figure 6

      With these data, we validate that the mice and rats are at metabolic and isotopic steady state by the end of the 120 min tracer infusion. We recognize that we have not validated that liver m+1 and m+2 glucose are at steady state, as that would require two additional groups of mice and rats (to sacrifice at 100 and 110 min, compared to the animals euthanized after 120 min of infusion) and introduce additional variability. Additionally, plasma m+1 and m+2 glucose come from endogenous glucose production from 13C tracer, so if m+1 and m+2 glucose are in steady state in plasma, they must be in steady state in liver.

      An additional assumption is that liver glycogen is effectively depleted after the overnight fast utilized in these studies. We have now verified this assumption by comparing fed and overnight fasted liver glycogen concentrations, and detect negligible glycogen after the fast in both rats and mice:

      Author Response Figure 7

      Additionally, we validated isotopic steady state in our hepatocytes incubated in 3-13C lactate. As expected in plated cell studies, cells reached steady state in both [13C] lactate enrichment and m+1 and m+2 glucose enrichment within 60 min. Because net glucose production is measured using the accumulation of glucose, we do not expect – and did not measure – glucose concentration at steady state, but we did confirm that the accumulation of glucose is linear throughout the 6 hr incubation (thus confirming that 6 hr is a reasonable endpoint):

      Author Response Figure 8

      We very respectfully submit that after 8 prior publications using PINTA called as such (PMID 28986525, 29307489, 29483297, 31545298, 31578240, 32610084, 32132708, 32179679), in addition to several prior publications that utilized PINTA without the acronym, it would not be the most responsible use of animals to try to prove in this manuscript that PINTA is a legitimate means of assessing substrate fluxes in the current manuscript. However, we thank the reviewer for raising the important point regarding assumptions of the method, thereby allowing us to insert data verifying that the key assumptions are met.

      1. The findings regarding the flux estimates seem to be fully determined by observed differences in gluconeogenesis (as demonstrated in Fig. 4). Usage of more involved approaches for metabolic flux analysis may provide wider-reaching conclusions beyond selected fluxes that appear fully coupled.

      Fluxes are back-calculated from total glucose production so that methodologically they are “coupled”, but this does not mean that glucose production will always mirror other flues. For example, in our 2015 manuscript using PINTA – although we had not yet named the method “PINTA” – we measured decreased endogenous glucose production (EGP) simultaneously with increased citrate synthase flux (mitochondrial oxidation, VTCA, which we have subsequently begun to call VCS in recognition of the fact that different reactions in the TCA cycle can proceed at different rates, but the calculation is the same) (Perry et al. Science 2015).

      Similarly, another study demonstrated that the same mitochondrial uncoupler (CRMP) increased VCS while EGP decreased in nonhuman primates (Goedeke et al. Sci. Transl. Med. 2019).

      These data demonstrate that, while fluxes are back-calculated from EGP with PINTA, the method is fully capable of detecting differences in oxidative fluxes without, or in the opposite direction of, changes in EGP. We very respectfully submit that we are not aware of what a more “involved” approach for metabolic flux analysis would entail, and that after the 8 prior publications listed in response to the previous point, we are not trying to validate PINTA in the current manuscript.

      Reviewer #3 (Public Review):

      This manuscript addresses a fundamental aspect of mammalian biology referred to as scaling, in which metabolic processes calibrate to the size of the organism. Longstanding observations related to scaling have been established based on rates of oxygen consumption. This manuscript extends these observations to gene expression and metabolic fluxes in order to discover the metabolic pathways that scale with body mass. The analyses are focused on the liver, which is the metabolic hub of the organism. Gene expression levels gleaned from available databases for organisms of varied sizes are analyzed and queried for scaling based on body mass. This analysis reveals that scaling is mainly a characteristic of metabolic genes. These data inform metabolic flux studies in cultured cells, liver slices and whole organisms. These studies demonstrate that scaling of metabolic fluxes occurs, but not out of the context of the whole organism or intact liver (in the form of liver slices). Scaling of metabolic fluxes is not observed in cultured hepatocytes. Overall, this is an interesting line of inquiry. The data are largely correlative in nature but add important texture to traditional characterization of oxygen consumption rates. The application of flux studies is a particular strength because these reflect the true metabolic processes. Enthusiasm was tempered by certain claims that extend beyond data (e.g., the title that suggests that metabolic scaling applies to tissues other than the liver, which was studied), as well as low numbers of biological replicates in some experiments, studies conducted in a single-gender and a writing style that includes excessive technical jargon.

      We thank the reviewer for their time spent evaluating the paper, and for their very helpful comments. We agree that “the application of flux studies is a particular strength because these reflect the true metabolic processes.” We agree that the study was focused on liver, although the previous iteration did include a small amount of white adipose tissue flux data, and have edited the manuscript to make clear that this is a liver-focused manuscript. We have now added specific numbers to each figure legend, and have also added in vivo flux measurements in female rats and mice. Additionally, the manuscript has been edited extensively. We have further detailed these modifications in our point-by-point responses to the reviewer.

    1. Author Response

      Reviewer #1 (Public Review):

      Bornstein and colleagues address an important question regarding the molecular makeup of the different cellular compartments contributing to the muscle spindle. While work focusing on single components of the spindle in isolation - proprioceptors, gamma-motor neurons, and intrafusal muscle fibres - have been recently published, a comprehensive analysis of the transcriptome and proteome of the spindle was missing and it fills an important gap considering how local translation and protein synthesis can affect the development and function of such a specialised organ.

      The authors combine bulk transcriptome and proteome analysis and identify new markers for neuronal, intrafusal, and capsule compartments that are validated in vivo and are shown to be useful for studying aspects of spindle differentiation during development. The methodology is sound and the conclusions in line with the results.

      We thank the reviewer for highlighting the importance of our study.

      I feel a bit more analysis regarding the specificity and developmental expression profiles of the identified markers would be a great addition. In particular:

      • Are any of the proprioceptive sensory neurons markers specific for fibres innervating the muscle spindles or also found in Golgi tendon organs?

      We thank the reviewer for the important question, following which we performed two additional analyses. First, in order to study the specificity of spindle afferent genes we identified, we examined the overlap between our list of 260 potential proprioceptive neuron genes and markers for the three proprioceptive neurons subtypes (Ia, II and Ib) identified by Wu and colleagues (Wu et al. 2021). As shown in the newly added Figure 1- figure supplement 2F, while we found many genes that are common to all subtypes, 69 genes exclusively overlapped with subtype markers (22 genes with type Ia neurons, 45 genes with type II neurons and 2 genes with both; lists are shown in Supplementary File 4). These results suggest that the 69 genes are expressed by muscle spindle afferents and not by GTO afferents.

      Second, to study the specificity of our validated markers, we examined the expression of ATP1a3, VCAN and GLTU1, marking proprioception neurons, extracellular matrix and outer capsule, respectively, in GTOs. Results showed that all three markers were also detected in the different tissues composing the GTOs (newly added Figure 3 – figure supplement 3, below). As ATP1a3 is not in the 69 unique marker list, this analysis verified that it is expressed by all proprioceptive neurons. The expression of both VCAN and GLUT1 in GTO capsules highlights the similarity between the capsules of the two proprioceptors.

      • On the same line are any of the gamma motor neurons markers found also in alpha?

      We thank the reviewer for raising this issue. Following the reviewer’s question, we conducted a detailed analysis of the expression of potential γ motor neuron genes. To this end, we first generated a list of α-motor neurons genes in our data by performing ranked GSEA using published expression profiles of these neurons (Blum et al., 2021). Then, we compared between the three lists of neuronal genes, i.e. γ motor neurons, α motor neurons and proprioceptive neurons (newly added Figure 1 – figure supplement 2G), and found an overlap between the three lists. Nonetheless, we also identified 40 spindle genes that are specific to γ motor neuron (Figure 1 – figure supplement 2G and Supplementary File 4) and, therefore, are potential markers for these neurons.

      • How early expression of ATP1A3 is found in neurons at the spindle or fibres starting to innervating the muscle? A couple of late embryonic timepoints would be great.

      We thank the reviewer for this suggestion. We performed late embryonic (E15.5-E17.5) staining for ATP1a3, which showed its expression as early as E15.5 (new Figure 4 – figure supplement 1).

      • Given that the approach used allows to obtain insights on whether local translation plays a major role into the differentiation of the spindle it would be interesting to assess whether the proprioceptor and gamma motor neuron markers identified are also found in the cell body or exclusively at the spindle.

      The reviewer raises an interesting question about local translation of the neuronal genes. Going through the literature, several lines of evidence indicate that the genes expressed at the neuronal end are also expressed in the neuron soma. In a study on retinal ganglion cell translatome, Holt and colleagues found that the axonal translatome is a subset of the significantly larger somal translatome (Shigeoka et al., Cell, 2016). Similarly, a study by Shuman and colleagues that compared the translatome of neuronal cell bodies, dendrites, and axons of rat hippocampal neurons showed that many common genes are translated, albeit at different levels (Glock et al., PNAS, 2021). Finally, following the reviewer’s suggestion, we studied the expression of ATP1a3 in the DRG, and found it to be expressed there as well (Figure L1). Thus, we predict that the markers we found in the neurons ends are likely also expressed in the soma. While this issue is very interesting, we believe that further validation of our assumption exceeds the scope of this study.

      Figure L1. ATP1a3 expression in the DRG. Confocal images of DRG sections from adult PValb-Cre;tdTomato mice stained for ATP1a3 (magenta). Scale bars represent 50 μm.

      Altogether, this is a novel and important work that will benefit scientists studying the neuromuscular and musculoskeletal systems by pushing the field toward an holistic understanding of the muscle spindle. These datasets in combination with the previous ones can be used to develop new genetic and viral strategies to study muscle spindle development and function in healthy and pathological states by analysing the roles and relative contributions of different components of this fascinating and still mysterious organ.

      We thank again the reviewer for highlighting the importance of our study.

      Reviewer #2 (Public Review):

      The data presented are of high quality. Through complementary experiments involving the isolation of masseter muscle spindles, the authors perform RNA-seq and proteomic analysis, and identify genes and proteins that are differentially expressed in the muscle spindle versus the adjacent muscle fiber, and proteins that accumulate specifically in capsule cells and nerve endings. These data, while essentially descriptive, provide important information about the developmental framework of the sensory apparatus present in each muscle that accounts for its tension/contraction state. The data presented thus allow for a better characterization of muscle spindles and provide the community with a set of new markers for better identification of these structures. Analysis of the expression pattern of the Tomato reporter in transgenic animals under the control of Piezo2-CRE, Gli1-CRE and Thy1-YFP reporter reinforces the findings and the specificity of the expression pattern of the specific genes and proteins identified by the multi-omics approach and further validated by immunohistochemistry.

      We thank the reviewer for the positive and encouraging feedback.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Marmor and colleagues reanalyze a previously published dataset of chronic widefield Ca2+ imaging from the dorsal cortex of mice as they learn a go/no-go somatosensory discrimination task. Comparing hit trials that have a distinct history (i.e. are preceded by distinct trial types), the authors find that hit trials preceded by correct rejections of the nontarget stimulus are associated with larger subsequent neural responses than trials precede by other hits, across the cortex. The authors analyze the time course over which this effect emerges in the barrel cortex (BC) and the rostrolateral visual area (RL), and find that its magnitude increases as the animals become expert task performers. Although the findings are potentially interesting, I, unfortunately, believe that there are important methodological concerns that could put them into question. I also disagree with the rationale that singles out BC and RL as being especially important for the emergence of trial history effects on neural responses during decision-making. I detail these points below .

      1) The authors did not perform correction for hemodynamic contamination of GCaMP fluorescence. In widefield imaging, blood vessels divisively decrease neural signals because they absorb green-wavelength photons, which could lead to crucial confounds in the interpretation of the main results because of neurovascular coupling, which lags neural activity by seconds. For example, if a reward response from the previous trial is associated with a lagged hemodynamic contamination that artificially decreases the signal in the following trial, one could get artificially higher activity in trials that were not preceded by a reward (i.e. CR), which is what the authors observed. Ideally, the experiments would be repeated with proper hemodynamic correction, but at the very least the authors should try to address this with control analyses.

      Done. We basically redone the experiment with proper hemodynamic correction and maintained trial history results. Please see point 1 above for more details (Figures S4 and S5). In addition to hemodynamic controls, we also present novel two-photon single cell data with similar results in Figure S6. We also added a dedicated section for this in the Methods section (pg. 12).

      For example, what is the time course of reward-related responses in BC and elsewhere?

      In general, and specifically in BC, reward related responses return to baseline up to 5 seconds after the start of the reward period and at least 5 seconds before the stimulus presentation of the next trial. In the novel experiments we even extended the baseline period by an additional 2 seconds just in case. Trial history information was still present with an extended inter-trial interval.

      The text now reads (pg. 4): "We further report that responses during the reward period in cortex and specifically in BC went back to baseline 4-5 seconds after the start of the reward period and 6-8 seconds before the presentation of the next stimulus (total inter-trial interval ranged between 10-12 seconds)."

      Do hemodynamics artifacts have a trial-by-trial correlation with the subsequent trial history effect?

      We have now done the proper hemodynamic control (Figure 2) and we did not find a strong effect of hemodynamic responses on trial history information.

      What is the learning time course of reward responses?

      Responses during the reward period as a function of learning were not significantly modulated. We further show the whole learning profile for BC response during the reward period in Author response image 1.

      Author response image 1.

      Response in BC averaged during the reward period (2-4 sec after texture stop) as a function of learning for each mouse separately.

      The text now reads (pg. 4): "In addition, responses in BC during the reward period were not consistently modulated as a function of learning (p>0.05; Wilcoxon signed-rank test between naïve and expert, BC response averaged during the reward period, 2-4 seconds after stimulus onset; n=7 mice). Taken together, we find that direct responses from the reward period do not effect history-related responses during the next trial."

      Note that I don't believe the FA-Hit condition analysis that the authors have already presented provides adequate control, as punishment responses are also pervasive in the cortex and therefore suffer from the same interpretational caveat. Unfortunately, I believe this is a serious methodological issue given the above. However, I will proceed to take the reported results at face value .

      We hope that our additional control analysis regarding the hemodynamic controls are satisfactory.

      2) The statistics used to assess the effect of trial history over learning are inadequate (e.g., Fig 2b). The existence of a significant effect in one condition (e.g., CR-Hit vs. Hit-Hit in expert) but not in another (e.g., same comparison in naive) does not imply that these two conditions are different. This needs to be tested directly. Moreover, the present analysis does not account for the fact that measures across learning stages are taken from the same animals. Thus, the appropriate analysis for these cases would be to first use a two-way ANOVA with repeated measures with factors of trial history and learning stage (or equivalent non-parametric test) and then derive conclusions based on post hoc pairwise tests, corrected for multiple comparisons .

      Done. We performed 2 way ANOVA as suggested and found significant history and learning effects along with a significant interaction effect for BC.

      The text now reads (pg. 4): "This difference was significant during the stim period in learning and expert phases across mice (Fig. 2b; 2-way ANOVA with repeated measures; DF (1-6) F=51 p<0.001, DF (2-12) F=18 p<0.001, DF(2-12) F=5 p<0.05 for trial history, learning and the interaction between trial history and learning; Post hoc Tukey analysis p<0.05 for trial history in learning and expert phases; p>0.05 in the naïve phase)."

      3) I am not convinced that BC and RL are especially important for trial-history-dependent effects. Figures 4 and 5 suggest that this modulation is present across the cortex, and in fact, the difference between CR-Hit and Hit-Hit in some learning stages appears stronger in other areas. BC and RL do have the highest absolute activity during the epochs in Figs 4 and 5, but I would argue that this is likely due to other aspects of the task (e.g., touch) and therefore is not necessarily relevant to the issue of trial history .

      Done. First, we would like to point out that RL during the pre period displays the largest difference between the CR-Hit and Hit-Hit conditions (Fig. 5c bottom). Second, we now show difference maps (i.e., activity in CR-Hit minus Hit-Hit) which clearly show a positive activity patch in BC during the stim period for 5 out of the 7 mice (Fig. S10a). Example maps also highlight RL during the pre period (Fig. S10b). We note that activity patches somewhat spread over to other areas and also slightly vary across mice. This is why the grand average may slightly average out trial history information. Taken together, we strongly feel that during the pre period, trial history information emerges in RL (and adjacent posterior association areas) which shift towards BC during the stim period

      Nevertheless, we agree with the reviewer that other areas (that do not necessarily display high activity) may encode trial history information and we now clearly report this in the text (pg. 5): "We note that other areas, e.g., different association areas, also encoded historydependent information especially during learning and expert phases. In addition, we present activity difference maps between CR-Hit and Hit-Hit conditions during the stim period (Fig. S10a). These maps clearly show the highest trial history information (i.e., difference in activity) in BC. Taken together, these results indicate that BC encodes history-dependent information that emerges during the stim period and just after learning. "

      And also in (pg. 6): " In addition, we present activity difference maps between CR-Hit and HitHit conditions during the pre period (Fig. S10b). These maps localize trial history information to RL which also spreads to other adjacent association areas. Moreover, activity patches slightly vary across the different mice which may affect the grand average (averaged across mice) of each area."

      4) Because of similar arguments to the above, and because this was not directly assessed, I do not believe the conclusion that history information emerges in RL and is transferred to BC is warranted. For instance, there is no direct comparison between areas, but inspection of the ROC plots in Fig 6b suggests that history information emerges concomitantly across cortical areas. I suggest directly comparing the time course between these and other areas

      Done. We now add example history AUC maps and quantify history AUC for all 25 areas during the pre and stim periods. During the pre period (Fig. 6), AUC values are concentrated around the RL (and other PPC areas), whereas during the stim periods AUC values shift to BC. Again, due to the inter-mouse variability, these differences are slightly averaged out which also makes it tough to have strong statistical test (with only 7 mice).

      The text now reads (pg. 7): "We next calculated the history AUC for each pixel during either the pre or stim period. The history AUC maps during the pre period display AUC values around the RL areas (Fig. 6f). In contrast, the history AUC maps during the stim period display AUC values mostly in BC (Fig. 6g). Quantified across 25 areas and averaged across mice, RL displays the highest history AUC during the pre period, whereas BC displays the highest history AUC values during the stim period (Fig. 6h). We note that other cortical areas such as other association areas also display high history AUC values. Taken together, we find that trial history emerges in RL before the texture arrives and then shifts to BC during stimulus presentation. "

      5) How much is task performance itself modulated by trial history? How does this change over the course of learning? These behavioral analyses would greatly help interpret the neural findings and how this trial history might be used behaviorally .

      Done, we have now calculated the dprime for Hit-Hit and CR-Hit trials separately. We find no significant differences between conditions both within and across mice (see Fig. S2 below).

      The text now reads pg. 3): "We note that learning curves that are calculated separately for each pair (i.e., either a preceding Hit or CR trial) were not significantly different (Fig. S2)."

      Reviewer #2 (Public Review):

      Marmor et al. mine a previously published dataset to examine whether recent reward/stimulus history influences responses in sensory (and other) cortices. Bulk L2/3 calcium activity is imaged across all of the dorsal cortex in transgenic mice trained to discriminate between two textures in a go/no-go behavior. The authors primarily focus on comparing responses to a specific stimulus given that the preceding trial was or was not rewarded. There are clear differences in activity during stimulus presentation in the barrel cortex along with other areas, as well as differences even before the second stimulus is presented. These differences only emerge after task learning. The data are of high quality and the paper is clear and easy to follow. My only major criticism is that I am not completely convinced that the observed difference in response is not due to differences in movement by the animal on the two trial types. That said, the demonstration of differences in sensory cortices is relatively novel, as most of the existing literature on trial history effect demonstrates such differences only in higher-order areas .

      Major :

      1a) The claim that body movements do not account for the results is in my view the greatest weakness of the paper - if the difference in response simply reflects a difference in movement, perhaps due to "excitement" in anticipation of reward after not receiving one on CR-H vs. HH trials, then this should show up in movement analysis. The authors do a little bit of this, but to me, more is needed .  

      Done. We have now extensively and carefully analyzed body and whisker movements for CRHit and Hit-Hit conditions. First, In the figure below we decomposed body movements into 22 different body parts using DeepLabCut. In short, we find no significant difference between CRHit and Hit-Hit conditions in each body part separately (Fig. S7 below). This was true for the naïve, learning and expert phases. Please see additional analyses in the points below.

      This is now reported in the text (pg. 4): “In addition, we performed a more detailed body and whisker analysis, e.g., decomposing the movement to different body parts and obtaining single whisker dynamics. These analyses did not find significant differences in movement parameters between CR-Hit and Hit-Hit conditions (Fig. s7 and s8).”

      First, given the small sample size and use of non-parametric tests, you will only get p<.05 if at least 6 of the 7 mice perform in the same way. So getting p>.05 is not surprising even if there is an underlying effect. This makes it especially important to do analyses that are likely to reveal any differences; using whisker angle and overall body movement, which is poorly explained, is in my opinion insufficient. An alternative approach would be to compare movements within animals; small as the dataset is, it is feasible to do an animal-by-animal analysis, and then one could leverage the large trial count to get much greater statistical power, foregoing summary analyses that pool over only n=7 .

      We agree with this point and are have now dramatically improved our statistical analysis.

      1) We now perform within mouse statistics for responses in BC during naïve, learning and expert (see Fig. S4 below). In short, we find statistical significance for 7 out of 7 mice during the expert phase, 6 out of 7 mice in the learning phase and 0 out of 7 in the naive phase. For RL during the pre period we find significant difference in 5 out of 7 expert mice.

      This is now reported in the text (pg. 4): "In addition, a statistical comparison between CR-Hit and Hit-Hit responses within each mouse separately maintained significance for expert (7/7 mice Mann-Whitney U-test p<0.05) and learning (6/7 mice) but not for naïve (0/7 mice. Fig. S3)."

      And also in (pg. 5): "In addition, a statistical comparison between CR-Hit and Hit-Hit responses in RL within each mouse separately maintained significance for expert (5/7 mice; MannWhitney U-test p<0.05)."

      2) We would like to point out that we have now added 3 additional mice (with hemodynamics control) and performed within mouse statistics in BC and RL (Fig. S5), adding to our initial observations.

      3) In terms of body movements, we now performed within mice statistics and compared body movements between CR-Hit and Hit-Hit conditions. In general, most mice did not show a significant difference in body movements or whisker envelope.

      This is now reported in the text (pg. 4): "A within mouse statistical comparison between body or whisker parameters in CR-Hit and Hit-Hit maintained a non-significant difference in expert (1/7 mice displayed a significant difference; Mann-Whitney U-test p>0.05), learning (2/7 mice) and naïve (0/7 mice)."

      And also in (pg. 4): "Body movements and whisker parameters did not significantly differ between CR-Hit and Hit-Hit conditions during the pre-period (Similar to the stim period. Across and within mice. P>0.05; Mann-Whitney U-test)."

      In summary, we have now substantially improved our statistical analysis and further decomposed the body movements, maintaining the trial history results.

      The authors only consider a simple parametrization of movement (correlation across successive frames), and given the high variability in movement across animals, it is likely that different mice adopt different movements during the task, perhaps altering movement in specific ways. Aggregating movement across different body parts after an analysis where body parts are treated separately seems like an odd choice - perhaps it is fine, but again, supporting evidence for this is needed. As it stands, it is not clear if real differences were averaged out by combining all body parts, or what averaging actually entails .

      Please see the above point where we decomposed body movements (Fig. S7 and Methods section in Pg. 14).

      If at all possible, I would recommend examining curvature and not just the whisker angle, since the angle being the same is not too surprising given that the stimulus is in the same place. If the animal is pressing more vigorously on CR-H trials, this should result in larger curvature changes .

      Done. We now decompose whisker dynamics (i.e., curvature) using DeepLabCut (Fig. S8 see below). In general, we find no significant differences in whisker parameters between Hit-Hit and CR-Hit conditions.

      This is now reported in the text (pg. 4): "In addition, we performed a more detailed body and whisker analysis, e.g., decomposing the movement to different body parts. This analysis did not find significant differences between CR-Hit and Hit-Hit conditions (Fig. S7 and S8)."

      Finally, the authors presumably have access to lick data. Are reaction times shorter on CR-H trials? Is lick count or lick frequency shorter?

      Done. We now calculated lick reaction time and lick rate and find a significant difference for the lick reaction time but not in lick rate. We show a figure below for the reviewer and report this in the text

      The text now reads (pg. 3): "In addition, the lick reaction time (but not the lick rate) between Hit-Hit and CR-Hit were significantly different (p<0.05; Wilcoxon signed-rank test) ,maybe indicating a more considered response after a previous stop signal."

      If movement differs across trial types, it is entirely plausible that at least barrel cortex activity differences reflect differences in sensory input due to differences in whisker position/posture/etc. This would mitigate the novelty of the present results .

      As detailed above, have now meticulously analyzed the whisker parameter differences between both conditions and did not find any significant differences.

      1b) Given the importance of this control to the story, both whisker and body movement tracking frames should be explicitly shown either in the primary paper or as a supplement. Moreover, in the methods, please elaborate on how both whisker and body tracking were performed .

      Done. Please see Figs. S7 and S8 for tracking frames. This is now detailed in the above points and also the revised relevant methods section

      2) .Did streak length impact the response? For instance, in Fig. 1f "Learning", there is a 6-trial "no-go" streak; if the data are there, it would be useful to plot CR-H responses as a function of preceding unrewarded trials.

      Done. We have now calculated response in CR-Hit as a function of the number of preceding CRs. In general, we obtain inconsistent results across mice that may be due to the small number of trials that have more than one preceding CR. Nevertheless, some mice have a trend, sometimes significant, in which CR-Hit responses are higher for longer CR preceding streaks. This is especially true during the learning phase. We have decided not to include this in the manuscript and present this figure only to the reviewer.

    1. Author response:

      Reviewer #1 (Public Review):

      This is an important and very well conducted study providing novel evidence on the role of zinc homeostasis for the control of infection with the intracellular bacterium S. typhimurium also disentangling the underlying mechanisms and providing clear evidence on the importance of spatio-temporal distribution of (free) zinc within the cell.

      We thank the reviewer for the positive comments.

      1) It would be important to provide more information on the genotype of mice.

      As suggested by the reviewer, we have added the detailed genotype of Slc30a1flagEGFP/+ and Slc30a1fl/flLysMCre mice to the revised supplementary Figure supplement 10.

      2) It is rather unlikely that C57Bl6 mice survive up to two weeks after i.p. injection of 1x10E5 bacteria.

      According to the reviewer comment, we have tested survival rate using a group of our experimental animals and C57BL/6 wild type.

      The Salmonella stain is a gift from our friend, Professor Ge Bao-xue. We have sent this stain for genetic characterisation which we found 100% identity to Salmonella enterica Typhimurium with many strains originated from poultry. One of them is Salmonella enterica subsp. enterica serovar Typhimurium strain MeganVac1 (Accession: CP112994.1), a live attenuated stain. We hope that this would support the relationship between the high infectious dose and mice survive.

      Author response image 1.

      (A) Survival rate of Slc30a1fl/fl and Slc30a1fl/flLysMCre (n = 14-15/group) and (B) Survival rate of C57BL/6 wild type (n = 8) after Salmonella infection for two weeks. (C) A fulllength sequence (1,478 bases) of 16S rDNA genes sequences of Salmonella stain and (D) the sequencing electropherogram.

      3) To be sure that macrophages Slc30A1 fl/fl LysMcre mice really have an impaired clearance of bacteria it would be important to rule out an effect of Slc30A1 deletion of bacterial phagocytosis and containment (f.e. evaluation of bacterial numbers after 30 min of infection).

      As the reviewer advised, we have repeated the experiment and measured the bacterial numbers after 30 min of infection (dashed line in A). The results show that there is no statistical difference in the bacterial numbers after 30 min between Slc30a1fl/flLysMCre and Slc30a1fl/fl BMDMs. Therefore, the reduction of bacterial numbers after 24 hours occurs due to the impairment of intracellular pathogen-killing capacity as the reviewer pointed out.

      Author respnse image 2.

      (A) Time course of the intracellular pathogen-killing capacity of Salmonellainfected Slc30a1fl/flLysMCre and Slc30a1fl/fl BMDMs measured in colony-forming units per ml (n = 5). (B) Fold change in Salmonella survival (CFU/mL) at different time points from A. (C) Representative images of Salmonella colonies on solid agar medium at 24 hours. Data are represented as mean ± SEM. P values were determined using 2-tailed unpaired Student’s t-test. P<0.05, *P<0.01, and ns, not significant.

      4) Does the addition of zinc to macrophages negatively affect iNOS transcription as previously observed for the divalent metal iron and is a similar mechanism also employed (CEBPß/NF-IL6 modulation) (Dlaska M et al. J Immunol 1999)?

      The reviewer has raised an important point here since free zinc also play a role in multiple levels of cellular signaling components (Kembe et al., 2015). Dlaska and colleague reported that NF-IL6, a protein responsible for iNOS transcription is negatively regulated by iron perturbation under IFNg/LPS stimulation in macrophages (Dlaska and Weiss, 1999). As the reviewer suggested, our results showed that zinc supplementation decreases the iNOS expression in macrophages after Salmonella infection, suggesting that free zinc might play a role in iNOS regulation.

      However, in Slc30a1fl/flLysMCre macrophages, despite increase intracellular free zinc, lacking Slc30a1 also induces Mt1, a zinc reservoir which might negatively affect NO production (Schwarz et al., 1995) or alternatively inhibits iNOS through NF-kB pathway (Cong et al., 2016) as reported by previous studies. Therefore, we couldn’t rule out the possibility that defects in Salmonella clearance due to iNOS/NO inhibition may be caused by a complex combination of excess free zinc and overexpression of the zinc reservoir. To prove this hypothesis, further studies using the specific target, for example Mtfl/fliNOSfl/flLysMCre model might be needed to investigate the precision mechanism.

      Author response image 3.

      RT-qPCR analysis of mRNA encoding Nos2 in BMDMs after infected with Salmonella and Salmonella plus ZnSO4 (20 μM) for 4 h.

      Reference:

      Dlaska M, Weiss G. 1999. Central role of transcription factor NF-IL6 for cytokine and ironmediated regulation of murine inducible nitric oxide synthase expression. The Journal of Immunology. 162:6171-6177, PMID: 10229861

      Kambe T, Tsuji T, Hashimoto A, Itsumura N. 2015. The physiological, biochemical, and molecular roles of zinc transporters in zinc homeostasis and metabolism. Physiological Reviews. 95:749-784. https://doi: 10.1152/physrev.00035.2014, PMID: 26084690

      Schwarz MA, Lazo JS, Yalowich JC, Allen WP, Whitmore M, Bergonia HA, Tzeng E, Billiar TR, Robbins PD, Lancaster JR Jr, et al. 1995. Metallothionein protects against the cytotoxic and DNA-damaging effects of nitric oxide. Proceedings of the National Academy of Sciences of the United States of America. 92: 4452-4456. https://doi: 10.1073/pnas.92.10.4452, PMID: 7538671

      Cong W, Niu C, Lv L, Ni M, Ruan D, Chi L, Wang Y, Yu Q, Zhan K, Xuan Y, Wang Y, Tan Y, Wei T, Cai L, Jin L. 2016. Metallothionein prevents age-associated cardiomyopathy via inhibiting NF-κB pathway activation and associated nitrative damage to 2-OGD. Antioxidants & Redox Signaling. 25: 936-952. https://doi: 10.1089/ars.2016.6648, PMID: 27477335

      5) How does Zinc or TPEN supplementation to bacteria in LB medium affect the log growth of Salmonella?

      We found that zinc supplementation at both low (20 µM) and high (640 µM) concentrations negatively effects Salmonella growth, especially during log phase and stationary phase in the broth culture medium, but not TPEN (20 µM) supplementation. These indicates that high zinc conditions occur at cellular levels such as within phagosomes (Botella et al., 2011) can limit bacterial growth.

      Author response image 4.

      Growth curve (optical density, OD 600 nm) of Salmonella in LB medium at different concentrations of ZnSO4 and/or TPEN. Bar graph indicating Salmonella growth at specific time points. Each value was expressed as mean of triplicates for each testing and data were determined using 2-tailed unpaired Student’s t-test. P<0.05, P<0.01, **P<0.001 and ns, not significant.

      Reference:

      Botella H, Peyron P, Levillain F, Poincloux R, Poquet Y, Brandli I, Wang C, Tailleux L, Tilleul S, Charrière GM, Waddell SJ, Foti M, Lugo-Villarino G, Gao Q, Maridonneau-Parini I, Butcher PD, Castagnoli PR, Gicquel B, de Chastellier C, Neyrolles O. 2011. Mycobacterial p(1)-type ATPases mediate resistance to zinc poisoning in human macrophages. Cell Host Microbe. 10:248-59. https://doi: 10.1016/j.chom.2011.08.006, PMID: 21925112

      Reviewer #2 (Public Review):

      This paper explores the importance of zinc metabolism in host defense against the intracellular pathogen Salmonella Typhimurium. Using conditional mice with a deletion of the Slc30a1 zinc exporter, the authors show a critical role for zinc homeostasis in the pathogenesis of Salmonella. Specifically, mice deficient in Slc30a1 gene in LysM+ myeloid cells are hypersusceptible to Salmonella infection, and their macrophages show alter phenotypes in response to Salmonella. The study adds important new information on the role metal homeostasis plays in microbe host interactions. Despite the strengths, the manuscript has some weaknesses. The authors conclude that lack of slc30a1 in macrophages impairs nos2-dependent anti-Salmonella activity. However, this idea is not tested experimentally. In addition, the research presented on Mt1 is preliminary. The text related to Figure 7 could be deleted without affecting the overall impact of the findings.

      We thank the reviewer for his/her positive comments and constructive suggestions.

      Reviewer #3 (Public Review):

      Na-Phatthalung et al observed that transcripts of the zinc transporter Slc30a1 was upregulated in Salmonella-infected murine macrophages and in human primary macrophages therefore they sought to determine if, and how, Slc30a1 could contribute to the control of bacterial pathogens. Using a reporter mouse the authors show that Slc30a1 expression increases in a subset of peritoneal and splenic macrophages of Salmonella-infected animals. Specific deletion of Slc30a1 in LysM+ cells resulted in a significantly higher susceptibility of mice to Salmonella infection which, counter to the authors conclusions, is not explained by the small differences in the bacterial burden observed in vivo and in vitro. Although loss of Slc30a1 resulted in reduced iNOS levels in activated macrophages, the study lacks experiments that mechanistically link loss of NO-mediated bactericidal activity to Salmonella survival in Slc30a1 deficient cells. The additional deletion of Mt1, another zinc binding protein, resulted in even lower nitrite levels of activated macrophages but only modest effects on Salmonella survival. By combining genetic approaches with molecular techniques that measure variables in macrophage activation and the labile zinc pool, Na-Phattalung et al successfully demonstrate that Slc30a1 and metallothionein 1 regulate zinc homeostasis in order to modulate effective immune responses to Salmonella infection. The authors have done a lot of work and the information that Slc30a1 expression in macrophages contributes to control of Salmonella infection in mice is a new finding that will be of interest to the field. Whether the mechanism by which SLC30A1 controls bacterial replication and/or lethality of infection involves nitric oxide production by macrophages remains to be shown.

      We very much appreciate the reviewer’s detailed evaluation and suggestions. The manuscript has been revised thoroughly according to the reviewer’s advice.

    1. Author Response

      Reviewer #1 (Public Review):

      This work focuses on the mechanisms that underlie a previous observation by the authors that the type VI secretion system (T6SS) of a Pseudomonas chlororaphis (Pchl) strain can induce sporulation in Bacillus subtilis (Bsub). The authors bioinformatically characterize the T6SS system in Pchl and identify all the core components of the T6SS, as well as 8 putative effectors and their domain structures. They then show that the Pchl T6SS, and in particular its effector Tse1, is necessary to induce sporulation in Bsub. They demonstrate that Tse1 has peptidoglycan hydrolase activity and causes cell wall and cell membrane defects in Bsub. Finally, the authors also study the signaling pathway in Bsub that leads to the induction of sporulation, and their data suggest that cell wall damage may lead to the degradation of the anti-sigma factor RsiW, leading to activation of the extracellular sigma factor σW that causes increased levels of ppGpp. Sensing of high ppGpp levels by the kinases KinA and KinB may lead to phosphorylation of Spo0F, and induction of the sporulation cascade.

      The findings add to the field's understanding of how competitive bacterial interactions work mechanistically and provide a detailed example of how bacteria may antagonize their neighbors, how this antagonism may be sensed, and the resulting defensive measures initiated.

      While several of the conclusions of this paper are supported by the data, additional controls would bolster some aspects of the data, and some of the final interpretations are not substantiated by the current data.

      • The Bsub signaling pathway that is proposed is intricate and extensive as shown in Fig 5A. However, the data supporting that is very sparse:

      a) The authors show no data showing that the proteases PrsW and/or RasP, or the extracellular sigma factor σW are necessary, or that the cleavage of RsiW is needed, for induction of sporulation - this could presumably be tested using mutants of those genes.

      It has been previously demonstrated that the proteases PrsW and/or RasP cleave RsiW under certain conditions such as alkaline-shock (Heinrich et al., 2009). In first place, PrsW cleaves RsiW and the resulting cleaved-RsiW serves as substrate to RasP. In the previous version of the manuscript, we already demonstrated that treatment with Tse1 causes damage to PG and delocalization of RsiW, however as the reviewer comments we did not show the participation of any of these proteases in the proposed signaling pathway. We have now generated single mutants in rsiW and prsW and they have been treated with Tse1. We have observed no variation in the levels of sporulation compared to untreated strains (Figure 1) a finding according to their suggested implication in the sporulation signaling pathway activated by Tse1. Positive controls, that is the single mutants grown at 37ºC, were still able to sporulate. This data has been added to Figure 6B in the new version of the manuscript.

      As suggested by other reviewers, we have generated a sister plot of this figure showing the raw CFUs in each case. These data are included in Supplementary file 3. This experiment and the related figure have been incorporated into the new version of the manuscript.

      Figure 1. A) Quantification of the percentage of sporulated Bsub, rsiW and prsW cells after treatment with purified Tse1 showing that rsiW and prsW single mutants are blind to the presence of Tse1. B) Cell density (CFUs/mL) of total (blue bars) and sporulated population (brown bars) of different Bacillus strains (Bsub, ∆rsiW and ∆prsW) untreated and treated with Tse1. Sporulation at 37ºC is shown as positive control in each strain. Statistical significance was assessed via t-tests. p value < 0.1, p value < 0.001, **p value < 0.0001.

      Similarly, they don't demonstrate that the levels of ppGpp increase in the cell upon exposure to Pchl.

      We have not been able to measure the levels of ppGpp, however, given that in the same proposed sporulation cascade the levels of different nucleotides are altered (Kriel et al., 2013, Tojo et al., 2013, López and Kolter, 2010), we have alternatively analyzed the levels of ATP using an ATP Determination Kit (Thermo, A22066). We have found that ATP levels increased by 3-fold in Bsub cells treated with Tse1 compared to untreated control cells. Consistently, no increase in ATP levels were observed in rsiW or prsW mutants treated with Tse1. We have incorporated all the raw luminescence data obtained for each sample and treatment in Figure 6-source data 1. This experiment, figures (Figure 6A in the new version of the manuscript) and description in “Materials and Methods” have been added to the new version of the manuscript.

      c) There is some data showing that kinA and kinB mutants don't induce sporulation (Fig supplement 7A), but that is lacking the 'no attacker' control that would demonstrate an induction.

      We have included in the new version of the manuscript the ‘no attacker’ control sporulation (%). The figure shows that the presence of Pchl strains induces the sporulation of all kinase mutants. This new data has been incorporated in Figure 6-figure supplement 1A in the new version of the manuscript.

      d) There is some data showing that RsiW may be cleaved (Fig 5C, D), but that data would benefit from a positive control showing that the lack of YFP foci is seen in a condition where RsiW is known to be cleaved, as well as from a time-course showing that the foci are present prior to the addition of Tse1, and then disappear. As it is shown now, it is possible that the addition of Tse1 just blocks the production of RsiW or its insertion into the membrane (especially given the membrane damage seen). Further, there is no data that the disappearance of the YFP loci requires the proteases PrsW and /or RasP - such data would also support the idea that the disappearance is due to cleavage of RsiW.

      Thank you for your useful suggestion. It is important to consider that we have not seen repression of the expression of genes that encode any of the two proteases on cells treated with Tse1 in our transcriptomics analysis. However, we agree that additional experiments would enhance the significance of our findings. We have repeated the whole experiment including a positive control to demonstrate that YFP foci disappears in a condition in which RsiW is known to be degraded by PrsW and RasP. Bacillus cells have been incubated in medium at pH 10 which provokes an alkaline shock that triggers RsiW cleavage (Asai, 2017; Heinrich et al., 2009). As shown in Fiugre 6D under this condition we also observed disappearance of YFP foci . We have also provided extra images with quantification of average signal from YFP-foci in Figure 6-figure supplement 2 .

      • The entire manuscript suggests that T6SS is solely responsible for the induction of sporulation. While T6SS does appear to play a major part in explaining the sporulation induction seen, in the absence of 'no attacker' controls for Fig. 2A, it is impossible to see this. From the data shown in Fig. 2C, and figure supplement 2A, the 'no attacker' sporulation rate seems to be ~20%, while the rate is ~40% with Pchl strains lacking T6SS, suggesting that an additional factor may be playing a role.

      This must be a misunderstanding of the message of this manuscript. The conceptual fundament of this study was settled in our previous manuscript (Molina-Santiago et al., 2019). We demonstrated that B. subtilis sporulated in the presence of P. chlororaphis. Interestingly, the overgrowth of P. chlororaphis over B. subtilis colony did not eliminate cells of B. subtilis, given that most of them were sporulated. The data we obtained strongly suggested that a functional T6SS was involved in the cellular response of Bacillus in the close cell to cell contact. In this new manuscript, we have explored this idea, and found that indeed, the T6SS of P. chlororaphis mobilized at least one effector, Tse1, which is able to trigger sporulation in Bacillus. Thus we did not conclude, and neither have done in this new study, that T6SS is the only factor expressed by P. chlororaphis responsible for sporulation activation in Bacillus. We have accordingly rephrased some sentences of the manuscript to clarify the proposed implication of T6SS in B. subtilis sporulation.

      In addition, as mentioned above, we have included data of sporulation percentages in the absence of an attacker to better compare the induction of sporulation observed in the presence of the different Pchl strains and in the presence of Tse1.

      Reviewer #2 (Public Review):

      In a previous study, the authors showed that cell-cell contact with Pseudomonas chlororaphis induces sporulation in Bacillus subtilis. Here, the authors build on this finding and elucidate the mechanism behind this observation. They describe the enzymatic activity of a protein (Tse1) secreted by the type VI secretion system (T6SS) of P. chlororaphis (Pch), which partially degrades the peptidoglycan (PG) of targeted B. subtilis cells and triggers a signal cascade culminating in sporulation.

      Most of the key conclusions of this paper (Tse1 being secreted by the T6SS and inducing sporulation in targeted cells) are well supported by the data. One conclusion (sporulation response being an anti-T6SS "defense" strategy) is not well supported by the data and should be removed or rephrased.

      The authors elucidate the enzymatic activity of Tse1, a T6SS effector protein, in a genus (Pseudomonas) of great interest to microbiologists, and to researchers studying the T6SS specifically. They also carefully dissect the cellular response (signal cascade and sporulation) of an important model organism (B. subtilis; Bsub) specifically to exposure to Tse1. The results describing this cellular response contribute substantially to our understanding of how T6SS effector proteins interact with cells of Gram-positive species.

      My only major concerns regard the interpretation of these results as sporulation being an adaptive and/or specific response to attacks by the T6SS. I outline my reasoning below.

      • Interpretation of sporulation as a "defense" mechanism/strategy against the T6SS. In order for a phenotype X to be regarded as a "defense against Y" mechanism, it has to be shown that phenotype X (sporulation in response to Tse1) evolved - at least in part - for the purposes of increasing survival in the presence of Y (T6SS attacker). There are no experiments in this study comparing e.g. a sporulating Bsub with a non-sporulating Bsub, that would allow testing if sporulation increases survival. The experiments carefully describe the cellular response to Tse1, but no inference can be made with regards to this being adaptive for Bsub, or if it helps the cells survive against T6SS attacks, etc. A more parsimonious explanation would be that Tse1 happens to target the PG and causes envelope stress, triggering sporulation. So, it would be a general stress response that also happens to be triggered by T6SS. Now, some general (cell envelope) stress responses are known to be very effective at protecting against the T6SS. But in those instances, a beneficial effect for survival in the face of T6SS attacks has been shown in dedicated experiments. Purely observing a response to a T6SS effector, as this study does (very well), is not evidence that the response has evolved for the purpose of surviving T6SS attacks. Tucked away in the supplement (and briefly mentioned in the main text) is data on Bsub and Bacillus cereus, showing that i) cell densities of the sporulating Bsub and a sporulating B. cereus strain are not affected by an active T6SS, and ii) cell densities of an asporogenic B. cereus are slightly reduced by an active T6SS. However, the effect sizes of density reduction by the T6SS in the asporogenic B. cereus are minute (20x10^6 vs. ~50x10^6). In typical killing assays against e.g. gram-negative strains, a typical effect size for T6SS killing would be a several order of magnitude reduction in survival of the target strain when exposed to a T6SS attacker. Based on this dataset alone (Figure Suppl. 8), I would say that all three Bacillus strains are not experiencing any "fitness-relevant" killing by the T6SS, which is in line with the T6SS often being useless against gram-positives when it comes to killing. Hence, no claims about fitness benefits of sporulation in response to a T6SS attack, or this being a "defense mechanism/strategy" should be made in the manuscript.

      Thanks for this interesting introductory and specific comments. We agree with the reviewer and have rephrased some sentences of the manuscript. Sporulation is not an adaptive or specific response of Bacillus to T6SS, indeed and as stated by reviewer 2, sporulation is a general stress response. It might happen that the way the manuscript was written, at some points, gave the wrong impression. In consequence we have rephrased some sentences. Nevertheless, in Figure supplement 8 (in the new version of the manuscript is Figure 6-figure supplement 3) we made a mistake during generation of the Figure. We have again done this experiment and we have generated a new and corrected chart that shows three orders of magnitude reduction in survival of the asporogenic B. cereus strain in competition with Pchl mutant strains compared to Pchl WT strain. These new findings show that the absence of sporulation ability leads to a severe reduction in survival of Bacillus cereus DSM 2302 population in competition with Pchl with an active T6SS compared to the survival in competition with Pchl hcp mutant. In this figure, it is also shown that Bacillus population also decreased in competition with tse1 mutant, demonstrating that Tse1 is responsible for killing Bacillus. However, there is a statistical difference in the survival of Bacillus competing with hcp or tse1 mutants. The increased survival of Bacillus in the interaction with tse1 strain compared to Bacillus-hcp competition, is suggestive of the ability of this strain to deliver additional T6SS-dependent toxins. This observation is in accordance to the data presented in Fig. 2B, which indicated that tse1 mutant has an active T6SS able to kill E. coli.

      • Data supporting baseline "no competitor" sporulation rates being no different from those triggered by T6SS mutants is not convincing. For the data shown in Fig. 2A, a key comparison here would be to show baseline Bsub sporulation rates in absence of a competitor. This measurement is shown in Fig supplement 2A, and the value shown there (roughly 22% on average) appears to be much lower than the average T6SS mutant shown in Fig. 2A. The main text states that sporulation rates induced rate by the different T6SS mutants are "statistically" similar to the no-competitor baseline (L206/207). I am not convinced by this, since i) overall sporulation rates (incl of WT Pch) appear to have been lower in the experiment shown in supplement 2A, so a direct comparison between the no-competitor baseline and the data shown in Fig. 2A is not possible; and ii) hcp and tse1 mutants were tested in different experiments throughout the study, and sporulation rates appear to consistently hover around 30-40%, which is higher than the roughly 22% for "no competitor" depicted in Supplement Fig2A. I am focussing on this, because for the interpretation of the results, and the main narrative of the paper, knowing if "simply interacting with a T6SS-negative P. chlororaphis" induces some sporulation would make a big difference. One sentence in the discussion adds to my confusion about this: L464/465, "... a strain lacking paar (Δpaar) had an active T6SS that triggered sporulation comparably to Δhcp, ΔtssA, and Δtse1 strains", suggesting that the authors' claims that even strains lacking active T6SS trigger increased sporulation (which I would agree with, based on the data).

      We understand the reviewer's comment that a direct comparison between the two figures is not correct due to fluctuations of the baseline sporulation rates between experiments. To solve this issue, we have added the baseline "no competitor" sporulation percentages in the experiments represented in Figure 2B in the new version of the manuscript.

      Related with the sporulation provoked by a T6SS-negative P. chlororaphis, the reviewer is right. Bacillus sporulation occurs due to many external factors (abiotic and biotic stresses) so the presence of P. chlororaphis in the competition already has an effect on the sporulation percentage of B. subtilis. Accordingly, we have removed the statement on the sporulation rates induced by the different T6SS mutants are "statistically" similar to the no-competitor. However, our previous data (Molina-Santiago, Nat Comm 2019) and current findings convincedly demonstrate the relevance of the T6SS and, specifically the Tse1 toxin, in the induction of sporulation at least in the close cell to cell contact.

      • Claim regarding "bacteriolytic activity" when tse1 is heterologously expressed in E. coli. The data supporting this claim (Fig2-supplement 2C) only shows a lower net population growth rate after induction of tse1 (truncated vs. non-truncated) expression. This could be caused by: slower growth (but no death), equal growth (with some death), or a combination of the two. The claim of "bacteriolytic" activity in E. coli is therefore not supported by this dataset.

      We agree with the reviewer and we have decided to remove this figure and the experiment of “bacteriolytic activity” given that it does not contribute conceptually to the message of the manuscript.

      I cannot comment in more detail on the validity of the biochemistry/enzymatic activity assays as these are not my area of expertise.

      Reviewer #3 (Public Review):

      The authors identify tse1, a gene located in the type 6 secretion system (T6SS) locus of the bacterium Pseudomonas chlororaphis, as necessary and sufficient for induction of Bacillus subtilis sporulation. The authors demonstrate that Tse1 is a hydrolase that targets peptidoglycan in the bacterial cell wall, triggering activation of the regulatory sigma factor sigma-w. The sporulation-inducing effects of sigma-w are dependent on the downstream presence of the sensor histidine kinases KinA and KinB. Overall, this is a well-structured paper that uses a combination of methods including bacterial genetics, HPCL, microscopy, and immunohistochemistry to elucidate the mechanism of action of Tse1 against B. subtilis peptidoglycan. There are some concerns regarding a few experimental controls that were not included/discussed and (in a few figures) the visual representation of the data could be improved. The structure of the manuscript and experiments is such that key questions are addressed in a logical flow that demonstrates the mechanisms described by the authors.

      To begin, we have concerns regarding the sporulation assays and their results. The data should be presented as "Percent sporulation" or "Sporulation (%)" - not as a "sporulation rate": there is no kinetic element to any of these measurements, so no rate is being measured (be careful of this in the text as well, for instance near lines 204). More importantly, there is no data provided to indicate that changes in percent spores are not instead just the death of non-sporulated cells. For example, imagine that within a population of B. subtilis cells, 85% of the cells are vegetative and 15% are spores. If, upon exposure to tse1, a large proportion of the vegetative cells are killed (say, 80% of them), this could lead to an apparent increase in sporulation: from 15% for the untreated population to ~50% of the treated, but the difference would be entirely due to a change in the vegetative population, not due to a change in sporulation. The authors need to clearly describe how they conducted their sporulation assays (currently there is no information about this in the methods) as well as provide the raw data of the counts of vegetative cells for their assays to eliminate this concern.

      Thanks for the suggestion. We have changed all the titles and data presented as “sporulation rate” by “sporulation (%)” or “sporulation percentage”. As also suggested by reviewer 2, we have included the raw data of the CFUs counts of total population and sporulated cells to show that there is no substantial change in the rate of death. Also, we have added a section in Material and Methods to specify how sporulation assays have been done. Quote text:

      “Sporulation assays

      Spots of bacteria were resuspended in 1 mL sterile distilled water. Then, serial dilutions were made and cultured in LB solid media for vegetative cells CFU counts. The same serial dilutions were further heated at 80ºC for 10 minutes to kill vegetative cells and immediately cultured again in LB solid media. Plates were grown overnight at 28 ºC and the resulting colonies were counted to calculate the percentage of Bsub sporulation (%). A list of raw CFUs (total and spore population) from all figures with sporulation percentage is shown in Supplementary file 3.”

      A related concern is regarding the analysis of the kinases and the effects of their deletions on the impact of Tse1. Previous literature shows that the basal levels of sporulation in a B. subtilis kinA or a kinB mutant are severely defective relative to a wild-type strain; these mutants sporulate poorly on their own. Therefore, the data presented on Lines 394+ and the associated Supplemental Figure regarding the sporulation defects of these two mutants are not compelling for showing that these kinases are required for this effector to act. It is likely that simply missing these kinases would severely impact the ability of these strains to sporulate at all, irrespective of the presence of Tse1, and no discussion of this confounding concern is discussed.

      Previous literature shows that mutation of kinases affects sporulation of B. subtilis. Histidine kinases KinA and KinB are the first responsible for initiation of sporulation cascade upon phosphorylation of spo0F. However, as shown in Figure 6-figure supplement 1A, single mutants in these kinases (ΔkinA, ΔkinB) still sporulate given that the phosphorylation cascade is controlled by numerous intermediaries and other histidine kinases that form a multicomponent phosphorelay (KinA-E). In this context, the sporulation of B. subtilis can be also triggered by KinC or KinD in the absence of KinA or KinB, as KinC/KinD can act directly on the master regulator of sporulation Spo0A (Burbulys et al., 1991; Wang et al., 2017).

      In addition, as suggested by reviewer 1, we have added to Figure 6-figure supplement 1A of the new version of the manuscript, the sporulation percentage 'no competitor' control of each kinase mutant and B. subtilis WT. The results show that, as commented by the reviewer and also supported by literature, these mutants sporulate poorly on their own in the absence of an attacker (none). However, as shown in the figure, all kinase mutants increase the sporulation percentage in the presence of a competitor.

      Another concern is regarding the statistical tests used in Figure 2. For statistical tests in A, B, and D, it should be stated whether a post-test was used to correct for multiple comparisons, and, if so, which post-test was used. to provide a stronger control comparison. For C, we suggest the inclusion of a mock control in addition to the two conditions already included (i.e., an extraction from an E. coli strain expressing the empty vector)

      We have clarified the statistical tests used in Figure 2. Briefly, we have used one-way ANOVA followed by the Dunnett test in Figure 2A, B and D for the statistical analysis of the sporulation percentage of Bsub in competition with Pchl as control group. In relation to Figure 2C, it is not possible to add a mock control with a strain carrying the empty vector, because this is a suicide plasmid (pDEST17) unable to replicate in E. coli without chromosome integration.

      An additional concern regarding controls is that there is an absence of loading controls for the immunoblot assays. In Figure 5D and all immunoblot assays, there is no mention of a loading control, which is a critical control that should be included.

      In the previous version of the manuscript, we already included a loading control for Figure 5D in Figure supplement 7B, both for cell and for supernatant fractions. In the new version of the manuscript, the loading control of Figure 6E (in the previous version of the manuscript Figure 5D) is shown in Figure 6-figure supplement 2C. We have also included the original unedited gels and blot (Figure 6-figure supplement 2- source data 1 and Figure 6-figure supplement 2-source data 2).

      Some of the visualizations could be improved to help the reader understand and appropriately interpret the data presented. For instance, in Figures 3 and 4 the scale bars are different across each of the Figure's imaging panels. These should be scaled consistently for better comparison. Additionally, the red false colorization makes the printed images difficult to see. Black-and-white would be easier to see and would not subtract from the images.

      The reviewer is right. Scales bar equal 2 in Figure 3A, but the length of the bars was not the same. We have edited the images to have the same magnifications for better comparison.

      In relation to Figure 4, we have changed the magnifications and now all the figures have the same scale bars and magnifications. In addition, we have added more images of broader fields in Figure 4-figure supplement 1 which were used to measure the percentage of permeabilized cells and to obtain the fluorescence intensity measures shown in Figure 4.

      An additional weakness of the paper is that the RNA-seq data is not fully investigated, and there is an absence of methods included regarding the RNA-seq differential abundance analysis (it is mentioned on L379-380 but no information is provided in the methods). As stated by the authors, 58% of differentially regulated genes belonged to the sw regulon, but the other 42% of genes are not discussed, and will hopefully be a target of future investigations.

      The methods section has been modified for a better explanation of the RNA-seq differential abundance analysis. Quote text: “The raw reads were pre-processed with SeqTrimNext (Falgueras et al., 2010) using the specific NGS technology configuration parameters. This pre-processing removes low-quality, ambiguous and low-complexity stretches, linkers, adapters, vector fragments, and contaminated sequences while keeping the longest informative parts of the reads. SeqTrimNext also discarded sequences below 25 bp. Subsequently, clean reads were aligned and annotated using the Bsub reference genome with Bowtie2 (Langmead and Salzberg, 2012) in BAM files, which were then sorted and indexed using SAMtools v1.484(Li et al., 2009). Uniquely localized reads were used to calculate the read number value for each gene via Sam2counts (https://github.com/vsbuffalo/sam2counts). Differentially expressed genes (DEGs) were analyzed via DEgenes Hunter, which provides a combined p value calculated (based on Fisher’s method) using the nominal p values provided by edgeR (Robinson et al., 2010) and DEseq2. This combined p value was adjusted using the Benjamini-Hochberg (BH) procedure (false discovery rate approach) and used to rank all the obtained DEGs. For each gene, combined p value < 0.05 and log2-fold change > 1 or < −1 were considered as the significance threshold”

      Regarding the RNA-seq analysis, we are aware of the amount of information that can be extracted. Previous to filtering the information shown in the manuscript, we have done bioinformatic analysis trying to find a connection with the cellular response, that is increase of sporulation. Besides this, we had some observations but with no direct connection to sporulation, which would be interesting to pursue in future studies, but not for the clarity of this story (Figure 23 below). In any case, we are including the whole picture of the transcriptomics changes occurring in Bsub after treatment with Tse1. KEGG pathway analyses of genes differentially expressed showed induction of flagellar assembly and aminobenzoate degradation, nitrogen and amino acid metabolisms. Interestingly, fatty acid degradation and CAMP resistance pathways were also induced, probably related to changes suffered in the cell wall after the action of Tse1 toxin. On the other hand, synthesis and degradation of ketone bodies pathway was mostly repressed.

      Figure 2. KEGG pathway analyses of genes differentially expressed occurring in Bsub after treatment with Tse1.

      Another methodological concern in this paper is the limited details provided for the calculation of the permeabilization rate (Figure 4, L359, L662-664). It is not clear how, or if, cell density was controlled for in these experiments.

      We agree with the reviewer and we have explained with more detail how the permeabilization rate was calculated. Quote text: “N=3 for Bsub treated with Tse1 and N=3 for untreated Bsub. N refers to the number of CLSM fields analyzed to calculate the number of permeabilized cells of the total of cells in the field”

      Finally, one weakness of the paper is the broad conclusions that they draw. The authors claim that the mechanism of sporulation activation is conserved across Bacilli when the authors only test one B. subtilis and one B. cereus strain. They further argue (lines 469+) that Tse1 requires a PAAR repeat for its targeting, but do not provide direct evidence for this possibility.

      We have reduced the tone of the final conclusion in order to specify that the activation of sporulation is a mechanism that can be found in different Bacillus species such as Bsub and Bcer. Related with the second appreciation, we have included a further explanation for this argument. Quote text: “As shown in Figure 2B, a paar mutant has an active T6SS able to kill E. coli. However, as shown in Figure 2A, we noticed that a paar mutant (which encodes tse1) is not able to trigger B. subtilis sporulation to a similar level than Pchl WT strain. Given that paar deletion apparently abolishes Tse1 secretion, we suggest that Tse1 is a PAAR-associated effector that requires a PAAR repeat domain protein to be targeted for secretion, thereby increasing Bacillus sporulation during contact with Pseudomonas cells (Cianfanelli et al., 2016; Hachani et al., 2014; Whitney et al., 2014)”.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, Elkind et al. use a deep learning segmentation algorithm trained on detecting putative cell nuclei in mouse brains to count cells in the Allen Mouse Brain Connectivity Atlas. The Allen Mouse Brain Connectivity Atlas is a dataset compromising hundreds of mice brains. The authors use this increased statistical power for detecting differences in volume, cell count, and cell density between strains (C57BL/6J and FVB.CD1) as well as sex differences.

      Both volume, cell count, and cell density are regularly used in neuroanatomy to normalize or benchmark results so having a large available dataset for others to compare their data would be a useful resource. The trained segmentation algorithm might also find utility in assays where investigators for one reason or another can't dedicate an entire labeled channel to count cell nuclei.

      Nevertheless, because of technical reasons, I find the current work problematic.

      We thank the Reviewer for acknowledging potential usefulness of our work, and the insightful, helpful comments. We believe this consideration has made our revised manuscript much stronger compared to the initial submission. We hope our revised version will also clear the Reviewer’s remaining doubts.

      Major:

      The authors make use of the "red" channel from the Allen Mouse Brain Connectivity Project (AMBCP). The AMBCP was acquired using two-photon tomography with the TissueCyte 1000 system (http://help.brain-map.org/download/attachments/2818171/Connectivity_Overview.pdf?version=2&modificationDate=1489022310670&api=v2). The sample is illuminated at 925 nm wavelength and the channel the authors describe as autofluorescence is collected through a 593/40 nm bandpass filter. The authors go on to describe their rationale for using this channel for quantifying cell nuclei:

      "We noticed that the red (background) channel of STPT images, taken for the purpose of atlas alignment, typically features dark, round-like objects resembling cell nuclei. We had observed this phenomenon in our own imaging of mouse brains but found little more than anecdotal mentions of it in the literature8,9,10,11".

      The authors here cite a Scientific Reports paper from 2021 with 11 citations, a Journal of Clinical Pathology paper from 2005 with 87 citations, and lastly a paper in Laboratory Investigation from 2016 with 41 citations. The authors completely fail to cite the work from Watt Webb's group (co-inventor of 2p microscopy) in PNAS from 2003 that entirely described the phenomena of native fluorescence by multiphoton- excitation (https://www.pnas.org/doi/10.1073/pnas.0832308100 ), citations so far: 1959 citations. This is either indicative of poor scholarship or an attempt to describe something as novel. Either way, the native fluorescence and second harmonic generation from multiphoton illumination are perfectly characterized by Webb and colleagues and they clearly show the differential effect on nucleosides, retinol, indoleamines, and collagen. This is also where the authors should have paid more attention to discrepancies in their own data when correlated to well-established cell nuclei markers (Murakami et al). The authors will note "black large spots" in the data at specific anatomical regions and structures, like the fornix and stria medullaris: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=18833&z=5

      which is not reproduced in for example the Allen Reference Atlas H&E staining: http://atlas.brain-map.org/atlas?atlas=1&plate=100960284#atlas=1&plate=100960284&resolution=4.19&x=5507.4000244140625&y=5903.39990234375&zoom=-2

      In connection here notice the poor signal in the 2p "autofluorescence" within the paraventricular nucleus: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=17833&z=6

      and then compare it to the H&E staining: http://atlas.brain-map.org/atlas?atlas=1&plate=100960280#atlas=1&plate=100960276&resolution=1.50&x=5342.476283482143&y=5368.023856026786&zoom=0

      These multiphoton-specific signals are especially pronounced in the pons and medulla which makes quantification especially dubious, which is even apparent simply from looking at Figure 1c in the manuscript.

      We thank the Reviewer for the comments and sincerely apologize for missing the seminal work of Webb’s group. We included the former references for their specific mention or illustration of non-autofluorescent nuclei. We indeed entirely missed to address the underlying chemistry that Webb’s group beautifully characterized. We have added the following sentence in the Results section “Autofluorescence of STPT images displays cell nuclei” (red font for new sentence; Reference #15 corresponds to Zipfel et al.):

      “We noticed that the red (background) channel of STPT images, taken for the purpose of atlas alignment, typically features dark, round-like objects resembling cell nuclei. This phenomenon was described in previous literature11,12,13,14. In particular, Zipfel et al. characterized the use of multiphoton-excited native florescence and second harmonic generation for the purpose of staining-free tissue imaging15.”

      And mentioned the dependency of our method on the presence of intrinsically fluorescent molecules in the Discussion:

      “The study has several limitations. First, the model is sensitive to the contrast between dark nuclei and autofluorescent surroundings, which can be limited by image quality and tissue composition. In particular, the staining-free approach depends on the presence of intrinsic molecular indicators such as NADH, retinol or collagen15, which may vary between cell or tissue components, even within the brain.”

      We understand that more generally, the Reviewer’s major concern above was regarding the technical validity of our approach; that the segmentation based on small objects lacking autofluorescence, as evident in the STPT dataset, in fact corresponds to cells/nuclei.

      In our initial Supplemental Figure 1 (in current version Figure 1—figure supplement 1) we provide technical validation of the method, by showing nuclear staining, and autofluorescence side-by-side, using epifluorescence microscopy. In our revision we now report appropriate statistical measures for this analysis (true positives, false positives, false negatives).

      In addition, we performed the following two sets of validations –

      (i) Technical validation of our staining-free quantification approach, by nuclear staining. We performed nuclear staining (Hoechst 33342) followed by STPT imaging of 9 female brains and trained a new deep neural network (DNN) to segment the resulting images (STPT was performed by TissueVision). Unfortunately, in STPT it is not technically possible to analyze nuclear staining and autofluorescence in the very same tissue. Therefore, we compared per-region density, cell count and volume of the nuclei-stained validation brains to our original DNN-based analysis of AMBCA brains. We show a correlation coefficient >0.99 for per-region cell count in AMBCA autofluorescence and our nuclear staining (and a similar correlation coefficient for volume). However, the number of cells in nuclear staining over the whole brain is 56% larger than in autofluorescence. Although we currently have no technically feasible way to prove this, one likely explanation for this discrepancy is the nature of the two signals the imaging detects; as positive (Hoechst fluorophore) or autofluorescence. Further, discrepancies between the two methods were notably higher in glial-rich tissues (e.g., CTX L1, midbrain, brainstem) – leading to the speculation that low-autofluorescent object-counts may be biased to detect neurons, rather than glia.

      (ii) Independent validation of the biological findings – discussed further below. Regarding the specific concern of “black large spots” in the fornix and stria medullaris – we would like to emphasize that our DNN does not identify and segment dark regions like ventricles and tracks. We provide in the Author Response Image 1 three examples featuring “black large spots” of different shapes and size, with examples of the segmentation results as shown in Figures 1 and 2 of the manuscript. Note that colored circles, that appear as dots depending on magnification, are the objects that were detected and segmented by the DNN. In the Figure we demonstrate that (1) fiber tracts (incl. fornix, stria medullaris) are not segmented; (2) striatal patches (that are smaller still than the fiber tracts in question) are not segmented; and (3) putative blood vessels, appearing as elongated, black structures, are ignored by our DNN.

      Author Response Image 1. How does the DNN deal with large black spots? Examples for fiber tracts, striatal patches, and blood vessels; adapted from Figures 1 and 2 in the manuscript. Note that dots/outlines represent segmented putative “nuclei” as detected by the model, colored by assigned region according to Allen Mouse Brain hierarchy. Example (1): fiber tracts (incl. fornix, stria medullaris) are not segmented. Example (2): Striasomes (patches in the striatum, that are smaller still than the fiber tracts in question) are not segmented, and the much smaller objects that are detected as putative nuclei are indicated by arrows. Example (3) putative blood vessels, appearing as elongated, black structures, are ignored by our DNN. Examples of the segmentation images were adapted from the manuscript’s Figure 1 to correspond to the STPT image featuring fiber tracts (and Striasomes/patches) was pointed out by the Reviewer.

      Retrieved from: https://connectivity.brain-map.org/projection/experiment/siv/263780729?imageId=263780960&imageType=TWO_PHOTON,SEGMENTATION&initImage=TWO_PHOTON&x=15702&y=18833&z=5.

      Regarding the claim of problematic counting in brain stem regions, we agree, and had addressed this limitation in the manuscript’s Discussion (see below). We believe that our counting is valuable even if in some regions there is a significant systematic error: Most of the analyses in this study compare brain regions across individuals and thus systematic error is less impactful. In the revision, we nevertheless took care to validate and quantify the size of this effect. Briefly, we compared counting based on nuclear staining (Hoechst) from 9 STPT imaged brains, to our quantifications of non-autofluorescent objects. As expected, the ratio between these counts depends on the brain region, and accuracy is better in regions with high brightness, which are not on the border of the section (Figure 2—figure supplement 2). As for pons and medulla, the densities in our Hoechst quantifications are 43% and 60% higher than in our AMBCA analysis, respectively, yet rank order is kept in both.

      We have revised the relevant sentences in the Discussion:

      Original sentences: The study has several limitations. … In the hindbrain (pons, medulla), contrast was exceedingly weak, and we expect our quantifications in this region to strongly underestimate real cell densities, to an extent we cannot quantify.

      Revised sentences: The study has several limitations. … In the hindbrain (pons, medulla), contrast was exceedingly weak, and we expect our quantifications in this region to be 66% of the value estimated by nuclear staining (Figure 2—figure supplement 2).

      The authors here use the correlation on log-log coordinates between their data and that of Murakami et al to argue that the method has validity. However, the variance explained here is R^2 = 0.74 which is very poor given the log-log coordinates. A more valid metric would use linear coordinates and computing the ICC and interpret it according to established guidelines (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/).

      As mentioned by the Reviewer, Figure 2D compares Murakami et al. cell counts and ours, across all brain regions. The value “r=0.869” represents the correlation coefficient between the two vectors in log scale and not the R^2. We also now display the correlation coefficient for the linear scale, in which case p=0.98. As suggested by the Reviewer, we added ICC values between the two vectors in linear scale. Using 6 different forms (ICC – 1-1;1-k;C-1;C-k;A-1;A-k), the ICC values were 0.98-0.99, thus corresponding to an excellent agreement (ICC values are mentioned in legend of Figure 2).

      Author Response Image 2 displays the revised Figure 2D (left), and the log value of the ratio between the AMBCA-based cell count and the Murakami-based value (right), as a function of region volume. The mean value across regions is zero, corresponding to similar cell counts in both methods. Indeed, there exist outlier regions, that may be attributed to either registration errors, different experimental protocols or may stem from the fact that the Murakami values are based on 3 brains, compared to hundreds of AMBCA brains.

      Author Response Image 2. Correlation with cell counts in Murakami et al. Left, revised Figure 2D; Right, ratio between AMBCA-based cell counts and Murakami et al. counts, as a function of region volume

      In addition to the above concern, the authors argue that the large sample size of the AMBCP is what would enable them to find statistically significant small effect sizes that might have gone undetected in the literature. However, this argument falls flat once we examine some of the main findings the authors report. Although the authors do not directly report measures of dispersion we can estimate it from the figures and then arrive at the sample size needed to find the reported effect size. For example, the effect that describes ORBvl2/3 volume is larger in female mice compared to males would only require n=13 mice at the desired power of 0.8. Likewise, the sample size needed to detect the increased BST volume in male mice looks to be roughly n=16 mice at the desired power of 0.8. Both of these estimates are well within what is a reasonable sample size to expect in an ordinary study. This begs the question: why did the authors simply not verify some of their main findings in an independent sample obtained through traditional ways to quantify volume and cell density since it is well within reach? Such validation would strengthen the arguments of the paper.

      We thank the reviewer for this comment and apologize. In the revised version we do report dispersion.

      We would like to emphasize that due to our restricted time and resources, we decided to focus our experimental validation on the technical comparison of nuclear staining vs. autofluorescence-based segmentation, outlined above.

      We then verified the biological findings from the initial cohort using C57BL/6J volume data from an additional 663 males vs.166 females on AMBCA. This independent cohort showed similar sexual dimorphism in the volume of MEA, BST and ORBvl2/3, as depicted in the following figure (panels A-D and also as new Figure 4—figure supplement 1).

      We fully acknowledge the interesting issue raised on sample sizes required to detect our reported effect sizes. Therefore, we here also present the average p-value for sexual dimorphism in volumes of MEA, BST and ORBvl2/3, as a function of the sample size (panel E in Figure 4—figure supplement 1 of the revised manuscript). The Reviewer will note that the regions with largest effect size (MEA, BST) can be detected within more ordinary sample sizes, and indeed, MEA and BST dimorphism is evident in the literature. ORB dimorphism required much greater sample size; and our analysis (Figure 4) systematically detected many more dimorphic regions, in volume, density and count.

      Reviewer #2 (Public Review):

      This report describes a large-scale analysis of cell counts in mouse brains. The authors found that the Allen Mouse Connectivity project has a rich dataset for cell counting that is yet to be analyzed, and they developed methods to quantify cells in different nuclei. They go on to compare males vs females and two different strains. From this analysis, they found specific differences between male versus female brains, left versus right hemispheres, and C57BL/6 versus FVB.CD1 mice, especially with regard to cell counts and density.

      Overall, the methodology is sound and the quality of the data seems high. In fact, this study uses >100 brains for the statistics, and this is one of the major strengths of this study. For researchers who are interested in interrogating the differences at the macroscopic level in brain structures, this study will be a great resource. For example, the manuscript contains an interesting finding that for most brain areas, females have larger volumes but fewer cell numbers.

      We thank the Reviewer for these comments. We would like to mention that the revised version of the manuscript does not include a statement regarding BL6 female volume. We found a batch effect in the AMBCA experiments, mostly affecting the volume in their first batch (Figure 2—figure supplement 1B). That batch included mostly males, and had, for some reason, lower volume compared to all later experiments, which caused the volume differences. We emphasize that (1) the total number of cells did not show any batch effect (Figure 2—figure supplement 1C); (2) We normalized the volume and repeated the analysis. Aside the finding that females did not in fact have larger volumes, other main findings remained unchanged.

      Reviewer #3 (Public Review):

      Elkind et al. have devised a strategy to detect cells in whole brain samples of the large, publicly accessible Allen Mouse Brain Connectivity database. They put together an analysis pipeline to quantify cell numbers and -density as well as volumes for all annotated brain areas in these samples. This allowed them to make several important discoveries such as (1) strain-, sex- and hemisphere-specific differences in cell densities, (2) a large interindividual variability in cell numbers, and (3) an absence of linear scaling of cell count with volume, among others. The key strength of this work lies in its comprehensive analysis, the large sample size that the authors have drawn from (making their conclusions particularly robust), and the fact that they have made their analysis tools accessible. A weakness of the current manuscript is the dense layout and overplotting of several of the figures, and the lack of necessary information to understand them more easily. Another, conceptual weakness of using the autofluorescence channel for cell detection is that the identity (neuronal vs non-neuronal) of the underlying cells remains unresolved. Overall, however, I believe that this study has the potential to serve as a valuable reference point, and I would expect this work to have a lasting impact on quantitative studies of mouse brain cytoarchitecture.

      We thank the Reviewer for these valuable comments. We have tried to minimize overplotting of figures and hopefully added all necessary information. For example, the revised manuscript presents more pared-down figures, with data labels omitted if they crowded the graphic. Instead, we provide the full data in Supplemental tables, and our online accessible GUI. We hope the reader will feel encouraged to both zoom the presented data, more deeply explore additional tables, and our online tool.

      Regarding the question of cell types, we were unfortunately not able to provide a definitive answer, but our validation experiments provided some potential clues. For example, nuclear staining (Hoechst) uniformly detected 65% more cells than AMBCA autofluorescence quantification. And, in neuron-rich regions, the correspondence between nuclear staining and AMBCA autofluorescence was notably better than in glia-rich regions (e.g., CTX L1, midbrain, medulla). These discrepancies between the techniques may therefore point to an underlying difference in cell types composition – such that counting low-autofluorescent nuclei is biased to neurons.

      In addition, however, the methods differ in their native physical properties; in that one detects presence of a fluorescent signal (e.g., the nuclear stain is detected beyond its focal plane), compared to the detection of the absence of a signal (which, in turn, is dependent on the presence of surrounding intrinsic fluorescent molecules). It is technically non-trivial to assess the extent to which these factors apply. We have added a clarification along these lines in the Discussion (below). We would further like to emphasize the nature of our study as a comparative, systematic analysis within this interesting cohort, rather than providing definitive cell counts – that we found to be greatly variable across the population.

      “We further attempted to estimate the region-specific accuracy of our cell counting by comparing autofluorescence STPT with brain-wide imaging of nuclear-stained STPT. However, this comparison is technically nontrivial because of the native physical properties of direct staining vs. autofluorescence. For example, stained nuclei located off the focal plane may appear in the image, yet remain undetected by autofluorescence. In addition, tissue composition (e.g., cell types, extracellular matrix) may affect the imaged region. Indeed, in regions rich with non-neuronal cells the error of autofluorescent-based counting was larger compared to nuclear staining. Hence, one may speculate that autofluorescent-based detection is biased for neurons”

    1. Author Response:

      Reviewer #1 (Public Review):

      Chakrabarti et al study inner hair cell synapses using electron tomography of tissue rapidly frozen after optogenetic stimulation. Surprisingly, they find a nearly complete absence of docked vesicles at rest and after stimulation, but upon stimulation vesicles rapidly associate with the ribbon. Interestingly, no changes in vesicle size were found along or near the ribbon. This would have indicated a process of compound fusion prior to plasma membrane fusion, as proposed for retinal bipolar cell ribbons. This lack of compound fusion is used to argue against MVR at the IHC synapse. However, that is only one form of MVR. Another form, coordinated and rapid fusion of multiple docked vesicles at the bottom of the ribbon, is not ruled out. Therefore, I agree that the data set provides good evidence for rapid replenishment of the ribbon-associated vesicles, but I do not find the evidence against MVR convincing. The work provides fundamental insight into the mechanisms of sensory synapses.

      We thank the reviewer for the appreciation of our work and the constructive comments. As pointed out below, we now included this discussion (from line 679 onwards).

      We wrote:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Reviewer #2 (Public Review):

      Chakrabarti et al. aimed to investigate exocytosis from ribbon synapses of cochlear inner hair cells with high-resolution electron microscopy with tomography. Current methods to capture the ultrastructure of the dynamics of synaptic vesicle release in IHCs rely on the application of potassium for stimulation, which constrains temporal resolution to minutes rather than the millisecond resolution required to analyse synaptic transmission. Here the authors implemented a high-pressure freezing method relying on optogenetics for stimulation (Opto-HPF), granting them both high spatial and temporal resolutions. They provide an extremely well-detailed and rigorously controlled description of the method, falling in line with previously use of such "Opto-HPF" studies. They successfully applied Opto-HPF to IHCs and had several findings at this highly specialised ribbon synapse. They observed a stimulation-dependent accumulation of docked synaptic vesicles at IHC active-zones, and a stimulation-dependent reduction in the distance of non-docked vesicles to the active zone membrane; while the total number of ribbon-associated vesicles remained unchanged. Finally, they did not observe increases in diameter of synaptic vesicles proximal to the active zone, or other potential correlates to compound fusion - a potential mode of multivesicular release. The conclusions of the paper are mostly well supported by data, but some aspects of their findings and pitfalls of the methods should be better discussed.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      Strengths:

      While now a few different groups have used "Opto-HPF" methods (also referred to as "Flash and Freeze) in different ways and synapses, the current study implemented the method with rigorous controls in a novel way to specifically apply to cochlear IHCs - a different sample preparation than neuronal cultures, brain slices or C. elegans, the sample preparations used so far. The analysis of exocytosis dynamics of IHCs with electron microscopy with stimulation has been limited to being done with the application of potassium, which is not physiological. While much has been learned from these methods, they lacked time resolution. With Opto-HPF the authors were successfully able to investigate synaptic transmission with millisecond precision, with electron tomography analysis of active zones. I have no overall questions regarding the methodology as they were very thoroughly described. The authors also employed electrophysiology with optogenetics to characterise the optical simulation parameters and provided a well described analysis of the results with different pulse durations and irradiance - which is crucial for Opto-HPF.

      Thank you very much.

      Further, the authors did a superb job in providing several tables with data and information across all mouse lines used, experimental conditions, and statistical tests, including source code for the diverse analysis performed. The figures are overall clear and the manuscript was well written. Such a clear representation of data makes it easier to review the manuscript.

      Thank you very much.

      Weaknesses:

      There are two main points that I think need to be better discussed by the authors.

      The first refers to the pitfalls of using optogenetics to analyse synaptic transmission. While ChR2 provides better time resolution than potassium application, one cannot discard the possibility that calcium influx through ChR2 alters neurotransmitter release. This important limitation of the technique should be properly acknowledged by the authors and the consequences discussed, specifically in the context in which they applied it: a single sustained pulse of light of ~20ms (ShortStim) and of ~50ms (LongStim). While longer, sustained stimulation is characteristic for IHCs, these are quite long pulses as far as optogenetics and potential consequences to intrinsic or synaptic properties.

      We thank the reviewer for pointing this out. We would like to mention that upon 15 min high potassium depolarization, the number of docked SVs only slightly increased as shown in Chakrabarti et al., 2018, EMBO rep and Kroll et al. 2020 JCS, but it was not statistically significant. In the current study, we report a similar phenomenon, but here light induced depolarization resulted in a more robust increase in the number of docked SVs.

      To compare the data from the previous studies with the current study, we included an additional table 3 (line 676) now in the discussion with all total counts (and average per AZ) of docked SVs.

      Furthermore, in response to the reviewers’ concern, we now discuss the Ca2+ permeability of ChR2 in addition to the above comparison to our previous studies that demonstrated very few docked SVs in the absence of K+ channel blockers and ChR2 expression in IHCs. We are not entirely certain, if the reviewer refers to potential dark currents of ChR2 (e.g. as an explanation for a depletion of docked vesicles under non-stimulated conditions) or to photocurrents, the influx of Ca2+ through ChR2 itself, and their contribution to Ca2+ concentration at the active zone.

      However, regardless this, we consider it unlikely that a potential contribution of Ca2+ influx via ChR2 evokes SV fusion at the hair cell active zone.

      First of all, we note that the Ca2+ affinity of IHC exocytosis is very low. As first shown in Beutner et al., 2001 and confirmed thereafter (e.g. Pangrsic et al., 2010), there is little if any IHC exocytosis for Ca2+ concentrations at the release sites below 10 µM. Two studies using CatCh (a ChR2 mutant with higher Ca2+ permeability than wildtype ChR2 (Kleinlogel et al., 2011; Mager et al., 2017) estimated a max intracellular Ca2+ increase below 10 µM, even at very negative potentials that promote Ca2+ influx along the electrochemical gradient or at high extracellular Ca2+ concentrations of 90 mM. In our experiments, IHCs were depolarized, instead, to values for which extrapolation of the data of Mager et al., 2017 indicate a submicromolar Ca2+ concentration. In addition, we and others have demonstrated powerful Ca2+ buffering and extrusion in hair cells (e.g. Tucker and Fettiplace, 1995; Issa and Hudspeth., 1996; Frank et al., 2009 Pangrsic et al., 2015). As a result, the hair cells efficiently clear even massive synaptic Ca2+ influx and establish a low bulk cytosolic Ca2+ concentration (Beutner and Moser, 2001; Frank et al., 2009). We reason that these clearance mechanisms efficiently counter any Ca2+ influx through ChR2. This will likely limit potential effects of ChR2 mediated Ca2+ influx on Ca2+ dependent replenishment of synaptic vesicles during ongoing stimulation.

      We have now added the following in the discussion (starting in line 620):

      “We note that ChR2, in addition to monovalent cations, also permeates Ca2+ ions and poses the question whether optogenetic stimulation of IHCs could trigger release due to direct Ca2+ influx via the ChR2. We do not consider such Ca2+ influx to trigger exocytosis of synaptic vesicles in IHCs. Optogenetic stimulation of HEK293 cells overexpressing ChR2 (wildtype version) only raises the intracellular Ca2+ concentration up to 90 nM even with an extracellular Ca2+ concentration of 90 mM (Kleinlogel et al., 2011). IHC exocytosis shows a low Ca2+ affinity (~70 µM, Beutner et al., 2001) and there is little if any IHC exocytosis for Ca2+ concentrations below 10 µM, which is far beyond what could be achieved even by the highly Ca2+ permeable ChR2 mutant (CatCh: Ca2+ translocating channelrhodopsin, Mager et al., 2017). In addition, we reason that the powerful Ca2+ buffering and extrusion by hair cells (e.g., Frank et al., 2009; Issa and Hudspeth, 1996; Pangršič et al., 2015; Tucker and Fettiplace, 1995) will efficiently counter Ca2+ influx through ChR2 and, thereby limit potential effects on Ca2+ dependent replenishment of synaptic vesicles during ongoing stimulation. “

      The second refers to the finding that the authors did not observe evidence of compound fusion (or homotypic fusion) in their data. This is an interesting finding in the context of multivesicular release in general, as well as specifically for IHCs. While the authors discussed the potential for "kiss-and-run" and/or "kiss-and-stay", it would be valuable if they could discuss their findings further in the context of the field for multivesicular release. For example, the evidence in support of the potential of multiple independent release events. Further, as far as such function-structure optical-quick-freezing methods, it is not unusual to not capture fusion events (so-called omega-shapes or vesicles with fusion pores); this is largely because these are very fast events (less than 10 ms), and not easily captured with optical stimulation.

      We agree with the reviewer that the discussion on MVR and UVR should be extended. We now added the following paragraph to the discussion from line 679 on:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Reviewer #3 (Public Review):

      Precise methods were developed to validate the expression of channelrhodopsin in inner hair cells of the Organ of Corti, to quantify the relationship between blue light irradiance and auditory nerve fiber depolarization, to control light stimulation within the chamber of a high-pressure freezing device, and to measure with good precision the delay between stimulation and freezing of the specimen. These methods represent a clear advance over previous experimental designs used to study this synaptic system and are an initial application of rapid high-pressure freezing with freeze substitution, followed by high-resolution electron tomography (ET), to sensory cells that operate via graded potentials.

      Short-duration stimuli were used to assess the redistribution of vesicles among pools at hair cell ribbon synapses. The number of vesicles linked to the synaptic ribbon did not change, but vesicles redistributed within the membrane-proximal pool to docked locations. No evidence was found for vesicle-to-vesicle fusion prior to vesicle fusion to the membrane, which is an important, ongoing question for this synapse type. The data for quantifying numbers of vesicles in membrane-tethered, non-tethered, and docked vesicle pools are compelling and important.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      These quantifications would benefit from additional presentation of raw images so that the reader can better assess their generality and variability across synaptic sites.

      The images shown for each of the two control and two experimental (stimulated) preparation classes should be more representative. Variation in synaptic cleft dimensions and numbers of ribbon-associated and membrane-proximal vesicles do not track the averaged data. Since the preparation has novel stimulus features, additional images (as the authors employed in previous publications) exhibiting tethered vesicles, non-tethered vesicles, docked vesicles, several sections through individual ribbons, and the segmentation of these structures, will provide greater confidence that the data reflect the images.

      Thank you very much for pointing this out. We now included more details in supplemental figures and in the text.

      Precisely, we added:

      • More details about the morphological sub-pools (analysis and images):

        -We now show a sequence of images with different tethering states of membrane proximal SVs together with examples for docked and non-tethered SVs as we did in Chakrabarti et al., 2018 for each condition (Fig. 6-figure supplement 2, line 438). Moreover, we included for each condition additional information, we selected further tomograms, one per condition, and depict two additional virtual sections: Fig. 6-figure supplement 2.

        -Moreover, we present a more detailed quantification for the different morphological sub-pools: For the MP-SV pool, we analyzed the SV diameters and the distances to the AZ membrane and PD of different SV sub-pools separately, we now included this information in Fig. 7 For the RA-SVs, we analyzed in addition the morphological sub-pools and the SV diameters in the distal and the proximal ribbon part as done in Chakrabarti et al. 2018. We now added a new supplement figure (Fig. 7-figure supplement 2, line 558 and a supplementary file 2).

      • We replaced the virtual section in panel 6D: In the old version, it appeared that the ribbon was contacting the membrane and we realized that this virtual section was not representative: actually, the ribbon was not directly contacting the AZ membrane, a presynaptic density was still visible adjacent to the docked SVs. To avoid potential confusion, we selected a different virtual section of the same tomogram and now indicated the presynaptic density also as graphical aid in Fig. 6.

      The introduction raises questions about the length of membrane tethers in relation to vesicle movement toward the active zone, but this topic was not addressed in the manuscript.

      We apologize for not stating it sufficiently clear, we now rephrased this sentence. We now wrote:

      “…and seem to be organized in sub-pools based on the number of tethers and to which structure these tethers are connected. “

      Seemingly quantification of this metric, and the number of tethers especially for vesicles near the membrane, is straightforward. The topic of EPSC amplitude as representing unitary events due to variation in vesicle volume, size of the fusion pore, or vesicle-vesicle fusion was partially addressed. Membrane fusion events were not evident in the few images shown, but these presumably occurred and could be quantified. Likewise, sites of membrane retrieval could also be marked. These analyses will broaden the scope of the presentation, but also contribute to a more complete story.

      Regarding the presence/absence of membrane fusion events we agree with the reviewer that this should be clearly addressed in the MS. We would like to point out that we

      (i) did not observe any omega shapes at the AZ membrane, which we also mention in the MS. We can also report that we could not see them in data sets from previous publications (Vogl et al., 2015, JCS; Jung et al., 2015, PNAS).

      (ii) To be clear on our observations on potential SV-SV fusion events we now point out in the discussion from line 688ff:

      “We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      Furthermore, we agree with the reviewer that a complete presentation of endo-exocytosis structural correlates is very important. However, we focused our study on exocytosis events and therefore mainly analyzed membrane proximal SVs at active zones.

      Nonetheless, in response to the reviewer’s comment, we now included a quantification of clathrin-coated (CC) structures. We determined the appearance of CC vesicles (V) and CC invaginations within 0-500 nm away from the PD. We measured the diameter of the CCV, and their distance to the membrane and the PD. We only found very few CC structures in our tomograms (now added in a table to the result section (Supplementary file 1). Sites for endocytic membrane retrieval likely are in the peri-active zone area or even beyond. We did not observe obvious bulk endocytosis events that were connected to the AZ membrane. However, we do observe large endosomal like vesicles that we did not quantify in this study. More details were presented in two of our previous studies: Kroll et al., 2019 and 2020, however, under different stimulation conditions.

      Overall, the methodology forms the basis for future studies by this group and others to investigate rapid changes in synaptic vesicle distribution at this synapse.

      Reviewer #4 (Public Review):

      This manuscript investigates the process of neurotransmitter release from hair cell synapses using electron microscopy of tissue rapidly frozen after optogenetic stimulation. The primary finding is that in the absence of a stimulus very few vesicles appear docked at the membrane, but upon stimulation vesicles rapidly associate with the membrane. In contrast, the number of vesicles associated with the ribbon and within 50 nm of the membrane remains unchanged. Additionally, the authors find no changes in vesicle size that might be predicted if vesicles fuse to one-another prior to fusing with the membrane. The paper claims that these findings argue for rapid replenishment and against a mechanism of multi-vesicular release, but neither argument is that convincing. Nonetheless, the work is of high quality, the results are intriguing, and will be of interest to the field.

      We thank the reviewer for the appreciation of our work and the constructive comments.

      1) The abstract states that their results "argue against synchronized multiquantal release". While I might agree that the lack of larger structures is suggestive that homotypic fusion may not be common, this is far from an argument against any mechanisms of multi-quantal release. At least one definition of synchronized multiquantal release posits that multiple vesicles are fusing at the same time through some coordinated mechanism. Given that they do not report evidence of fusion itself, I fail to see how these results inform us one way or the other.

      We agree with the reviewer that the discussion on MVR and UVR should be extended. It is important to point out that we do not claim that the evoked release is mediated by one single SV. As discussed in the paper (line 672), we consider that our optogenetic stimulation of IHCs triggers the release of more than 10 SVs per AZ. This falls in line with the previous reports of several SVs fusing upon stimulation. This type of evoked MVR is probably mediated by the opening of Ca2+ channels in close proximity to each SV Ca2+ sensor. We indeed sometimes observed more than one docked SV per AZ upon long optogenetic stimulation. This could reflect that possibility. However, given the absence of large structures directly at the ribbon or the AZ membrane that could suggest the compound fusion of several SVs prior or during fusion, we argue against compound MVR release at IHCs. As mentioned above, we added to the discussion (from line 679 onwards).

      We wrote:

      “This might reflect spontaneous univesicular release (UVR) via a dynamic fusion pore (i.e. ‘kiss and run’, (Ceccarelli et al., 1979), which was suggested previously for IHC ribbon synapses (Chapochnikov et al., 2014; Grabner and Moser, 2018; Huang and Moser, 2018; Takago et al., 2019) and/or and rapid undocking of vesicles (e.g. Dinkelacker et al., 2000; He et al., 2017; Nagy et al., 2004; Smith et al., 1998). In the UVR framework, stimulation by ensuing Ca2+ influx triggers the statistically independent release of several SVs. Coordinated multivesicular release (MVR) has been indicated to occur at hair cell synapses (Glowatzki and Fuchs, 2002; Goutman and Glowatzki, 2007; Li et al., 2009) and retinal ribbon synapses (Hays et al., 2020; Mehta et al., 2013; Singer et al., 2004) during both spontaneous and evoked release. We could not observe structures which might hint towards compound or cumulative fusion, neither at the ribbon nor at the AZ membrane under our experimental conditions. Upon short and long stimulation, RA-SVs as well as docked SVs even showed a slightly reduced size compared to controls. However, since some AZs harbored more than one docked SV per AZ in stimulated conditions, we cannot fully exclude the possibility of coordinated release of few SVs upon depolarization.”

      2) The complete lack of docked vesicles in the absence of a stimulus followed by their appearance with a stimulus is a fascinating result. However, since there are no docked vesicles prior to a stimulus, it is really unclear what these docked vesicles represent - clearly not the RRP. Are these vesicles that are fusing or recently fused or are they ones preparing to fuse? It is fine that it is unknown, but it complicates their interpretation that the vesicles are "rapidly replenished". How does one replenish a pool of docked vesicles that didn't exist prior to the stimulus?

      In response to the reviewers’ comment, we would like to note that we indeed reported very few docked SVs in wild type IHCs at resting conditions without K+ channel blockers in Chakrabarti et al. EMBO Rep 2018 and in Kroll et al., 2020, JCS. In both studies, a solution without TEA and Cs was used for the experiments (resting solution Chakrabarti: 5 mM KCl, 136.5 mM NaCl, 1 mM MgCl2, 1.3 mM CaCl2, 10 mM HEPES, pH 7.2, 290 mOsmol; control solution Kroll: 5.36 mM KCl, 139.7 mM NaCl, 2 mM CaCl2, 1 mM MgCl2, 0.5 mM MgSO4, 10 mM HEPES, 3.4 mM L-glutamine, and 6.9 mM D-glucose, pH 7.4). Similarly, our current study shows very few docked SVs in the resting condition even in the presence of TEA and Cs. Based on the results presented in ‘Response to reviewers Figure 1’, we assume that the scarcity of docked SVs under control conditions is not due to depolarization induced by a solution containing 20 mM TEA and 1 mM Cs but is rather representative for the physiological resting state of IHC ribbon synapses. Upon 15 min high potassium depolarization, the number of docked SVs only slightly increased as shown in Chakrabarti et al., 2018 and Kroll et al. 2020, but it was not statistically significant. In the current study, we report a similar phenomenon, but here depolarization resulted in a more robust increase in the number of docked SVs.

      To compare the data from the previous studies with the current study, we included an additional table 3 (line 676) now in the discussion with all total counts (and average per AZ) of docked SVs.

    1. Author Response

      eLife assessment:

      This study addresses whether the composition of the microbiota influences the intestinal colonization of encapsulated vs unencapsulated Bacteroides thetaiotaomicron, a resident micro-organism of the colon. This is an important question because factors determining the colonization of gut bacteria remain a critical barrier in translating microbiome research into new bacterial cell-based therapies. To answer the question, the authors develop an innovative method to quantify B. theta population bottlenecks during intestinal colonization in the setting of different microbiota. Their main finding that the colonization defect of an acapsular mutant is dependent on the composition of the microbiota is valuable and this observation suggests that interactions between gut bacteria explains why the mutant has a colonization defect. The evidence supporting this claim is currently insufficient. Additionally, some of the analyses and claims are compromised because the authors do not fully explain their data and the number of animals is sometimes very small.

      Thank you for this frank evaluation. Based on the Reviewers’ comments, the points raised have been addressed by improving the writing (apologies for insufficient clarity), and by the addition of data that to a large extent already existed or could be rapidly generated. In particularly the following data has been added:

      1. Increase to n>=7 for all fecal time-course experiments

      2. Microbiota composition analysis for all mouse lines used

      3. Data elucidating mechanisms of SPF microbiome/ host immune mechanisms restriction of acapsular B. theta

      4. Short- versus long-term recolonization of germ-free mice with a complete SPF microbiota and assessment of the effect on B. theta colonization probability.

      5. Challenge of B. theta monocolonized mice with avirulent Salmonella to disentangle effects of the host inflammatory response from other potential explanations of the observations.

      6. Details of all inocula used

      7. Resequencing of all barcoded strains

      Additionally, we have improved the clarity of the text, particularly the methods section describing mathematical modeling in the main text. Major changes in the text and particularly those replying to reviewers comment have been highlighted here and in the manuscript.

      Reviewer #1 (Public Review):

      The study addresses an important question - how the composition of the microbiota influences the intestinal colonization of encapsulated vs unencapsulated B. theta, an important commensal organism. To answer the question, the authors develop a refurbished WITS with extended mathematical modeling to quantify B. theta population bottlenecks during intestinal colonization in the setting of different microbiota. Interestingly, they show that the colonization defect of an acapsular mutant is dependent on the composition of the microbiota, suggesting (but not proving) that interactions between gut bacteria, rather than with host immune mechanisms, explains why the mutant has a colonization defect. However, it is fairly difficult to evaluate some of the claims because experimental details are not easy to find and the number of animals is very small. Furthermore, some of the analyses and claims are compromised because the authors do not fully explain their data; for example, leaving out the zero values in Fig. 3 and not integrating the effect of bottlenecks into the resulting model, undermines the claim that the acapsular mutant has a longer in vivo lag phase.

      We thank the reviewer for taking time to give this details critique of our work, and apologies that the experimental details were insufficiently explained. This criticism is well taken. Exact inoculum details for experiment are now present in each figure (or as a supplement when multiple inocula are included). Exact microbiome composition analysis for OligoMM12, LCM and SPF microbiota is now included in Figure 2 – Figure supplement 1.

      Of course, the models could be expanded to include more factors, but I think this comment is rather based on the data being insufficiently clearly explained by us. There are no “zero values missing” from Fig. 3 – this is visible in the submitted raw data table (excel file Source Data 1), but the points are fully overlapped in the graph shown and therefore not easily discernable from one another. Time-points where no CFU were recovered were plotted at a detection limit of CFU (50 CFU/g) and are included in the curve-fitting. However, on re-examination we noticed that the curve fit was carried out on the raw-data and not the log-normalized data which resulted in over-weighting of the higher values. Re-fitting this data does not change the conclusions but provides a better fit. These experiments have now been repeated such that we now have >=7 animals in each group. This new data is presented in Fig. 3C and D and Fig. 3 Supplement 2.

      Limitations:

      1) The experiments do not allow clear separation of effects derived from the microbiota composition and those that occur secondary to host development without a microbiota or with a different microbiota. Furthermore, the measured bottlenecks are very similar in LCM and Oligo mice, even though these microbiotas differ in complexity. Oligo-MM12 was originally developed and described to confer resistance to Salmonella colonization, suggesting that it should tighten the bottleneck. Overall, an add-back experiment demonstrating that conventionalizing germ-free mice imparts a similar bottleneck to SPF would strengthen the conclusions.

      These are excellent suggestions and have been followed. Additional data is now presented in Figure 2 – figure supplement 8 showing short, versus long-term recolonization of germ-free mice with an SPF microbiota and recovering very similar values of beta, to our standard SPF mouse colony. These data demonstrate a larger total niche size for B. theta at 2 days post-colonization which normalizes by 2 weeks post-colonization. Independent of this, the colonization probability, is already equivalent to that observed in our SPF colony at day 2 post-colonization. Therefore, the mechanisms causing early clonal loss are very rapidly established on colonization of a germ-free mouse with an SPF microbiota. We have additionally demonstrated that SPF mice do not have detectable intestinal antibody titers specific for acapsular B. theta. (Figure 2 – figure supplement 7), such that this is unlikely to be part of the reason why acapsular B. theta struggles to colonize at all in the context of an SPF microbiota. Experiments were also carried to detect bacteriophage capable of inducing lysis of B. theta and acapsular B. theta from SPF mouse cecal content (Figure 2 – figure supplement 7). No lytic phage plaques were observed. However, plaque assays are not sensitive for detection of weakly lytic phage, or phage that may require expression of surface structures that are not induced in vitro. We can therefore conclude that the restrictive activity of the SPF microbiota is a) reconstituted very fast in germ-free mice, b) is very likely not related to the activity of intestinal IgA and c) cannot be attributed to a high abundance of strongly lytic bacteriophage. The simplest explanation is that a large fraction of the restriction is due to metabolic competition with a complex microbiota, but we cannot formally exclude other factors such as antimicrobial peptides or changes in intestinal physiology.

      2) It is often difficult to evaluate results because important parameters are not always given. Dose is a critical variable in bottleneck experiments, but it is not clear if total dose changes in Figure 2 or just the WITS dose? Total dose as well as n0 should be depicted in all figures.

      We apologized for the lack of clarity in the figures. Have added panels depicting the exact inoculum for each figure legend (or a supplementary figure where many inocula were used). Additionally, the methods section describing how barcoded CFU were calculated has been rewritten and is hopefully now clearer.

      3) This is in part a methods paper but the method is not described clearly in the results, with important bits only found in a very difficult supplement. Is there a difference between colonization probability (beta) and inoculum size at which tags start to disappear? Can there be some culture-based validation of "colonization probability" as explained in the mathematics? Can the authors contrast the advantages/disadvantages of this system with other methods (e.g. sequencing-based approaches)? It seems like the numerator in the colonization probability equation has a very limited range (from 0.18-1.8), potentially limiting the sensitivity of this approach.

      We apologized for the lack of clarity in the methods. This criticism is well taken, and we have re-written large sections of the methods in the main text to include all relevant detail currently buried in the extensive supplement.

      On the question of the colonization probability and the inoculum size, we kept the inoculum size at 107 CFU/ mouse in all experiments (except those in Fig.4, where this is explicitly stated); only changing the fraction of spiked barcoded strains. We verified the accuracy of our barcode recovery rate by serial dilution over 5 logs (new figure added: Figure 1 – figure supplement 1). “The CFU of barcoded strains in the inoculum at which tags start to disappear” is by definition closely related to the colonization probability, as this value (n0) appears in the calculation. Note that this is not the total inoculum size – this is (unless otherwise stated in Fig. 4) kept constant at 107 CFU by diluting the barcoded B. theta with untagged B. theta. Again, this is now better explained in all figure legends and the main text.

      We have added an experiment using peak-to-trough ratios in metagenomic sequencing to estimate the B. theta growth rate. This could be usefully employed for wildtype B. theta at a relatively early timepoint post-colonization where growth was rapid. However, this is a metagenomics-based technique that requires the examined strain to be present at an abundance of over 0.1-1% for accurate quantification such that we could not analyze the acapsular B. theta strain in cecum content at the same timepoint. These data have been added (Figure 3 – figure supplement 3). Note that the information gleaned from these techniques is different. PTR reveals relative growth rates at a specific time (if your strain is abundant enough), whereas neutral tagging reveals average population values over quite large time-windows. We believe that both approaches are valuable. A few sentences comparing the approaches have been added to the discussion.

      The actual numerator is the fraction of lost tags, which is obtained from the total number of tags used across the experiment (number of mice times the number of tags lost) over the total number of tags (number of mice times the number of tags used). Very low tag recovery (less than one per mouse) starts to stray into very noisy data, while close to zero loss is also associated with a low-information-to-noise ratio. Therefore, the size of this numerator is necessarily constrained by us setting up the experiments to have close to optimal information recovery from the WITS abundance. Robustness of these analyses is provided by the high “n” of between 10 and 17 mice per group.

      4) Figure 3 and the associated model is confusing and does not support the idea that a longer lag-phase contributes to the fitness defect of acapsular B.theta in competitive colonization. Figure 3B clearly indicates that in competition acapsular B. theta experiences a restrictive bottleneck, i.e., in competition, less of the initial B. theta population is contributed by the acapsular inoculum. There is no need to appeal to lag-phase defects to explain the role of the capsule in vivo. The model in Figure 3D should depict the acapsular population with less cells after the bottleneck. In fact, the data in Figure 3E-F can be explained by the tighter bottleneck experienced by the acapsular mutant resulting in a smaller acapsular founding population. This idea can be seen in the data: the acapsular mutant shedding actually dips in the first 12-hours. This cannot be discerned in Figure 3E because mice with zero shedding were excluded from the analysis, leaving the data (and conclusion) of this experiment to be extrapolated from a single mouse.

      We of course completely agree that this would be a correct conclusion if only the competitive colonization data is taken into account. However, we are also trying to understand the mechanisms at play generating this bottleneck and have investigated a range of hypotheses to explain the results, taking into account all of our data.

      Hypothesis 1) Competition is due to increased killing prior to reaching the cecum and commencing growth: Note that the probability of colonization for single B. theta clones is very similar for OligoMM12 mouse single-colonization by the wildtype and acapsular strains. For this hypothesis to be the reason for outcompetition of the acapsular strain, it would be necessary that the presence of wildtype would increase the killing of acapsular B. theta in the stomach or small intestine. The bacteria are at low density at this stage and stomach acid/small intestinal secretions should be similar in all animals. Therefore, this explanation seems highly unlikely

      Hypothesis 2) Competition between wildtype and acapsular B. theta occurs at the point of niche competition before commencing growth in the cecum (similar to the proposal of the reviewer). It is possible that the wildtype strain has a competitive advantage in colonizing physical niches (for example proximity to bacteria producing colicins). On the basis of the data, we cannot exclude this hypothesis completely and it is challenging to measure directly. However, from our in vivo growth-curve data we observe a similar delay in CFU arrival in the feces for acapsular B. theta on single colonization as in competition, suggesting that the presence of wildtype (i.e., initial niche competition) is not the cause of this delay. Rather it is an intrinsic property of the acapsular strain in vivo,

      Hypothesis 3) Competition between wildtype and acapsular B. theta is mainly attributable to differences in growth kinetics in the gut lumen. To investigate growth kinetics, we carried our time-courses of fecal collection from OligoMM12 mice single-colonized with wildtype or acapsular B. theta, i.e., in a situation where we observe identical colonization probabilities for the two strains. These date, shown now in Figure 3 C and D and Figure 3 – figure supplement 2, show that also without competition, the CFU of acapsular B. theta appear later and with a lower net growth rate than the wildtype. As these single-colonizations do not show a measurable difference between the colonization probability for the two strains, it is not likely that the delayed appearance of acapsular B. theta in feces is due to increased killing (this would be clearly visible in the barcode loss for the single-colonizations). Rather the simplest explanation for this observation is a bona fide lag phase before growth commences in the cecum. Interestingly, using only the lower net growth rate (assumed to be a similar growth rate but increased clearance rate) produces a good fit for our data on both competitive index and colonization probability in competition (Figure 3, figure supplement 5). This is slightly improved by adding in the observed lag-phase (Figure 3). It is very difficult to experimentally manipulate the lag phase in order to directly test how much of an effect this has on our hypothesis and the contribution is therefore carefully described in the new text.

      Please note that all data was plotted and used in fitting in Fig 3E, but “zero-shedding” is plotted at a detection limit and overlayed, making it look like only one point was present when in fact several were used. This was clear in the submitted raw data tables. To sure-up these observations we have repeated all time-courses and now have n>=7 mice per group.

      5) The conclusions from Figure 4 rely on assumptions not well-supported by the data. In the high fat diet experiment, a lower dose of WITS is required to conclude that the diet has no effect. Furthermore, the authors conclude that Salmonella restricts the B. theta population by causing inflammation, but do not demonstrate inflammation at their timepoint or disprove that the Salmonella population could cause the same effect in the absence of inflammation (through non-inflammatory direct or indirect interactions).

      We of course agree that we would expect to see some loss of B. theta in HFD. However, for these experiments the inoculum was ~109 CFUs/100μL dose of untagged strain spiked with approximately 30 CFU of each tagged strain. Decreasing the number of each WITS below 30 CFU leads to very high variation in the starting inocula from mouse-to-mouse which massively complicates the analysis. To clarify this point, we have added in a detection-limit calculation showing that the neutral tagging technique is not very sensitive to population contractions of less than 10-fold, which is likely in line with what would be expected for a high-fat diet feeding in monocolonized mice for a short time-span.

      This is a very good observation regarding our Salmonella infection data. We have now added the fecal lipocalin 2 values, as well as a group infected with a ssaV/invG double mutant of S. Typhimurium that does not cause clinical grade inflammation (“avirulent”). This shows 1) that the attenuated S. Typhimurium is causing intestinal inflammation in B. theta colonized mice and 2) that a major fraction of the population bottleneck can be attributed to inflammation. Interestingly, we do observe a slight bottleneck in the group infected with avirulent Salmonella which could be attributable either to direct toxicity/competition of Salmonella with B. theta or to mildly increased intestinal inflammation caused by this strain. As we cannot distinguish these effects, this is carefully discussed in the manuscript.

      6) Several of the experiments rely on very few mice/groups.

      We have increased the n to over 5 per group in all experiments (most critically those shown in Fig 3, Supplement 5). See figure legends for specific number of mice per experiment.

      Reviewer #2 (Public Review):

      The goal of this study was to understand population bottlenecks during colonization in the context of different microbial communities. Capsular polysaccharide mutants, diet, and enteric infection were also used paired to short-term monitoring of overall colonization and the levels of specific strains. The major strength of this study is the innovative approach and the significance of the overall research area.

      The first major limitation is the lack of clear and novel insight into the biology of B. theta or other gut bacterial species. The title is provocative, but the experiments as is do not definitively show that the microbiota controls the relative fitness of acapsular and wild-type strains or provide any mechanistic insights into why that would be the case. The data on diet and infection seem preliminary. Furthermore, many of the experiments conflict with prior literature (i.e., lack of fitness difference between acapsular and wild-type strain and lack of impact of diet) but satisfying explanations are not provided for the lack of reproducibility.

      In line with suggestions from Reviewer 1, the paper has undergone quite extensive re-writing to better explain the data presented and its consequences. Additionally, we now explicitly comment on apparent discrepancies between our reported data and the literature – for example the colonization defect of acapsular B. theta is only published for competitive colonizations, where we also observe a fitness defect so there is no actual conflict. Additionally, we have calculated detection limits for the effect of high-fat diet and demonstrate that a 10-fold reduction in the effective population size would not be robustly detected with the neutral tagging technique such that we are probably just underpowered to detect small effects, and we believe it is important to point out the numerical limits of the technique we present here. Additionally for the Figure 4 experiments, we have added data on colonization/competition with an avirulent Salmonella challenge giving some mechanistic data on the role of inflammation in the B. theta bottleneck.

      Another major limitation is the lack of data on the various background gut microbiotas used. eLife is a journal for a broad readership. As such, describing what microbes are in LCM, OligoMM, or SPF groups is important. The authors seem to assume that the gut microbiota will reflect prior studies without measuring it themselves.

      All gnotobiotic lines are bred as gnotobiotic colonies in our isolator facility. This is now better explained in the methods section. Additionally, 16S sequencing of all microbiotas used in the paper has been added as Figure 2 – figure supplement 1.

      I also did not follow the logic of concluding that any differences between SPF and the two other groups are due to microbial diversity, which is presumably just one of many differences. For example, the authors acknowledge that host immunity may be distinct. It is essential to profile the gut microbiota by 16S rRNA amplicon sequencing in all these experiments and to design experiments that more explicitly test the diversity hypotheses vs. alternatives like differences in the membership of each community or other host phenotypes.

      This is an important point. We have carried out a number of experiments to potentially address some issues here.

      1) We carried out B. theta colonization experiments in germ-free mice that had been colonized by gavage of SPF feces either 1 day prior to colonization of 2 weeks prior to colonization. While the shorter pre-colonization allowed B. theta to colonize to a higher population density in the cecum, the colonization probability was already reduced to levels observed in our SPF colony in the short pre-colonization. Therefore, the factors limiting B. theta establishment in the cecum are already established 1-2 days post-colonization with an SPF microbiota (Figure 2 - figure supplement 8). 2) We checked for the presence of secretory IgA capable of binding to the surface of live B. theta, compared to a positive control of a mouse orally vaccinated against B. theta. (Fig. 2, Supplement 7) and could find no evidence of specific IgA targeting B. theta in the intestinal lavages of our SPF mouse colony. 3) We isolated bacteriophage from the intestine of SPF mice and used this to infect lawns of B. theta wildtype and acapsular in vitro. We could not detect and plaque-forming phage coming from the intestine of SPF mice (Figure 2 – figure supplement 7).

      We can therefore exclude strongly lytic phage and host IgA as dominant driving mechanisms restricting B. theta colonization. It remains possible that rapidly upregulated host factors such as antimicrobial peptide secretion could play a role, but metabolic competition from the microbiota is also a very strong candidate hypothesis. The text regarding these experiments has been slightly rewritten to point out that colonization probability inversely correlates with microbiota complexity, and the mechanisms involved may involve both direct microbe-microbe interactions as well as host factors.

      Given the prior work on the importance of capsule for phage, I was surprised that no efforts are taken to monitor phage levels in these experiments. Could B. theta phage be present in SPF mice, explaining the results? Alternatively, is the mucus layer distinct? Both could be readily monitored using established molecular/imaging methods.

      See above: no plaque-forming phage could be recovered from the SPF mouse cecum content. The main replicative site that we have studied here, in mice, is the cecum which does not have true mucus layers in the same way as the distal colon and is upstream of the colon so is unlikely to be affected by colon geography. Rather mucus is well mixed with the cecum content and may behave as a dispersed nutrient source. There is for sure a higher availability of mucus in the gnotobiotic mice due to less competition for mucus degradation by other strains. However, this would be challenging to directly link to the B. theta colonization phenotype as Muc2-deficient mice develop intestinal inflammation.

      The conclusion that the acapsular strain loses out due to a difference of lag phase seems highly speculative. More work would be needed to ensure that there is no difference in the initial bottleneck; for example, by monitoring the level of this strain in the proximal gut immediately after oral gavage.

      This is an excellent suggestion and has been carried out. At 8h post-colonization with a high inoculum (allowing easy detection) there were identical low levels of B. theta in the upper and lower small intestine, but more B. theta wildtype than B. theta acapsular in the cecum and colon, consistent with commencement of growth for B. theta wildtype but not the acapsular strain at this timepoint. We have additionally repeated the single-colonization time-courses using our standard inoculum and can clearly see the delayed detection of acapsular B. theta in feces even in the single-colonization state when no increased bottleneck is observed. This can only be reasonably explained by a bona fide lag-phase extension for acapsular B. theta in vivo. These data also reveal and decreased net growth rate of acapsular B. theta. Interestingly, our model can be quite well-fitted to the data obtained both for competitive index and for colonization probability using only the difference in net growth rate. Adding the (clearly observed) extended lag-phase generates a model that is still consistent with our observations.

      Another major limitation of this paper is the reliance on short timepoints (2-3 days post colonization). Data for B. theta levels over 2 weeks or longer is essential to put these values in context. For example, I was surprised that B. theta could invade the gut microbiota of SPF mice at all and wonder if the early time points reflect transient colonization.

      It should be noted that “SPF” defines microbiota only on missing pathogens and not on absolute composition. Therefore, the rather efficient B. theta colonization in our SPF colony is likely due to a permissive composition and this is likely to be not at all reproducible between different SPF colonies (a major confounder in reproducibility of mouse experiments between institutions. In contrast the gnotobiotic colonies are highly reproducible). We do consistently see colonization of our SPF colony by wildtype B. theta out to at least 10 days post-inoculation (latest time-point tested) at similar loads to the ones observed in this work, indicating that this is not just transient “flow-through” colonization. Data included below:

      For this paper we were very specifically quantifying the early stages of colonization, also because the longer we run the experiments for, the more confounding features of our “neutrality” assumptions appear (e.g., host immunity selecting for evolved/phase-varied clones, within-host evolution of individual clones etc.). For this reason, we have used timepoints of a maximum of 2-3 days.

      Finally, the number of mice/group is very low, especially given the novelty of these types of studies and uncertainty about reproducibility. Key experiments should be replicated at least once, ideally with more than n=3/group.

      For all barcode quantification experiments we have between 10 and 17 mice per group. Experiments for the in vivo time-courses of colonization have been expanded to an “n” of at least 7 per group.

    1. Author Response

      Reviewer #1 (Public Review):

      This is a well-executed study using cutting-edge proteomics analysis to characterize muscle tissue from a genetically diverse mouse population. The use of only females in the study is a serious limitation that the authors acknowledge. The statistical methods, including protein quantification, QTL mapping, and trait correlation analysis are appropriate and include corrections for multiple testing. One concern is that missense variants, if they occur in peptides used to quantify proteins, could lead to false-positive signatures of low abundance (see lines 123-127). The experimental validation and deep dive into UFMylation provide some confidence in the reliability of other associations that can be mined from these data. The authors have provided a web-based tool for exploring the data.

      We thank the reviewer for these very positive comments and for reviewing the manuscript.

      We agree the quantification of peptides containing missense variants could confound quantification at the protein level. This is an important consideration when there are only a few peptides identified for a specific protein. However, in our data the average number of peptides used to quantify the 14 proteins containing missense-associated pQTLs was ~68 peptides/protein (lowest was 5 peptides for FGB and highest 703 peptides for NEB).

      In the case of EPHX1, we quantified 15 peptides (Figure R1A). We identified a peptide adjacent to R338 spanning amino acids 339-347. As such, mutation of R338C would prevent trypsin from cleavage resulting in the missense peptide not being identified and may lead to false-positive signatures of low abundance as suggested by the reviewer. To investigate this, we re-quantified EPHX1 relative protein abundance with or without the peptide spanning 339-347 for each genotype (Figure R1B). This made little difference to protein quantification and EPHX1 abundance was still significantly lower following mutation of R338C (AA genotype). In fact, quantification at the peptide-level revealed 12 out of the remaining 14 peptides were also significantly lower in AA genotype (data not shown).

      Although we agree this a very important consideration, we are mindful of the length of the article and feel including these data would not significantly improve the manuscript. We therefore request to not include these data as it would detract from the main findings of the paper focused on phenotypic associations and validation of UFMylation as a regulator of muscle function.

      Figure 1R. (A) Identified peptides from EPHX1 mapped onto primary amino acid sequence highlighting the missense mutation induced by SNP rs32746574 that was associated to EPHX1 protein levels by pQTL analysis. (B) Relative quantification of EPHX1 between the two genotypes of SNP rs32746574 with and without the peptide neighboring the missense mutation (amino acids 339-347) (**p<0.001, students t-test)

    1. Author Response

      Reviewer #1 (Public Review):

      Building upon the previous evidence of activation of auditory cortex VIP interneurons in response to non-classical stimuli like reward and punishment, Szadai et al., extended the investigation to multiple cortical regions. Use of three-dimensional acousto-optical two-photon microscopy along with the 3D chessboard scanning method allowed high-speed signal acquisition from numerous VIP interneurons in a large brain volume. Additionally, activity of VIP interneurons in deep cortical regions was obtained using fiber photometry. With the help of these two imaging methods authors were able to extract and analyze the VIP cell signal from different cortical regions. Study of VIP interneuron activity during an auditory go-no-go task revealed that more than half of recorded cortical VIP interneurons were responding to both reward and punishment with high reliability. Fiber photometry data revealed similar observations; however, the temporal dynamics of reinforcement stimuli-related response in mPFC was slower than in the auditory cortex. The authors performed detailed analysis of individual cell activity dynamics, which revealed five categories of VIP cells based on their temporal profiles. Further, animals with higher performance on the discrimination task showed stronger VIP responses to 'go trials' possibly suggesting the role of VIP interneurons in discrimination learning. Authors found that reinforcement related response of VIP interneurons in visual cortex was not correlated with their sensory tuning, unveiling an interesting idea that VIP interneurons take part in both local as well as global processing. These observations bring attention to the possible involvement of VIP interneurons in reinforcement stimuli-associated global signaling that would regulate local connectivity and information processing leading to learning.

      The state-of-the-art imaging technique allowed authors to succeed in imaging VIP interneurons from several cortical regions. Advanced analyses revealed the nuances, similarities and differences in the VIP activity trend in various regions. The conclusions about reinforcement stimuli related activity of VIP interneurons made by the authors are well supported by the results obtained, however some claims and interpretations require more attention and clarification.

      We thank Reviewer #1 for the positive general comments.

      Reviewer #2 (Public Review):

      In recent years the activity of cortical VIP+ interneurons in relation to learning and sensory processing has raised great interest and has been intensely investigated. The ability of VIP+ interneurons in the auditory cortex to respond to both reward and punishment was already reported a few years ago by some of the authors (Pi et al., 2013, Nature). However, this work importantly adds to their previous study demonstrating a largely similar and synchronous response of a large fraction of these interneurons across the neocortex to salient stimuli of different valence during the performance of an auditory discrimination task.

      An additional strength of this study is the analysis and identification of the general pattern of VIP+ interneuron responses associated to specific behaviors in the different layers of the neocortex depth.

      Interestingly, the authors also identified using cluster analysis 5 different classes of VIP+ interneurons, based on the dynamic of their responses, that were unequally distributed in distinct cortical areas.

      This is a well performed study that took advantage of a cutting-edge imaging approach with high recording speed and good signal-to-noise ratio. Experiments are well performed and the data are properly analyzed and nicely illustrated. However, one shortcoming of this paper, in my opinion, is the "case report" structure of the data. Essentially for each neocortical area the activity of VIP+ interneurons was analyzed only in one animal. This limits the assessment of the stability of the response/recruitment of these interneurons. I appreciate the high number of recorded VIP+ interneurons per area/animal and I do understand that it would be excessively laborious to perform 3D random-access two-photon microscopy in several mice for each cortical area. On the other hand, it would be important to have some knowledge of the general variability of the responses of these neurons among animals.

      In conclusion, despite the findings described in this manuscript being generally sound, additional experiments are recommended to further substantiate the conclusions.

      Thank you for pointing out this potential misunderstanding. Although we mentioned the number of animals the recordings were obtained from (n=22 total), we repeated this multiple times to alleviate the potential confusion. The data recorded with the 2-photon microscope are from 16 animals, and fiber photometry was performed on a separate 6 animals. Each animal was recorded in one (14 mice) or two areas (8 mice, 2 AOD, 6 photometry). We aimed to acquire data from at least 3 recordings per area (4 in the primary somatosensory cortex, 6 in the primary and secondary motor cortices, 4 in the lateral and medial parietal cortices, 3 in the primary visual cortices, 6 in the auditory and medial prefrontal cortices). In the revised manuscript this information can be found at the beginning of the results section and in the figure legends:

      “To probe the behavioral function of VIP interneurons, we trained head-fixed mice (n=22 in total, n=16 for 2-photon microscopy and n=6 for fiber photometry) on a simple auditory discrimination task (Figure 1A).”

      “Among the 811 neurons imaged in 18 imaging sessions from 16 mice,”

      “Ca2+ responses of individual VIP interneurons recorded separately from 18 different cortical regions from 16 mice using fast 3D AO imaging were averaged for Hit (thick green), FA (thick red), Miss (dark blue), and CR (light blue). Fiber photometry data were recorded simultaneously from mPFC and ACx regions and are shown in gray boxes. Functional map (Kirkcaldie, 2012) used with the permission of the author. Speaker symbols represent the average time of tone onset, and gray triangles mark the reinforcement onset for Hit and FA. Averages of Miss and CR trials were aligned according to the expected reinforcement delivery calculated on the basis of the average reaction time. mPFC: medial prefrontal cortex (n=6 mice), ACx: auditory cortex (n=6), S1Hl/S1Tr/S1Bf/S1Sh: primary somatosensory cortex, hindlimb/trunk/barrel field/shoulder region (n=4), M1/M2: primary/secondary motor cortex (n=6), Mpta/Lpta: medial/lateral parietal cortex (n=4), V1: primary visual cortex (n=3).”

      “This approach allowed us to simultaneously measure bulk calcium-dependent signals from VIP interneurons located in the right medial prefrontal (mPFC) and left auditory cortices (ACx) by implanting two 400 µm optical fibers at these locations (n=6 sessions from n=6 mice, Figure 1–figure supplement 1C).”

      “Raster plot of the trial-to-trial activation of the responsive VIP neurons in Hit and FA trials during the two-photon imaging sessions (n=18 sessions, n=16 mice, n=746 cells).”

      Subregional labels, for example on Figure 2, should be considered as additional information to orient the readers, even if they were very precisely defined on the basis of the coordinates. All analyses considering regional differences were conducted on the level of the main functional areas of the dorsal cortex (motor, somatosensory, parietal, and visual). Despite some location-dependent heterogeneity in the late response phase (Figures 2G and H), even these main dorsal cortical regions were all similar from the perspective of responsiveness to reinforcers and auditory cues.

      Reviewer #3 (Public Review):

      In this study Szadai et al. show reliable, relatively synchronous activation of VIP neurons across different areas of dorsal cortex in response to reward and punishment of mice performing an auditory discrimination task. The authors use both a relatively fast 2 photon imaging, as well as fiber photometry for some deeper areas. They cluster neurons according to their temporal response profiles and show that these profiles differ across areas and cortical depths. Task performance, running behavior and arousal are all related to VIP response magnitude, as has been previously shown.

      Methodologically, this paper is strong: the described imaging technique allows for fairly fast sampling rates, they sample VIP cells from many different areas and the analyses are sophisticated and touch on the most relevant points. The figures are of high quality.

      However, as the manuscript is now, the presentation could be clearer, the methods more complete and it is not clear whether their conclusions are entirely supported by the data.

      The main issue is that reinforcement and arousal are hard to distinguish in this study. It is well known that VIP activity is correlated with arousal. And it is fairly clear that the reinforcement they use in this study - air puffs to the eye, as well as water rewards - cause arousal. It is possible that the reinforcer responses they observe in VIP neurons throughout all areas merely reflect the increases in arousal caused by these behaviorally salient events. They do discuss this caveat (albeit not fully convincingly) and in their abstract even state that the arousal state was not predictive of reinforcer responses. However their data clearly shows the tight relationship of the VIP reinforcer responses to both arousal (as measured by pupil diameter), as well as running speed of the animal. Both of these variables are well known to be tightly coupled to VIP activity.

      Although barely mentioned, the authors do appear to sometimes present uncued reward (Figure S2F). If responses were noticeably different from the same events in the task context (as actual reinforcers) this could at least hint towards the reinforcement signal being distinct from mere arousal. However, this data is only mentioned in one supplementary figure in a different context (comparison with PV cells) and neither directly compared to cued reward, nor is this discussed at all. Were uncued air puffs also presented? How do the responses compare to cued air puffs/punishment?

      Our original approach to distinguish between reinforcement- and arousal-related responses aimed:

      1) to show that VIP cells with both low and high correlation coefficients with arousal produce large signals upon reinforcement presentation (Figure 3B),

      2) the high differences of low and high arousal changes were reflected in a limited way in the VIP activity (Figures 3C and D): as highlighted in Figure R1, where we also added bars to show ∆P/P in high and low pupil change conditions, the difference in ∆P/P is ~5-fold, while it is only ~1.5-fold for ∆F/F. This disproportionality suggests that a large part of the signal below the dashed blue line is independent of arousal. We have added these modifications to the new version of Figure 3 for clarity.

      Figure R1 = Figure 3C-D with modification. Comparison of pupil changes and corresponding calcium averages.

      We collected further evidence to support our claims. In Figure 3–figure supplement 2 we depicted Hit and FA trials in which the reinforcement didn’t elevate the arousal level any further. Many of these trials were associated with locomotion prior to the reinforcement, but it was also common that the animals remained still during the whole trial. Trials with increased locomotion upon reinforcement presentation were excluded. Reinforcement-related calcium signals were still present under these conditions, indicating that these signals are not simple reflections of arousal. Moreover, we estimate the distinct contributions of arousal, locomotion, and reinforcers in Figure 3–figure supplement 2D in a systematic way with a generalized linear model. This model also confirmed our view about the reinforcement-related coding.

      We now say in the results:

      “Finally, to assess the motor- and reinforcement-related contributions to VIP interneuronal activity, we built a generalized linear model using the behavior and imaging data of the SS and Mtr recordings (Figure 3–figure supplement 2D, n=3 mice). This model was able to explain 18.8 ± 11.1% of the variance of the VIP population calcium signal, and highlighted that arousal was the best predictor, followed by reward, punishment, locomotion velocity, and auditory cue (weights = 0.055, 0.031, 0.028, 0.020, 0.018 respectively; all predictors, except the auditory cue in the case of one animal, contributed significantly, p<0.001). These observations indicate that running and arousal changes alone cannot fully explain the recruitment of VIP interneurons by reinforcers.”

      We apologize for not describing the rational and the result from the uncued reward experiments. Briefly, while recording reinforcement related signals in auditory cortex in our task, we realized that the cue delivery, and the resulting purely sensory response could alter the measurement of the reward-related responses. Hence, in order to disentangle the reward and sensory-related responses, we presented the animals with simple, uncued reward and observed a similar and robust recruitment of VIP interneurons. Based on the same rational, we made similar measurement for PV neurons.

      We now say in the results:

      “We did not further analyze the FA responses in auditory cortex as those responses also had a sensory component linked to the white noise-like sound created by the air puff delivery. Because the cue delivery could prove as a confound to measure reward-mediated responses from VIP interneurons in auditory cortex (see also methods), we delivered random reward in separate sessions. Water droplets delivery recruited VIP interneurons in both auditory and medial prefrontal cortex in a similar fashion as water delivery during the discrimination task (Figure 2–figure supplement 1G). Like our single cell results, PV-expressing neuronal population in ACx did not show any significant change in activity upon similar random reward delivery (Figure 2–figure supplement 1G).”

      Regarding the difference between cued and uncued responses, we definitely agree with the reviewer that it is an important point. The goal of this manuscript is however to study how reward and punishment are being represented by VIP interneurons in cortex.

      The imaging method appears well suited for their task, however the improvements listed in table S1 make the method appear far superior to existing methods in many aspects. Published or preprinted papers with 2 photon imaging of VIP populations (eg. from Scanziani lab (Keller et al.), Carandini lab (Dipoppa et al.), deVries lab (Millman et al.), Adesnik lab (Mossing et al.), which use the much more common resonant scanning, seem to be able to image 4-7 layers at 4-8Hz with a good enough SNR and potentially bigger neuronal yield of approximately 100-200 VIP cells, depending on the field of view. While not every single cell in a volume would be captured by these studies, the only main advantage of the here-used technique appears to be the superior temporal resolution.

      We thank the reviewer for the positive comment and we agree that interpretation must be improved. We agree that the imaging methods in the papers listed above have good SNR and were proper to address the scientific questions that had arisen. As the reviewer points out, 3D-AOD imaging allows fast 3D measurement that cannot be achieved otherwise. We used these advantages to address the critical question of layer specificity in the response of VIP interneurons to reinforcer presentation (Figure 2–figure supplement 1F, but see also the new Figure 1–figure supplement 1B). Regarding the comparison and quantification of the factual advantages of AOD microscopy over other imaging methods, the reviewer and readers can refer to the methods section (3D AO microscopy), Table S1 and Szalay et al., 2016. We agree with the reviewer that one of the main advantages is the superior temporal resolution. The second main advantage is the improved SNR. This originates from the fact that the entire measurement time is spent on regions of interest; measurement of unnecessary background areas is not required. More specifically, SNR is improved even in the case of 2D imaging by the factor of:

      ((area of the entire frame )/(area of the recorded VIP cells))^0.5

      which is about (100)0.5=10 as VIP interneurons represent about 1% of the brain. We used this second advantage of AO scanning when we determined the activation ratio (e.g., see Figure 2D).

      As the resolution of single or a few action potentials is challenging in behaving mice labelled with the GCaMP6 sensor, any improvement in SNR will improve the detection threshold. The higher SNR achieved here improved the detection threshold, which also explains the relatively high activation ratio in our work.

      In the case of asynchronous activity patterns, there is negligible contribution of individual small neuropil structures to somatic activities because of the relatively high volume-ratio of a soma and a given small neuropil structure: this minimizes the error during ∆F/F calculation of somatic responses. However, reinforcement, arousal, and running can generate highly synchronous neuronal activities which can synchronize neuropil activity around a given soma and, therefore, effectively and systematically modulating the somatic ∆F/F responses. To avoid this error, we used a high NA objective with proper neuropil resolution and combined it with motion correction. The use of the high NA also decreased the total scanning volume to about 689 µm × 639 µm × 580 µm and, therefore, it limited the maximum number of VIP cells which could be recorded. It is also possible to use a low-NA objective with a much higher FOV and scanning volume and record over 1000 VIP cells, but the extension of the PSF along the z dimension is inversely and quadratically proportional to the NA of the objective, therefore neuropil resolution will be at least partially lost. In summary, using the high-NA Olympus objective we maximized the 2P resolution which, in combination with off-line motion artifact elimination, allowed precise recording of somatic signals without any neuropil contamination: this provided correct activation ratio values.

      Even though this is not mentioned at all, it certainly appears possible, that the accousto-optical scanning emits audible noise. In this case it would be good to know the frequency range and level of this background noise, whether there are auditory responses to the scanning itself and if it interferes with the performance of the animals in the auditory task in any way. If this is not the case, this should probably simply be mentioned for non-experts.

      While the name of the acousto-optical deflectors seems to refer to “acoustic noise”, these devices are driven in the range of 55-120 MHz, which is 3 orders of magnitude higher frequency than the hearing threshold of animals: mice don’t hear them. Moreover, we developed water-cooled AODs ten years ago which means that ventilators are also not required, therefore AOD-based scanning can be used with zero noise emission. In contrast, galvo, resonant, and piezo scanning work in the kHz frequency range, which is in the middle of the hearing range of mice. Moreover, these technologies can’t be used in a vacuum and the scanner is just a few tens of centimeters away from the mice, which means that acoustic noise can’t be canceled but can only be partially suppressed with white noise. We thank the reviewer for the helpful comment and have added one sentence about the absence of acoustic noise during acousto-optical scanning:

      “The deflectors are driven in the 55-120 MHz frequency range, therefore the noise emitted does not interfere with the auditory cues, as mice can’t hear it. This, in combination with the water cooling of the deflectors, makes the AOD-based scanning the quietest technology for in-vivo imaging.”

      The authors show a strong correlation between task performance (hit rate) and the response to the auditory cue on hit trials. Was there any other significant correlations of VIP cells' responses to other trial types? Was reinforcer response correlated to behavioral variables at all?

      We have not found any remarkable correlations between VIP cell activity and behavioral variables except the one mentioned above.

      For example, we tested discrimination rate (hit rate/FA rate) correlation with ∆F/Ftone in Hit trials, but this was not significant (R2=0.03, F=0.49, p=0.69), just like Hit rate vs. ∆F/Ftone in FA trials (R2=0.19, F=3.8, p=0.07), and discrimination rate vs. ∆F/Ftone in FA trials (R2=0.07, F=1.1, p=0.31).

    1. Author Response

      Reviewer #1 (Public Review):

      This study used GWAS and RNAseq data of TCGA to show a link between telomere length and lung cancer. Authors identified novel susceptibility loci that are associated with lung adenocarcinoma risk. They showed that longer telomeres were associated with being a female nonsmoker and early-stage cancer with a signature of cell proliferation, genome stability, and telomerase activity.

      Major comments:

      1) It is not clear how are the signatures captured by PC2 specific for lung adenocarcinoma compared to other lung subtypes. In other words, why is the association between long telomeres specific to lung adenocarcinoma?

      We thank the reviewer for raising this point (similarly mentioned by reviewer #2). Indeed, it is unclear why genetically predicted LTL appears more relevant to lung adenocarcinoma. We have used LASSO approach to select important features of PC2 in lung adenocarcinoma and inferred PC2 in lung squamous cell carcinomas tumours to better explore the differences between histological subtypes. The new results are presented in Figure 5, as well as being described in the methods and results sections. In addition, we have expanded upon this point in the discussion with the following paragraph (page 11, lines 229-248):

      ‘An explanation for why long LTL was associated with increased risk of lung cancer might be that individuals with longer telomeres have lower rates of telomere attrition compared to individuals with shorter telomeres. Given a very large population of histologically normal cells, even a very small difference in telomere attrition would change the probability that a given cell is able to escape the telomere-mediated cell death pathways (24). Such inter-individual differences could suffice to explain the modest lung cancer risk observed in our MR analyses. However, it is not clear why longer TL would be more relevant to lung adenocarcinoma compared to other lung cancer subtypes. A suggestion may come from our observation that longer LTL is related to genomic stable lung tumours (such as lung adenocarcinomas in never smokers and tumours with lower proliferation rates) but not genomic unstable lung tumours (such as heavy smoking related, highly proliferating lung squamous carcinomas). One possible hypothesis is that histologic normal cells exposed to highly genotoxic compounds, such as tobacco smoking, might require an intrinsic activation of telomere length maintenance at early steps of carcinogenesis that would allow them to survival, and therefore, genetic differences in telomere length are less relevant in these cells. By contrast, in more genomic stable lung tumours, where TL attrition rate is more modest, the hypothesis related to differences in TL length may be more relevant and potentially explaining the heterogeneity in genetic effects between lung tumours (Figure 2). Alternately, we also note that the cell of origin may also differ, with lung adenocarcinoma is postulated to be mostly derived from alveolar type 2 cells, the squamous cell carcinoma is from bronchiolar epithelium cells (19), possibly suggesting that LTL might be more relevant to the former.

      2) The manuscript is lacking specific comparisons of gene expression changes across lung cancer subtypes for identified genes such as telomerase etc since all the data is presented as associations embedded within PCs.

      The genes associated with telomere maintenance such as TERT and TERC are very low expressed in these tumours (Barthel et al NG 2017). In this context, no sample has more than 5 normalised read counts by RNA-sequencing for TERT within TCGA lung cohorts (TCGA-LUSC, TCGA-LUAD). As such we have not explored the difference by individual telomere related genes. Nevertheless, we have explored an inferred telomerase activity gene signature, developed by Barthel et al and we did explore this in the context of lung adenocarcinoma tumours. We have added a note in the result section to inform the reader regarding why we did not directly test TERT/TERC expression (page 9, lines 184-187).

      3) It is not clear how novel are the findings given that most of these observations have been made previously i.e. the genetic component of the association between telomere length and cancer.

      Others, including ourselves, have studied TL and lung cancer. We have built on that on the most updated TL genetic instrument and the largest lung cancer study available. In addition, we provided insights into the possible mechanisms in which telomere length might affect lung adenocarcinoma development. Using colocalisation analyses, we reported novel shared genetic loci between telomere length and lung adenocarcinoma (MPHOSPH6, PRPF6, and POLI), such genes/loci that have not previously linked to lung adenocarcinoma susceptibility. For MPHOSPH6 locus, we showed that the risk allele of rs2303262 (missense variant annotated for MPHOSPH6 gene) colocalized with increased lung adenocarcinoma risk, lower lung function (FEV1 and FVC), and increased MPHOSPH6 gene expression in lung, as highlighted in the discussion section of the revised manuscript.

      In addition, we have used a PRS analysis to identify a gene expression component associated with genetically predicted telomere length in lung adenocarcinoma but not in squamous cell carcinoma subtype. The aspect of this gene expression component associated with longer telomere length are also associated with molecular characteristics related to genome stability (lower accumulation of DNA damage, copy number alterations, and lower proliferation rates), being female, early-stage tumours, and never smokers, which is an interesting but not completely understood lung cancer strata. As far as we are aware, this is the first time an association between a PRS related to an etiological factor, such as telomere length and a particular expression component in the tumour.

      We have adjusted the discussion further highlight the novel aspects in the discussion section of the revised manuscript.

      Reviewer #2 (Public Review):

      The manuscript of Penha et al performs genetic correlation, Mendelian randomization (MR), and colocalization studies to determine the role of genetically determined leukocyte telomere length (LTL) and susceptibility to lung cancer. They develop an instrument from the most recent published association of LTL (Codd et al), which here is based on n=144 genetic variants, and the largest association study of lung cancer (including ~29K cases and ~56K controls). They observed no significant genetic correlation between LTL and lung cancer, in MR they observed a strong association that persisted after accounting for smoking status. They performed colocalization to identify a subset of loci where LTL and lung cancer risk coincided, mainly around TERT but also other loci. They also utilized RNA-Seq data from TCGA lung cancer adenocarcinoma, noting that a particular gene expression profile (identified by a PC analysis) seemed to correlate with LTL. This expression component was associated with some additional patient characteristics, genome stability, and telomerase activity.

      In general, most of the MR analysis was performed reasonably (with some suggestions and comments below), it seems that most of this has been performed, and the major observations were made in previous work. That said, the instrument is better powered and some sub-analyses are performed, so adds further robustness to this observation. While perhaps beyond the scope here, the mechanism of why longer LTL is associated with (lung) cancer seems like one of the key observations and mechanistically interesting but nothing is added to the discussion on this point to clarify or refute previous speculations listed in the discussion mentioned here (or in other work they cite).

      Some broad comments:

      1) The observations that lung adenocarcinoma carries the lion's share of risk from LTL (relative to other cancer subtypes) could be interesting but is not particularly highlighted. This could potentially be explored or discussed in more detail. Are there specific aspects of the biology of the substrata that could explain this (or lead to testable hypotheses?)

      We thank the reviewer for these comments. A similar point was raised by reviewer #1. Please see our response above, as well as the additional analysis described in Figure 5 that considers the differences by histological subtype.

      2) Given that LTL is genetically correlated (and MR evidence suggests also possibly causal evidence in some cases) across a range of traits (e.g., adiposity) that may also associate with lung cancer, a larger genetic correlation analysis might be in order, followed by a larger set of multivariable MR (MVMR) beyond smoking as a risk factor. Basically, can the observed relationship be explained by another trait (beyond smoking)? For example, there is previous MR literature on adiposity measures, for example (BMI, WHR, or WHRadjBMI) and telomere length, plus literature on adiposity with lung cancer; furthermore, smoking with BMI. A bit more comprehensive set of MVMR analyses within this space would elevate the significance and interpretation compared to previous literature.

      Indeed, there are important effects related to BMI and lung cancer (Zhou et al., 2021. Doi:10.1002/ijc.33292; Mariosa et al., 2022. Doi: 10.1093/jnci/djac061). We have tested the potential for influence on our finding using MVMR, modelling LTL and BMI using a BMI genetic instrument of 755 SNPs obtained from UKBB (feature code: ukb-b-19953). This multivariate approach did not result any meaningful changes in the associations between LTL and lung cancer risk.

      3) In the initial LTL paper, the authors constructed an IV for MR analyses, which appears different than what the authors selected here. For example, Codd et al. proposed an n=130 SNP instrument from their n=193 sentinel variants, after filtering for LD (n=193 >>> n=147) and then for multi-trait association (n=147 >> n=130). I don't think this will fundamentally change the author's result, but the authors may want to confirm robustness to slightly different instrument selection procedures or explain why they favor their approach over the previous one.

      We appreciate the reviewer’s suggestion. Our study is designed for a Mendelian Randomization framework and chose to be conservative in the construction of our instrumental variable (IV). We therefore applied more stringent filters to the LTL variants relative to Codd et al’s approach. We applied a wider LD window (10MB vs. 1MB) centered around the LTL variants that were significant at genome-wide level (p<5e-08) and we restricted our analyses to biallelic common SNPs (MAF>1% and r2<0.01 in European population from 1000 genomes). Nevertheless, the LTL genetic instrument based on our study (144 LTL variants) is highly correlated with the PRS based on the 130 variants described by Codd et al. (correlation estimate=0.78, p<2.2e-16). The MR analyses based on the 130 LTL instrument described by Codd et al showed similar results to our study.

      4) Colocalization analysis suggests that a /subset/ of LTL signals map onto lung cancer signals. Does this mean that the MR relationships are driven entirely by this small subset, or is there evidence (polygenic) from other loci? Rather than do a "leave one out" the authors could stratify their instrument into "coloc +ve / coloc -ve" and redo the MR analyses.

      Mainly here, the goal is to interpret if the subset of signals at the top (looks like n=14, the bump of non-trivial PP4 > 0.6, say) which map predominantly to TERT, TERC, and OBFC1 explain the observed effect here. I.e., it is biology around these specific mechanisms or generally LTL (polygenicity) but exemplified by extreme examples (TERT, etc.). I appreciate that statistical power is a consideration to keep in mind with interpretation.

      We appreciate the reviewer’s comment and, indeed, we considered this idea. However, the analytical approach used the lung cancer GWAS to identify variants that colocalise. To validate this hypothesis that a subset of colocalised variants would be driving all the MR associations, we would need an independent lung cancer case control study to act as an out-of-sample validation set. This is not available to us at this point. Nevertheless, we slightly re-worded the discussion to highlight that the colocalised loci tend to be near genes related to telomere length biology and are also exploring the colocalisation approach to select variants for PRS analysis elsewhere.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors examine the role of the K700E mutation in the Sf3B1 splicing factor in PDAC and report that this Sf3B1 mutation promotes PDAC by decreasing sensitivity to TGF-b resulting in decreased EMT and decreased apoptosis as a result. They propose that the Sf3b1 K700E mutant causes decreased expression of Map3K7, a known mediator of TGF-β signaling and also known to be alternately spliced in other systems by the Sf3b1 K700E mutation. The role of splicing defects in cancer is relatively understudied and could identify novel targets for therapeutic intervention so this work is of potential significance. However, the data is over-interpreted in many instances and it is not clear the authors can make the claims they do based on the data shown. In particular, the data showing that decreased Map3k7 underlies the effects of the Sf3b1K700E mutant is very weak. Does over-expression of Map3k7 promote the EMT signature and induce apoptosis? Do the Map3k7 expressing organoids form tumors more effectively when transplanted into mice? Also, the novelty of the work is a concern since aberrant Map3k7 splicing due to SF3B1 mutation was seen previously in other systems. The authors also do not address the apparent conundrum of Sf3b1 K700E mutation promoting tumorigenesis despite there being less EMT which is also required for progression to metastasis in PDAC.

      Major Concerns.

      1) The analysis of the effect of Sf3b1K700E expression on normal pancreas and on PanINs in KC mice and PDAC in KPC mice is superficial and could be enhanced by staining for amylase, cytokeratin-19 and insulin. In particular, the data quantified in figure 1L should be accompanied by staining for CK19, Mucin5AC or some other marker of ductal transformation. Also, are any effects seen at older ages in normal mice?

      We performed staining of normal and cancerous mouse pancreata using Ck19, MUC5AC and b-amylase antibodies. In line with our hypothesis that Sf3b1K700E mainly plays a role in early stages of PDAC formation, we observed significant differences in CK19 (increase), MUC5AC (increase) and b-amylase (decrease) expression in early stage KPC-Sf3b1K700E vs. KPC tumors (Fig. 1G-J), but not in late stage tumors (see Figure 1-figure supplement 1F-I). In addition, no differences were observed in normal mice. We added these data to the revised manuscript (see Figure 1-figure supplement 1D, E).

      2) The invasion assays used are limited and should be complemented by more routine quantification of cell migration and invasion including such assays as a scratch assay, Boyden chamber assays and use of the IncuCyte system to quantify. As it stands the image in Figure 3B is difficult to interpret since it is very poorly described in the figure legend. Additional evidence is needed to make the claims made by the authors.

      During the revisions we performed wound healing/scratch assays using PANC-1 cells with inducible SF3B1 WT/K700E overexpression. We observed a significant difference in migratory capacity between SF3B1 WT- and SF3B1 K700E overexpressing cells stimulated with TGF-β. We added this data to the revised manuscript (Fig. 2I, J). We also describe the abovementioned figure 3B in more detail (revised manuscript Fig. 2G, H; line 759-767).

      3) The authors should show the actual CC3 staining quantified in Suppl. Figure 2G.

      We added a representative image of CC3 staining (see Figure 3-figure supplement 1A) for the quantified data (see Figure 3-figure supplement 1B in the revised manuscript).

      4) The graph in Figure 3L should show WT and Sf3b1K700E expressing organoids number both with and without TGF-b.

      Since without TGF-b supplementation organoids have to be split in a 1:3 ratio every 5 days, we could not follow the same passaging regimen as in experiments with TGF-b supplementation (split in a 1:2 ratio every 20 days, Fig. 3I). However, we assessed the organoid number grown in control medium without TGF-b for 4 passages (20 days) in a 1:3 ratio, and observe no difference in organoid number in WT and Sf3b1K700E expressing organoids (Author response image 1). In the revised manuscript we show with a highly quantitative read-out (CellTiterGlo) that Sf3b1K700E expressing organoids do not grow faster than Sf3b1 WT expressing organoids in absence of TGF-β (see Figure 3-figure supplement 1E). Taken together, we can exclude that Sf3b1K700E organoids outgrow Sf3b1 WT organoids in medium with TGF-β supplementation because they generally have a growth advantage.

      Author response image 1.

      Author response image 1. WT and Sf3b1K700E expressing organoids were cultured without TGF-β supplementation. Organoids were split in a 1:3 ratio every 5 days. Data points show organoid number before splitting, assessed for 4 passages.

      Reviewer #2 (Public Review):

      The manuscript has several areas of strength; it functionally explores a mutant that is detected in a portion of pancreatic cancers; it conducts mechanistic investigation and it uses human cell lines to validate the findings based on mouse models. Some areas for improvement are described below.

      1) TGF-b is known to act as a tumor suppressor early in carcinogenesis, and as a tumor promoter later. The authors should extend their analysis of mouse models to determine whether the effect of SF3B1K700E is specific to promoting initiation (e.g. more, early acinar ductal metaplasia) or faster progression of PanINs following their formation. Another way to address this could be acinar cultures, to determine whether an increased propensity to ADM exists.

      To further detangle the effect KPC-Sf3b1K700E with respect to tumor progression, we analyzed our autochthonous model at an early and late stage of tumor progression: Histological examination at 5 weeks revealed increased propensity to ADM (see Figure 1-figure supplement 1J, K), PanIN formation (shown by Muc5a1 and CK19 IF stainings, Fig. 1G, I, J) and a concomitant decrease of acinar cells (shown by b-amylase staining) in KPC-Sf3b1K700E vs. KPC tumors (Fig. 1G, H). Analyzing tumors at 9 weeks of age did not show differences in CK19 staining and fibrosis. We added these data to the revised manuscript (see Figure 1-figure supplement 1F-I).

      2) Given that the effect of SF3B1K700E expression is more prominent in KC mice, rather than in KPC mice, the authors should explain the rationale for using the latter for RNA sequencing.

      In KC mice, pre-invasive PanIN lesions only infrequently progress to PDAC (spontaneous progression, see Gabriel et al., Pancreatology, 2020 ). Therefore, it would have been difficult to collect enough material for cell sorting and downstream RNA sequencing of tumor cells. The KPC mouse model develops PDAC with a 100% penetrance, allowing the collection of sufficient material.

      3) Given that this mutation is found in about 3% of human pancreatic cancer, it would be interesting to know whether these tumors have any unique feature, and specifically any characteristic that could be harnessed therapeutically.

      Unfortunately, the size of published datasets is too small for a meaningful differential gene expression analysis of SF3B1-WT vs. SF3B1-K700E PDAC tumors (due to the low occurrence of SF3B1-K700E PDAC). However, harnessing the K700E mutation therapeutically by increasing missplicing through splicing inhibitors has previously been suggested, and it was shown that SF3B1-K700E mutated cancer cells are more prone to apoptosis when splicing is chemically targeted than SF3B1-WT cells. We tested a similar approach in murine pre-cancerous organoids, demonstrating that Sf3b1-WT organoids show higher survival than Sf3b1K700E expressing organoids when treated with the splicing-inhibitor Pladienolide B (Author response image 2). However, since this concept is not novel and not within the topic of our manuscript, we would prefer to not integrate this data into our manuscript.

      Author response image 2.

      Author response image 2. 33 nM of the splicing inhibitor Pladienolide B was added to the cell culture medium for 48 hours and the viability was assessed by normalizing organoid numbers to untreated control organoids. The line indicates WT and Sf3b1K700E organoids assessed in the same replicate.

      4) It would be interesting to know whether this mutation mutually exclusive to other mutations affecting response to TGF-b. Further, while the data might not be widely available, it would be interesting to know whether in human patients the mutation occurs in precursor lesions (PanIN might be difficult to assess, but IPMN might be doable) or at later stages.

      We performed a mutual exclusivity analysis in PDAC samples available at www.cbioportal.org, but did not find mutual exclusivity of SF3B1-K700E to genes of the TGF-β-pathway. Of note, the value of the analysis is limited by the small sample size of SF3B1-K700E PDAC (n=7) Moreover, to our knowledge there is no public tissue biobank for PDAC which would allow us to assess the stage of SF3B1-K700E mutated PDAC tumors. Thus, unfortunately we cannot histologically assess if the mutations already occur in early stages of human tumor development.

      Author response table 1.

      Author response table 1: Mutual exclusivity analysis of public PDAC databases (ICGC, CPTAC, QCMG, TCGA, UTSW), including 910 patients. Mutation frequency is 25% for SMAD4, 5% for TGF-ΒR2, 3% for SMAD2, 2.6% for TGF-ΒR1, 1.4% for SMAD3, 0.7% for SF3B1-K700E, 0.7% for TGF-ΒR3, 0.4% for SMAD1. Analysis was performed on cbioportal.org.

      Reviewer #3 (Public Review):

      Alternative splicing as a result of mutations in different components of the splicing machinery has been associated with a variety of cancer types, including hematological malignancies where this has been most extensively studied but also for solid tumors such as breast and pancreatic ductal adenocarcinoma (PDAC). Here the authors analyze genome sequencing data in human PDAC samples and identify a recurring mutation in the SF3B1 subunit that substitutes lysine for glutamate at residue 700 (SF3B1K700E) in PDACs. This mutation has been identified and its' molecular role in disease progression in other diseases has been studied, but the mechanism for promoting disease progression in pancreatic cancer has not been as well characterized.

      To study how SF3B1K700E contributes to PDAC pathology, the authors generate a novel genetically modified mouse model of a pancreas specific SF3B1K700E mutation and explore its oncogenicity and tumor promoting potential. The authors find that SF3B1K700E is not oncogenic, but potentiates the oncogenic potential of Kras and p53 (KP) driver mutations commonly found in PDAC tumors. The authors then proceed to characterize the molecular mechanisms that might drive this phenotype. By transcriptomic analysis, the authors find KP-SF3B1K700E tumors have downregulation of epithelial-to-mesenchymal transition (EMT) genes compared to KP tumors. The cytokine TGFβ has previously been found to limit PDAC initiation and progression by causing lethal EMT in PDAC and PDAC precursor cells. Thus, the authors propose SF3B1K700E inhibition of EMT blocks the tumor suppressive activity of TGFβ and this underpins the tumor promoting role of SF3B1K700E mutation in PDAC. Consistent with this finding, SF3B1K700E mutation blocks TGFβ-induced toxicity in a variety of cell culture models of PDAC and PDAC precursor models.

      Lastly, the authors seek to identify how altered splicing reduces EMT activity in PDAC cells. The authors identify misspliced genes consistent in both KP and human SF3B1K700E mutant cancer samples and find Map3k7 as one of 11 consistently misspliced genes. MAP3K7 has previously been identified as a positive regulator of EMT. Thus the authors speculated Map3k7 missplicing would lead to reduced MAP3K7 activity and a reduction EMT and that this underpins the TGFβ in SF3B1K700E mutant PDAC cells. Consistent with this, the authors find inhibition of MAP3K7 reduces TGFβ toxicity in SF3B1K700E WT cells and overexpression of MAP3K7 in SF3B1K700E mutant PDAC cells induces TGFβ toxicity. Altogether, this suggests activity of Map3k7 is responsible for altered EMT activity and TGFβ sensitivity in SF3B1K700E mutant PDAC.

      Altogether, the authors generate a valuable model to study the role of a recurring splicing mutation in PDAC and provide compelling evidence that this mutation is accelerates disease. The authors then perform both: (1) an open-ended investigation of how this mutation alters PDAC cell biology where they identify altered EMT activity and (2) rigorous mechanistic studies showing suppressed EMT provides PDAC cells with resistance to TGFβ, which has previously been shown to be tumor suppressive in PDAC, suggesting a possible mechanism by which SF3B1K700E mutation is oncogenic in PDAC that future animal studies can confirm. This work generates valuable models and datasets to advance the understanding of how mutations in the splicing machinery can promote PDAC progression and suggests alternative splicing of MAP3K7 is one such possible mechanism that altered splicing promotes PDAC progression in vivo.

      • One major concern about the manuscript is that the proposed mechanism by which SF3B1K700E mutation accelerates PDAC progression (MAP3K7 inhibition -> EMT inhibition -> reduced TGF-β toxicity) is only tested in ex vivo culture models and there is very limited and correlative data to suggest that this is the operative mechanism by which SF3B1K700E mutant tumors are accelerated. This is especially important because of recent findings that IFN-α signaling, which the authors also found to be high in SF3B1K700E mutant tumors, also promotes PDAC progression (https://www.biorxiv.org/content/10.1101/2022.06.29.497540v1). Thus, while thoroughly convinced by the rigorous ex vivo work that SF3B1K700E does lead to MAP3K7 inhibition -> EMT inhibition -> reduced TGF-β toxicity, further experiments to confirm this mechanism is critical in vivo would be needed to convince me that this mechanism is critical to tumor progression in vivo. For example, would forced expression of MAP3K7 slow orthotopic KP-SF3B1K700E tumor growth while leaving IFN-α signaling unperturbed?

      We thank the reviewer for raising these important points. To first test if the upregulation of IFN-α signaling, seen in our RNA-seq data of sorted KPC-Sf3b1K700E cells, was directly caused by the Sf3b1-K700E mutation, we assessed the 5 most deregulated genes of the IFN-α signature in in-vitro activated KPC and KPC-Sf3b1K700E organoids (analogous to the experiments on the EMT gene signature in see Figure 2-figure supplement 1D). However, in contrast to EMT marker genes, INFa signature genes were not differently expressed in KPC-Sf3b1K700E vs. KPC organoids (Author response image 3). Thus, increased IFN-α signaling in KPC-Sf3b1K700E tumors in mice is likely an indirect consequence of further progressed cancers rather than an effect directly caused by Sf3b1K700E mediated missplicing.

      Author response image 3.

      Author response image 3. Expression of the 5 most deregulated genes of the IFN-α gene set identified in sorted KPC-Sf3b1K700E cells in in-vitro activated KPC-Sf3b1K700E and KPC organoids. 4 biological replicates were performed. For analysis, Ct-values of the indicated genes were normalized to Actb and a two-tailed unpaired t-test was used to compute the indicated p-values.

      To next examine the effect of Map3k7 on tumors in vivo, we established orthotopic transplantation models with KPC and KPC-Sf3b1K700E cells, with overexpression or knockdown of Map3k7 (Author response image 4). However, in contrast to the autochthonous mouse model, already orthotopically transplanted KPC vs. KPC-Sf3b1K700E cells did not show differences in tumor size (see Figure 1-figure supplement 1M, N). These data support our hypothesis that Sf3b1-K700E rather plays an important role during early stages of PDAC (KPC cells are isolated from fully developed PDAC tumors and orthotopic KPC transplantation thus represents a late-stage PDAC model).

      Unfortunately, these data also demonstrate that orthotopic transplantation of KPC cells is not a suitable model for studying the impact of Map3k7 in PDAC development, and as expected, neither Map3k7 overexpression in transplanted KPC-Sf3b1K700E cells nor shRNA mediated knockdown of Map3k7 (shMap3k7) in transplanted KPC cells led to differences in growth compared to their control groups (Author response image 4). In line with these results, the EMT genes that were found to be differentially expressed in our autochthonous mouse model (KPC vs. KPC-Sf3b1K700E) were expressed at similar levels upon Map3K7 downregulation or overexpression.

      Since establishment of an autochthonous KPC PDAC mouse model with a knock-down of MAP3K7 is out of scope for a revision, in the revised manuscript we discuss the limitation of our study that the molecular link between Sf3b1K700E, Map3k7 and Tgfb resistance has only been studied in vitro in organoids and cell lines. We also adapted the abstract and the title of the manuscript accordingly (formerly “Mutant SF3B1 promotes PDAC malignancy through TGF-β resistance”, now “Mutant SF3B1 promotes malignancy in PDAC”).

      Author response image 4.

      Author response image 4. (A) Relative gene expression of Map3k7 in KPC cells transduced with shRNA targeting Map3k7 (shMap3k7), normalized to KPC cells transduced with scrambled control shRNA (shCtrl). 3 biological replicates are shown. (B) Weight of tumors derived by orthotopical transplantation of shMap3k7 and shCtrl KPC cells. 5 biological replicates are shown. (C) Relative gene expression of EMT genes in tumors derived by orthotopic transplantation of shCtrl and shMap3k7 cells. 4 biological replicates are shown. (D) Relative gene expression of Map3k7 in KPC-Sf3b1K700E cells transduced with an overexpression vector of Map3k7 (OE Map3k7), normalized to control KPC cells without Map3k7 overexpression. 3 biological replicates are shown, a two-sided student’s t-test was used to calculate significance. (E) Weight of tumors derived by orthotopical transplantation of Map3k7 overexpressing KPC-Sf3b1K700E cells (n=5) and control KPC-Sf3b1K700E cells (n=4). (F) Relative gene expression of EMT genes in tumors derived by orthotopic transplantation of KPC-Sf3b1K700E cells with- and without overexpression of Map3k7. 4 biological replicates are shown. A two-sided student’s t-test was used to calculate significance in Fig. 2A-F.

    1. Author Respones

      Reviewer #1 (Public Review):

      The manuscript by Hekselman et al presents analyses linking cell-types to monogenic disorders using over-expression of monogenic disease genes as the signal. The manuscript analyses data from 6 tissues (bone marrow, lung, muscle, spleen, tongue and trachea) together with ~1,000 rare diseases from OMIM (with ~2,000 associated genes) to identify cell-type of interest for specific disease of choice. The signal used by the approach is the relative expression of OMIM-genes in a particular cell type relative to the expression of the gene in the tissue of interest identifying celltype-disease pairs that are then investigated through literature review and recapitulated using mouse expression. A potentially interesting finding is that disease genes manifesting in multiple tissues seem to hit same cell-types. Overall this important study combines multiple data analyses to quantify the connection between cell types and human disorders. However whereas some of the analyses are compelling, the statistical analyses are incomplete as they don't provide full treatment of type I error.

      Statistical analyses were changed to include permutation testing and a different threshold (Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2). Assessments of type I error were based on literature text-mining and expert curation, and showed that false-positive rates were low in both (0.01 and 0.07, respectively; Figure 1F and Figure 1–figure supplement 4A).

      Reviewer #2 (Public Review):

      This study identifies 110 disease-affected cell types for 714 Mendelian diseases, based on preferential expression of known disease-associated genes in single-cell data. It is likely that many or most of the results are real, and the results are biologically interesting and provide a valuable resource. However, updates to the method are needed to ensure that inference of statistical significance is appropriately stringent and rigorous.

      Strengths: a systematic evaluation of disease-affected cell types across Mendelian diseases is a valuable addition to the literature, complementing systematic evaluations of common disease and targeted analyses of individual Mendelian diseases. The validation via excess overlap with diseasecell type pairs from literature co-appearance provides compelling evidence that many or most of the results are real. In addition, many of the results are biologically interesting. In particular, it is interesting that diseases with multiple affected tissues tend to affect similar cell types in the respective tissues.

      Limitations: the main limitation of the study is that, although many or most of the results are likely to be real, the criteria for statistical significance is probably not stringent enough, and is not welljustified. For diseases with only 1 disease-associated gene, the threshold is a z-score>2 for preferential expression in the cell type, but this threshold is likely to be often exceeded by chance. (For diseases with many disease-associated genes, the threshold is a median (across genes) zscore>2 for preferential expression in the cell type, which is less likely to occur by chance but still an arbitrary threshold.) Thus, there is a good chance that a sizable proportion of the reported disease-affected cell types might be false positives. The best solution would be to assess statistical significance via empirical comparison with results for non-disease-associated control genes, and assess the statistical significance of the resulting P-values using FDR.

      We thank the reviewer for the valuable insights and suggestions. We revised the method to assess statistical significance by using empirical comparison followed by FDR correction, as suggested by the reviewer (Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2).

      The re-analysis using mouse single-cell data adds an interesting additional dimension to the study, with the small caveat that mouse single-cell data does not provide statistically independent information across genes (for the same reason that adding data from independent human individuals would not provide statistically independent information across genes, given that human and mouse expression are partially correlated).

      We acknowledge this caveat in the text (Discussion, page 17, 2nd paragraph, lines 8-11).

      Reviewer #3 (Public Review):

      The authors describe the method, PrEDiCT, which helps identify disease affected cell types based on gene sets. As I understand it, the method is based on finding which "disease genes" (from an annotation) are relatively highly expressed. The idea is nice, however, I have concerns about how "significance" is assessed and the relative controls.

      Overall, I find the idea interesting, but the execution raises some concerns.

      1) From a causal perspective, there is an association of high expression of these genes within these cell types, but without also assessing individuals with those specific diseases, I do not it is fair to say "disease affected" cell types. It is possible that these genes might behave completely fine but are highly expressed in those cell types while being affected another in other cell types.

      We agree with the reviewer. We changed the terminology to "likely disease-affected cell types” and added this caveat to the Discussion, page 16, 2nd paragraph.

      2) It is unclear to me what the "null" comparison is in the method and if there is one. For example, by chance, would I expect this gene to be highly expressed because other genes are also highly expressed in this cell type? Some way to assess "significance" or "enrichment" beyond simply using ranks and thresholds would be helpful in deciding whether these associations are robust.

      We revised the procedure for assessing statistical significance to include permutation tests. Specifically, given a disease D with n disease-associated genes, the null hypothesis was that the PrEDiCT score of these genes is not significantly different from the PrEDiCT score of a random set of n genes. To test this, we randomly selected n genes expressed in any cell type, and computed the PrEDiCT score for this random gene set in each cell type of the disease-affected tissue (referred to as ‘random score’). We repeated this procedure 1,000 times, resulting in 1,000 random scores per disease and cell type. The p-value of the PrEDiCT score of disease D in cell type c was set to the fraction of random scores in c that were at least as high as the original PrEDiCT score of D in c. The acquired p-values were adjusted for multiple hypothesis testing per disease using the Benjamini-Hochberg procedure. To increase stringency, we treated only statistically significant disease–cell-type pairs with PrEDiCT score≥1 as 'likely affected'. The procedure is detailed in Results, page 6, 1st paragraph; Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’; Figure 1–figure supplement 2. Additionally, we estimated type I error by using literature text-mining or expert curation (Results, page 7, 2nd paragraph; Methods, page 22, ‘Textmining of PubMed records’, and page 23, ‘Expert curation and assessment of disease-affected cell types’; Figure 1F and Figure 1–figure supplement 4A).

      3) Additionally, it is unclear to me, but I suspect that there are unequal cell numbers in the scores computed as well as between relevant tissues. This is related to point (2) above, but as a result, the estimates of the scores will inherently have different variances, thus making comparisons between them difficult/unreliable unless accounted for. If I understand correctly, the score is first the average expression within a tissue, then, the Z-score? If so, my comment applies.

      To clarify, the PrEDiCT score of a disease D in cell type c was set to the median preferential expression P of its disease genes (Equation 1 below). The preferential expression of each gene in c was computed as a Z-score, by comparing the average expression of the gene in c to its average expression in all cell types of the tissue, divided by the standard deviation (SD, Equation 2 below). Tissues indeed had unequal numbers of cell types, however, the distribution of PrEDiCT scores were similar between tissues (now in Supplementary File 13). We revised this part of Methods and added Equations 1 and 2 (Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’) and Supplementary File 13.

      4) There is a large set of work done in gene enrichment sets which appears to not be mentioned (e.g. GSEA and other works by the Price group). It would be helpful for the authors to summarize these methods and how their method differs.

      We added work done in gene enrichment sets (including two relevant and recent studies from the Price group) and summarized these methods in the Introduction (page 2-3).

      5) Additionally, it should be noted that a caveat of this analysis is that the comparisons are all done only relative to the cell types sampled and the diseases which have Mendelian genes associated with them. I would expect these results to change, possibly drastically, if the sampled cell types and diseases were to be changed.

      We agree with the reviewer and now discuss the generalizability of our results, relating to the extent of the sampled cell types (Discussion, page 18, 1st paragraph).

      6) Finally, I would appreciate a more detailed explanation in the methods of how the score is computed. Some equations and the data they are calculated from would be helpful here.

      We now provide a detailed explanation of how the score and its statistical significance were computed and added Equations 1 and 2 (Methods, page 21-22, ‘PrEDiCT score calculation and significance assessment’).

      In summary, the general idea is an interesting one, but I do think the issues above should be addressed to make the results convincing.

      We thank the reviewer for the important feedback which helped us strengthen our analyses.

    1. Author Response

      Reviewer #2 (Public Review):

      I believe the authors succeeded in finding neural evidence of reactivation during REM sleep. This is their main claim, and I applaud them for that. I also applaud their efforts to explore their data beyond this claim, and I think they included appropriate controls in their experimental design. However, I found other aspects of the paper to be unclear or lacking in support. I include major and medium-level comments:

      Major comments, grouped by theme with specifics below:

      Theta.

      Overall assessment: the theta effects are either over-emphasized or unclear. Please either remove the high/low theta effects or provide a better justification for why they are insightful.

      Lines ~ 115-121: Please include the statistics for low-theta power trials. Also, without a significant difference between high- and low-theta power trials, it is unclear why this analysis is being featured. Does theta actually matter for classification accuracy?

      Lines 123-128: What ARE the important bands for classification? I understand the point about it overlapping in time with the classification window without being discriminative between the conditions, but it still is not clear why theta is being featured given the non-significant differences between high/low theta and the lack of its involvement in classification. REM sleep is high in theta, but other than that, I do not understand the focus given this lack of empirical support for its relevance.

      Line 232-233: "8). In our data, trials with higher theta power show greater evidence of memory reactivation." Please do not use this language without a difference between high and low theta trials. You can say there was significance using high theta power and not with low theta power, but without the contrast, you cannot say this.

      Thank you, we have taken this point onboard. We thought the differences observed between classification in high and low theta power trials were interesting, but we can see why the reviewer feels there is a need for a stronger hypothesis here before reporting them. We have therefore removed this approach from the manuscript, and no longer split trials into high and low theta power.

      Physiology / Figure 2.

      Overall assessment: It would be helpful to include more physiological data.

      It would be nice, either in Figure 2 or in the supplement, to see the raw EEG traces in these conditions. These would be especially instructive because, with NREM TMR, the ERPs seem to take a stereotypical pattern that begins with a clear influence of slow oscillations (e.g., in Cairney et al., 2018), and it would be helpful to show the contrast here in REM.

      We thank the reviewer for these comments. We have now performed ERP and time-frequency analyses following a similar approach to that of (Cairney et al., 2018). We have added a section in the results for these analyses as follows:

      “Elicited response pattern after TMR cues

      We looked at the TMR-elicited response in both time-frequency and ERP analyses using a method similar to the one used in (Cairney et al., 2018), see methods. As shown in Figure 2a, the EEG response showed a rapid increase in theta band followed by an increase in beta band starting about one second after TMR onset. REM sleep is dominated by theta activity, which is thought to support the consolidation process (Diekelmann & Born, 2010), and increased theta power has previously been shown to occur after successful cueing during sleep (Schreiner & Rasch, 2015). We therefore analysed the TMR-elicited theta in more detail. Focussing on the first second post-TMR-onset, we found that theta was significantly higher here than in the baseline period, prior to the cue [-300 -100] ms, for both adaptation (Wilcoxon signed rank test, n = 14, p < 0.001) and experimental nights (Wilcoxon signed rank test, n = 14, p < 0.001). The absence of any difference in theta power between experimental and adaptation conditions (Wilcoxon signed rank test, n = 14, p = 0.68), suggests that this response is related to processing of the sound cue itself, not to memory reactivation. Turning to the ERP analysis, we found a small increase in ERP amplitude immediately after TMR onset, followed by a decrease in amplitude 500ms after the cue. Comparison of ERPs from experimental and adaptation nights showed no significant difference, (n= 14, p > 0.1). Similar to the time-frequency result, this suggests that the ERPs observed here relate to the processing of the sound cues rather than any associated memory.“

      And we have updated Figure 2.

      Also, please expand the classification window beyond 1 s for wake and 1.4 s for sleep. It seems the wake axis stops at 1 s and it would be instructive to know how long that lasts beyond 1 s. The sleep signal should also go longer. I suggest plotting it for at least 5 seconds, considering prior investigations (Cairney et al., 2018; Schreiner et al., 2018; Wang et al., 2019) found evidence of reactivation lasting beyond 1.4 s.

      Regarding the classification window, this is an interesting point. TMR cues in sleep were spaced 1.5 s apart and that is why we included only this window in our classification. Extending our window beyond 1.5 s would mean that we considered the time when the next TMR cue was presented. Similarly, in wake the duration of trials was 1.1 s thus at 1.1 s the next tone was presented.

      Following the reviewer’s comment, we have extended our window as requested even though this means encroaching on the next trial. We do this because it could be possible that there is a transitional period between trials. Thus, when we extended the timing in wake and looked at reactivation in the range 0.5 s to 1.6 s we found that the effect continued to ~1.2 s vs adaptation and chance, e.g. it continued 100 ms after the trial. Results are shown in the figures below.

      Temporal compression/dilation.

      Overall assessment: This could be cut from the paper. If the authors disagree, I am curious how they think it adds novel insight.

      Line 179 section: In my opinion, this does not show evidence for compression or dilation. If anything, it argues that reactivation unfolds on a similar scale, as the numbers are clustered around 1. I suggest the authors scrap this analysis, as I do not believe it supports any main point of their paper. If they do decide to keep it, they should expand the window of dilation beyond 1.4 in Figure 3B (why cut off the graph at a data point that is still significant?). And they should later emphasize that the main conclusion, if any, is that the scales are similar.

      Line 207 section on the temporal structure of reactivation, 1st paragraph: Once again, in my opinion, this whole concept is not worth mentioning here, as there is not really any relevant data in the paper that speaks to this concept.

      We thank the reviewer for these frank comments. On consideration, we have now removed the compression/dilation analysis.

      Behavioral effects.

      Overall assessment: Please provide additional analyses and discussion.

      Lines 171-178: Nice correlation! Was there any correlation between reactivation evidence and pre-sleep performance? If so, could the authors show those data, and also test whether this relationship holds while covarying our pre-sleep performance? The logic is that intact reactivation may rely on intact pre-sleep performance; conversely, there could be an inverse relationship if sleep reactivation is greater for initially weaker traces, as some have argued (e.g., Schapiro et al., 2018). This analysis will either strengthen their conclusion or change it -- either outcome is good.

      Thanks for these interesting points. We have now performed a new analysis to check if there was a correlation between classification performance and pre-sleep performance, but we found no significant correlation (n = 14, r = -0.39, p = 0.17). We have included this in the results section as follows:

      “Finally, we wanted to know whether the extent to which participants learned the sequence during training might predict the extent to which we could identify reactivation during subsequent sleep. We therefore checked for a correlation between classification performance and pre-sleep performance to determine whether the degree of pre-sleep learning predicted the extent of reactivation, this showed no significant correlation (n = 14, r = -0.39, p = 0.17). “

      Note that we calculated the behavioural improvement while subtracting pre-sleep performance and then normalising by it for both the cued and un-cued sequences as follows:

      [(random blocks after sleep - the best 4 blocks after sleep) – (random blocks pre-sleep – the best 4 blocks pre-sleep)] / (random blocks pre-sleep – the best 4 blocks pre-sleep).

      Unlike Schönauer et al. (2017), they found a strong correspondence between REM reactivation and memory improvement across sleep; however, there was no benefit of TMR cues overall. These two results in tandem are puzzling. Could the authors discuss this more? What does it mean to have the correlation without the overall effect? Or else, is there anything else that may drive the individual differences they allude to in the Discussion?

      We have now added a discussion of this point as follows:

      “We are at a very early phase in understanding what TMR does in REM sleep, however we do know that the connection between hippocampus and neocortex is inhibited by the high levels of Acetylcholine that are present in REM (Hasselmo, 1999). This means that the reactivation which we observe in the cortex is unlikely to be linked to corresponding hippocampal reactivation, so any consolidation which occurs as a result of this is also unlikely to be linked to the hippocampus. The SRTT is a sequencing task which relies heavily on the hippocampus, and our primary behavioural measure (Sequence Specific Skill) specifically examines the sequencing element of the task. Our own neuroimaging work has shown that TMR in non-REM sleep leads to extensive plasticity in the medial temporal lobe (Cousins et al., 2016). However, if TMR in REM sleep has no impact on the hippocampus then it is quite possible that it elicits cortical reactivation and leads to cortical plasticity but provides no measurable benefit to Sequence Specific Skill. Alternatively, because we only measured behavioural improvement right after sleep it is possible that we may have missed behavioural improvements that would have emerged several days later, as we know can occur in this task (Rakowska et al., 2021).”

      Medium-level comments

      Lines 63-65: "We used two sequences and replayed only one of them in sleep. For control, we also included an adaptation night in which participants slept in the lab, and the same tones that would later be played during the experimental night were played."

      I believe the authors could make a stronger point here: their design allowed them to show that they are not simply decoding SOUNDS but actual memories. The null finding on the adaptation night is definitely helpful in ruling this possibility out.

      We agree and would like to thank the reviewer for this point. We have now included this in the text as follows: “This provided an important control, as a null finding from this adaptation night would ensure that we are decoding actual memories, not just sounds. “

      Lines 129-141: Does reactivation evidence go down (like in their prior study, Belal et al., 2018)? All they report is theta activity rather than classification evidence. Also, I am unclear why the Wilcoxon comparison was performed rather than a simple correlation in theta activity across TMR cues (though again, it makes more sense to me to investigate reactivation evidence across TMR cues instead).

      Thanks a lot for the interesting point. In our prior study (Belal et. al. 2018), the classification model was trained on wake data and then tested on sleep data, which enabled us to examine its performance at different timepoints in sleep. However in the current study the classifier was trained on sleep and tested on wake, so we can only test for differential replay at different times during the night by dividing the training data. We fear that dividing sleep trials into smaller blocks in this way will lead to weakly trained classifiers with inaccurate weight estimation due to the few training trials, and that these will not be generalisable to testing data. Nevertheless, following your comment, we tried this, by dividing our sleep trials into two blocks, e.g. the first half of stimulation during the night and the second half of stimulation during the night. When we ran the analysis on these blocks separately, no clusters were found for either the first or second halves of stimulation compared to adaptation, probably due to the reasons cited above. Hence the differences in design between the two studies mean that the current study does not lend itself to this analysis.

      Line 201: It seems unclear whether they should call this "wake-like activity" when the classifier involved training on sleep first and then showing it could decode wake rather than vice versa. I agree with the author's logic that wake signals that are specific to wake will be unhelpful during sleep, but I am not sure "wake-like" fits here. I'm not going to belabor this point, but I do encourage the authors to think deeply about whether this is truly the term that fits.

      We agree that a better terminology is needed, and have now changed this: “In this paper we demonstrated that memory reactivation after TMR cues in human REM sleep can be decoded using EEG classifiers. Such reactivation appears to be most prominent about one second after the sound cue onset. ”

      Reviewer #3 (Public Review):

      The authors investigated whether reactivation of wake EEG patterns associated with left- and right-hand motor responses occurs in response to sound cues presented during REM sleep.

      The question of whether reactivation occurs during REM is of substantial practical and theoretical importance. While some rodent studies have found reactivation during REM, it has generally been more difficult to observe reactivation during REM than during NREM sleep in humans (with a few notable exceptions, e.g., Schonauer et al., 2017), and the nature and function of memory reactivation in REM sleep is much less well understood than the nature and function of reactivation in NREM sleep. Finding a procedure that yields clear reactivation in REM in response to sound cues would give researchers a new tool to explore these crucial questions.

      The main strength of the paper is that the core reactivation finding appears to be sound. This is an important contribution to the literature, for the reasons noted above.

      The main weakness of the paper is that the ancillary claims (about the nature of reactivation) may not be supported by the data.

      The claim that reactivation was mediated by high theta activity requires a significant difference in reactivation between trials with high theta power and trials with low theta, but this is not what the authors found (rather, they have a "difference of significances", where results were significant for high theta but not low theta). So, at present, the claim that theta activity is relevant is not adequately supported by the data.

      The authors claim that sleep replay was sometimes temporally compressed and sometimes dilated compared to wakeful experience, but I am not sure that the data show compression and dilation. Part of the issue is that the methods are not clear. For the compression/dilation analysis, what are the features that are going into the analysis? Are the feature vectors patterns of power coefficients across electrodes (or within single electrodes?) at a single time point? or raw data from multiple electrodes at a single time point? If the feature vectors are patterns of activity at a single time point, then I don't think it's possible to conclude anything about compression/dilation in time (in this case, the observed results could simply reflect autocorrelation in the time-point-specific feature vectors - if you have a pattern that is relatively stationary in time, then compressing or dilating it in the time dimension won't change it much). If the feature vectors are spatiotemporal patterns (i.e., the patterns being fed into the classifier reflect samples from multiple frequencies/electrodes / AND time points) then it might in principle be possible to look at compression, but here I just could not figure out what is going on.

      Thank you. We have removed the analysis of temporal compression and dilation from the manuscript. However, we wanted to answer anyway. In this analysis, raw data were smoothed and used as time domain features. The data was then organized as trials x channels x timepoints then we segmented each trial in time based on the compression factor we are using. For instance, if we test if sleep is 2x faster than wake we look at the trial lengths in wake which was 1.1 sec. and we take half of this value which is 0.55 sec. we then take a different window in time from sleep data such that each sleep trial will have multiple smaller segments each of 0.55 sec., we then add those segments as new trials and label them with the respective trial label. Afterwards, we resize those segments temporally to match the length of wake trials. We now reshape our data from trials x channels x timepoints to trials x channels_timepoints so we aggregate channels and timepoints into one dimension. We then feed this to PCA to reduce the dimensionality of channels_timepoints into principal components. We then feed the resultant features to a LDA classifier for classification. This whole process is repeated for every scaling factor and it is done within participant in the same fashion the main classification was done and the error bars were the standard errors. We compared the results from the experimental night to those of the adaptation night.

      For the analyses relating to classification performance and behavior, the authors presently show that there is a significant correlation for the cued sequence but not for the other sequence. This is a "difference of significances" but not a significant difference. To justify the claim that the correlation is sequence-specific, the authors would have to run an analysis that directly compares the two sequences.

      Thanks a lot. We have now followed this suggestion by examining the sequence specific improvement after removing the effect of the un-cued sequence from the cued sequence. This was done by subtracting the improvement of the un-cued sequence from the improvement for the cued sequence, and then normalising the result by the improvement of the un-cued sequence. The resulting values, which we term ‘cued sequence improvement’ showed a significant correlation with classification performance (n = 14, r = 0.56, p = 0.04). We have therefore amended this section of the manuscript as follows: We have updated the text as follows: “We therefore set out to determine whether there was a relationship between the extent to which we could classify reactivation and overnight improvement on the cued sequence. This revealed a positive correlation (n = 14, r = 0.56, p = 0.04), Figure 3b.”

    1. Author response:

      Reviewer #1 (Public Review):

      In this study, Girardello et al. use proteomics to reveal the membrane tension sensitive caveolin-1 interactome in migrating cells. The authors use EM and surface rendering to demonstrate that caveolae formed at the rear of migrating cells are complex membrane-linked multilobed structures, and they devise a robust strategy to identify caveolin-1 associated proteins using APEX2-mediated proximity biotinylation. This important dataset is further validated using proximity ligation assays to confirm key interactions, and follows up with an interrogation of a surprising relationship between caveolae and RhoGTPase signalling, where caveolin-1 recruits ROCK1 under high membrane tension conditions, and ROCK1 activity is required to reform caveolae upon reversion to isotonic solution. However, caveolin-1 recruits the RhoA inactivator ARHGAP29 when membrane tension is low and ARHGAP29 overexpression leads to disassembly of caveolae and reduced cell motility. This study builds on previous findings linking caveolae to positive feedback regulation of RhoA signalling, and provides further evidence that caveolae serve to drive rear retraction in migration but also possess an intrinsic brake to limit RhoA activation, leading the authors to suggest that cycles of caveolae assembly and disassembly could thereby be central to establish a stable cell rear for persistent cell migration

      A major strength of the manuscript is the robust proteomic dataset. The experimental set up is well defined and mostly well controlled, and there is good internal validation in that the high abundance of core caveolar proteins in low membrane tension (isotonic) conditions, and absence under high membrane tension (brief hypo-osmotic shock) conditions, correlating very well with previous finding. The data could however be better presented to show where statically robust changes occur, and supplementary information should include a table of showing abundance. It's very good to see a link to PRIDE, providing a useful resource for the community.

      We thank the reviewer for the positive feedback. We have included the outputs from the search engine in Supplementary File 1.

      The authors detail several known interactions and their mechanosensitivty, but also report new interactors of caveolin-1. Several mechanosensitive interactions of caveolin-1 take place at the cell rear, but others are more diffuse across the cell looking at the PLA data (e.g FLN1, CTTN, HSPB1; Figure 4A-F and Figure 4 supplement 1). It is interesting to speculate that those at the cell rear are involved in caveolae, whilst others are linked specifically to caveolin-1 (e.g. dolines). PLA or localisation analysis with Cavin1/PTRF may be able to resolve this and further specify caveolae versus non-caveolae mechanosensitive interactions.

      We thank the reviewer for this interesting idea. It is true that many if not most proteins we identified to be associated with Cav1 are not restricted to the cell rear. To analyse to what extent the identified proteins interact with Cav1 at the rear we reanalysed our PLA data for some of the antibody combinations we looked at. This new analysis is now shown in Fig 5G. As expected, for Cav1/PTRF and Cav1/EHD2 most PLA dots (70-80%) were found at the rear. This rear bias is also evident from the representative images we show in the Figure panels 5A and 5E. On the contrary, much fewer PLA dots (~40%) were rear-localised for Cav1/CTTN and Cav1/FLNA antibody combinations. This reflects the much broader cellular distribution of these proteins compared to the core caveolae proteins, and might suggest that there are generally few links between caveolae and cortical actin. However, it is also possible that such links/interactions are more difficult to detect using PLA (because of the extended distance between caveolae and the actin cortex, or because of steric constraints).

      The Cav1/ARHGAP29 influence on YAP signalling is interesting, but appear to be quite isolated from the rest of the manuscript. Does overexpression of ARHGAP29 influence YAP signalling and/or caveolar protein expression/Cav1pY14?

      Our data and published work originally prompted us to speculate that there is a potential functional link between Cav1, YAP, and ARHGAP29. In an attempt to address this we have performed several Western blots on cell lysates from cells overexpressing ARHGAP29. We did not see major changes in Cav1 Y14 phosphorylation levels in cells overexpressing ARHGAP29, and YAP and pYAP levels also remained unchanged (not shown). In addition, based on previous literature 1,2 we expected to see an effect on ARHGAP29 mRNA levels and YAP target gene transcripts in Cav1 siRNA transfected cells. To our surprise, the mRNA levels of three independent YAP target genes and ARHGAP29 were unchanged in Cav1 siRNA treated cells (this is now shown in Figure 6 Figure Supplement 1). Our data therefore suggest that in RPE1 cells, the connection between Cav1 and ARHGAP29 is independent of YAP signalling, and that the increase in ARHGAP29 protein levels observed in Cav1 siRNA cells is due to some unknown post-translational mechanism.

      ARHGAP29 and RhoA/ROCK1 related observations are very interesting and potentially really important. However, the link between ARHGAP29 and caveolae is not well established (other than in proteomic data). PLA or FRET could help establish this.

      We agree that the physical and functional link between caveolae (or Cav1) and ARHGAP29 was not well worked out in the original manuscript. In an attempt to address this we have performed PLA assays in GFP-ARHGAP29 transfected cells (as we did not find a suitable ARHGAP29 antibody that works reliably in IF) using anti-Cav1 and anti-GFP antibodies. The PLA signal we obtained for Cav1 and ARHGAP29 was not significantly different to control PLA experiments. There was very little PLA signal to start with. This is not surprising given that ARHGAP29 localisation is mostly diffuse in the cytoplasm, whilst Cav1 is concentrated at the rear. In addition, in cases where we do see ARHGAP29 localisation at the cell cortex, Cav1 tends to be absent (this is now shown in Figure 6 – Figure Supplement 2E). In other words, with the tools we have available, we see little colocalization between Cav1 and ARHGAP29 at steady state. Altogether we speculate that ARHGAP29, through its negative effect on RhoA, flattens caveolae at the membrane or interferes with caveolae assembly at these sites.

      This of course prompts the question why ARHGAP29 was identified in the Cav1 proteome with such specificity and reproducibility in the first place? This can be explained by the way APEX2 labeling works. Proximity biotinylation with APEX2 is extremely sensitive and restricted to a labelling radius of ~20 nm 3. The labeling reaction is conducted on live and intact cells at room temperature for 1 min. Although 1 min appears short, dynamic cellular processes occur at the time scale of seconds and are ongoing during the labelling reaction. It is conceivable that within this 1 min time frame, ARHGAP29 cycles on and off the rear membrane (kiss and run). This allows ARHGAP29 to be biotinylated by Cav1-APEX2, resulting in its identification by MS. We have included this in the discussion section.

      The relationship between ARHGAP29 and RhoA signalling is not well defined. Is GAP activity important in determining the effect on migration and caveolae formation? What is the effect on RhoA activity? Alternatively, the authors could investigate YAP dependent transcriptional regulation downstream of overexpression.

      We have addressed this point using overexpression and siRNA transfections. We overexpressed ARHGAP29 or ARHGAP29 lacking its GAP domain and performed WB analysis against pMLC (which is a commonly used and reliable readout for RhoA and myosin-II activity). Much to our surprise, overexpression of ARHGAP29 increased (rather than decreased) pMLC levels, partially in a GAP-dependent manner (see Author response image 1). This is puzzling, as ARHGAP29 is expected to reduce RhoA-GTP levels, which in turn is expected to reduce ROCK activity and hence pMLC levels. In addition, and also surprisingly, siRNA-mediated silencing of ARHGAP29 did not significantly change pMLC levels. By contrast, pMLC levels were strongly reduced in Cav1 siRNA treated cells (this is shown in Fig. 6A and 6B in the revised manuscript). These new data underscore the important role of caveolae in the control of myosin-II activity, but do not allow us to draw any firm conclusions about the role of ARHGAP29 at the cell rear.

      Author response image 1.

      Overexpression of ARHGAP29 reduces, rather than increases pMLC in RPE1 cells.

      We are uncertain as to how to interpret the ARHGAP29 overexpression data presented in Author response image 1 and therefore decided not to include it in the manuscript. One possibility is that inactivation of RhoA below a certain critical threshold causes other mechanisms to compensate. For instance, the activity of alternative MLC kinases such as MLCK could be enhanced under these conditions. Another possibility is that ARHGAP29 controls MLC phosphorylation indirectly. For instance, it has been shown that ARHGAP29 promotes actin destabilization through inactivating LIMK/cofilin signalling 1. In agreement with this, we find that overexpression of ARHGAP29 reduces p-cofilin (serine 3) levels (see Author response image 2). Since cofilin and MLC crosstalk 4, it is possible that increased pMLC levels are the result of a feedback loop that compensates for the effect of actin depolymerisation. This is now discussed in the discussion section. Whichever the case, we hope the reviewers understand that deeper mechanistic insight into the intricate mechanisms of Rho signalling at the cell rear are beyond the scope of this manuscript.

      Author response image 2.

      Overexpression of ARHGAP29 reduces p-cofilin levels in RPE1.

      Reviewer #2 (Public Review):

      Girardello et al investigated the composition of the molecular machinery of caveolae governing their mechano-regulation in migrating cells. Using live cell imaging and RPE1 cells, the authors provide a spatio-temporal analysis of cavin-3 distribution during cell migration and reveal that caveolae are preferentially localized at the rear of the cell in a stable manner. They further characterize these structures using electron tomography and reveal an organization into clusters connected to the cell surface. By performing a proteomic approach, they address the interactome of caveolin-1 proteins upon mechanical stimulation by exposing RPE1 cells to hypo-osmotic shock (which aims to increase cell membrane tension) or not as a control condition. The authors identify over 300 proteins, notably proteins related to actin cytoskeleton and cell adhesion. These results were further validated in cellulo by interrogating protein-protein interactions using proximity ligation assays and hypo-osmotic shock. These experiments confirmed previous data showing that high membrane tension induces caveolae disassembly in a reversible manner. Eventually, based on literature and on the results collected by the proteomic analysis, authors investigated more deeply the molecular signaling pathway controlling caveolae assembly upon mechanical stimuli. First, they confirm the targeting of ROCK1 with Caveolin-1 and the implication of the kinase activity for caveolae formation (at the rear of the cell). Then, they show that RhoGAP ARHGAP29, a factor newly identified by the proteomic analysis, is also implicated in caveolae mechano-regulation likely through YAP protein and found that overexpression of RhoGAP ARHGAP29 affects cell motility. Overall, this paper interrogated the role of membrane tension in caveolae located at the rear of the cell and identified a new pathway controlling cell motility.

      Strengths:

      Using a proximity-based proteomic assay, the authors reveal the protein network interacting with caveolae upon mechanical stimuli. This approach is elegant and allows to identify a substantial new set of factors involved in the mechano-regulation of caveolin-1, some of which have been verified directly in the cell by PLA. This study provides a compelling set of data on the interactions between caveolae and its cortical network which was so far ill-characterized.

      We thank the reviewer for this positive feedback.

      Weaknesses:

      The methodology demonstrating an impact of membrane tension is not precise enough to directly assess a direct role on caveolae at a subcellular scale, that is between the front and the rear of the cell. First, a better characterization of the "front-rear" cellular model is encouraged.

      We agree with the reviewer that a quantitative analysis of the caveolae front-rear polarity would strengthen our conclusions. To address this, we have analysed the localisation of Cav1 and cavins in detail and in a large pool of cells, both in fixed and live cells. Our quantification clearly shows that Cav1 and cavins are enriched at the cell rear. This is now shown in Figure 1 and Figure 1 - Figure Supplement 1. To demonstrate that Cav1/cavins are truly rear-localised we analysed live migrating cells expressing tagged Cav1 or cavins. This analysis, which was performed on several individual time lapse movies, showed that caveolae rear localisation is remarkably stable (e.g. Figure 1C and 1D). We also present novel data panels and movies showing caveolae dynamics during rear retractions, in dividing cells, and in cells that polarise de novo. This new data is now described in the first paragraph of the results section.

      Secondly, authors frequently present osmotic shock as "high membrane tension" stimuli. While osmotic shock is widely used in the field, this study is focused only on caveolae localized at the rear of cell and it remains unclear how the level of a global mechanical stimuli triggered by an osmotic shock could mimic a local stimuli.

      We agree with the reviewer that osmotic shock will cause a global increase in membrane tension and therefore is only of limited value to understand how membrane tension is regulated at the rear, and how caveolae respond to such a local stimulus. It was not our aim nor is it our expertise to address such questions. To answer this sophisticated optogenetic approaches or localised membrane tension measurements (e.g. through the use of the Flipper-TR probe) are needed. It is beyond the scope of this manuscript to perform such experiments. However, given the strong enrichment of caveolae at the cell rear, we believe it is justified to propose that the changes we observe in the proteome do (mostly) reflect changes in caveolae at the rear. We have now included several quantifications on fixed cells, live cells, and PLA assays to support that caveolae are highly enriched at the rear. In addition, and importantly, a recent preprint by the Roux lab shows that membrane tension gradients indeed exist in many migrating and non-migrating cells 5. Using very similar hypotonic shock assays, the Caswell lab also showed that low membrane tension at the rear is required for caveolae formation 6. We have included a section in the discussion in which we elaborate on how membrane tension is controlled in migrating cells, and how it might regulate caveolae rear localisation.

      In the present case, it remains unknown the extent to which this mechanical stress is physiologically relevant to mimic mechanical forces applied at the rear of a migrating cell.

      This is true. Our study does not address the nature of mechanical forces at the cell rear. This a complex subject that is technically challenging to address, and therefore is beyond the scope of this manuscript.

      Some images are not satisfying to fully support the conclusions of the article.

      We agree that some of the images, in particular the ones presented for the PLA assays, do not always show a clear rear localisation of caveolae. We have explained above why this is the case. We hope that our new quantitative measurements, movies and figure panels, addresses the reviewer’s concern.

      At this stage, the lack of an unbiased quantitative analysis of the spatio-temporal analysis of caveolae upon well-defined mechanical stimuli is also needed.

      These are all very good points that were previously addressed beautifully by the Caswell group 6. To address this in part in our RPE1 cell system, we imaged RPE1 cells exposed to the ROCK inhibitor Y27632 (see Author response image 3). The data shows that cell rear retraction is impeded in response to ROCK inhibition, which is in line with several previous reports. Cavin-1 remained mostly associated with the cell rear, although the distribution appeared more diffuse. We believe this data does not add much new insight into how caveolae function at the rear, and hence was not included in the manuscript.

      Author response image 3.

      Effect of ROCK inhibition on cavin1 rear localisation and rear retraction. Cells were imaged one hour after the addition of Y27632.

      Cells on images, in particular Figure 1, are difficult to see. Signal-to noise ratio in different cell area could generate a biased. Since there is inconsistency between caveolae density and localization between Figures, more solid illustrations are needed along quantitative analysis.

      As mentioned above, we have carefully analysed the localisation of caveolae in fixed cells (using Cav1 and cavin1 antibodies as well as Cav1 and cavin fusion proteins) and in live cells transfected with various different caveolae proteins. The analysis clearly demonstrates an enrichment of caveolae at the rear (Figure 1 and Figure 1 – Figure Supplement 1). Our tomography and TEM data supports this as well (Figure 2).

      References:

      1. Qiao Y, Chen J, Lim YB, et al. YAP Regulates Actin Dynamics through ARHGAP29 and Promotes Metastasis. Cell reports. 2017;19(8):1495-1502.

      2. Rausch V, Bostrom JR, Park J, et al. The Hippo Pathway Regulates Caveolae Expression and Mediates Flow Response via Caveolae. Curr Biol. 2019;29(2):242-255 e246.

      3. Hung V, Udeshi ND, Lam SS, et al. Spatially resolved proteomic mapping in living cells with the engineered peroxidase APEX2. Nat Protoc. 2016;11(3):456-475.

      4. Wiggan O, Shaw AE, DeLuca JG, Bamburg JR. ADF/cofilin regulates actomyosin assembly through competitive inhibition of myosin II binding to F-actin. Dev Cell. 2012;22(3):530-543.

      5. Juan Manuel García-Arcos AM, Julissa Sánchez Velázquez, Pau Guillamat, Caterina Tomba, Laura Houzet, Laura Capolupo, Giovanni D’Angelo, Adai Colom, Elizabeth Hinde, Charlotte Aumeier, Aurélien Roux. Actin dynamics sustains spatial gradients of membrane tension in adherent cells. bioRxiv 20240715603517. 2024.

      6. Hetmanski JHR, de Belly H, Busnelli I, et al. Membrane Tension Orchestrates Rear Retraction in Matrix-Directed Cell Migration. Dev Cell. 2019;51(4):460-475 e410.

      7. Tsai TY, Collins SR, Chan CK, et al. Efficient Front-Rear Coupling in Neutrophil Chemotaxis by Dynamic Myosin II Localization. Dev Cell. 2019;49(2):189-205 e186.

      8. Mueller J, Szep G, Nemethova M, et al. Load Adaptation of Lamellipodial Actin Networks. Cell. 2017;171(1):188-200 e116.

      9. De Belly H, Yan S, Borja da Rocha H, et al. Cell protrusions and contractions generate long-range membrane tension propagation. Cell. 2023.

      10. Matthaeus C, Sochacki KA, Dickey AM, et al. The molecular organization of differentially curved caveolae indicates bendable structural units at the plasma membrane. Nat Commun. 2022;13(1):7234.

      11. Sinha B, Koster D, Ruez R, et al. Cells respond to mechanical stress by rapid disassembly of caveolae. Cell. 2011;144(3):402-413.

      12. Lieber AD, Schweitzer Y, Kozlov MM, Keren K. Front-to-rear membrane tension gradient in rapidly moving cells. Biophysical journal. 2015;108(7):1599-1603.

      13. Shi Z, Graber ZT, Baumgart T, Stone HA, Cohen AE. Cell Membranes Resist Flow. Cell. 2018;175(7):1769-1779 e1713.

      14. Grande-Garcia A, Echarri A, de Rooij J, et al. Caveolin-1 regulates cell polarization and directional migration through Src kinase and Rho GTPases. The Journal of cell biology. 2007;177(4):683-694.

      15. Grande-Garcia A, del Pozo MA. Caveolin-1 in cell polarization and directional migration. Eur J Cell Biol. 2008;87(8-9):641-647.

      16. Ludwig A, Howard G, Mendoza-Topaz C, et al. Molecular composition and ultrastructure of the caveolar coat complex. PLoS biology. 2013;11(8):e1001640.

    1. Author Response

      Reviewer #1 (Public Review):

      The study presented by AL Seufert et al. follows the trajectory of trained immunity research in the context of sterile inflammatory diseases such as gout, cardiovascular disease and obesity. Previous studies in mice have shown that a 4 week Western-type diet is sufficient to induce systemic trained immunity, with gross reorganization of the bone marrow to support a potentiated inflammatory response [PMID: 29328911]. The current study demonstrates that mice on a Western-type diet (WD) and the more extreme Ketogenic diet (KD; where carbohydrates are essentially eliminated from the diet) for 2 weeks results in a state of increased monocyte-driven immune responsiveness when compared to standard chow diets (SC). This increased immune responsiveness after high-fat diet resulted in a deadly hyper-inflammatory in the mice in response to endotoxin (LPS) challenge in vivo.

      These initial findings as displayed in Figure 1 are made difficult to interpret because the authors use a mix of male and female mice coupled with very small sample sizes (n = 5 - 9). Male and female mice are shown to have dimorphic responses to LPS exposure in vivo, with males having elevated cytokine levels (TNF, IL-6, IL1β, and also interesting IL-10) increased rates severe outcomes to LPS challenge [PMID: 27631979]. As a reader it is impossible to discern from their methodological description what the proportion of the sexes were in each group, and therefore cannot determine if their data are skewed or biased due to sexual dimorphic responses to LPS rather than diet. Additionally due to the very small sample sizes, the authors can't perform a stratified analysis based on sex to determine whether the diets are having the greatest effects in accordance with LPS induce inflammation.

      The Reviewer brings up an important point, all studies with endotoxemia in wild-type conventional mice were carried out in 6–8-week female BALB/c mice, as mentioned in the Methods section under “Ethical approval of animal studies” and “endotoxin-induced model of sepsis” sections. This is extremely important to mention more clearly in the results text, because the Reviewer 1 is correct, sexual dimorphism and age differences can have very large effects on LPS treatment outcome. This was not stated clearly enough in the results and now the age, sex, and background of mice have been explicitly stated in each Results and Figure Legend section for each experiment.

      When comparing SC to the KD, the authors identify large changes in fatty acid distribution circulating in the blood. The majority of the fatty acids were shown to relate to saturated fatty acids (SFA). Although Lauric, Myristic, and Myristovaccenic acid where the most altered after KD, the authors focus their research on the more thoroughly studied palmitic acid (PA).

      We followed up on multiple saturated fatty acids (SFAs; Myristic, Lauric, and Behenic acid) that were identified in the lipidomic data, and found no robust or repeatable phenotypes in vitro using physiologically relevant concentrations. The inability to reproduce some of the findings with these SFAs may be due to the instability of some of these fats in solution, and plan to troubleshoot these assays in order to understand the complexity of SFA-dependent control of inflammation in macrophages. Please see Fig. R1 in this document for data showing LPS-stimulated BMDMs pre-treated with Myristic (Fig R1 A-C), Lauric (Fig R1 D-F), or Behenic (Fig R1 G-I) fatty acids. The physiological concentrations used in these studies were referenced from Perreault et. al., 2014.

      Figure R1. The effect of Myristic Acid, Lauric Acid, and Behenic Acid on the response to LPS in macrophages. Primary bone marrowderived macrophages (BMDMs) were isolated from aged-matched (6-8 wk) C57BL/6 female and male mice. BMDMs were plated at 1x106 cells/mL and treated with either ethanol (EtOH; media with 0.05% or 0.35% ethanol to match MA and LA solutions respectively), media (Ctrl), LPS (10 ng/mL) for 24 h, or myristic or lauric acid (MA, LA stock diluted in 0.05%, or 0.35% EtOH; conjugated to 2% BSA) for 24 h, with and without a secondary challenge with LPS (10 ng/mL). After indicated time points, RNA was isolated and expression of (A, B) tnf, (D, E) il- 6, and (G, H) il-1β was measured via qRT-PCR. RAW 264.7 macrophages were thawed and cultured for 3-5 days, pelleted and resuspended in DMEM containing 5% FBS and 2% BSA, and treated identical to BMDM treatments with behenic acid (BA stock diluted in 1.7% EtOH) used as the primary stimulus. (C) tnf, (F) il-6, and (I) il-1β was measured via qRT-PCR. For all plates, all treatments were performed in triplicate. For all panels, a student’s t-test was used for statistical significance. p< 0.05; p < 0.01; **p< 0.001. Error bars shown mean ± SD.

      PA was shown to increase the expression of inflammatory cytokines gene expression and protein production of TNF, IL-6 and IL-1β in bone marrow derived macrophages (BMDMs). The authors tie these effects to ceramide synthesis through a pharmacological blockade as well as the use of oleic acid, which allegedly sequesters ceramide synthesis. The author's claim that oleic acid supplementation reverses the inflammatory signaling induced by PA is invalid, as oleic acid was shown to induce a high level of cytokines in their model. When PA was added along with oleic acid, the cytokine levels returned to the levels produced by BMDM's stimulated with PA alone (see Figure 4 panels D- F).

      This was an unfortunate oversight in our revisions of this manuscript, original Figure 5A-C was mislabeled (though colored the correct colors) – OA-12h → LPS-24h should have been switched with PA-12h → LPS-24h. These data were labeled correctly in the source file: Source_data_Fig5 and have since been updated in Figure 5 of the manuscript with correct labels. The corrected graphs have been split up in the resubmission in light of new data collected. Please see Fig 3K-M and Fig 5A-C.

      Finally the authors test whether injection of PA into mice can recapitulate the systemic inflammatory response seen by WD and KD feeding followed by LPS exposure. They were able to demonstrate that injecting 1 mM of PA, waiting for 12h, and then exposing the mice to LPS for 24h could similarly result in a hyper-inflammatory state resulting in greater mortality. The reviewer is skeptical that 1 mM of PA truly represents post-prandial PA levels as one would expect to see after a single fatty meal, and whether this injection is generally well tolerated by mice. Looking into the paper cited by Eguchi et al. to inform their methods, it's shown that the earlier study continuously infused an emulsified ethyl palmitate solution (which contained 600 mM) at a rate of 0.2 uL/min. As far as I can read by Eguchi, they only managed to reach a serum PA concentration of 0.5 mM. This is hardly the same thing as a single i.p. injection of 1 mM PA. and reflects a single bolus injection of double the serum concentration of PA achieved by Eguchi et al.

      The reviewer brings up an important point, Eguchi et al. did use infusions. From their data (Fig 1A), we calculated that after 600mM of i.v. injection (total = 267uL within 14h; 0.2L/min) there was ~420uM absolute PA within the blood. They were using C57BL/6 mice that were 23g on average. Using these results, we extrapolated that one single 200uL injection of a 750mM PA solution within 6–8-week female BALB/c mice (~15-18g) would equate to ~500-1mM of PA within the blood. Considering obese healthy and unhealthy humans vary widely in total PA concentrations in the blood (0.3-4.1 mM) (1, 2), we moved forward with these calculations. Considering this, we thank the reviewer for this advice, and we agree that we have not definitively shown we are increasing systemic levels of PA. Thus, we ran a lipidomic analysis of serum from SC-fed mice with Veh or PA for 12 h. We show that a 750 mM i.p. injection of ethyl palmitate enhances free PA levels in the serum to 173-425 μM at 2 h post-injection, which is within the reported range for humans on high-fat diets (0.34.1mM). We have added this new data to Fig. S7A of the main manuscript.

      Importantly, the concentration in the PA-treated mice is greater than that of the Veh-treated mice, however we believe the value shown is an underestimate of maximum serum PA levels enhanced by i.p. injection, because free PA is known to be packaged into chylomicrons within enterocytes and travel through the circulation with a half-life of less than an hour (3, 4). Thus, serum concentrations of free PA are only transiently enhanced by i.p. injection, and is quickly taken up by adipose tissue, skeletal muscle, heart, and liver tissue. These complex lipid transport processes make it difficult to determine maximum concentrations of free PA in the serum.

      While all of the details concerning PA circulation following an i.p. injection are unknown, we suggest that this method of “force-feeding” is similar to dietary intake in that uptake of PA into the circulation occurs within the peritoneal space prior to traveling to the blood via the thoracic duct and right lymphatic duct (5).

      PA is known to induce inflammation in monocytes and macrophages, therefore the findings certainly make sense in the context of previously published literature. However the authors have made some poor methodological decisions in their mouse studies, namely haphazardly switching between groups of young and old mice (4-6 weeks, 8-9 weeks, and 14-23 weeks), using different LPS injection protocols (6, 10, and 50 mg/ml of LPS), and including multiple sexes of mice. All of which are drastically alter the interpretation of the data, and preventing solid conclusions from being drawn.

      We appreciate this review and suggest that:

      1) For the LPS models, mice were all female and aged matched between 6-8 weeks. We are aware of sex differences in the endotoxemia model, which is why we specifically use female mice in our studies (6, 7). This is mentioned twice in the methods under the sections “Endotoxin-induced model of sepsis” and “Ethical approval of animal studies”. We have added these specifics of our model to all Results and Figure Legend sections for clarification.

      2) For Germ-free models, it is notoriously difficult to breed C57BL/6 germ-free mice. It was inherently difficult to obtain enough mice within the same sex and age to carry out these experiments, however since we have published in this model before with mixed sex and age we were aware that our WD phenotype is robust enough in these backgrounds (7). Further, we believe that seeing our robust phenotype independent of age or sex within germ-free mice provides more evidence of the strength of this phenotype. It is important to note that we induce endotoxemia within Germ-free mice with 50mg/kg, instead of 6mg/kg which is used in conventional mice, because this is our reported LD50 for mixed sex Germ-free C57BL/6, as we have published previously in detail (7). This difference is due to the presence of the microbiota (8, 9) and also germ-free mice have an immature immune system that correlates with a hyporesponsiveness to microbial products (10-12). We agree with the reviewer that the ages of the C57BL/6 germ-free mice are significantly older than our conventional 6-8 week mice, thus we confirmed that WD- and KD-fed conventional C57BL/6 female mice aged 20 – 21 weeks old still show enhanced disease severity and mortality in an LPS-induced endotoxemia model, compared to mice fed SC (Fig. S1G-H).

      Figure R2. PA treatment enhances survival in both female and male RAG-/- mice. Age-matched (8-9 wk) RAG-/- mice were injected i.v. with ethyl palmitate (PA, 750mM) or vehicle (Veh) solutions 12 h before C. albicans infection. Survival was monitored for 40h post-infection.

      3) In our preliminary results, we stratified survival during C. albicans infection between male and female C57BL/6 and found no notable difference in survival at 40h post IP infection with Candida albicans (Fig R2 A-B). However, the data presented in the manuscript on CFU is female kidney burden and we do not have data on fungal burden within male mice. This is an important piece of data that we would like to collect for understanding sex differences in the PA-dependent enhanced resistance to systemic C. albicans. We are currently addressing this question within the lab as well as elucidating the cell type and mechanism of PA-dependent enhanced fungal resistance.

    1. Author Response

      Reviewer #1 (Public Review):

      Esmaily and colleagues report two experimental studies in which participants make simple perceptual decisions, either in isolation or in the context of a joint decision-making procedure. In this "social" condition, participants are paired with a partner (in fact, a computer), they learn the decision and confidence of the partner after making their own decision, and the joint decision is made on the basis of the most confident decision between the participant and the partner. The authors found that participants' confidence, response times, pupil dilation, and CPP (i.e. the increase of centro-parietal EEG over time during the decision process) are all affected by the overall confidence of the partner, which was manipulated across blocks in the experiments. They describe a computational model in which decisions result from a competition between two accumulators, and in which the confidence of the partner would be an input to the activity of both accumulators. This model qualitatively produced the variation in confidence and RTs across blocks.

      The major strength of this work is that it puts together many ingredients (behavioral data, pupil and EEG signals, computational analysis) to build a picture of how the confidence of a partner, in the context of joint decision-making, would influence our own decision process and confidence evaluations. Many of these effects are well described already in the literature, but putting them all together remains a challenge.

      We are grateful for this positive assessment.

      However, the construction is fragile in many places: the causal links between the different variables are not firmly established, and it is not clear how pupil and EEG signals mediate the effect of the partner's confidence on the participant's behavior.

      We have modified the language of the manuscript to avoid the implication of a causal link.

      Finally, one limitation of this setting is that the situation being studied is very specific, with a joint decision that is not the result of an agreement between partners, but the automatic selection of the most confident decisions. Thus, whether the phenomena of confidence matching also occurs outside of this very specific setting is unclear.

      We have now acknowledged this caveat in the discussion in line 485 to 504. The final paragraph of the discussion now reads as follows:

      “Finally, one limitation of our experimental setup is that the situation being studied is confined to the design choices made by the experimenters. These choices were made in order to operationalize the problem of social interaction within the psychophysics laboratory. For example, the joint decisions were not made through verbal agreement (Bahrami et al., 2010, 2012). Instead, following a number of previous works (Bang et al., 2017, 2020) joint decisions were automatically assigned to the most confident choice. In addition, the partner’s confidence and choice were random variables drawn from a distribution prespecified by the experimenter and therefore, by design, unresponsive to the participant’s behaviour. In this sense, one may argue that the interaction partner’s behaviour was not “natural” since they did not react to the participant's confidence communications (note however that the partner’s confidence and accuracy were not entirely random but matched carefully to the participant’s behavior prerecorded in the individual session). How much of the findings are specific to these experimental setting and whether the behavior observed here would transfer to real-life settings is an open question. For example, it is plausible that participants may show some behavioral reaction to a human partner’s response time variations since there is some evidence indicating that for binary choices such as those studied here, response times also systematically communicate uncertainty to others (Patel et al., 2012). Future studies could examine the degree to which the results might be paradigm-specific.”

      Reviewer #2 (Public Review):

      This study is impressive in several ways and will be of interest to behavioral and brain scientists working on diverse topics.

      First, from a theoretical point of view, it very convincingly integrates several lines of research (confidence, interpersonal alignment, psychophysical, and neural evidence accumulation) into a mechanistic computational framework that explains the existing data and makes novel predictions that can inspire further research. It is impressive to read that the corresponding model can account for rather non-intuitive findings, such as that information about high confidence by your collaborators means people are faster but not more accurate in their judgements.

      Second, from a methodical point of view, it combines several sophisticated approaches (psychophysical measurements, psychophysical and neural modelling, electrophysiological and pupil measurements) in a manner that draws on their complementary strengths and that is most compelling (but see further below for some open questions). The appeal of the study in that respect is that it combines these methods in creative ways that allow it to answer its specific questions in a much more convincing manner than if it had used just either of these approaches alone.

      Third, from a computational point of view, it proposes several interesting ways by which biologically realistic models of perceptual decision-making can incorporate socially communicated information about other's confidence, to explain and predict the effects of such interpersonal alignment on behavior, confidence, and neural measurements of the processes related to both. It is nice to see that explicit model comparison favor one of these ways (top-down driving inputs to the competing accumulators) over others that may a priori have seemed more plausible but mechanistically less interesting and impactful (e.g., effects on response boundaries, no-decision times, or evidence accumulation).

      Fourth, the manuscript is very well written and provides just the right amount of theoretical introduction and balanced discussion for the reader to understand the approach, the conclusions, and the strengths and limitations.

      Finally, the manuscript takes open science practices seriously and employed preregistration, a replication sample, and data sharing in line with good scientific practice.

      We are grateful to the reviewer for their positive assessment of our work.

      Having said all these positive things, there are some points where the manuscript is unclear or leaves some open questions. While the conclusions of the manuscript are not overstated, there are unclarities in the conceptual interpretation, the descriptions of the methods, some procedures of the methods themselves, and the interpretation of the results that make the reader wonder just how reliable and trustworthy some of the many findings are that together provide this integrated perspective.

      We hope that our modifications and revisions in response to the criticisms listed below will be satisfactory. To avoid redundancies, we have combined each numbered comment with the corresponding recommendation for the Authors.

      First, the study employs rather small sample sizes of N=12 and N=15 and some of the effects are rather weak (e.g., the non-significant CPP effects in study 1). This is somewhat ameliorated by the fact that a replication sample was used, but the robustness of the findings and their replicability in larger samples can be questioned.

      Our study brings together questions from two distinct fields of neuroscience: perceptual decision making and social neuroscience. Each of these two fields have their own traditions and practical common sense. Typically, studies in perceptual decision making employ a small number of extensively trained participants (approximately 6 to 10 individuals). Social neuroscience studies, on the other hand, recruit larger samples (often more than 20 participants) without extensive training protocols. We therefore needed to strike a balance in this trade-off between number of participants and number of data points (e.g. trials) obtained from each participant. Note, for example, that each of our participants underwent around 4000 training trials. Strikingly, our initial study (N=12) yielded robust results that showed the hypothesized effects nearly completely, supporting the adequacy of our power estimate. However, we decided to replicate the findings because, like the reviewer, we believe in the importance of adequate sampling. We increased our sample size to N=15 participants to enhance the reliability of our findings. However, we acknowledge the limitation of generalizing to larger samples, which we have now discussed in our revised manuscript and included a cautionary note regarding further generalizations.

      To complement our results and add a measure of their reliability, here we provide the results of a power analysis that we applied on the data from study 1 (i.e. the discovery phase). These results demonstrate that the sample size of study 2 (i.e. replication) was adequate when conditioned on the results from study 1 (see table and graph pasted below). The results showed that N=13 would be an adequate sample size for 80% power for behavoural and eye-tracking measurements. Power analysis for the EEG measurements indicated that we needed N=17. Combining these power analyses. Our sample size of N=15 for Study 2 was therefore reasonably justified.

      We have now added a section to the discussion (Lines 790-805) that communicates these issues as follows:

      “Our study brings together questions from two distinct fields of neuroscience: perceptual decision making and social neuroscience. Each of these two fields have their own traditions and practical common sense. Typically, studies in perceptual decision making employ a small number of extensively trained participants (approximately 6 to 10 individuals). Social neuroscience studies, on the other hand, recruit larger samples (often more than 20 participants) without extensive training protocols. We therefore needed to strike a balance in this trade-off between number of participants and number of data points (e.g. trials) obtained from each participant. Note, for example, that each of our participants underwent around 4000 training trials. Importantly, our initial study (N=12) yielded robust results that showed the hypothesized effects nearly completely, supporting the adequacy of our power estimate. However, we decided to replicate the findings in a new sample with N=15 participants to enhance the reliability of our findings and examine our hypothesis in a stringent discovery-replication design. In Figure 4-figure supplement 5, we provide the results of a power analysis that we applied on the data from study 1 (i.e. the discovery phase). These results demonstrate that the sample size of study 2 (i.e. replication) was adequate when conditioned on the results from study 1.”

      We conducted Monte Carlo simulations to determine the sample size required to achieve sufficient statistical power (80%) (Szucs & Ioannidis, 2017). In these simulations, we utilized the data from study 1. Within each sample size (N, x-axis), we randomly selected N participants from our 12 partpincats in study 1. We employed the with-replacement sampling method. Subsequently, we applied the same GLMM model used in the main text to assess the dependency of EEG signal slopes on social conditions (HCA vs LCA). To obtain an accurate estimate, we repeated the random sampling process 1000 times for each given sample size (N). Consequently, for a given sample size, we performed 1000 statistical tests using these randomly generated datasets. The proportion of statistically significant tests among these 1000 tests represents the statistical power (y-axis). We gradually increased the sample size until achieving an 80% power threshold, as illustrated in the figure.The the number indicated by the red circle on the x axis of this graph represents the designated sample size.

      Second, the manuscript interprets the effects of low-confidence partners as an impact of the partner's communicated "beliefs about uncertainty". However, it appears that the experimental setup also leads to greater outcome uncertainty (because the trial outcome is determined by the joint performance of both partners, which is normally reduced for low-confidence partners) and response uncertainty (because subjects need to consider not only their own confidence but also how that will impact on the low-confidence partner). While none of these other possible effects is conceptually unrelated to communicated confidence and the basic conclusions of the manuscript are therefore valid, the reader would like to understand to what degree the reported effects relate to slightly different types of uncertainty that can be elicited by communicated low confidence in this setup.

      We appreciate the reviewer’s advice to remain cautious about the possible sources of uncertainty in our experiment. In the Discussion (lines 790-801) we have now added the following paragraph.

      “We have interpreted our findings to indicate that social information, i.e. partner’s confidence, impacts the participants beliefs about uncertainty. It is important to underscore here that, similar to real life, there are other sources of uncertainty in our experimental setup that could affect the participants' belief. For example, under joint conditions, the group choice is determined through the comparison of the choices and confidences of the partners. As a result, the participant has a more complex task of matching their response not only with their perceptual experience but also coordinating it with the partner to achieve the best possible outcome. For the same reason, there is greater outcome uncertainty under joint vs individual conditions. Of course, these other sources of uncertainty are conceptually related to communicated confidence but our experimental design aimed to remove them, as much as possible, by comparing the impact of social information under high vs low confidence of the partner.”

      In addition to the above, we would like to clarify one point here with specific respect to the comment. Note that the computer-generated partner’s accuracy was identical under high and low confidence. In addition, our behavioral findings did not show any difference in accuracy under HCA and LCA conditions. As a consequence, the argument that “the trial outcome is determined by the joint performance of both partners, which is normally reduced for low-confidence partners)” is not valid because the low-confidence partner’s performance is identical to that of the high-confidence partner. It is possible, of course, that we have misunderstood the reviewer’s point here and we would be happy to discuss this further if necessary.

      Third, the methods used for measurement, signal processing, and statistical inference in the pupil analysis are questionable. For a start, the methods do not give enough details as to how the stimuli were calibrated in terms of luminance etc so that the pupil signals are interpretable.

      Here we provide in Author response image 1 the calibration plot for our eye tracking setup, describing the relationship between pupil size and display luminance. Luminance of the random dot motion stimuli (ie white dots on black background) was Cd/m2 and, importantly, identical across the two critical social conditions. We hope that this additional detail satisfies the reviewer’s concern. For the purpose of brevity, we have decided against adding this part to the manuscript and supplementary material.

      Author response image 1.

      Calibration plot for the experimental setup. Average pupil size (arbitrary units from eyelink device) is plotted against display luminance. The plot is obtained by presenting the participant with uniform full screen displays with 10 different luminance levels covering the entire range of the monitor RGB values (0 to 255) whose luminance was separately measured with a photometer. Each display lasted 10 seconds. Error bars are standard deviation between sessions.

      Moreover, while the authors state that the traces were normalized to a value of 0 at the start of the ITI period, the data displayed in Figure 2 do not show this normalization but different non-zero values. Are these data not normalized, or was a different procedure used? Finally, the authors analyze the pupil signal averaged across a wide temporal ITI interval that may contain stimulus-locked responses (there is not enough information in the manuscript to clearly determine which temporal interval was chosen and averaged across, and how it was made sure that this signal was not contaminated by stimulus effects).

      We have now added the following details to the Methods section in line 1106-1135.

      “In both studies, the Eye movements were recorded by an EyeLink 1000 (SR- Research) device with a sampling rate of 1000Hz which was controlled by a dedicated host PC. The device was set in a desktop and pupil-corneal reflection mode while data from the left eye was recorded. At the beginning of each block, the system was recalibrated and then validated by 9-point schema presented on the screen. For one subject was, a 3-point schema was used due to repetitive calibration difficulty. Having reached a detection error of less than 0.5°, the participants proceeded to the main task. Acquired eye data for pupil size were used for further analysis. Data of one subject in the first study was removed from further analysis due to storage failure.

      Pupil data were divided into separate epochs and data from Inter-Trials Interval (ITI) were selected for analysis. ITI interval was defined as the time between offset of trial (t) feedback screen and stimulus presentation of trial (t+1). Then, blinks and jitters were detected and removed using linear interpolation. Values of pupil size before and after the blink were used for this interpolation. Data was also mid-pass filtered using a Butterworth filter (second order,[0.01, 6] Hz)[50]. The pupil data was z-scored and then was baseline corrected by removing the average of signal in the period of [-1000 0] ms interval (before ITI onset). For the statistical analysis (GLMM) in Figure 2, we used the average of the pupil signal in the ITI period. Therefore, no pupil value is contaminated by the upcoming stimuli. Importantly, trials with ITI>3s were excluded from analysis (365 out of 8800 for study 1 and 128 out 6000 for study 2. Also see table S7 and Selection criteria for data analysis in Supplementary Materials)”

      Fourth, while the EEG analysis in general provides interesting data, the link to the well-established CPP signal is not entirely convincing. CPP signals are usually identified and analyzed in a response-locked fashion, to distinguish them from other types of stimulus-locked potentials. One crucial feature here is that the CPPs in the different conditions reach a similar level just prior to the response. This is either not the case here, or the data are not shown in a format that allows the reader to identify these crucial features of the CPP. It is therefore questionable whether the reported signals indeed fully correspond to this decision-linked signal.

      Fifth, the authors present some effective connectivity analysis to identify the neural mechanisms underlying the possible top-down drive due to communicated confidence. It is completely unclear how they select the "prefrontal cortex" signals here that are used for the transfer entropy estimations, and it is in fact even unclear whether the signals they employ originate in this brain structure. In the absence of clear methodical details about how these signals were identified and why the authors think they originate in the prefrontal cortex, these conclusions cannot be maintained based on the data that are presented.

      Sixth, the description of the model fitting procedures and the parameter settings are missing, leaving it unclear for the reader how the models were "calibrated" to the data. Moreover, for many parameters of the biophysical model, the authors seem to employ fixed parameter values that may have been picked based on any criteria. This leaves the impression that the authors may even have manually changed parameter values until they found a set of values that produced the desired effects. The model would be even more convincing if the authors could for every parameter give the procedures that were used for fitting it to the data, or the exact criteria that were used to fix the parameter to a specific value.

      Seventh, on a related note, the reader wonders about some of the decisions the authors took in the specification of their model. For example, why was it assumed that the parameters of interest in the three competing models could only be modulated by the partner's confidence in a linear fashion? A non-linear modulation appears highly plausible, so extreme values of confidence may have much more pronounced effects. Moreover, why were the confidence computations assumed to be finished at the end of the stimulus presentation, given that for trials with RTs longer than the stimulus presentation, the sensory information almost certainly reverberated in the brain network and continued to be accumulated (in line with the known timing lags in cortical areas relative to objective stimulus onset)? It would help if these model specification choices were better justified and possibly even backed up with robustness checks.

      Eight, the fake interaction partners showed several properties that were highly unnatural (they did not react to the participant's confidence communications, and their response times were random and thus unrelated to confidence and accuracy). This questions how much the findings from this specific experimental setting would transfer to other real-life settings, and whether participants showed any behavioral reactions to the random response time variations as well (since several studies have shown that for binary choices like here, response times also systematically communicate uncertainty to others). Moreover, it is also unclear how the confidence convergence simulated in Figure 3d can conceptually apply to the data, given that the fake subjects did not react to the subject's communicated confidence as in the simulation.

    1. Author Response

      Reviewer #1 (Public Review):

      This work by Shen et al. demonstrates a single molecule imaging method that can track the motions of individual protein molecules in dilute and condensed phases of protein solutions in vitro. The authors applied the method to determine the precise locations of individual molecules in 2D condensates, which show heterogeneity inside condensates. Using the time-series data, they could obtain the displacement distributions in both phases, and by assuming a two-state model of trapped and mobile states for the condensed phase, they could extract diffusion behaviors of both states. This approach was then applied to 3D condensate systems, and it was shown that the estimates from the model (i.e., mobile fraction and diffusion coefficients) are useful to quantitatively compare the motions inside condensates. The data can also be used to reconstruct the FRAP curves, which experimentally quantify the mobility of the protein solution.

      This work introduces an experimental method to track single molecules in a protein solution and analyzes the data based on a simple model. The simplicity of the model helps a clear understanding of the situation in a test tube, and I think that the model is quite useful in analyzing the condensate behaviors and it will benefit the field greatly. However, the manuscript in its current form fails to situate the work in the right context; many previous works are omitted in this manuscript, exaggerating the novelty of the work. Also, the two- state model is simple and useful, but I am concerned about the limits of the model. They extract the parameters from the experimental data by assuming the model. It is also likely that the molecules have a continuum between fully trapped and fully mobile states, and that this continuum model can also explain the experimental data well.

      We thank the reviewer for the warm overview of our work and the insightful comments on the areas that need to be improved. We are very encouraged by the reviewer’s general positive assessment of our approach. We have addressed these comments in the revised manuscript

      Reviewer #2 (Public Review):

      In this paper, Shen and co-workers report the results of experiments using single particle tracking and FRAP combined with modeling and simulation to study the diffusion of molecules in the dense and dilute phases of various kinds of condensates, including those with strong specific interactions as well as weak specific interactions (IDR-driven). Their central finding is that molecules in the dense phase of condensates with strong specific interactions tend to switch between a confined state with low diffusivity and a mobile state with a diffusivity that is comparable to that of molecules in the dilute phase. In doing so, the study provides experimental evidence for the effect of molecular percolation in biomolecular condensates.

      Overall, the experiments are remarkably sophisticated and carefully performed, and the work will certainly be a valuable contribution to the literature. The authors' inquiry into single particle diffusivity is useful for understanding the dynamics and exchange of molecules and how they change when the specific interaction is weak or strong. However, there are several concerns regarding the analysis and interpretation of the results that need to be addressed, and some control experiments that are needed for appropriate interpretation of the results, as detailed further below.

      We thank the reviewer for the warm support of our work (assessing that our work is “remarkably sophisticated and carefully performed” and “will certainly be a valuable contribution”) and for the constructive comments/critiques, which we have now addressed in the revised manuscript (please refer to our detailed responses below).

      (1) The central finding that the molecules tend to experience transiently confined states in the condensed phase is remarkable and important. This finding is reminiscent of transient "caging"/"trapping" dynamics observed in diverse other crowded and confined systems. Given this, it is very surprising to see the authors interpret the single-molecule motion as being 'normal' diffusion (within the context of a two-state diffusion model), instead of analyzing their data within the context of continuous time random walks or anomalous diffusion, which is generally known to arise from transient trapping in crowded/confined systems. It is not clear that interpreting the results within the context of simple diffusion is appropriate, given their general finding of the two confined and mobile states. Such a process of transient trapping/confinement is known to lead to transient subdiffusion at short times and then diffusive behavior at sufficiently long times. There is a hint of this in the inset of Fig 3, but these data need to be shown on log-log axes to be clearly interpreted. I encourage the authors to think more carefully and critically about the nature of the diffusive model to be used to interpret their results.

      We thank the reviewer for the insightful comments and suggestions, which have been very helpful for us to think deeper about the experimental data and the possible underlying mechanism of our findings. Indeed, the phase separated systems studied here resemble previously studied crowed and confined systems with transient caging/trapping dynamics in the literature ((Akimoto et al., 2011; Bhattacharjee and Datta, 2019; Wong et al., 2004) for examples)(references have been added in the revised manuscript). In our PSD system in Figure 3, The caging/trapping of NR2B in the condensed phase is likely due to its binding to the percolated PSD network. Thus, NR2B molecules in the condensed phase should undergo subdiffusive motions. Indeed, from our single molecule tracking data, the motion of NR2B fits well with the continuous time random walk (CTRW) model, as surmised by this reviewer. We have now fitted the MSD curve of all tracks of NR2B in the condensed phase with an anomalous diffusion model: MSD(t)=4Dtα (see Response Figure 1 below). The fitted α is 0.74±0.03, indicating that NR2B molecules in the condensed phase indeed undergo sub- diffusive motions. The fitted diffusion coefficient D is 0.014±0.001 μm2/s. We have now replaced the Brownian motion fitting in Figure 3E in the original manuscript with this sub- diffusive model fitting in the revised manuscript to highlight the complexity of NR2B diffusion in PSD condensed phase we observed.

      Response Figure 1: Fitted the MSD curve (mean value as red dot with standard error as error bar) in condensed phase with an anomalous diffusion model (blue curve, MSD=4Dtα). The fitting gives D=0.014±0.001 μm2/s and α=0.74±0.03.

      We find it useful to interpret the apparent diffusion coefficient (D=0.014±0.001 μm2/s) derived from this particular anomalous diffusion model as containing information of NR2B motions in a broadly construed mobile state (i.e., corresponding to the network unbound form) as well as in a broadly construed confined state (i.e., corresponding to NR2B molecules bound to percolated PSD networks). The global fitting using the sub-diffusive model does not pin down motion properties of NR2B in these different motion states. This is why we used, at least as a first approximation, the two-state motion switch model (HMM model) to analyse our data (please refer also to our detailed response to the comment #7 from reviewer 1 and corresponding additional analyses made during the revision as highlighted in Response Figure 4).

      As described in our response to the comment points #4 and #7 from reviewer 1, the two- state model is most likely a simplification of NR2B motions in the condensed phase. Both the mobile state and the confined state in our simplified interpretative framework likely represent ensemble averages of their respective motion states. However, the tracking data available currently do not allow us to further distinguish the substates, but further analysis using more refined model in the future may provide more physical insight, as we now emphasize in the revised “Discussion” section: “With this in mind, the two motion states in our simple two-state model for condensed-phase dynamics should be understood to be consisting of multiple sub-states. For instance, one might envision that the percolated molecular network in the condensed phase is not uniform (e.g., existence of locally denser or looser local networks) and dynamic (i.e., local network breaking and forming). Therefore, individual proteins binding to different sub-regions of the network will have different motion properties/states. … In light of this basic understanding, the “confined state” and “mobile state” as well as the derived diffusion coefficients in this work should be understood as reflections of ensemble-averaged properties arising from such an underlying continuum of mobilities. Further development of experimental techniques in conjunction with more refined models of anomalous diffusion (Joo et al., 2020; Kuhn et al., 2021; Muñoz-Gil et al., 2021) will be necessary to characterize these more subtle dynamic properties and to ascertain their physical origins” (p.23 of the revised manuscript).

      A practical reason for using the two-state motion switch HMM model to analyse our tracking data in the condensed phase is that the lifetime of the putative mobile state (when the per-frame molecular displacements are relatively large) is very short and such relatively faster short trajectories are interspersed by long confined states (see Response Figure 4C for an example). Statistically, ascertaining a particular anomalous diffusion model by fitting to such short tracks is likely not reliable. Therefore, here we opted for a semi-quantitative interpretative framework by using fitted diffusion coefficients in a two-state HMM as well as the new correlation-based approach for demarcating a low-mobility state and a high- mobility state (see our detailed response to reviewer 1’s point #7) in the present manuscript (which is quite an extensive study already) while leaving refinements of our computational modelling to future effort.

      Even in the context of the 'normal' two-state diffusion model they present, if they wish to stick with that-although it seems inappropriate to do so-can the authors provide some physical intuition for what exactly sets the diffusivities they extract from their data. (0.17 and 0.013 microns squared per second for the mobile and confined states). Can these be understood using e.g., the Stoke-Einstein or Ogston models somehow?

      As stated above, we are in general agreement with this reviewer that the motion of NR2B in the condensed phase is more complex than the simple two-state picture we adopted as a semi-quantitative interpretation that is adequate for our present purposes. Within the multi-pronged analysis we have performed thus far, NR2B molecules clearly undergo anomalous diffusions in solution containing dense, percolated, and NR2B-binding molecular networks. As a first approximation, our simple two-state HMM analysis yielded two simple diffusion coefficients (0.17 μm2/s for the mobile state and 0.013 μm2/s for the confined state). For the diffusion coefficient in the mobile state, we regard it as providing a time scale for relatively faster diffusive motions (which may be further classified into various motion substates in the future) that are not bound or only weakly associated with the percolated network of strong interactions in the PSD condensed phase. For the confined or low-mobility state in our present formulation, these molecules are likely bound relatively tightly to the percolated networks, thus the diffusion coefficient should be much smaller than the unbounded form (i.e., the mobile state) according to the Stoke-Einstein model. However, due to the detection limitation of the supper resolution imaging method (resolution of ~20 nm), we could not definitively tell the actual diffusivity beyond the resolution limit. So the diffusion coefficient in the confined state can also be interpreted as a Gaussian distributed microscope detection error (𝑓(𝑥) =1 , which is x~N(0, σ2), where σ is the standard deviation of the Gaussian distribution viewed as the resolution of localization-based microscopy, x is the detection error between recorded localization and molecule’s actual position). The track length in the confined state is the distance between localizations in consecutive frames, which can be calculated by subtraction of two independent Gaussian distributions, and the distribution of this track length (r) will be r~N(0, 2σ2). To link the detection error with the fitted diffusion coefficient, we calculated the log likelihood function of Gaussian distributed localization error (, where σ is the standard deviation of the Gaussian distribution) for the maximum likelihood estimation process to fit the HMM model. The random walk shares a similar log likelihood term () in performing maximum likelihood estimation.

      These two log likelihood functions will produce same fitting results with 2σ2 equivalent to 4Dt according to the likelihood function. In this way, the diffusion coefficient yielded by our HMM analyses for the confined state (0.0127 μm2/s) can be interpreted as the standard deviation of localization detection error (or microscope resolution limit), which is 𝜎 =√2𝐷𝑡 = 19.5 𝑛𝑚. We have included this consideration as an alternate interpretation of the confined-state or low-mobility motions with the results now provided in the “Materials and Methods” section in the sentence, viz., “… the L-component distribution may be reasonably fitted (albeit with some deviations, see below) to a simple-diffusion functional form with a parameter s =13.6 ± 3.7 nm, where s may be interpreted as a microscope detection error due to imaging limits or alternately expressed as s = DLt with DL = 0.006149 μm2/s being the fitted confined-state diffusion coefficient and t = 0.03s is the time interval of the time step between experimental frames. (The HMM-estimated confined-state Dc = 0.0127 μm2/s corresponds to s = 19.5 nm)” (p.32 of the revised manuscript).

      (2) Equation 1 (and hence equation 2) is concerning. Consider a limit when P_m=1, that is, in the condensed phase, there are no confined particles, then the model becomes a diffusion equation with spatially dependent diffusivity, \partial c /\partial t = \nabla * (D(x) \nabla c). The molecules' diffusivity D(x) is D_d in the dilute phase and D_m in the condensed phase. No matter what values D_d and D_m are, at equilibrium the concentration should always be uniform everywhere. According to Equation 1, the concentration ratio will be D_d/D_m, so if D_d/D_m \neq 1, a concentration gradient is generated spontaneously, which violates the second law of thermodynamics. Can the authors please justify the use of this equation?

      Indeed, the derivation of Equation 1 appears to be concerning. The flux J is proportional to D * dc/dx (not kDc as in the manuscript). At equilibrium dc/dx = 0 on both sides and c is constant everywhere. Can the authors please comment?

      So then another question is, why does the Monte Carlo simulation result agree with Equation 1? I suspect this has to do with the behavior of particles crossing the boundary. Consider another limit where D_m = 0, that is, particles freeze in the condensed phase. If once a particle enters the condensed phase, it cannot escape, then eventually all particles will end up in the condensed phase and EF=infty. The authors likely used this scheme. But as mentioned above this appears to violate the second law.

      Thanks for the incisive comment. After much in-depth considerations, we are in agreement with the reviewer that Eq.1 should not be presented as a relation that is generally applicable to diffusive motions of molecules in all phase-separated systems. There are cases in which this relation can need to unphysical outcomes as correctly pointed out by the reviewer.

      Nonetheless, based on our theoretical/computational modeling, it is also clear, empirically, that Eq.1 holds approximately for the NR2B/PSD system we studied, and as such it is a useful approximate relation in our analysis. We have therefore provided a plausible physical perspective for Eq.1’s applicability as an approximate relation based upon a schematic consideration of diffusion on an underlying rugged (free) energy landscape (Zhang and Chan, 2012) of a phase-separated system (See Figure 3G in the revised manuscript), while leaving further studies of such energy landscape models to future investigations.

      This additional perspective is now included in the following added passage under a new subheading in the revised manuscript:

      "Physical picture and a two-state, two-phase diffusion model for equilibrium and dynamic properties of PSD condensates"

      (3) Despite the above two major concerns described in (1) and (2), the enrichment due to the presence of a "confined state", is reasonable. The equilibrium between "confined" and "mobile" states is determined by its interaction with the other proteins and their ratio at equilibrium corresponds to the equilibrium constant. Therefore EF=1/Pm is reasonable and comes solely from thermodynamics. In fact, the equilibrium partition between the dilute and dense phases should solely be a thermodynamic property, and therefore one may expect that it should not have anything to do with diffusivity. Can the authors please comment on this alternative interpretation?

      Thanks for this thought-provoking comment. We agree with the reviewer that the relative molecular densities in the condensed versus dilute phases are governed by thermodynamics unless there is energy input into the system. However, in our formulation, the mobile ratio should not be the only parameters for determining the enrichment fold in a phase separated system. In fact, the approximate relation (Eq.1) is EF ≈ Dd/PmDm, and thus EF ≈ 1/Pm only when Dd ≈ Dm . But the speed of mobile-state diffusion in the condensed phase is found to be appreciably smaller than that of diffusion in the dilute phase (Dd > Dm). In general, a hallmark of a phase separation system is to enrich involved molecules in the condensed phase, regardless whether the molecule is a driver (or scaffold) or a client of the system. Such enrichment is expected to be resulted from the net free energy gain due to increased molecular interactions of the condensed phase (as envisioned in Response Figure 9). For example, in the phase separation systems containing PrLD-SAMME (Figure 4 of the manuscript), Pm is close to 1, but the enrichment of PrLD-SAMME in the condensed phase is much greater than 1 (estimated to be ~77, based on the fluorescence intensity of the protein in the dilute and condensed phase; Figure 5—figure supplement 1). As far as Eq.1 is concerned, this is mathematically correct because the diffusion coefficient of PrLD-SAMME in the condensed phase (D ~0.2 μm2/s) is much smaller than the diffusion coefficient of a monomeric molecule with a similar molecular mass in dilute solution (D~ 100 μm2/s, measured by FRAP-based assay; the mobility of the molecules in the dilute solution in 3D is too fast to be tracked). Physically, it’s most likely that the slower molecular motion in the condensed phase is caused by favorable intermolecular interactions and the same favorable interactions underpinning the dynamic effects lead also to a larger equilibrium Boltzmann population.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors set out to extend modeling of bispecific engager pharmacology through explicit modelling of the search of T cells for tumour cells, the formation of an immunological synapse and the dissociation of the immunological synapse to enable serial killing. These features have not been included in prior models and their incorporation may improve the predictive value of the model.

      Thank you for the positive feedback.

      The model provides a number of predictions that are of potential interest- that loss of CD19, the target antigen, to 1/20th of its initial expression will lead to escape and that the bone marrow is a site where the tumour cells may have the best opportunity to develop loss variants due to the limited pressure from T cells.

      Thank you for the positive feedback.

      A limitation of the model is that adhesion is only treated as a 2D implementation of the blinatumomab mediated bridge between T cell and B cells- there is no distinct parameter related to the distinct adhesion systems that are critical for immunological synapse formation. For example, CD58 loss from tumours is correlated with escape, but it is not related to the target, CD19. While they begin to consider the immunological synapse, they don't incorporate adhesion as distinct from the engager, which is almost certainly important.

      We agree that adhesion molecules play critical roles in cell-cell interaction. In our model, we assumed these adhesion molecules are constant (or not showing difference across cell populations). This assumption made us to focus on the BiTE-mediated interactions.

      Revision: To clarify this point, we added a couple of sentences in the manuscript.

      “Adhesion molecules such as CD2-CD58, integrins and selectins, are critical for cell-cell interaction. The model did not consider specific roles played by these adhesion molecules, which were assumed constant across cell populations. The model performed well under this simplifying assumption”.

      In addition, we acknowledged the fact that “synapse formation is a set of precisely orchestrated molecular and cellular interactions. Our model merely investigated the components relevant to BiTE pharmacologic action and can only serve as a simplified representation of this process”.

      While the random search is a good first approximation, T cell behaviour is actually guided by stroma and extracellular matrix, which are non-isotropic. In a lymphoid tissue the stroma is optimised for a search that can be approximated as brownian, or more accurately, a correlated random walk, but in other tissues, particularly tumours, the Brownian search is not a good approximation and other models have been applied. It would be interesting to look at observations from bone marrow or other sites to determine the best approximating for the search related to BiTE targets.

      We agree that the tissue stromal factors greatly influence the patterns of T cell searching strategy. Our current model considered Brownian motion as a good first approximation for two reasons: 1) we define tissues as homogeneous compartments to attain unbiased evaluations of factors that influence BiTE-mediated cell-cell interaction, such as T cell infiltration, T: B ratio, and target expression. The stromal factors were not considered in the model, as they require spatially resolved tissue compartments to represent the gradients of stromal factors; 2) our model was primarily calibrated against in vitro data obtained from a “well-mixed” system that does not recapitulate specific considerations of tissue stromal factors. We did not obtain tissue-specific data to support the prediction of T cell movement. This is under current investigation in our lab. Therefore, we are cautious about assuming different patterns of T cell movement in the model when translating into in vivo settings. We acknowledged the limitation of our model for not considering the more physiologically relevant T-cell searching strategies.

      Revision: In the Discussion, we added a limitation of our model: “We assumed Brownian motion in the model as a good first approximation of T cell movement. However, T cells often take other more physiologically relevant searching strategies closely associated with many stromal factors. Because of these stromal factors, the cell-cell encounter probabilities would differ across anatomical sites.”

      Reviewer #3 (Public Review):

      Liu et al. combined mechanistic modeling with in vitro experiments and data from a clinical trial to develop an in silico model to describe response of T cells against tumor cells when bi-specific T cell engager (BiTE) antigens, a standard immunotherapeutic drug, are introduced into the system. The model predicted responses of T cell and target cell populations in vitro and in vivo in the presence of BiTEs where the model linked molecular level interactions between BiTE molecules, CD3 receptors, and CD19 receptors to the population kinetics of the tumor and the T- cells. Furthermore, the model predicted tumor killing kinetics in patients and offered suggestions for optimal dosing strategies in patients undergoing BiTE immunotherapy. The conclusions drawn from this combined approach are interesting and are supported by experiments and modeling reasonably well. However, the conclusions can be tightened further by making some moderate to minor changes in their approach. In addition, there are several limitations in the model which deserves some discussion.

      Strengths

      A major strength of this work is the ability of the model to integrate processes from the molecular scales to the populations of T cells, target cells, and the BiTE antibodies across different organs. A model of this scope has to contain many approximations and thus the model should be validated with experiments. The authors did an excellent job in comparing the basic and the in vitro aspects of their approach with in vitro data, where they compared the numbers of engaged target cells with T cells as the numbers of the BiTE molecules, the ratio of effector and target cells, and the expressions of the CD3 and CD19 receptors were varied. The agreement with the model with the data were excellent in most cases which led to several mechanistic conclusions. In particular, the study found that target cells with lower CD19 expressions escape the T cell killing.

      The in vivo extension of the model showed reasonable agreements with the kinetics of B cell populations in patients where the data were obtained from a published clinical trial. The model explained differences in B cell population kinetics between responders and non-responders and found that the differences were driven by the differences in the T cell numbers between the groups. The ability of the model to describe the in vivo kinetics is promising. In addition, the model leads to some interesting conclusions, e.g., the model shows that the bone marrow harbors tumor growth during the BiTE treatment. The authors then used the model to propose an alternate dosage scheme for BiTEs that needed a smaller dose of the drug.

      Thank you for the positive comments.

      Weaknesses

      There are several weaknesses in the development of the model. Multiscale models of this nature contain parameters that need to be estimated by fitting the model with data. Some these parameters are associated with model approximations or not measured in experiments. Thus, a common practice is to estimate parameters with some 'training data' and then test model predictions using 'test data'. Though Supplementary file 1 provides values for some of the parameters that appeared to be estimated, it was not clear which dataset were used for training and which for test. The confidence intervals of the estimated parameters and the sensitivity of the proposed in vivo dosage schemes to parameter variations were unclear.

      We agree with the reviewer on the model validation.

      Revision: To ensure reproducibility, we summarized model assumptions and parameter values/sources in the supplementary file 1. To mimic tumor heterogeneity and evolution process, we applied stochastic agent-based models, which are challenging to be globally optimized against the data. The majority of key parameters was obtained or derived from the literature. Details have been provided in the response to Reviewer 3 - Question 1. In our modeling process, we manually optimized sensitive coefficient (β) for base model using pilot in-vitro data and sensitive coefficient (β) for in-vivo model by re-calibrating against the in-vitro data at a low BiTE concentration. BiTE concentrations in patients (mostly < 2 ng/ml) is only relevant to the low bound of the concentration range we investigated in vitro (0.65-2000 ng/ml). We have added some clarification/limitation of this approach in the text (details are provided in the following question). We understand the concerns, but the agent-based modeling nature prevent us to do global optimization.

      The model appears to show few unreasonable behaviors and does not agree with experiments in several cases which could point to missing mechanisms in the model. Here are some examples. The model shows a surprising decrease in the T cell-target cell synapse formation when the affinity of the BiTEs to CD3 was increased; the opposite should have been more intuitive. The authors suggest degradation of CD3 could be a reason for this behavior. However, this probably could be easily tested by removing CD3 degradation in the model. Another example is the increase in the % of engaged effector cells in the model with increasing CD3 expressions does not agree well with experiments (Fig. 3d), however, a similar fold increase in the % of engaged effector cells in the model agrees better with experiments for increasing CD19 expressions (Fig. 3e). It is unclear how this can be explained given CD3 and CD19 appears to be present in similar copy numbers per cell (~104 molecules/cell), and both receptors bind the BiTE with high affinities (e.g., koff < 10-4 s-1).

      Thank you for pointing this out. The bidirectional effect of CD3 affinity on IS formation is counterintuitive. In a hypothetical situation when there is no CD3 downregulation, the bidirectional effect disappears (as shown below), consistent with our view that CD3 downregulation accounts for the counterintuitive behavior. We have included the simulation to support our point. From a conceptual standpoint, the inclusion of CD3 degradation means the way to maximize synapse formation is for the BiTE to first bind tumor antigen, after which the tumor-BiTE complex “recruits” a T cell through the CD3 arm.

      We agree that the model did not adequately capture the effect of CD3 expression at the highest BiTE concentration 100 ng/ml, while the effects at other BiTE concentrations were well captured (as shown below, left). The model predicted a much moderate effect of CD3 expression on IS formation at the highest concentration. This is partly because the model assumed rapid CD3 downregulation upon antibody engagement. We did a similar simulation as above, with moderate CD3 downregulation (as shown below, right). This increases the effect of CD3 expression at the highest BiTE concentration, consistent with experiments. Interestingly, a rapid CD3 downregulation rate, as we concluded, is required to capture data profiles at all other conditions. Considering BiTE concentration at 100 ng/ml is much higher than therapeutically relevant level in circulation (< 2 ng/ml), we did not investigate the mechanism underlying this inconsistent model prediction but we acknowledged the fact that the model under-predicted IS formation in Figure 3d. Notably, this discrepancy may rarely appear in our clinical predictions as the CD3 expression is low level and blood BiTE concentration is very low (< 2 ng/ml).

      Revision: we have made text adjustment to increase clarity on these points. In addition, we added: “The base model underpredicted the effect of CD3 expression on IS formation at 100 ng/ml BiTE concentration, which is partially because of the rapid CD3 downregulation upon BiTE engagement and assay variation across experimental conditions.”

      The model does not include signaling and activation of T cells as they form the immunological synapse (IS) with target cells. The formation IS leads to aggregation of different receptors, adhesion molecules, and kinases which modulate signaling and activation. Thus, it is likely the variations of the copy numbers of CD3, and the CD19-BiTE-CD3 will lead to variations in the cytotoxic responses and presumably to CD3 degradation as well. Perhaps some of these missing processes are responsible for the disagreements between the model and the data shown in Fig. 3. In addition, the in vivo model does not contain any development of the T cells as they are stimulated by the BiTEs. The differences in development of T cells, such as generation of dysfunctional/exhausted T cells could lead to the differences in responses to BiTEs in patients. In particular, the in vivo model does not agree with the kinetics of B cells after day 29 in non-responders (Fig. 6d); could the kinetics of T cell development play a role in this?

      We agree that intracellular signaling is critical to T cell activation and cytotoxic effects. IS formation, T cell activation, and cytotoxicity are a cascade of events with highly coordinated molecular and cellular interactions. Compared to the events of T cell activation and cytotoxicity, IS formation occurs at a relatively earlier time. As shown in our study, IS formation can occur at 2-5 min, while the other events often need hours to be observed. We found that IS formation is primarily driven by two intercellular processes: cell-cell encounter and cell-cell adhesion. The intracellular signaling would be initiated in the process of cell-cell adhesion or at the late stage of IS formation. We think these intracellular events are relevant but may not be the reason why our model did not adequately capture the profiles in Figure 3d at the highest BiTE concentrations. Therefore, we did not include intracellular signaling in the models. Another reason was that we simulated our models at an agent level to mimic the process of tumor evolution, which is computationally demanding. Intracellular events for each cell may make it more challenging computationally.

      T cell activation and exhaustion throughout the BiTE treatment is very complicated, time-variant and impacted by multiple factors like T cell status, tumor burden, BiTE concentration, immune checkpoints, and tumor environment. T cell proliferation and death rates are challenging to estimate, as the quantitative relationship with those factors is unknown. Therefore, T cell abundance (expansion) was considered as an independent variable in our model. T cell counts are measured in BiTE clinical trials. We included these data in our model to reveal expanded T cell population. Patients with high T cell expansion are often those with better clinical response. Notably, the T cell decline due to rapid redistribution after administration was excluded in the model. T cell abundance was included in the simulations in Figure 6 but not proof of concept simulations in Figure 7.

      In Figure 6d, kinetics of T cell abundance had been included in the simulations for responders and non-responders in MT103-211 study. Thus, the kinetics of T cell development can’t be used to explain the disagreement between model prediction and observation after day 29 in non-responders. The observed data is actually median values of B-cell kinetics in non-responders (N = 27) with very large inter-subject variation (baseline from 10-10000/μL), which makes it very challenging to be perfectly captured by the model. A lot of non-responders with severe progression dropped out of the treatment at the end of cycle 1, which resulted in a “more potent” efficacy in the 2nd cycle. This might be main reason for the disagreement.

      Variation in cytotoxic response was not included in our models. Tumor cells were assumed to be eradicated after the engagement with effecter cells, no killing rate or killing probability was implemented. This assumption reduced the model complexity and aligned well with our in-vitro and clinical data. Cytotoxic response in vivo is impacted by multiple factors like copy number of CD3, cytokine/chemokine release, tumor microenvironment and T cell activation/exhaustion. For example, the cytotoxic response and killing rate mediated by 1:1 synapse (ET) and other variants (ETE, TET, ETEE, etc.) are supposed to be different as well. Our model did not differentiate the killing rate of these synapse variants, but the model has quantified these synapse variants, providing a framework for us to address these questions in the future. We agree that differentiate the cytotoxic responses under different scenarios cell may improve model prediction and more explorations need to be done in the future.

      Revision: We added a discussion of the limitations which we believe is informative to future studies.

      “Our models did not include intracellular signaling processes, which are critical for T activation and cytotoxicity. However, our data suggests that encounter and adhesion are more relevant to initial IS formation. To make more clinically relevant predictions, the models should consider these intracellular signaling events that drive T cell activation and cytotoxic effects. Of note, we did consider the T cell expansion dynamics in organs as independent variable during treatment for the simulations in Figure 6. T cell expansion in our model is case-specific and time-varying.”

      References:

      Chen W, Yang F, Wang C, Narula J, Pascua E, Ni I, Ding S, Deng X, Chu ML, Pham A, Jiang X, Lindquist KC, Doonan PJ, Blarcom TV, Yeung YA, Chaparro-Riggers J. 2021. One size does not fit all: navigating the multi-dimensional space to optimize T-cell engaging protein therapeutics. MAbs 13:1871171. DOI: 10.1080/19420862.2020.1871171, PMID: 33557687

      Dang K, Castello G, Clarke SC, Li Y, AartiBalasubramani A, Boudreau A, Davison L, Harris KE, Pham D, Sankaran P, Ugamraj HS, Deng R, Kwek S, Starzinski A, Iyer S, Schooten WV, Schellenberger U, Sun W, Trinklein ND, Buelow R, Buelow B, Fong L, Dalvi P. 2021. Attenuating CD3 affinity in a PSMAxCD3 bispecific antibody enables killing of prostate tumor cells with reduced cytokine release. Journal for ImmunoTherapy of Cancer 9:e002488. DOI: 10.1136/jitc-2021-002488, PMID: 34088740

      Gong C, Anders RA, Zhu Q, Taube JM, Green B, Cheng W, Bartelink IH, Vicini P, Wang BPopel AS. 2019. Quantitative Characterization of CD8+ T Cell Clustering and Spatial Heterogeneity in Solid Tumors. Frontiers in Oncology 8:649. DOI: 10.3389/fonc.2018.00649, PMID: 30666298

      Mejstríková E, Hrusak O, Borowitz MJ, Whitlock JA, Brethon B, Trippett TM, Zugmaier G, Gore L, Stackelberg AV, Locatelli F. 2017. CD19-negative relapse of pediatric B-cell precursor acute lymphoblastic leukemia following blinatumomab treatment. Blood Cancer Journal 7: 659. DOI: 10.1038/s41408-017-0023-x, PMID: 29259173

      Samur MK, Fulciniti M, Samur AA, Bazarbachi AH, Tai YT, Prabhala R, Alonso A, Sperling AS, Campbell T, Petrocca F, Hege K, Kaiser S, Loiseau HA, Anderson KC, Munshi NC. 2021. Biallelic loss of BCMA as a resistance mechanism to CAR T cell therapy in a patient with multiple myeloma. Nature Communications 12:868. DOI: 10.1038/s41467-021-21177-5, PMID: 33558511

      Xu X, Sun Q, Liang X, Chen Z, Zhang X, Zhou X, Li M, Tu H, Liu Y, Tu S, Li Y. 2019. Mechanisms of relapse after CD19 CAR T-cell therapy for acute lymphoblastic leukemia and its prevention and treatment strategies. Frontiers in Immunology 10:2664. DOI: 10.3389/fimmu.2019.02664, PMID: 31798590

      Yoneyama T, Kim MS, Piatkov K, Wang H, Zhu AZX. 2022. Leveraging a physiologically-based quantitative translational modeling platform for designing B cell maturation antigen-targeting bispecific T cell engagers for treatment of multiple myeloma. PLOS Computational Biology 18: e1009715. DOI: 10.1371/journal.pcbi.1009715, PMID: 35839267

    1. Author Response

      Reviewer #1 (Public Review):

      This study examines the factors underlying the assembly of MreB, an actin family member involved in mediating longitudinal cell wall synthesis in rod-shaped bacteria. Required for maintaining rod shape and essential for growth in model bacteria, single molecule work indicates that MreB forms treadmilling polymers that guide the synthesis of new peptidoglycan along the longitudinal cell wall. MreB has proven difficult to work with and the field is littered with artifacts. In vitro analysis of MreB assembly dynamics has not fared much better as helpfully detailed in the introduction to this study. In contrast to its distant relative actin, MreB is difficult to purify and requires very specific conditions to polymerize that differ between groups of bacteria. Currently, in vitro analysis of MreB and related proteins has been mostly limited to MreBs from Gram-negative bacteria which have different properties and behaviors from related proteins in Gram-positive organisms.

      Here, Mao and colleagues use a range of techniques to purify MreB from the Gram-positive organism Geobacillus stearothermophilus, identify factors required for its assembly, and analyze the structure of MreB polymers. Notably, they identify two short hydrophobic sequences-located near one another on the 3-D structure-which are required to mediate membrane anchoring.

      With regard to assembly dynamics, the authors find that Geobacillus MreB assembly requires both interactions with membrane lipids and nucleotide binding. Nucleotide hydrolysis is required for interaction with the membrane and interaction with lipids triggers polymerization. These experiments appear to be conducted in a rigorous manner, although the salt concentration of the buffer (500mM KCl) is quite high relative to that used for in vitro analysis of MreBs from other organisms. The authors should elaborate on their decision to use such a high salt buffer, and ideally, provide insight into how it might impact their findings relative to previous work.

      Response 1.1. MreB proteins are notoriously difficult to maintain in a soluble form. Some labs deleted the N-terminal amphipathic or hydrophobic sequences to increase solubility, while other labs used full-length protein but high KCl concentration (300 mM KCl) (Harne et al, 2020; Pande et al., 2022; Popp et al, 2010; Szatmari et al, 2020). Early in the project, we tested many conditions and noticed that high KCl helped keeping a slightly better solubility of full length MreBGs, without the need for deleting a part of the protein. In addition, concentrations of salt > 100 mM would better mimic the conditions met by the protein in vivo. While 50-100 mM KCl is traditionally used in actin polymerization assays, physiological salt concentrations are around 100-150 mM KCl in invertebrates and vertebrates (Schmidt-Nielsen, 1975), around 50-250 in fungal and plant cells (Rodriguez-Navarro, 2000) and 200-300 mM in the budding yeast (Arino et al, 2010). However, cytoplasmic K+ concentration varies greatly (up to 800 mM) depending on the osmolality of the medium in both E. coli (Cayley et al, 1991; Epstein & Schultz, 1965; Rhoads et al, 1976), and B. subtilis, in which the basal intracellular concentration of KCl was estimated to be ~ 350 mM (Eisenstadt, 1972; Whatmore et al, 1990). 500 mM KCl can therefore be considered as physiological as 100 mM KCl for bacterial cells. Since we observed plenty of pairs of protofilaments at 500 mM KCl and this condition helped to avoid aggregation, we kept this high concentration as a standard for most of our experiments. Nonetheless, we had also performed TEM polymerization assays at 100 mM in line with most of MreB and F-actin in vitro literature, and found no difference in the polymerization (or absence of polymerization) conditions. This was indicated in the initial submission (e.g. M&M section L540 and footnote of Table S2) but since two reviewers bring it up as a main point, it is evident we failed at communicating it clearly, for which we apologize. This has been clarified in the revised version of the manuscript. We have also almost systematically added the 100 mM KCl concentration too as per reviewer #2 request and to conciliate our salt conditions with those used for some in vitro analysis of MreBs from other organisms (see also response to reviewer #2 comments 1A and 1B = Responses 2.1A, 2.1B below). We then decided to refer to the 100 mM KCl concentration as our “standard condition” in the revised version of the manuscript, but we compile and compare the results obtained at 500 mM too, as both concentrations are within the physiological range in Bacillus.

      Additionally, this study, like many others on MreB, makes much of MreB's relationship to actin. This leads to confusion and the use of unhelpful comparisons. For example, MreB filaments are not actin-like (line 58) any more than any polymer is "actin-like." As evidenced by the very beautiful images in this manuscript, MreB forms straight protofilaments that assemble into parallel arrays, not the paired-twisted polymers that are characteristic of F-actin. Generally, I would argue that work on MreB has been hindered by rather than benefitted from its relationship to actin (E.g early FP fusion data interpreted as evidence for an MreB endoskeleton supporting cell shape or depletion experiments implicating MreB in chromosome segregation) and thus such comparisons should be avoided unless absolutely necessary.

      Response 1.2. We completely agree with reviewer #1 regarding unhelpful comparisons of actin and MreB, and that work on MreB has been traditionally hindered from its relationship to eukaryotic actin. MreB is nonetheless a structural homolog of actin, with a close structural fold and common properties (polymerization into pairs of protofilaments, ATPase activity…). It still makes sense to refer to a protein with common features, common ancestry and widely studied as long as we don’t enclose our mind into a conceptual framework. This said, actin and MreB diverged very early in evolution, which may account for differences in their biochemical properties and cellular functions. Current data on MreB filaments confirm that they display F-actin-like and F-actin-unlike properties. We thank the reviewer for this insightful comment. We have revised the text to remove any inaccurate or unhelpful comparison to actin (in particular the ‘actin-like filaments’ statement, previously used once)

      Reviewer #2 (Public Review):

      The paper "Polymerization cycle of actin homolog MreB from a Gram-positive bacterium" by Mao et al. provides the second biochemical study of a gram-positive MreB, but importantly, the first study examines how gram-positive MreB filaments bind to membranes. They also show the first crystal structure of a MreB from a Gram-positive bacterium - in two nucleotide-bound forms, finally solving structures that have been missing for too long. They also elucidate what residues in Geobacillus MreB are required for membrane associations. Also, the QCM-D approach to monitoring MreB membrane associations is a direct and elegant assay.

      While the above findings are novel and important, this paper also makes a series of conclusions that run counter to multiple in vitro studies of MreBs from different organisms and other polymers with the actin fold. Overall, they propose that Geobacillus MreB contains biochemical properties that are quite different than not only the other MreBs examined so far but also eukaryotic actin and every actin homolog that has been characterized in vitro. As the conclusions proposed here would place the biochemical properties of Geobacillus MreB as the sole exception to all other actin fold polymers, further supporting experiments are needed to bolster these contrasting conclusions and their overall model.

      Response 2.0. We are grateful to reviewer #2 for stressing out the novelty and importance of our results. Most of our conclusions were in line with previous in vitro studies of MreBs (formation of pairs of straight filaments on a lipid layer, both ATP and GTP binding and hydrolysis, distortion of liposomes…), to the exception of the claimed requirement of NTP hydrolysis for membrane binding prior to polymerization based on the absence of pairs of filaments in free solution or in the presence of AMP-PNP in our experimental conditions (which we agree was not sufficient to make such a bold claim, see below). Thanks to the reviewer’s comments, we have performed many controls and additional experiments that lead us to refine our results and largely conciliate them with the literature. Please see the answer to the global review comments - our conclusions have been revised on the basis of our new data.

      1. (Difference 1) - The predominant concern about the in vitro studies that makes it difficult to evaluate many of their results (much less compare them to other MreB/s and actin homologs) is the use of a highly unconventional polymerization buffer containing 500(!) mM KCL. As has been demonstrated with actin and other polymers, the high KCl concentration used here (500mM) is certain to affect the polymerization equilibria, as increasing salt increases the hydrophobic effect and inhibits salt bridges, and therefore will affect the affinity between monomers and filaments. For example, past work has shown that high salt greatly changes actin polymerization, causing: a decreased critical concentration, increased bundling, and a greatly increased filament stiffness (Kang et al., 2013, 2012). Similarly, with AlfA, increased salt concentrations have been shown to increase the critical concentration, decrease the polymerization kinetics, and inhibit the bundling of AlfA filaments (Polka et al., 2009).

      A more closely related example comes from the previous observation that increasing salt concentrations increasingly slow the polymerization kinetics of B. subtilis MreB (Mayer and Amann, 2009). Lastly, These high salt concentrations might also change the interactions of MreB(Gs) with the membrane by screening charges and/or increasing the hydrophobic effect. Given that 500mM KCl was used throughout this paper, many (if not all) of the key experiments should be repeated in more standard salt concentration (~100mM), similar to those used in most previous in vitro studies of polymers.

      Response 2.1A. As per reviewer #2 request, we have done at 100 mM KCl too most experiments (TEM, cryo-EM, QCMD and ATPase assays) initially performed at 500 mM KCl only. The KCl concentration affects both membrane binding and filament stiffness as anticipated by the reviewer but the main conclusions are the same. The revised version of the manuscript compiles and compares the results obtained at both high and low [KCl], both concentrations being within the physiological range in Bacillus. Please see point 1 of the response to the global review comments and the first response to reviewer 1 (Response 1.1) for further elaboration.

      Please note that in Mayer & Amann, 2009 (B. subtilis MreB), light scattering in free solution was inversely proportional to the KCl concentration, with the higher light scattering signal at 0 mM KCl (!), a > 2-fold reduction below 30 mM KCl and no scatter at all at 250 mM, suggesting a “salting in” phenomenon (see also the “Other Points to address” answers 1A and 2, below) (Mayer & Amann, 2009). Since no effective polymer formation (e.g. polymers shown by EM) was demonstrated in these experiments, it cannot be excluded that KCl was simply preventing aggregation of B. subtilis MreB in solution, as we observe. For all their other light scattering experiments, the ‘standard polymerization condition’ used by Mayer & Amann was 0.2 mM ATP, 5 mM MgCl2, 1 mM EGTA and 10 mM imidazole pH 7.0, to which MreB (in 5 mM Tris pH 8.0) was added. No KCl was present in their ‘standard’ polymerization conditions.

      This would test if the many divergent properties of MreB(Gs) reported here arise from some difference in MreB(Gs) relative to other MreBs (and actin homologs), or if they arise from the 400mM difference in salt concentration between the studies. Critically, it would also allow direct comparisons to be made relative to previous studies of MreB (and other actin homologs) that used much lower salt, thereby allowing them to definitively demonstrate whether MreB(Gs) is indeed an outlier relative to other MreB and actin homologs. I would suggest using 100mM KCL, as historically, all polymerization assays of actin and numerous actin homologs have used 50-100mM KCL: 50mM KCl (for actin in F buffer) or 100mM KCl for multiple prokaryotic actin homologs and MreB (Deng et al., 2016; Ent et al., 2014; Esue et al., 2006, 2005; Garner et al., 2004 ; Polka et al., 2009 ; Rivera et al., 2011 ; Salje et al., 2011). Likewise, similar salt concentrations are standard for tubulin (80 mM K-Pipes) and FtsZ (100 mM KCl or 100mM KAc in HMK100 buffer).

      Response 2.1B. We appreciate the reviewer’s feedback on this point. Please note that, although actin polymerization assays are historically performed at 50-100 mM KCl and thus 100 mM KCl was used for other bacterial actin homologs (MamK, ParM and AlfA), MreB polymerization assays have previously been reported at 300 mM KCl too (Harne et al., 2020; Pande et al., 2022; Popp et al., 2010; Szatmari et al., 2020), which is closer to the physiological salt concentration in bacterial cells (see Response 1.1), but also in the absence of KCl (see above). As a matter of fact, we originally wanted to use a “standard polymerization condition” based on the literature on MreB, before realizing there was none: only half used KCl (the other half used NaCl, or no monovalent salt at all) and among these, KCl concentrations varied (out of 8 publications, 2 used 20 mM KCl, 2 used 50 mM KCl and 4 used 300 mM KCl).

      1. (Difference 2) - One of the most important differences claimed in this paper is that MreB(Gs) filaments are straight, a result that runs counter to the curved T. Maritima and C. crescentus filaments detailed by the Löwe group (Ent et al., 2014; Salje et al., 2011). Importantly, this difference could also arise from the difference in salt concentrations used in each study (500mM here vs. 100mM in the Löwe studies), and thus one cannot currently draw any direct comparisons between the two studies.

      One example of how high salt could be causing differences in filament geometry: high salts are known to greatly increase the bending stiffness of actin filaments, making them more rigid (Kang et al., 2013). Likewise, increasing salt is known to change the rigidity of membranes. As the ability of filaments to A) bend the membrane or B) Deform to the membrane depends on the stiffness of filaments relative to the stiffness of the membrane, the observed difference in the "straight vs. curved" conformation of MreB filaments might simply arise from different salt concentrations. Thus, in order to draw several direct comparisons between their findings and those of other MreB orthologs (as done here), the studies of MreB(GS) confirmations on lipids should be repeated at the same buffer conditions as used in the Löwe papers, then allowing them to be directly compared.

      Response 2.2. We fully agreed with reviewer #2 that the salts could be affecting the assay and did cryo-EM experiments also in the presence of 100 mM KCl as requested. The results unambiguously showed countless curved liposomes on the contact areas with MreB (Fig. 2F-G and Fig. 2-S5), very similar to what was reported for Thermotoga and Caulobacter MreBs by the Lowe group. Our results therefore confirm the previous findings that MreBs can bend lipids, and suggest that, indeed, high salt may increase filament stiffness as it has been shown for actin filaments. We are very grateful to reviewer #2 for his suggestion and for drawing our attention to the work of Kang et al, 2013. The different bending observed when varying the salt concentration raise relevant questions regarding the in vivo behavior of MreB, since KCl was shown to vary greatly depending on the medium composition. The manuscript has been updated accordingly in the Results (from L243) and Discussion sections (L585-595).

      1. (Difference 3) - The next important difference between MreB(Gs) and other MreBs is the claim that MreB polymers do not form in the absence of membranes.

      A) This is surprising relative to other MreBs, as MreBs from 1) T. maritime (multiple studies), E.coli (Nurse and Marians, 2013), and C. crescentus (Ent et al., 2014) have been shown to form polymers in solution (without lipids) with electron microscopy, light scattering, and time-resolved multi-angle light scattering. Notably, the Esue work was able to observe the first phase of polymer formation and a subsequent phase of polymer bundling (Esue et al., 2006) of MreB in solution. 2) Similarly, (Mayer and Amann, 2009) demonstrated B. subtilis MreB forms polymers in the absence of membranes using light scattering.

      Response 2.3A. The literature does convincingly show that Thermotoga MreB forms polymers in solution, without lipids (note that for Caulobacter MreB filaments were only reported in the presence of lipids, (van den Ent et al, 2014)). Assemblies reported in solution are bundles or sheets (included in at the earlier time points in the time-resolved EM experiments reported by Esue et al. 2006 mentioned by the reviewer – ‘2 minutes after adding ATP, EM revealed that MreB formed short filamentous bundles’) (Esue et al, 2006). However, and as discussed above (Response 2.1A), the light scattering experiments in Mayer et Amann, 2009 do not conclusively demonstrate the presence of polymers of B. subtilis MreB in solution (Mayer & Amann, 2009). We performed many light scattering experiments of B. subtilis MreB in solution in the past (before finding out that filaments were only forming in the presence of lipids), and got similar scattering curves (see two examples of DLS experiments in Author response image 1) in conditions in which NO polymers could ever been observed by EM while plenty of aggregates were present.

      Author response image 1.

      We did not consider these results publishable in the absence of true polymers observed by TEM. As pointed out on the interesting study from Nurse et al. (on E. coli MreB) (Nurse & Marians, 2013), one cannot rely only on light scattering only because non-specific aggregates would show similar patterns than polymers. Over the last two decades, about 15 publications showed polymers of MreB from several Gram-negative species, while none (despite the efforts of many) showed a single convincing MreB polymer from a Gram-positive bacterium by EM. A simple hypothesis is that a critical parameter was missing, and we present convincing evidence that lipids are critical for Geobacillus MreB to form pairs of filaments in the conditions tested. However, in solution too we do occasionally see pairs of filaments (Fig 2-S2), and also sheet-like structures among aggregates when the concentration of MreB is increased (Fig. 2-S2 and Fig. 3-S2). Thus, we agree with the reviewer that it cannot be claimed that Geobacillus MreB is unable to polymerize in the absence of lipids, but rather that lipids strongly stimulate its polymerization, condition depending.

      B) The results shown in figure 5A also go against this conclusion, as there is only a 2-fold increase in the phosphate release from MreB(Gs) in the presence of membranes relative to the absence of membranes. Thus, if their model is correct, and MreB(Gs) polymers form only on membranes, this would require the unpolymerized MreB monomers to hydrolyze ATP at 1/2 the rate of MreB in filaments. This high relative rate of hydrolysis of monomers compared to filaments is unprecedented. For all polymers examined so far, the rate of monomer hydrolysis is several orders of magnitude less than that of the filament. For example, actin monomers are known to hydrolyze ATP 430,000X slower than the monomers inside filaments (Blanchoin and Pollard, 2002; Rould et al., 2006).

      Response 2.3B. We agree with the reviewer. We have now found conditions where sheets of MreB form in solution (at high MreB concentration) in the presence of ADP and AMP-PNP. However, we have now added several controls that exclude efficient formation of polymers in solution in the presence of ATP at low concentrations of MreBGs (≤ 1.5 µM), the condition used for the malachite green assays. At these MreB concentrations, pairs of filaments are observed in the presence of lipids, but very unfrequently in solution, and sheets are not observed in solution either (Fig. 2-S2A, B). Yet, albeit puzzling, in these conditions Pi release is reproducibly observed in solution, reduced only ~ 2 to 3-fold relative to Pi release in the presence of lipids (Fig. 5A and Fig. 5-S1). A reinforcing observation is when the ATPase assays is performed at 100 mM KCl (Fig. 5A). In this condition MreB binding to lipids is increased relative to 500 mM KCl (Fig. 4-S4C), and the stimulation of the ATPase activity by the presence of lipids is also stronger that at 500 mM (Fig. 5-S1A). Further work is needed to characterize in detail the ATPase activity of MreB proteins, for which data in the literature is very scarce. We can’t exclude that MreB could nucleate in solution or form very unstable filaments that cannot be seen in our EM assay but consume ATP in the process. At the moment, the significance of the Pi released in solution is unknown and will require further investigation.

      C) Thus, there is a strong possibility that MreB(Gs) polymers are indeed forming in solution in addition to those on the membrane, and these "solution polymers" may not be captured by their electron microscopy assay. For example, high salt could be interfering with the absorption of filaments to glow discharged lacking lipids.

      Response 2.3C. We appreciate the reviewer’s insight about this critical point. Polymers presented in the original Fig. 2A were obtained at 500 mM KCl but we had tested the polymerization of MreB at 100 mM KCl as well, without noticing differences. We have nonetheless redone this quantitatively and used these data for the revised Fig. 2A, as we are now using 100 mM KCl as our standard polymerization condition throughout the revised manuscript. We also followed the other suggestion of the reviewer and tested glow discharged grids (a more classic preparation for soluble proteins) vs non-glow discharged EM grids, as well as a higher concentration of MreB. Grids are generally glow-discharged to make them hydrophilic in order to adsorb soluble proteins, but the properties of MreB (soluble but obviously presenting hydrophobic domains) made difficult to predict what support putative soluble polymers would preferentially interact with. Septins for example bind much better to hydrophobic grids despite their soluble properties (I. Adriaans, personal communication). Virtually no double filaments were observed in solution at either low or high [MreB]. The fact that in some conditions (high [MreB], other nucleotides) we were able to detect sheet-like structures excluded a technical issue that would prevent the detection of existing but “invisible” polymers here. We have added these new data in Fig. 2-S2.

      As indicated above, the reviewer’s comments made us realize that we could not state or imply that MreB cannot polymerize in the absence of lipids. As a matter of fact, we always saw some random filaments in the EM fields, both in solution and in the presence of non-hydrolysable analogues, at very low frequency (Fig. 2A). And we do see now sheets at high MreB concentration (Fig. 2-S2B). We could be just missing the optimal conditions for polymerisation in solution, while our phrasing gave the impression that no polymers could ever form in the absence of ATP or lipids. Therefore, we have:

      1) analyzed all TEM data to present it as semi-quantitative TEM, using our methodology originally implemented for the analysis of the mutants

      2) reworked the text to remove any issuing statements and to indicate that MreBGs was only found to bind to a lipid monolayer as a double protofilament in the presence of ATP/GTP but that this does not exclude that filaments may also form in other conditions.

      In order to definitively prove that MreB(Gs) does not have polymers in solution, the authors should:

      i) conduct orthogonal experiments to test for polymers in solution. The simplest test of polymerization might be conducting pelleting assays of MreB(Gs) with and without lipids, sweeping through the concentration range as done in 2B and 5a.

      Response 2.3Ci. Following reviewer #2 suggestion, we conducted a series of sedimentation assays in the presence and in the absence of lipids, at low (100 mM) and high (500 mM) salt, for both the wild-type protein and the three membrane-anchoring mutants (all at 1.3 µM). Sedimentation experiments in salt conditions preventing aggregation in solution (500 mM KCl) fitted with our TEM results: MreB wild-type pelleting increased in the presence of both ATP and lipids (Fig. R1). The sedimentation was further increased at 100 mM KCl, which would fit our other results indicating an increased interaction of MreB with the membrane. However, in addition to be poorly reproducible (in our hands), the approach does not discriminate between polymers and aggregates (or monomers bound to liposomes) and since MreB has a strong tendency to aggregate, we believe that the technique is ill-suited to reliably address MreB polymerization and prefer not to include sedimentation data in our manuscript. The recent work from Pande et al. (2022) illustrates well this issue since no sedimentation of MreB (at 2 µM) was observed in solution in conditions supporting polymerization (at 300 mM KCl): ‘the protein does not pellet on its own in the absence of liposome, irrespective of its polymerization state’, implying that sedimentation does not allow to detect MreB5 filaments in solution (Pande et al., 2022).

      ii) They also could examine if they see MreB filaments in the absence of lipids at 100mM salt (as was seen in both Löwe studies), as the high salt used here might block the charges on glow discharged grids, making it difficult for the polymer to adhere.

      See above, Response 2.3C

      iii) Likewise, the claim that MreB lacking the amino-terminus and the α2β7 hydrophobic loop "is required for polymerization" is questionable as if deleting these resides blocks membrane binding, the lack of polymers on the membrane on the grid is not unexpected, as these filaments that cannot bind the membrane would not be observable. Given these mutants cannot bind the membrane, mutant polymers could still indeed exist in solution, and thus pelleting assays should be used to test if non-membrane associated filaments composed of these mutants do or do not exist.

      Response 2.3Ciii. This is a fair point, we thank the reviewer for this remark. We did not mean to state or imply that the hydrophobic loop was required for polymerization per se, but that polymerization into double filaments only efficiently occurs upon membrane binding, which is mediated by the two hydrophobic sequences. We tested all three mutants by sedimentation as suggested by reviewer #2. In the salt condition that limits aggregation (500 mM KCl) the mutants did not pellet while the wild-type protein did (in the presence of lipids) (Fig. R2 below), in agreement with our EM data. We tested the absence of lipids on the mutant bearing the 2 deletions and observed that the (partial) sedimentation observed at low KCl concentration was ATP and lipid dependent (Fig. R3).

      Given our concerns about MreB sedimentation assays (see above, Response 2.3Ci), we prefer not to include these sedimentation data in our manuscript. Instead, we tested by TEM the possible polymerization of the mutants in solution (we only tested them in the presence of lipids in the initial submission). No filaments were detected in solution for any of the mutants (Fig. 4-S3A).

      A final note, the results shown in "Figure 1 - figure supplement 2, panel C" appear to directly refute the claim that MreB(Gs) requires lipids to polymerize. As currently written, it appears they can observe MreB(Gs) filaments on EM grids without lipids. If these experiments were done in the presence of lipids, the figure legend should be updated to indicate that. If these experiments were done in the absence of lipids, the claim that membrane association is required for MreB polymerizations should be revised.

      The TEM experiments show were indeed performed in the presence of lipids. We apologize for this was not clearly stated in the legend. To prevent all confusion, we have nevertheless removed these images in this figure since the polymerization conditions and lipid requirement are not yet presented when this figure is referred to in the text. We have instead added a panel with the calibration curve for the size exclusion profiles as per request of reviewer #3. The main point of this figure is to show the tendency of MreBGs to aggregate: analytical size-exclusion chromatography shows a single peak corresponding to the monomeric MreBGs, molecular weight ~ 37 KDa, in our purification conditions, but it can readily shift to a peak corresponding to high MW aggregates, depending on the protein concentration and/or storage conditions.

      1. (Difference 4) - The next difference between this study and previous studies of MreB and actin homologs is the conclusion that MreB(Gs) must hydrolyze ATP in order to polymerize. This conclusion is surprising, given the fact that both T. Maritima (Salje · 2011, Bean 2008) and B. subtilis MreB (Mayer 2009) have been shown to polymerize in the presence of ATP as well as AMP-PNP.

      Likewise, MreB polymerization has been shown to lag ATP hydrolysis in not only T. maritima MreB (Esue 2005), eukaryotic actin, and all other prokaryotic actin homologs whose polymerization and phosphate release have been directly compared: MamK (Deng et al., 2016), AlfA (Polka et al., 2009), and two divergent ParM homologs (Garner et al., 2004; Rivera et al., 2011). Currently, the only piece of evidence supporting the idea that MreB(Gs) must hydrolyze ATP in order to polymerize comes from 2 observations: 1) using electron microscopy, they cannot see filaments of MreB(Gs) on membranes in the presence of AMP-PNP or ApCpp, and 2) no appreciable signal increase appears testing AMPPNP- MreB(Gs) using QCM-D. This evidence is by no means conclusive enough to support this bold claim: While their competition experiment does indicate AMPPNP binds to MreB(Gs), it is possible that MreB(Gs) cannot polymerize when bound to AMPPNP.

      For example, it has been shown that different actin homologs respond differently to different non-hydrolysable analogs: Some, like actin, can hydrolyze one ATP analog but not the other, while others are able to bind to many different ATP analogs but only polymerize with some of one of them.

      Response 2.4. We agree with the reviewer, it is uncertain what analogs bind because they are quite different to ATP and some proteins just do not like them, they can change conditions such that filaments stop forming as well and be (theoretically) misleading. This is why we had tested ApCpp in addition to AMP-PNP as non-hydrolysable analog (Fig. 3A). As indicated above, our new complementary experiments (Fig. 3-S1B-D) now show that some rare (i.e. unfrequently and in limited amount) dual polymers are detected in the presence of ApCpp (Fig. 3A) and at high MreB concentration only in the presence of AMP-PNP (Fig. 3-S1B-D), suggesting different critical concentrations in the presence of alternative nucleotides. We have dampened our conclusions, in the light of our new data, and modified the discussion accordingly.

      Thus, to further verify their "hydrolysis is needed for polymerization" conclusion, they should:

      A. Test if a hydrolysis deficient MreB(Gs) mutant (such as D158A) is also unable to polymerize by EM.

      Response 2.4A. We thank the reviewer for this suggestion. As this conclusion has been reviewed on the basis of our new data (see previous response), testing putative ATPase deficient mutants is no longer required here. The study of ATPase mutants is planned for future studies (see Response 3.10 to reviewer #3).

      B. They also should conduct an orthogonal assay of MreB polymerization aside from EM (pelleting assays might be the easiest). They should test if polymers of ATP, AMP-PNP, and MreB(Gs)(D158A) form in solution (without membranes) by conducting pelleting assays. These could also be conducted with and without lipids, thereby also addressing the points noted above in point 3.

      Response 2.4B. Please see Response 2.3Ci above.

      C. Polymers may indeed form with ATP-gamma-S, and this non-hydrolysable ATP analog should be tested.

      Response 2.4C. It is fairly possible that ATP-γ-S supports polymerization since it is known to be partially hydrolysable by actin giving a mild phenotype (Mannherz et al, 1975). This molecule can even be a bona fide substrate for some ATPases (e.g. (Peck & Herschlag, 2003). Thus, we decided to exclude this “non-hydrolysable” analog and tested instead AMP-PNP and ApCpp. We know that ATP-γ-S has been and it is still frequently used, but we preferred to avoid it for the moment for the above-indicated reasons. We chose AMPPNP and AMPPCP instead because (1) they were shown to be completely non-hydrolysable by actin, in contrast to ATP-γ-S; (2) they are widely used (the most commonly used for structural studies; (Lacabanne et al, 2020), (3) AMPPNP was previously used in several publications on MreB (Bean & Amann, 2008; Nurse & Marians, 2013; Pande et al., 2022; Popp et al., 2010; Salje et al, 2011; van den Ent et al., 2014)and thus would allow direct comparison. AMPPCP was added to confirm the finding with AMP-PNP. There are many other analogs that we are planning to explore in future studies (see next Response, 2.4D).

      D. They could also test how the ADP-Phosphate bound MreB(Gs) polymerizes in bulk and on membranes, using beryllium phosphate to trap MreB in the ADP-Pi state. This might allow them to further refine their model.

      Response 2.4D. We plan to address the question of the transition state in depth in following-up work, using a series of analogs and mutants presumably affected in ATPase activity, both predicted and identified in a genetic screen. As indicated above, it is uncertain what analogs bind because they are quite different to ATP and some may bind but prevent filament formation. Thus, we anticipate that trying just one may not be sufficient, they can change conditions and be (theoretically) misleading and thus a thorough analysis is needed to address this question. Since our model and conclusions have been revised on the basis of our new data, we believe that these experiments are beyond the scope of the current manuscript.

      E. Importantly, the Mayer study of B. subtilis MreB found the same results in regard to nucleotides, "In polymerization buffer, MreB produced phosphate in the presence of ATP and GTP, but not in ADP, AMP, GDP or AMP-PNP, or without the readdition of any nucleotide". Thus this paper should be referenced and discussed

      Response 2.4E. We agree that Pi release was detected previously. We have added the reference (L121)

      1. (Difference 5) - The introduction states (lines 128-130) "However, the need for nucleotide binding and hydrolysis in polymerization remains unclear due to conflicting results, in vivo and in vitro, including the ability of MreB to polymerize or not in the presence of ADP or the non-hydrolysable ATP analog AMP-PNP."

      A) While this is a great way to introduce the problem, the statement is a bit vague and should be clarified, detaining the conflicting results and appropriate references. For example, what conflicting in vivo results are they referring to? Regarding "MreB polymerization in AMP-PNP", multiple groups have shown the polymerization of MreB(Tm) in the presence of AMP-PNP, but it is not clear what papers found opposing results.

      Response 2.5A. Thanks for the comment. We originally did not detail these ‘conflicting results’ in the Introduction because we were doing it later in the text, with the appropriate references, in particular in the Discussion (former L433-442). We have now removed this from the Discussion section and added a sentence in the introduction too (L123-130) quickly detailing the discrepancies and giving the references.

      • For more clarity, we have removed the “in vivo” (which referred to the distinct results reported for the presumed ATPase mutants by the Garner and Graumann groups) and focus on the in vitro discrepancies only.

      • These discrepancies are the following: while some studies showed indeed polymerization (as assessed by EM) of MreBTm in the presence of AMPPNP, the studies from Popp et al and Esue et al on T. maritima MreB, and of Nurse et al on E. coli MreB reported aggregation in the presence of AMP-PNP (Esue et al., 2006; Popp et al., 2010) or ADP (Nurse & Marians, 2013), or no assembly in the presence of ADP (Esue et al., 2006). As for the studies reporting polymerization in the presence of AMP-PNP by light scattering only (Bean & Amann, 2008; Gaballah et al, 2011; Mayer & Amann, 2009; Nurse & Marians, 2013), they could not differentiate between aggregates or true polymers and thus cannot be considered conclusive.

      B) The statement "However, the need for nucleotide binding and hydrolysis in polymerization remains unclear due to conflicting results, in vivo and in vitro, including the ability of MreB to polymerize or not in the presence of ADP or the non-hydrolyzable ATP analog AMP-PNP" is technically incorrect and should be rephrased or further tested.

      i. For all actin (or tubulin) family proteins, it is not that a given filament "cannot polymerize" in the presence of ADP but rather that the ADP-bound form has a higher critical concentration for polymer formation relative to the ATP-bound form. This means that the ADP polymers can indeed polymerize, but only when the total protein exceeds the ADP critical concentration. For example, many actin-family proteins do indeed polymerize in ADP: ADP actin has a 10-fold higher critical concentration than ATP actin, (Pollard, 1984) and the ADP critical concentrations of AlfA and ParM are 5X and 50X fold higher (respectively) than their ATP-bound forms(Garner et al., 2004; Polka et al., 2009)

      Response 2.5Bi. Absolutely correct. We apologize for the lack of accuracy of our phrasing and have corrected it (L123).

      ii. Likewise, (Mayer and Amann, 2009) have already demonstrated that B. subtilis MreB can polymerize in the presence of ADP, with a slightly higher critical concentration relative to the ATP-bound form.

      Response 2.5Bii. In Mayer and Amann, 2009, the same light scattering signal (interpreted as polymerization) occurred regardless of the nucleotide, and also in the absence of nucleotide (their Fig. 10) and ATP-, ADP- and AMP-PNP-MreB ‘displayed nearly indistinguishable critical concentrations’. They concluded that MreB polymerization is nucleotide-independent. Please see below (responses to ’Other points to address’) our extensive answer to the Mayer & Amann recurring point of reviewer #2

      Thus, to prove that MreB(Gs) polymers do not form in the presence of ADP would require one to test a large concentration range of ADP-bound MreB(Gs). They should test if ADP- MreB(Gs) polymerizes at the highest MreB(Gs) concentrations that can be assayed. Even if this fails, it may be the MreB(Gs) ADP polymerizes at higher concentrations than is possible with their protein preps (13uM). An even more simple fix would be to simply state MreB(Gs)-ADP filaments do not form beneath a given MreB(Gs) concentration.

      We agree with the reviewer. Our wording was overstating our conclusions. Based on our new quantifications (Fig. 3-S1B, D), we have rephrased the results section and now indicate that pairs of filaments are occasionally observed in the presence of ADP in our conditions across the range of MreB concentration that could be tested, suggesting a higher critical concentration for MreB-ADP (L310-312). Only at the highest MreB concentration, sheet- and ribbon-like structures were observed in the presence of ADP (Fig. 3-S2B).

      Other Points to address:

      1) There are several points in this paper where the work by Mayer and Amann is ignored, not cited, or readily dismissed as "hampered by aggregation" without any explanation or supporting evidence of that fact.

      We have cited the Mayer study where appropriate. However, we cannot cite it as proof of polymerization in such or such condition since their approach does not show that polymers were obtained in their conditions. Again, they based all their conclusions solely on light scattering experiments, which cannot differentiate between polymers and aggregates.

      A) Lines 100-101 - While the irregular 3-D formations seen formed by MreB in the Dersch 2020 paper could be interpreted as aggregates, stating that the results from specifically the Gaballah and Meyer papers (and not others) were "hampered by aggregation" is currently an arbitrary statement, with no evidence or backing provided. Overall, these lines (and others in the paper) dismiss these two works without giving any evidence to that point. Thus, they should provide evidence for why they believe all these papers are aggregation, or remove these (and other) dismissive statements.

      We apologize if our statements about these reports seemed dismissive or disrespectful, it was definitely not our intention. Light scattering shows an increase of size of particles over time, but there is no way to tell if the scattering is due to organized (polymers) or disorganized (aggregation) assemblies. Thus, it cannot be considered a conclusive evidence of polymerization without the proof that true filaments are formed by the protein in the conditions tested, as confirmed by EM for example. MreB is known to easily aggregate (see our size exclusion chromatography profiles and ones from Dersch 2020 (Dersch et al, 2020), and note that no chromatography profiles were shown in the Mayer report) and, as indicated above, we had similar light scattering results for MreB for years, while only aggregates could be observed by TEM (see above Response 2.3A). Several observations also suggest that aggregation instead of polymerization might be at play in the Mayer study, for example ‘polymerization’ occurring in salt-less buffer but ‘inhibited’ with as low as 100 mM KCl, which should rather be “salting in” (see below). We did not intend to be dismissive, but it seemed wrong to report their conclusions as conclusive evidence. We thought that we had cited these papers where appropriate but then explained that they show no conclusive proof of polymerization and why, but it is evident that we failed at communicating it clearly. We have reworked the text to remove any issuing and arbitrary statement about our concerns regarding these reports (e.g. L93 & L126).

      One important note - There are 2 points indicating that dismissing the Meyer and Amann work as aggregation is incorrect:

      1) the Meyer work on B. subtilis MreB shows both an ATP and a slightly higher ADP critical concentration. As the emergence of a critical concentration is a steady-state phenomenon arising from the association/dissociation of monomers (and a kinetically limiting nucleation barrier), an emergent critical concentration cannot arise from protein aggregation, critical concentrations only arise from a dynamic equilibrium between monomer and polymer.

      • Critical concentration for ATP, ADP or AMPPNP were described in Mayer & Amann (Mayer & Amann, 2009) as “nearly indistinguishable” (see Response 2.5Bii)
      • Protein aggregation depends on the solution (pH and ions), protein concentration and temperature. And above a certain concentration, proteins can become instable, thus a critical concentration for aggregation can emerge.

      2) Furthermore, Meyer observed that increased salt slowed and reduced B. subtilis MreB light scattering, the opposite of what one would expect if their "polymerization signal" was only protein aggregation, as higher salts should increase the rate of aggregation by increasing the hydrophobic effect.

      It is true that at high salt concentration proteins can precipitate, a phenomenon described as “salting out”. However, it is also true that salts help to solubilize proteins (“salting in”), and that proteins tend to precipitate in the absence of salt. Considering that the starting point of the Mayer and Amann experiment (Mayer & Amann, 2009) is the absence of salt (where they observed the highest scattering) and that they gradually reduce this scattering by increasing KCl (the scattering is almost abolished below 100 mM only!) it is plausible that a salting-in phenomenon might be at play, due to increased solubility of MreB by salt. In any case, this cannot be taken as a proof that polymerization rather than aggregation occurred.

      B) Lines 113-137 -The authors reference many different studies of MreB, including both MreB on membranes and MreB polymerized in solution (which formed bundles). However, they again neglect to mention or reference the findings of Meyer and Amann (Mayer and Amann, 2009), as it was dismissed as "aggregation". As B. subtilis is also a gram-positive organism, the Meyer results should be discussed.

      We did cite the Mayer and Amann paper but, as explained above, we cannot cite this study as an example of proven polymerization. We avoided as much as possible to polemicize in the text and cited this paper when possible. Again, we have reworked the text to avoid any issuing or dismissive statement. Also, we forgot mentioned this study at L121 as an example of reported ATPase activity, and this has now been corrected.

      2) Lines 387-391 state the rates of phosphate release relative to past MreB findings: "These rates of Pi release upon ATP hydrolysis (~ 1 Pi/MreB in 6 min at 53{degree sign}C) are comparable to those observed for MreBTm and MreB(Ec) in vitro". While the measurements of Pi release AND ATP hydrolysis have indeed been measured for actin, this statement does not apply to MreB and should be corrected: All MreB papers thus far have only measured Pi release alone, not ATP hydrolysis at the same time. Thus, it is inaccurate to state "rates of Pi release upon ATP hydrolysis" for any MreB study, as to accurately determine the rate of Pi release, one must measure: 1. The rate of polymer over time, 2) the rate of ATP hydrolysis, and 3) the rate of phosphate release. For MreB, no one has, so far, even measured the rates of ATP hydrolysis and phosphate release with the same sample.

      We completely agree with the reviewer, we apologize if our formulation was inaccurate. We have corrected the sentence (L479). Thank you for pointing out this mistake.

      3) The interpretation of the interactions between monomers in the MreB crystal should be more carefully stated to avoid confusion. While likely not their intention, the discussions of the crystal packing contacts of MreB can appear to assume that the monomer-monomer contacts they see in crystals represent the contacts within actual protofilaments. One cannot automatically assume the observations of monomer-monomer contacts within a crystal reflect those that arise in the actual filament (or protofilament).

      We agree, we thank the reviewer for his comments. We have revamped the corresponding paragraph.

      A) They state, "the apo form of MreBGs forms less stable protofilaments than its G- homologs ." Given filaments of the Apo form of MreB(GS) or b. subtilis have never been observed in solution, this statement is not accurate: while the contacts in the crystal may change with and without nucleotide, if the protein does not form polymers in solution in the apo state, then there are no "real" apo protofilaments, and any statements about their stability become moot. Thus this statement should be rephrased or appropriately qualified.

      see above.

      B) Another example: while they may see that in the apo MreB crystal, the loop of domain IB makes a single salt bridge with IIA and none with IIB. This contrasts with every actin, MreB, and actin homolog studied so far, where domain IB interacts with IIB. This might reflect the real contacts of MreB(Gs) in the solution, or it may be simply a crystal-packing artifact. Thus, the authors should be careful in their claims, making it clear to the reader that the contacts in the crystal may not necessarily be present in polymerized filaments.

      Again, we agree with the reviewer, we cannot draw general conclusions about the interactions between monomers from the apo form. We have rephrased this paragraph.

      4) lines 201-202 - "Polymers were only observed at a concentration of MreB above 0.55 μM (0.02 mg/mL)". Given this concentration dependence of filament formation, which appears the same throughout the paper, the authors could state that 0.55 μM is the critical concentration of MreB on membranes under their buffer conditions. Given the lack of critical concentration measurement in most of the MreB literature, this could be an important point to make in the field.

      Following reviewer’s #2 suggestion, we have now estimated the critical concentration (Cc=0.4485 µM) and reported it in the text. (L218).

      5) Both mg/ml and uM are used in the text and figures to refer to protein concentration. They should stick to one convention, preferably uM, as is standard in the polymer field.

      Sorry for the confusion. We have homogenized to MreB concentrations to µM throughout the text and figures.

      6) Lines 77-78 - (Teeffelen et al., 2011) should be referenced as well in regard to cell wall synthesis driving MreB motion.

      This has been corrected, sorry for omitting this reference.

      7) Line 90 - "Do they exhibit turnover (treadmill) like actin filaments?". This phrase should be modified, as turnover and treadmilling are two very different things. Turnover is the lifetime of monomers in filaments, while treadmilling entails monomer addition at one end and loss at the other. While treadmilling filaments cause turnover, there are also numerous examples of non-treadmilling filaments undergoing turnover: microtubules, intermediate filaments, and ParM. Likewise, an antiparallel filament cannot directionally treadmill, as there is no difference between the two filament ends to confer directional polarity.

      This is absolutely true, we apologize for our mistake. The sentence has been corrected (L82).

      8) Throughout the paper, the term aggregation is used occasionally to describe the polymerization shown in many previous MreB studies, almost all of which very clearly showed "bundled" filaments, very distinct entities from aggregates, as a bundle of polymers cannot form without the filaments first polymerizing on their own. Evidence to this point, polymerization has been shown to precede the bundling of MreB(Tm) by (Esue et al., 2005).

      We agree with reviewer #2 about polymers preceding bundles and “sheets”. However, we respectfully disagree that we used the word aggregation “throughout the paper” to describe structures that clearly showed polymers or sheets of filaments. A search (Ctrl-F: “aggreg”) reveals only 6 matches, 3 describing our own observations (L152, 163/5, and 1023/28), one referring to (Salje et al., 2011) (L107) but citing her claim that they observed aggregation (due to the N-terminus), and the last two (L100, L440) refer (again) to the Gaballah/Mayer/Dersch publications to say that aggregation could not be excluded in these reports as discussed above (Dersch et al., 2020; Gaballah et al., 2011; Mayer & Amann, 2009).

      9) lines 106-108 mention that "The N-terminal amphipathic helix of E. coli MreB (MreBEc) was found to be necessary for membrane binding. " This is not accurate, as Salje observed that one single helix could not cause MreB to mind to the membrane, but rather, multiple amphipathic helices were required for membrane association (Salje et al., 2011).

      Salje et al showed that in vivo the deletion of the helix abolishes the association of MreB to the membrane. This publication also shows that in vitro, addition of the helix to GFP (not to MreB) prompts binding to lipid vesicles, and that this was increased if there are 2 copies of the helix, but they could not test this directly in vitro with MreB (which is insoluble when expressed with its N-terminus). This prompted them to speculate that multiple MreBs could bind better to the membrane than monomers. However, this remained to be demonstrated. Additional hydrophobic regions in MreB such as the hydrophobic loop could participate to membrane anchoring but are absent in their in vitro assays with GFP.

      The Salje results imply that dimers (or further assemblies) of MreB drive membrane association, a point that should be discussed in regard to the question "What prompts the assembly of MreB on the inner leaflet of the cytoplasmic membrane?" posed on lines 86-87.

      We agree that this is an interesting point. As it is consistent with our results, we have incorporated it to our model (Fig. 6) and we are addressing it in the discussion L573-575.

      10) On lines 414-415, it is stated, "The requirement of the membrane for polymerization is consistent with the observation that MreB polymeric assemblies in vivo are membrane-associated only." While I agree with this hypothesis, it must be noted that the presence or absence of MreB polymers in the cytoplasm has not been directly tested, as short filaments in the cytoplasm would diffuse very quickly, requiring very short exposures (<5ms) to resolve them relative to their rate of diffusion. Thus, cytoplasmic polymers might still exist but have not been tested.

      This is also an interesting point. Indeed if a nucleated form, or very short (unbundled) polymers exist in the cytoplasm, they have not been tested by fluorescence microscopy. However, the polymers that localize at the membrane (~ 200 nm), if soluble, would have been detected in the cytoplasm by the work of reviewer #2, us or others.

      11) lines 429-431 state, "but polymerization in the presence of ADP was in most cases concluded from light scattering experiments alone, so the possibility that aggregation rather than ordered polymerization occurred in the process cannot be excluded."

      A) If an increased light scattering signal is initiated by the addition of ADP (or any nucleotide), that signal must come from polymerization or multimerization. What the authors imply is that there must be some ADP-dependent "aggregation" of MreB, which has not been seen thus far for any polymer. Furthermore, why would the addition of ADP initiate aggregation?

      We did not mean that ADP itself would prompt aggregation, but that the protein would aggregate in the buffer regardless of the presence of ADP or other nucleotides. The Mayer & Amann study claims that MreB “polymerization” is nucleotide-independent, as they got identical curves with ATP, ADP, AMPPNP and even with no nucleotides at all (Fig. 10 in their paper, pasted here) (Mayer & Amann, 2009).

      Their experiments with KCl are also remarkable as when they lowered the salt they got faster and faster “polymerization”, with the strongest light scattering signal in the absence of any salt. The high KCl concentration in which they got almost no more “polymers” was 75 mM KCl, and ‘polymerization was almost entirely inhibited at 100 mM’ (Fig. 7, pasted below). Yet the intracellular level of KCl in bacteria is estimated to be ~300 mM (see Response 1.1)

      B) Likewise, the statement "Differences in the purity of the nucleotide stocks used in these studies could also explain some of the discrepancies" is unexplained and confusing. How could an impurity in a nucleotide stock affect the past MreB results, and what is the precedent for this claim?

      We meant that the presence of ATP in the ADP stocks might have affected the outcome of some assays, generating the conflicting results existing in the literature. We agree this sentence was confusing, we have removed it.

      12) lines 467-469 state, "Thus, for both MreB and actin, despite hydrolyzing ATP before and after polymerization, respectively, the ADP-Pi-MreB intermediate would be the long-lived intermediate state within the filaments."

      A) For MreB, this statement is extremely speculative and unbiased, as no one has measured 1) polymerization, 2) ATP hydrolysis, and 3) phosphate release. For example, it could be that ATP hydrolysis is slow, while phosphate release is fast, as is seen in the actin from Saccharomyces cerevisiae.

      We agree that this was too speculative. This has been removed from the (extensively) modified Discussion section. Thanks for the comment.

      B) For actin, the statement of hydrolysis of ATP of monomer occurring "before polymerization" is functionally irrelevant, as the rate of ATP hydrolysis of actin monomers is 430,000 times slower than that of actin monomers inside filaments (Blanchoin and Pollard, 2002; Rould et al., 2006).

      We agree that the difference of hydrolysis rate between G-actin and F-actin implies that ATP hydrolysis occurs after polymerization. We are afraid that we do not follow the reviewer’s point here, we did not say or imply that ATP hydrolysis by actin monomers was functionally relevant.

      13) Lines 442-444. "On the basis of our data and the existing literature, we propose that the requirement for ATP (or GTP) hydrolysis for polymerization may be conserved for most MreBs." Again, this statement both here (and in the prior text) is an extremely bold claim, one that runs contrary to a large amount of past work on not just MreB, but also eukaryotic actin and every actin homolog studied so far. They come to this model based on 1) one piece of suggestive data (the behavior of MreB(GS) bound to 2 non-hydrolysable ATP analogs in 500mM KCL), and 2) the dismissal (throughout the paper) of many peer-reviewed MreB papers that run counter to their model as "aggregation" or "contaminated ATP stocks ." If they want to make this bold claim that their finding invalidates the work of many labs, they must back it up with further validating experiments.

      We respectfully disagree that our model was based on “one piece of suggestive data” and backed-up by dismissing most past work in the field. We only wanted to raise awareness about the conflicting data between some reports (listed in response 2.5a), and that the claims made by some publications are to be taken with caution because they only rely on light scattering or, when TEM was performed, showed only disorganized structures.

      This said, we clearly failed in proposing our model and we are sorry to see that we really annoyed the reviewer with our suspicion that the work by Mayer & Amann reports aggregation. As indicated above, we have amended our manuscript relative to this point. We also agree that our suggestion to generalize our findings to most MreBs was unsupported, and overstated considering how confusing some result from the literature are. We have refined our model and reworked the text to take on board the reviewer’s remarks as well as the new data generated during the revision process.

      We would like to thank reviewer #2 for his in-depth review of our manuscript.  

      Reviewer #3 (Public Review):

      The major claim from the paper is the dependence of two factors that determine the polymerization of MreB from a Gram-positive, thermophilic bacteria 1) The role of nucleotide hydrolysis in driving the polymerization. 2) Lipid bilayer as a facilitator/scaffold that is required for hydrolysis-dependent polymerization. These two conclusions are contrasting with what has been known until now for the MreB proteins that have been characterized in vitro. The experiments performed in the paper do not completely justify these claims as elaborated below.

      We understand the reviewer’ concerns in view of the existing literature on actin and Gram-negative MreBs. We may just be missing the optimal conditions for polymerisation in solution, while our phrasing gave the impression that polymers could never form in the absence of ATP or lipids. Our new data actually shows that MreBGs at higher concentration can assemble into bundle- and sheet-like structures in solution and in the presence of ADP/AMP-PNP. Pairs of filaments are however only observed in the presence of lipids for all conditions tested. As indicated in the answers to the global review comments, we have included our new data in the manuscript, revised our conclusions and claims about the lipid requirement and expanded on these points in the Discussion.

      Major comments:

      1) No observation of filaments in the absence of lipid monolayer can also be accounted due to the higher critical concentration of polymerization for MreBGS in that condition. It is seen that all the negative staining without lipid monolayer condition has been performed at a concentration of 0.05 mg/mL. It is important to check for polymerization of the MreBGS at higher concentration ranges as well, in order to conclusively state the requirement of lipids for polymerization.

      Response 3.1. 0.05 mg/ml (1.3µM) is our standard condition, and our leeway was limited by the rapid aggregation observed at higher MreB concentrations, as indicated in the text. We have now tested as well 0.25 mg/ml (6.5 µM - the maximum concentration possible before major aggregation occurs in our experimental conditions). At this higher concentration, we see some sheet-like structures in solution, confirming a requirement of a higher concentration of MreB for polymerization in these conditions (see the answers to the global review comments for more details)

      We thank the reviewer for pushing us to address this point. We have revised our conclusions accordingly.

      2) The absence of filaments for the non-hydrolysable conditions in the lipid layer could also be because the filaments that might have formed are not binding to the planar lipid layer, and not necessarily because of their inability to polymerize.

      Response 3.2. This is a fair point. To test the possibility that polymers would form but would not bind to the lipid layer we have now added additional semi-quantitative EM controls (for both the non-hydrolysable ATP analogs and the three ‘membrane binding’ deletion mutants) testing polymerization in solution (without lipids) and also using plasma-treated grids. These showed that in our standard polymerization conditions, virtually no polymers form in solution (Fig. 3-S1B and Fig. 4-S4A). Albeit at very low frequency, some dual protofilaments were however detected in the presence of ADP or AMP-PNP at the high MreB concentration (Fig. 3-S1D). At this high MreB concentration, the sheet-like structures occasionally observed in solution in the presence of ATP were frequent in the presence of ADP and very frequent in the presence of AMP-PNP (Fig. 3-S2B). We have revised our conclusions on the basis of these new data: MreBGs can form polymeric assemblies in solution and in the absence of ATP hydrolysis at a higher critical concentration than in the presence of ATP and lipids.

      See the answers to the global review comments (point 2) and Response 2.3C to reviewer #2 for more details.

      3) Given the ATPase activity measurements, it is not very convincing that ATP rather than ADP will be present in the structure. The ATP should have been hydrolysed to ADP within the structure. The structure is now suggestive that MreB is not capable of hydrolysis, which is contradictory to the ATP hydrolysis data.

      Response 3.3. We thank the reviewer for her insightful remarks about the MreB-ATP crystal structure. The electron density map clearly demonstrates the presence of 3 phosphates. However, as suggested by the reviewer, the density which was attributed to a Mg2+ ion was to be interpreted as a water molecule. The absence of Mg2+ in the crystal could thus explain why the ATP had not been hydrolyzed.

      References

      Arino J, Ramos J, Sychrova H (2010) Alkali metal cation transport and homeostasis in yeasts. Microbiology and molecular biology reviews 74: 95-120

      Bean GJ, Amann KJ (2008) Polymerization properties of the Thermotoga maritima actin MreB: roles of temperature, nucleotides, and ions. Biochemistry 47: 826-835

      Cayley S, Lewis BA, Guttman HJ, Record MT, Jr. (1991) Characterization of the cytoplasm of Escherichia coli K-12 as a function of external osmolarity. Implications for protein-DNA interactions in vivo. Journal of molecular biology 222: 281-300

      Dersch S, Reimold C, Stoll J, Breddermann H, Heimerl T, Defeu Soufo HJ, Graumann PL (2020) Polymerization of Bacillus subtilis MreB on a lipid membrane reveals lateral co-polymerization of MreB paralogs and strong effects of cations on filament formation. BMC Mol Cell Biol 21: 76

      Eisenstadt E (1972) Potassium content during growth and sporulation in Bacillus subtilis. Journal of bacteriology 112: 264-267

      Epstein W, Schultz SG (1965) Cation Transport in Escherichia coli: V. Regulation of cation content. J Gen Physiol 49: 221-234

      Esue O, Wirtz D, Tseng Y (2006) GTPase activity, structure, and mechanical properties of filaments assembled from bacterial cytoskeleton protein MreB. Journal of bacteriology 188: 968-976

      Gaballah A, Kloeckner A, Otten C, Sahl HG, Henrichfreise B (2011) Functional analysis of the cytoskeleton protein MreB from Chlamydophila pneumoniae. PloS one 6: e25129

      Harne S, Duret S, Pande V, Bapat M, Beven L, Gayathri P (2020) MreB5 Is a Determinant of Rod-to-Helical Transition in the Cell-Wall-less Bacterium Spiroplasma. Curr Biol 30: 4753-4762 e4757

      Kang H, Bradley MJ, McCullough BR, Pierre A, Grintsevich EE, Reisler E, De La Cruz EM (2012) Identification of cation-binding sites on actin that drive polymerization and modulate bending stiffness. Proceedings of the National Academy of Sciences of the United States of America 109: 16923-16927

      Lacabanne D, Wiegand T, Wili N, Kozlova MI, Cadalbert R, Klose D, Mulkidjanian AY, Meier BH, Bockmann A (2020) ATP Analogues for Structural Investigations: Case Studies of a DnaB Helicase and an ABC Transporter. Molecules 25

      Mannherz HG, Brehme H, Lamp U (1975) Depolymerisation of F-actin to G-actin and its repolymerisation in the presence of analogs of adenosine triphosphate. Eur J Biochem 60: 109-116

      Mayer JA, Amann KJ (2009) Assembly properties of the Bacillus subtilis actin, MreB. Cell motility and the cytoskeleton 66: 109-118

      Nurse P, Marians KJ (2013) Purification and characterization of Escherichia coli MreB protein. The Journal of biological chemistry 288: 3469-3475

      Pande V, Mitra N, Bagde SR, Srinivasan R, Gayathri P (2022) Filament organization of the bacterial actin MreB is dependent on the nucleotide state. The Journal of cell biology 221

      Peck ML, Herschlag D (2003) Adenosine 5 '-O-(3-thio)triphosphate (ATP-gamma S) is a substrate for the nucleotide hydrolysis and RNA unwinding activities of eukaryotic translation initiation factor eIF4A. Rna 9: 1180-1187

      Popp D, Narita A, Maeda K, Fujisawa T, Ghoshdastider U, Iwasa M, Maeda Y, Robinson RC (2010) Filament structure, organization, and dynamics in MreB sheets. The Journal of biological chemistry 285: 15858-15865

      Rhoads DB, Waters FB, Epstein W (1976) Cation transport in Escherichia coli. VIII. Potassium transport mutants. J Gen Physiol 67: 325-341

      Rodriguez-Navarro A (2000) Potassium transport in fungi and plants. Biochimica et biophysica acta 1469: 1-30

      Salje J, van den Ent F, de Boer P, Lowe J (2011) Direct membrane binding by bacterial actin MreB. Molecular cell 43: 478-487

      Schmidt-Nielsen B (1975) Comparative physiology of cellular ion and volume regulation. J Exp Zool 194: 207-219

      Szatmari D, Sarkany P, Kocsis B, Nagy T, Miseta A, Barko S, Longauer B, Robinson RC, Nyitrai M (2020) Intracellular ion concentrations and cation-dependent remodelling of bacterial MreB assemblies. Sci Rep-Uk 10

      van den Ent F, Izore T, Bharat TA, Johnson CM, Lowe J (2014) Bacterial actin MreB forms antiparallel double filaments. eLife 3: e02634

      Whatmore AM, Chudek JA, Reed RH (1990) The Effects of Osmotic Upshock on the Intracellular Solute Pools of Bacillus subtilis. Journal of general microbiology 136: 2527-2535

    1. Author Response

      Reviewer #1 (Public Review):

      The manuscript investigates how humans store temporal sequences of tones in working memory. The authors mainly focus on a theory named "Language of thought" (LoT). Here the structure of a stimulus sequence can be stored in a tree structure that integrates the dependencies of a stimulus stored in working memory. To investigate the LoT hypothesis, participants listened to multiple stimulus sequences that varied in complexity (e.g., alternating tones vs. nearly random sequence). Simultaneously, the authors collected fMRI or MEG data to investigate the neuronal correlates of LoT complexity in working memory. Critical analysis was based on a deviant tone that violated the stored sequence structure. Deviant detection behavior and a bracketing task allowed a behavioral analysis.

      Results showed accurate bracketing and fast/correct responses when LoT complexity is low. fMRI data showed that LoT complexity correlated with the activation of 14 clusters. MEG data showed that LoT complexity correlated mainly with activation from 100-200 ms after stimulus onset. These and other analyses presented in the manuscript lead the authors to conclude that such tone sequences are represented in human memory using LoT in contrast to alternative representations that rely on distinct memory slot representations.

      Strengths

      The study provides a concise and easily accessible introduction. The task and stimuli are well described and allow a good understanding of what participants experience while their brain activation is recorded. Results are extensive as they include multiple behavioral investigations and brain activation data from two different measurement modalities. The presentation of the behavioral results is intuitive. The analysis provided a direct comparison of the LoT with an alternative model based on estimating a transition-probability measure of surprise.

      For the fMRI data, the whole brain analysis was accompanied by detailed region of interest analyses, including time course analysis, for the activation clusters correlated with LoT complexity. In addition, the activation clusters have been set in relation (overlap and region of interest analyses) to a math and a language localizer. For the MEG data, the authors investigated the LoT complexity effect based on linear regression, including an analysis that also included transitional probabilities and multivariate decoding analysis. The discussion of the results focused on comparing the activation patterns of the task with the localizer tasks. Overall, the authors have provided considerable new data in multiple modalities on a well-designed experiment investigating how humans represent sequences in auditory working memory.

      Weaknesses

      The primary issue of the manuscript is the missing formal description of the LoT model and alternatives, inconsistencies in the model comparisons, and no clear argumentation that would allow the reader to understand the selection of the alternative model. Similar to a recent paper by similar authors (Planton et al., 2021 PLOS Computational Biology), an explicit model comparison analysis would allow a much stronger conclusion. Also, these analyses would provide a more extensive evidence base for the favored LoT model. Needed would be a clear argumentation for why the transitional probabilities were identified as the most optimal alternative model for a critical test. A clear description of the models (e.g., how many free parameters) and a description of the simulation procedure (e.g., are they trained, etc.) Here it would be strongly advised to provide the scripts that allow others to reproduce the simulations.

      We thank the reviewer for the requests and critiques. Although this paper follows upon our extensive prior behavioral work (Planton et al.), we agree that it should stand alone and that therefore the models need to be described more fully. We have now added a formal description of the LoT in the subsection The Language of Thought for binary sequences in the Results section and have added a formal and verbal description of the selected sequences in Figure 1-figure supplement 1. Furthermore, we added a model comparison similar to the one done in (Planton et al., 2021 PLOS Computational Biology). This analysis is now included in Figure 2 and in the Behavioral data subsection of the Results section. It replicates previous behavioral results obtained in Planton et al., 2021 PLOS Computational Biology, namely that complexity, as measured by minimal description length in the binary version of the “language of geometry” was the best predictor of participants’ behaviour.

      Interestingly, we found that the model that considered both complexity and surprise had even lower AIC suggesting that statistical learning is simultaneously occurring in the brain (Brain signatures of a multiscale process of sequence learning in humans, M Maheu, S Dehaene, F Meyniel - eLife, 2019). In this respect, we do not consider surprise from transition probabilities as an alternative model but rather as a mechanism that is occurring in parallel to sequence compression. The main goal of this work was to determine how sequence processing was affected by sequence structure, captured by the language of thought. In this line, we didn't select the tested sequences in order to investigate statistical learning but, instead, chose them with similar global statistical properties.

      The MEG experiment provided us with the opportunity to separate temporally the contributions of statistical mechanisms from the ones of sequence compression according to the language of thought. Indeed, contrary to the fMRI experiment, we could model at the item level the statistical properties of individual sounds. We report the results when accounting jointly for statistical processing and LoT-complexity in Supplementary materials.

      The different models considered in previous work didn’t need to be trained. The sequence complexity they provided could be analytically computed based on sequence minimal description length.

      Furthermore, the manuscript needs a clear motivation for the type of sequences and some methodological decisions. Central here is the quadratic trend selectively used for the fMRI analysis but not for the other datasets.

      To design the MEG, we had to decrease the number of sequences from 10 to 7. We selected them based on the LoT-complexity and the type of sequence information they spanned. As a consequence, the predictors for linear and quadratic complexity are very correlated (82%). Unfortunately, due to low SNR, this doesn’t allow to robustly account for the contributions of quadratic complexity in the MEG-recorded brain signals. Still, in response to the referee, we performed a linear regression as a function of quadratic complexity on the residuals of the regression as function of statistics and complexity that we report here. No significant clusters were found for habituation and standard trials but two were found (corresponding to the same topography) for deviant trials for late time-points.

      In Author response image 1 regression coefficients for the quadratic complexity regressor regressed on the residuals of the surprise from transition probabilities and complexity. In Author response image 2, 2 significant clusters were found for the deviant sounds.

      We also averaged the decoding scores from Figure7.A over the time-window obtained from the temporal cluster-based permutation test (see Author response image 2). The choice of complexity values didn’t allow any clear assessment of the contribution of the quadratic complexity term.

      In summary, in the current design, we do not think that the number of tested sequences allows us to clearly conclude that no quadratic effect can be found for Habituation and Standard trials. We would need to re-design an experiment to test specifically the quadratic complexity contribution to brain signals in MEG.

      Author response image 1.

      Author response image 2.

      Also, the description of the linear mixed models is missing (e.g., the random effect structure, e.g., see Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv preprint arXiv:1506.04967.). Moreover, sample sizes have not been justified by a power analysis.

      The linear mixed model that is considered in this work is very simple, it only uses Subject as a random variable. This is now stated clearly in the corresponding part in the Experimental procedures section:

      To test whether subject performance correlated with LoT complexity, we performed linear regressions on group-averaged data, as well linear mixed models including participant as the (only) random factor. The random effect structure of the mixed models was kept minimal, and did not include any random slopes, to avoid the convergence issues often encountered when attempting to fit more complex models.

    1. Author Response

      Reviewer #1 (Public Review):

      The actual description of the methods does not allow the reader to evaluate the precision of two important processing steps. First, rCBF measures are supposed to be restricted to the cortex, but given the pCASL image spatial resolution, partial volume effects with white matter probably exist, especially in younger infants. Furthermore, segmenting tissues on the basis of anatomical images (especially T1-weighted) is complicated in the first postnatal year. As rCBF measurements are very different between grey and white matter, the performed procedure might impact the measures at each age, or even lead to a systematic bias on age-dependent changes. Second, the methodology and accuracy of the brain registration across infants are little detailed whereas it is a challenging aspect given the intense brain growth and folding, the changing contrast in T1w images at these ages, and the importance of this step to perform reliable voxelwise comparison across ages.

      We thank the reviewer for this comment. We have added more descriptions in the methods to address this comment. Briefly, individual rCBF map was generated in the individual space and calibrated by phase contrast MRI to minimize the individual variations of processing parameters such as T1 of arterial blood (Aslan et al., 2010). Cortical segmentation was also conducted in individual space. Then different types of images including rCBF map and gray matter segmentation probability map in the individual space were normalized into the template space. An averaged gray matter probability map was generated after inter-subject normalization. After carefully testing multiple thresholds in the averaged gray matter probability maps, 40% probability minimizing the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex was used to generate the binary gray matter mask shown on the left panel of Figure R1 below. Despite poor contrasts and poor cortical segmentation of T1-weighted images of younger infants rightfully pointed out by this reviewer, the poor cortical segmentation of younger infants was compensated by the averaged cortical mask and measurement of rCBF in the template space. As demonstrated in the right three panels in Figure R1, the rCBF measure in the cortical mask in the template space is consistent across ages for accurate and reliable voxelwise comparison across age.

      Figure R1. The gray matter mask and segmented cortical mask overlaid on rCBF map of three representative infants aged 3, 6, and 20 months in the template space. The gray matter mask on the left panel was created to minimize the contamination of white matter and CSF while keeping the continuity of the cortical gray matter mask across the cerebral cortex. The contour of the gray matter mask was highlighted with bule line.

      The authors achieved their aim in showing that the rCBF increase differs across brain regions (the DMN showing intense changes compared to the visual and sensorimotor networks). Nevertheless, an analysis of covariance (instead of an ANOVA) including the infants' age as covariate (in addition to the brain region) would have allowed them to evaluate the interaction between age and region (i.e. different slopes of age-related changes across regions) in a more rigorous manner. Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures, the results only partly support the authors' conclusion. Actually, both measures strongly depend on the infants' age, as the authors highlight in the first parts of the study. Thus, considering this common age dependency would be required to show that the physiological and connectivity measurements are specifically related and that there is indeed a coupling.

      We thank the reviewer for this comment. Following the reviewer’s suggestion, we conducted an analysis of covariance (ANCOVA) and found significant interaction between regions and age (F(6, 322) = 2.45, p < 0.05) with age as a covariate. This ANCOVA result is consistent with Figure 3c showing differential rCBF increase rates across brain regions. The ANCOVA result was added in the last paragraph in the Results section “Faster rCBF increases in the DMN hub regions during infant brain development”.

      Regarding the evaluation of the coupling between physiological (rCBF) and functional connectivity measures (FC), the Figure 5, Figure 5–figure supplement 1 and 2 were generated exactly to test that the FC-rCBF coupling specifically localized in the DMN is not due to mutual age dependency. Briefly, Figure 5B demonstrated significant correlation only clustered in the DMN regions using the correlation method demonstrated in Figure 5-figure supplement 1. Furthermore, nonparametric permutation tests with 10,000 permutations were conducted. Such permutation tests are sensitive and effective with Figure 5c revealing significant coupling only in the DMN regions. If coupling is related to mutual age dependency, Figure 5c would demonstrate significant coupling in Vis and SM network regions too.

    1. Author Response

      Reviewer #1 (Public Review):

      Briggs et al use a combination of mathematical modelling and experimental validation to tease apart the contributions of metabolic and electronic coupling to the pancreatic beta cell functional network. A number of recent studies have shown the existence of functional beta cell subpopulations, some of which are difficult to fully reconcile with established electrophysiological theory. More generally, the contribution of beta cell heterogeneity (metabolism, differentiation, proliferation, activity) to islet function cannot be explained by existing combined metabolic/electrical oscillator models. The present studies are thus timely in modelling the islet electrical (structural) and functional networks. Importantly, the authors show that metabolic coupling primarily drives the islet functional network, giving rise to beta cell subpopulations. The studies, however, do not diminish the critical role of electrical coupling in dictating glucose responsiveness, network extent as well as longer-range synchronization. As such, the studies show that islet structural and functional networks both act to drive islet activity, and that conclusions on the islet structural network should not be made using measures of the functional network (and vice versa).

      Strengths:

      • State-of-the-art multi-parameter modelling encompassing electrical and metabolic components.

      • Experimental validation using advanced FRAP imaging techniques, as well as Ca2+ data from relevant gap junction KO animals.

      • Well-balanced arguments that frame metabolic and electrical coupling as essential contributors to islet function.

      • Likely to change how the field models functional connectivity and beta cell heterogeneity.

      Weaknesses:

      • Limitations of FRAP and electrophysiological gap junction measures not considered.

      • Limitations of Cx36 (gap junction) KO animals not considered.

      • Accuracy of citations should be improved in a few cases.

      We thank reviewer 1 for their positive comments, including the many strengths in the approaches, arguments and impact. We do note the weaknesses raised by the reviewer and have addressed them following the comments below.

      We would like to also note that when we refer to metabolic activity driving the functional network, we are not referring to metabolic coupling between beta cells. Rather we mean that two cells that show either high levels of metabolic activity (glycolytic flux) or that show similar levels metabolic activity will show increased synchronization and thus a functional network edge as compares to cells with elevated gap junction conductance. Increased metabolic activity would likely generate increased depolarizing currents that will provide an increased coupling current to drive synchronization; whereas similar metabolic activity would mean a given coupling current could more readily drive synchronized activity. We have substantially rewritten the manuscript to clarify this point.

      Reviewer #2 (Public Review):

      In their present work, Briggs et al. combine biophysical simulations and experimental recordings of beta cell activity with analyses of functional network parameters to determine the role played by gap-junctional coupling, metabolism, and KATP conductance in defining the functional roles that the cells play in the functional networks, assess the structure-function relationship, and to resolve an important current open question in the field on the role of so-called hub cells in islets of Langerhans.

      Combining differential equation-based simulations on 1000 coupled cells with demanding calcium, NAPDH, and FRAP imaging, as well as with advanced network analyses, and then comparing the network metrics with simulated and experimentally determined properties is an achievement in its own right and a major methodological strength. The findings have the potential to help resolve the issue of the importance of hub cells in beta cell networks, and the methodological pipeline and data may prove invaluable for other researchers in the community.

      However, methodologically functional networks may be based on different types of calcium oscillations present in beta cells, i.e., fast oscillations produced by bursts of electrical activity, slow oscillations produced by metabolic/glycolytic oscillations, or a mixture of both. At present, the authors base the network analyses on fast oscillations only in the case of simulated traces and on a mixture of fast and slow oscillations in the case of experimental traces. Since different networks may depend on the studied beta cell properties to a different extent (e.g., fast oscillation-based networks may, more importantly, depend on electrical properties and slow oscillationbased networks may more strongly depend on metabolic properties), it is important that in drawing the conclusions the authors separately address the influence of a cell's electrical and metabolic properties on its functional role in the network based on fast oscillations, slow oscillations, or a mixture of both.

      We thank reviewer 2 for their positive comments, including addressing the importance of this study as it pertains to islet biology and acknowledging methodological complexities of this study. We also thank the reviewer for their careful reading and providing useful comments. We have integrated each comment into the manuscript. Most importantly, we have now extended our analysis to both fast and slow oscillations by incorporating an additional mathematical model of coupled slow oscillations and performing additional experimental analysis of fast, slow, and mixed oscillations.

      Reviewer #3 (Public Review):

      Over the past decade, novel approaches to understanding beta cell connectivity and how that contributes to the overall function of the pancreatic islet have emerged. The application of network theory to beta cell connectivity has been an extremely useful tool to understand functional hierarchies amongst beta cells within an islet. This helps to provide functional relevance to observations from structural and gene expression data that beta cells are not all identical.

      There are a number of "controversies" in this field that have arisen from the mathematical and subsequent experimental identification of beta "hub" cells. These are small populations of beta cells that are very highly connected to other beta cells, as assessed by applying correlation statistics to individual beta cell calcium traces across the islet.

      In this paper Briggs et al set out to answer the following areas of debate:

      They use computational datasets, based on established models of beta cells acting in concert (electrically coupled) within an islet-like structure, to show that it is similarities in metabolic parameters rather than "structural" connections (ie proximity which subserves gap junction coupling) that drives functional network behaviour. Whilst the computational models are quite relevant, the fact that the parameters (eg connectivity coefficients) are quite different to what is measured experimentally, confirm the limitations of this model. Therefore it was important for the authors to back up this finding by performing both calcium and metabolic imaging of islet beta cells. These experimental data are reported to confirm that metabolic coupling was more strongly related to functional connectivity than gap junction coupling. However, a limitation here is that the metabolic imaging data confirmed a strong link between disconnected beta cells and low metabolic coupling but did not robustly show the opposite. Similarly, I was not convinced that the FRAP studies, which indirectly measured GJ ("structural") connections were powered well enough to be related to measures of beta cell connectivity.

      The group goes on to provide further analytical and experimental data with a model of increasing loss of GJ connectivity (by calcium imaging islets from WT, heterozygous (50% GJ loss), and homozygous (100% loss). Given the former conclusion that it was metabolic not GJ connectivity that drives small world network behaviour, it was surprising to see such a great effect on the loss of hubs in the homs. That said, the analytical approaches in this model did help the authors confirm that the loss of gap junctions does not alter the preferential existence of beta cell connectivity and confirms the important contribution of metabolic "coupling". One perhaps can therefore conclude that there are two types of network behaviour in an islet (maybe more) and the field should move towards an understanding of overlapping network communities as has been done in brain networks.

      Overall this is an extremely well-written paper which was a pleasure to read. This group has neatly and expertly provided both computational and experimental data to support the notion that it is metabolic but not "structural" ie GJ coupling that drives our observations of hubs and functional connectivity. However, there is still much work to do to understand whether this metabolic coupling is just a random epiphenomenon or somehow fated, the extent to which other elements of "structural" coupling - ie the presence of other endocrine cell types, the spatial distribution of paracrine hormone receptors, blood vessels and nerve terminals are also important.

      We thank reviewer 3 for their positive comments, including the methodology, writing style, and the importance of this paper to the broader islet community. We thank the reviewer for their very in-depth and helpful comments. We have addressed each comment below and made significant changes to the manuscript according. We conducted more FRAP experiments and separated results into slow, fast, and mixed oscillations. We included analysis of an additional computational model that simulates slow calcium oscillations. Additionally, we substantially rewrote the paper to clarify that we are not referring to metabolic coupling and speak on the broader implications of network theory and our findings.

      Reviewer #4 (Public Review):

      This manuscript describes a complex, highly ambitious set of modeling and experimental studies that appear designed to compare the structural and functional properties of beta cell subpopulations within the islet network in terms of their influence on network synchronization. The authors conclude that the most functionally coupled cell subpopulations in the islet network are not those that are most structurally coupled via gap junctions but those that are most metabolically active.

      Strengths of the paper include (1) its use of an interdisciplinary collection of methods including computer simulations, FRAP to monitor functional coupling by gap junctions, the monitoring of Ca2+ oscillations in single beta cells embedded in the network, and the use of sophisticated approaches from probability theory. Most of these methods have been used and validated previously. Unfortunately, however, it was not clear what the underlying premise of the paper actually is, despite many stated intentions, nor what about it is new compared to previous studies, an additional weakness.

      Although the authors state that they are trying to answer 3 critical questions, it was not clear how important these questions are in terms of significance for the field. For example, they state that a major controversy in the field is whether network structure or network function mediates functional synchronization of beta cells within the islet. However, this question is not much debated. As an example, while it is known that there can be long-range functional coupling in islets, no workers in the field believe there is a physical structure within islets that mediates this, unlike the case for CNS neurons that are known to have long projections onto other neurons. Beta cells within the islets are locally coupled via gap junctions, as stated repeatedly by the authors but these mediate short-range coupling. Thus, there are clearly functional correlations over long ranges but no structures, only correlated activity. This weakness raises questions about the overall significance of the work, especially as it seems to reiterate ideas presented previously.

      We thank reviewer 4 for their positive comments, including our multidisciplinary use of mathematical models and experimental imaging techniques. We have now included an additional model of slow oscillations (the Integrated Oscillator Model) to improve our conclusions. We also thank reviewer 4 for the insightful comments. We have carefully reviewed each comment and made significant changes to the manuscript accordingly. In particular, we have significantly rewritten the introduction and discussion attempting to clarify what is new in our manuscript and what is previously shown. Additionally, we agree with the reviewers’ sentiment that there is little debate over whether, for example, there are physical structures within the islet that mediate long-range functional connections. However, there is current debate over whether functional beta-cell subpopulations can dictate islet dynamics (see [11]–[13]). This debate can be framed by observing whether these functional subpopulations emerge from the islet due to physical connections (structural network) or something more nuisance (such as intrinsic dynamics). We have reframed the introduction and discussion to clarify this debate as well as more clearly state the premise of the paper.

      Specific Comments

      1). The authors state it is well accepted that the disruption of gap junctional coupling is a pathophysiological characteristic of diabetes, but this is not an opinion widely accepted by the field, although it has been proposed. The authors should scale back on such generalizations, or provide more compelling evidence to support such a claim.

      Thank you for pointing this out, we have provided more specific citations and changes the wording from “well accepted” to “has been documented”. See Discussion page 13 lines 415-416.

      2) The paper relies heavily on simulations performed using a version of the model of Cha et al (2011). While this is a reasonable model of fast bursting (e.g. oscillations having periods <1 min.), the Ca2+ oscillations that were recorded by the authors and shown in Fig. 2b of the manuscript are slow oscillations with periods of 5 min and not <1 min, which is a weakness of the model in the current context. Furthermore, the model outputs that are shown lack the well-known characteristics seen in real islets, such as fast-spiking occurring on prolonged plateaus, again as can be seen by comparing the simulated oscillations shown in Fig. 1d with those in Fig. 2b. It is recommended that the simulations be repeated using a more appropriate model of slow oscillations or at least using the model of Cha et al but employed to simulate in slower bursting.

      The reviewer raises an important point and caveat associated with our simulated model and experimental data. This point was also made by other reviewers, and a similar response to this comment can be found elsewhere in response to reviewer 2 point 6. To address this comment, we have performed several additional experiments and analyses:

      1) We collected additional Ca2+ (to identify the functional network and hubs) and FRAP data (to assess gap junction permeability) in islets which show either pure slow, pure fast, or mixed oscillations. We generated networks based on each time scale to compare with FRAP gap junction permeability data. We found that the conclusions of our first draft to be consistent across all oscillation types. There was no relationship between gap junction conductance, as approximated using FRAP, and normalized degree for slow (Figure 3j), fast (Figure 3 Supp 1d,e), or mixed (Figure 3 Supp 1g,h) oscillations. We also include discussion of these conclusions - See Results page 7 lines 184-186 and lines 188-191, Discussion page 12 lines 357-360.

      2) We also performed additional simulations with a coupled ‘Integrated Oscillator Model’ which shows slow oscillations because of metabolic oscillations (Figure 2). We compared connectivity with gap junction coupling and underlying cell parameters. In this case, there is an association between functional and structural networks, with highly-connected hub cells showing higher gap junction conductance (Figure 2f) but also low KATP channel conductance (gKATP) (Figure 2e). However, there are some caveats to these findings – given the nature of the IOM model, we were limited to simulating smaller islets (260 cells) and less heterogeneity in the calcium traces was observed. Additional analysis suggests the greater association between functional and structural networks in this model was a result of the smaller islets, and the association was also dependent on threshold (unlike in the Cha-Noma fast oscillator model) robust. These limitations and results are discussed further (Discussion page 11 lines 344-354).

      Additionally, in the IOM, the underlying cell dynamics of highly-connected hub cells are differentiated by KATP channel conductance (gKATP), which is different than in the fast oscillator model (differentiated by metabolism, kglyc). However this difference between models can be linked to differences in the way duty cycle is influenced by gKATP and kglyc (Figure 1h, Figure 2g). In each model there was a similar association between duty cycle and highly-connected hub cells. We also discuss these findings (Discussion page 11 lines 334-343).

      Overall these results and discussion with respect to the coupled IOM oscillator model can be found in Figure 2, Results page 6 lines 128-156 and Discussion page 11 lines 332-354.

      3) Much of the data analyzed whether obtained via simulation or through experiment seems to produce very small differences in the actual numbers obtained, as can be seen in the bar graphs shown in Figs. 1e,g for example (obtained from simulations), or Fig. 2j (obtained from experimental measurements). The authors should comment as to why such small differences are often seen as a result of their analyses throughout the manuscript and why also in many cases the observed variance is high. Related to the data shown, very few dots are shown in Figs. 1eg or Fig 4e and 4h even though these points were derived from simulations where 100s of runs could be carried out and many more points obtained for plotting. These are weaknesses unless specific and convincing explanations are provided.

      We thank the reviewer for these comments, which are similar to those of reviewer 2 (point 4) and reviewer 3 (point 6). Indeed there is some variability between cells in both simulations and experiments related to the metabolic activity in hubs and non-hubs. The variability points to potentially other factors being involved in determining hubs beyond simply kglyc, including a minor role for gap junction coupling structural network and potentially cell position and other intrinsic factors. We now discuss this point – see Discussion page 12 lines 364-266.

      The differences between hubs and nonhubs appear small because the value of kglyc is very small. For figure 1e, the average kglyc for nonhubs was 1.26x10-4 s-1 (which is the average of the distribution because most cells are non hubs) while the average kglyc for hubs was 1.4x10-4 s-1 which is about half of a standard deviation higher. The paired t-test controls for the small value of average kglyc.

      For simulation data each of the 5 dots corresponds to a simulated islet averaged over 1000 cells (or 260 cells for coupled IOM). The computational resources are high to generate such data so it is not feasible to conduct 100s of runs. Again, we note the comparisons between hubs and non-hubs are paired, and we find statistically significant differences for kglyc in figure 1 using only 5 paired data points. That we find these differences indicates the substantial difference between hubs and non-hubs. This is further supported all effect sizes being much greater than 0.8 for all significantly different findings (Cha Noma - kglyc: 2.85, gcoup: 0.82) (IOM: gKATP: 1.27, gcoup: 2.94) – We have included these effect sizes in the captions see Figure 1 and 2 captions (pages 34, 36)

      To consider all of the available data rather than the average across an entire islet, we created a kernel density estimate the kglyc for hubs and nonhubs created by concatenating every single cell in each of the five islets. A kstest results in a highly significant difference (P<0.0001) between these two distributions.

      Author response image 1.

      4) The data shown in Fig. 4i,j are intended to compare long-range synchronization at different distances along a string of coupled cells but the difference between the synchronized and unsynchronized cells for gcoup and Kglyc was subtle, very much so.

      Thank you for pointing out these subtle differences. The y-axis scale for i and j is broad to allow us to represent all distances on a single plot. After correction for multiple comparison, the differences were still statistically significant. As the reviewer mentioned in point 3, each plot contains only five data points, each of which represent the average of a single simulated islet, therefore we are not concerned about statistical significance coming from too large of a sample size. We also checked the differences between synchronized and nonsynchronized cell pairs in figure 4 panels e and h (now figure 5 e, h). These are the same data as i and j but normalized such that all of the distances could be averaged together. We again found statistical significance between synchronized and non-synchronized cell pairs. As can be seen in Author response image 2 the difference between synchronized and non-synchronized cell pairs is greater than the variability between simulated islets. Thus, in this case the variability is not substantial.

      Author response image 2.

      5) The data shown in Fig. 5 for Cx36 knockout islets are used to assess the influence of gap junctional coupling, which is reasonable, but it would be reassuring to know that loss of this gene has no effects on the expression of other genes in the beta cell, especially genes involved with glucose metabolism.

      This is an important point. Previous studies have assessed that no significant change in NAD(P)H is observed in Cx36 deficient islets – see Benninger et al J.Physiol 2011 [14]. Islet architecture is also retained. Further the insulin secretory response of dissociated Cx36 knockout beta cells is the same as that of dissociated wildtype beta cells, further indicating no significant defect in the intrinsic ability of the beta cell to release insulin – see Benninger et al J.Physiol 2011 [14]. We now Mention these findings in the discussion. See Discussion page 14 lines 459-464.

      6) In many places throughout the paper, it is difficult to ascertain whether what is being shown is new vs. what has been shown previously in other studies. The paper would thus benefit strongly from added text highlighting the novelty here and not just restating what is known, for instance, that islets can exhibit small-world network properties. This detracts from the strengths of the paper and further makes it difficult to wade through. Even the finding here that metabolic characteristics of the beta cells can infer profound and influential functional coupling is not new, as the authors proposed as much many years ago. Again, this makes it difficult to distill what is new compared to what is mainly just being confirmed here, albeit using different methods.

      Thank you for the suggestion, we have made significant modifications throughout the Introduction, Discussion and Results to be clearer about what is known from previous work and what is newly found in this manuscript.

      Reviewer #5 (Public Review):

      The authors use state-of-the-art computation, experiment, and current network analysis to try and disaggregate the impact of cellular metabolism driving cellular excitability and structural electrical connections through gap junctions on islet synchronization. They perform interesting simulations with a sophisticated mathematical model and compare them with closely associated experiments. This close association is impressive and is an excellent example of using mathematics to inform experiments and experimental results. The current conclusions, however, appear beyond the results presented. The use of functional connectivity is based on correlated calcium traces but is largely without an understood biophysical mechanism. This work aims to clarify such a mechanism between metabolism and structural connection and comes out on the side of metabolism driving the functional connectivity, but both are required and more nuanced conclusions should be drawn.

      We thank reviewer 5 for their positive comments, including our multifaceted experimental and computational techniques. We also found the reviewers careful reading and thoughtful comments to be very helpful and we have worked to integrate each comment into our manuscript. It is evident from the reviewer comments that we did not clearly explain what was meant by our conclusions concerning the functional network reflecting metabolism rather than gap junctions. We have conducted significant rewriting to show that we are not concluding that communication (metabolic or electric) occurs due to conduits other than gap junctions. Rather, our data suggest that the functional network (which reflects calcium synchronization) reflects intrinsic dynamics of the cells, which include metabolic rates, more than individual gap junction connections.

      References referred to in this response to reviewers document:

      [1] A. Stožer et al., “Functional connectivity in islets of Langerhans from mouse pancreas tissue slices,” PLoS Comput Biol, vol. 9, no. 2, p. e1002923, 2013.

      [2] N. L. Farnsworth, A. Hemmati, M. Pozzoli, and R. K. Benninger, “Fluorescence recovery after photobleaching reveals regulation and distribution of connexin36 gap junction coupling within mouse islets of Langerhans,” The Journal of physiology, vol. 592, no. 20, pp. 4431–4446, 2014.

      [3] C.-L. Lei, J. A. Kellard, M. Hara, J. D. Johnson, B. Rodriguez, and L. J. Briant, “Beta-cell hubs maintain Ca2+ oscillations in human and mouse islet simulations,” Islets, vol. 10, no. 4, pp. 151–167, 2018.

      [4] N. R. Johnston et al., “Beta cell hubs dictate pancreatic islet responses to glucose,” Cell metabolism, vol. 24, no. 3, pp. 389–401, 2016.

      [5] V. Kravets et al., “Functional architecture of pancreatic islets identifies a population of first responder cells that drive the first-phase calcium response,” PLoS Biology, vol. 20, no. 9, p. e3001761, 2022.

      [6] H. Ren et al., “Pancreatic α and β cells are globally phase-locked,” Nature Communications, vol. 13, no. 1, p. 3721, 2022.

      [7] A. Stožer et al., “From Isles of Königsberg to Islets of Langerhans: Examining the function of the endocrine pancreas through network science,” Frontiers in Endocrinology, vol. 13, p. 922640, 2022.

      [8] J. Zmazek et al., “Assessing different temporal scales of calcium dynamics in networks of beta cell populations,” Frontiers in physiology, vol. 12, p. 337, 2021.

      [9] M. E. Corezola do Amaral et al., “Caloric restriction recovers impaired β-cell-β-cell gap junction coupling, calcium oscillation coordination, and insulin secretion in prediabetic mice,” American Journal of Physiology-Endocrinology and Metabolism, vol. 319, no. 4, pp. E709–E720, 2020.

      [10] J. M. Dwulet, J. K. Briggs, and R. K. P. Benninger, “Small subpopulations of beta-cells do not drive islet oscillatory [Ca2+] dynamics via gap junction communication,” PLOS Computational Biology, vol. 17, no. 5, p. e1008948, May 2021, doi: 10.1371/journal.pcbi.1008948.

      [11] B. E. Peercy and A. S. Sherman, “Do oscillations in pancreatic islets require pacemaker cells?,” Journal of Biosciences, vol. 47, no. 1, pp. 1–11, 2022.

      [12] G. A. Rutter, N. Ninov, V. Salem, and D. J. Hodson, “Comment on Satin et al.‘Take me to your leader’: an electrophysiological appraisal of the role of hub cells in pancreatic islets. Diabetes 2020; 69: 830–836,” Diabetes, vol. 69, no. 9, pp. e10–e11, 2020.

      [13] L. S. Satin and P. Rorsman, “Response to comment on satin et al.‘Take me to your leader’: An electrophysiological appraisal of the role of hub cells in pancreatic islets. Diabetes 2020; 69: 830–836,” Diabetes, vol. 69, no. 9, pp. e12–e13, 2020.

      [14] R. K. Benninger, W. S. Head, M. Zhang, L. S. Satin, and D. W. Piston, “Gap junctions and other mechanisms of cell–cell communication regulate basal insulin secretion in the pancreatic islet,” The Journal of physiology, vol. 589, no. 22, pp. 5453–5466, 2011.

      [15] R. Fried, Erectile dysfunction as a cardiovascular impairment. Academic Press, 2014. [16] T. Pipatpolkai, S. Usher, P. J. Stansfeld, and F. M. Ashcroft, “New insights into KATP channel gene mutations and neonatal diabetes mellitus,” Nature Reviews Endocrinology, vol. 16, no. 7, pp. 378–393, 2020.

      [17] A. M. Notary, M. J. Westacott, T. H. Hraha, M. Pozzoli, and R. K. P. Benninger, “Decreases in Gap Junction Coupling Recovers Ca2+ and Insulin Secretion in Neonatal Diabetes Mellitus, Dependent on Beta Cell Heterogeneity and Noise,” PLOS Computational Biology, vol. 12, no. 9, p. e1005116, Sep. 2016, doi: 10.1371/journal.pcbi.1005116.

      [18] J. V. Rocheleau, G. M. Walker, W. S. Head, O. P. McGuinness, and D. W. Piston, “Microfluidic glucose stimulation reveals limited coordination of intracellular Ca2+ activity oscillations in pancreatic islets,” Pro ceedings of the National Academy of Sciences, vol. 101, no. 35, pp. 12899–12903, 2004. [19] R. K. Benninger, M. Zhang, W. S. Head, L. S. Satin, and D. W. Piston, “Gap junction coupling and calcium waves in the pancreatic islet,” Biophysical journal, vol. 95, no. 11, pp. 5048–5061, 2008.

    1. Author Response

      Reviewer #1 (Public Review):

      This paper presents an interesting data set from historic Western Eurasia and North Africa. Overall, I commend the authors for presenting a comprehensive paper that focuses the data analysis of a large project on the major points, and that is easy to follow and well-written. Thus, I have no major comments on how the data was generated, or is presented. Paradoxically, historical periods are undersampled for ancient DNA, and so I think this data will be useful. The presentation is clever in that it focuses on a few interesting cases that highlight the breadth of the data.

      The analysis is likewise innovative, with a focus on detecting "outliers" that are atypical for the genetic context where they were found. This is mainly achieved by using PCA and qpAdm, established tools, in a novel way. Here I do have some concerns about technical aspects, where I think some additional work could greatly strengthen the major claims made, and lay out if and how the analysis framework presented here could be applied in other work.

      clustering analysis

      I have trouble following what exactly is going on here (particularly since the cited Fernandes et al. paper is also very ambiguous about what exactly is done, and doesn't provide a validation of this method). My understanding is the following: the goal is to test whether a pair of individuals (lets call them I1 and I2) are indistinguishable from each other, when we compare them to a set of reference populations. Formally, this is done by testing whether all statistics of the form F4(Ref_i, Ref_j; I1, I2) = 0, i.e. the difference between I1 and I2 is orthogonal to the space of reference populations, or that you test whether I1 and I2 project to the same point in the space of reference populations (which should be a subset of the PCA-space). Is this true? If so, I think it could be very helpful if you added a technical description of what precisely is done, and some validation on how well this framework works.

      We agree that the previous description of our workflow was lacking, and have substantially improved the description of the entire pipeline (Methods, section “Modeling ancestry and identifying outliers using qpAdm”), making it clearer and more descriptive. To further improve clarity, we have also unified our use of methodology and replaced all mentions of “qpWave” with “qpAdm”. In the reworked Methods section mentioned above, we added a discussion on how these tests are equivalent in certain settings, and describe which test we are exactly doing for our pairwise individual comparisons, as well as for all other qpAdm tests downstream of cluster discovery. In addition, we now include an additional appendix document (Appendix 4) which, for each region, shows the results from our individual-based qpAdm analysis and clustering in the form of heatmaps, in addition to showing the clusters projected into PC space.

      An independent concern is the transformation from p-values to distances. I am in particular worried about i) biases due to potentially different numbers of SNPs in different samples and ii) whether the resulting matrix is actually a sensible distance matrix (e.g. additive and satisfies the triangle inequality). To me, a summary that doesn't depend on data quality, like the F2-distance in the reference space (i.e. the sum of all F4-statistics, or an orthogonalized version thereof) would be easier to interpret. At the very least, it would be nice to show some intermediate results of this clustering step on at least a subset of the data, so that the reader can verify that the qpWave-statistics and their resulting p-values make sense.

      We agree that calling the matrix generated from p-values a “distance matrix” is a misnomer, as it does not satisfy the triangle inequality, for example. We still believe that our clustering generates sensible results, as UPGMA simply allows us to project a positive, symmetric matrix to a tree, which we can then use, given some cut-off, to define clusters. To make this distinction clear, we now refer to the resulting matrix as a “dissimilarity matrix” instead. As mentioned above, we now also include a supplementary figure for each region visualizing the clustering results.

      Regarding the concerns about p-values conflating both signal and power, we employ a stringent minimum SNP coverage filter for these analyses to avoid extremely-low coverage samples being separated out (min. SNPs covered: 100,000). In addition, we now show that cluster size and downstream outlier status do not depend on SNP coverage (Figure 2 - Suppl. 3).

      The methodological concerns lead me to some questions about the data analysis. For example, in Fig2, Supp 2, very commonly outliers lie right on top of a projected cluster. To my understanding, apart from using a different reference set, the approach using qpWave is equivalent to using a PCA-based clustering and so I would expect very high concordance between the approaches. One possibility could be that the differences are only visible on higher PCs, but since that data is not displayed, the reader is left wondering. I think it would be very helpful to present a more detailed analysis for some of these "surprising" clustering where the PCA disagrees with the clustering so that suspicions that e.g. low-coverage samples might be separated out more often could be laid to rest.

      To reduce the risk of artifactual clusters resulting from our pipeline, we devised a set of QC metrics (described in detail below) on the individuals and clusters we identified as outliers. Driven by these metrics, we implemented some changes to our outlier detection pipeline that we now describe in substantially more detail in the Methods (see comment above). Since the pipeline involves running many thousands of qpAdm analyses, it is difficult to manually check every step for all samples – instead, we focused our QC efforts on the outliers identified at the end of the pipeline. To assess outlier quality we used the following metrics, in addition to manual inspection:

      First, for an individual identified as an outlier at the end of the pipeline, we check its fraction of non-rejected hypotheses across all comparisons within a region. The rationale here is that by definition, an outlier shouldn’t cluster with many other samples within its region, so a majority of hypotheses should be rejected (corresponding to gray and yellow regions in the heatmaps, Appendix 4). Through our improvements to the pipeline, the fraction of non-rejected hypotheses was reduced from an average of 5.3% (median 1.1%) to an average of 3.8% (median 0.6%), while going from 107 to 111 outliers across all regions.

      Second, we wanted to make sure that outlier status was not affected by the inclusion of pre-historic individuals in our clustering step within regions. To represent majority ancestries that might have been present in a region in the past, we included Bronze and Copper Age individuals in the clustering analysis. We found that including these individuals in the pairwise analysis and clustering improved the clusters overall. However, to ensure that their inclusion did not bias the downstream identification of outliers, we also recalculated the clustering without these individuals. We inspected whether an individual identified as an outlier would be part of a majority cluster in the absence of Bronze and Copper Age individuals, which was not the case (see also the updated Methods section for more details on how we handle time periods within regions).

      In response to the “surprising” outliers based on the PCA visualizations in Figure 2, Supplement 2: with our updated outlier pipeline, some of these have disappeared, for example in Western and Northern Europe. However, in some regions the phenomenon remains. We are confident this isn’t a coverage effect, as we’ve compared the coverage between outliers and non-outliers across all clusters (see previous comment, Figure 2 - Suppl. 3), as well as specifically for “surprising” outliers compared to contemporary non-outliers – none of which showed any differences in the coverage distributions of “surprising” outliers (Author response images 1 and 2). In addition, we believe that the quality metrics we outline above were helpful in minimizing artifactual associations of samples with clusters, which could influence their downstream outlier status. As such, we think it is likely that the qpAdm analysis does detect a real difference between these sets of samples, even though they project close to each other in PCA space. This could be the result of an actual biological difference hidden from PCA by the differences in reference space (see also the reply to the following comment). Still, we cannot fully rule out the possibility of latent technical biases that we were not able to account for, so we do not claim the outlier pipeline is fully devoid of false positives. Nevertheless, we believe our pipeline is helpful in uncovering true, recent, long-range dispersers in a high-throughput and automated manner, which is necessary to glean this type of insight from hundreds of samples across a dozen different regions.

      Author response image 1.

      SNP coverage comparison between outliers and non-outliers in region-period pairings with “surprising” outliers (t-test p-value: 0.242).

      Author response image 2.

      PCA projection (left) and SNP coverage comparison (right) for “surprising” outliers and surrounding non-outliers in Italy_IRLA.

      One way the presentation could be improved would be to be more consistent in what a suitable reference data set is. The PCAs (Fig2, S1 and S2, and Fig6) argue that it makes most sense to present ancient data relative to present-day genetic variation, but the qpWave and qpAdm analysis compare the historic data to that of older populations. Granted, this is a common issue with ancient DNA papers, but the advantage of using a consistent reference data set is that the analyses become directly comparable, and the reader wouldn't have to wonder whether any discrepancies in the two ways of presenting the data are just due to the reference set.

      While it is true that some of the discrepancies are difficult to interpret, we believe that both views of the data are valuable and provide complementary insights. We considered three aspects in our decision to use both reference spaces: (1) conventions in the field (including making the results accessible to others), (2) interpretability, and (3) technical rigor.

      Projecting historical genomes into the present-day PCA space allows for a convenient visualization that is common in the field of ancient DNA and exhibits an established connection to geographic space that is easy to interpret. This is true especially for more recent ancient and historical genomes, as spatial population structure approaches that of present day. However, there are two challenges: (1) a two-dimensional representation of a fairly high-dimensional ancestry space necessarily incurs some amount of information loss and (2) we know that some axes of genetic variation are not well-represented by the present-day PCA space. This is evident, for example, by projecting our qpAdm reference populations into the present-day PCA, where some ancestries which we know to be quite differentiated project closely together (Author response image 3). Despite this limitation, we continue to use the PCA representation as it is well resolved for visualization and maximizes geographical correspondence across Eurasia.

      On the other hand, the qpAdm reference space (used in clustering and outlier detection) has higher resolution to distinguish ancestries by more comprehensively capturing the fairly high-dimensional space of different ancestries. This includes many ancestries that are not well resolved in the present-day PCA space, yet are relevant to our sample set, for example distinguishing Iranian Neolithic ancestry against ancestries from further into central and east Asia, as well as distinguishing between North African and Middle Eastern ancestries (Author response image 3).

      To investigate the differences between these two reference spaces, we chose pairwise outgroup-f3 statistics (to Mbuti) as a pairwise similarity metric representing the reference space of f-statistics and qpAdm in a way that’s minimally affected by population-specific drift. We related this similarity measure to the euclidean distance on the first two PCs between the same set of populations (Author response image 4). This analysis shows that while there is almost a linear correspondence between these pairwise measures for some populations, others comparisons fall off the diagonal in a manner consistent with PCA projection (Author response image 3), where samples are close together in PCA but not very similar according to outgroup-f3. Taken together, these analyses highlight the non-equivalence of the two reference spaces.

      In addition, we chose to base our analysis pipeline on the f-statistics framework to (1) afford us a more principled framework to disentangle ancestries among samples and clusters within and across regions (using 1-component vs. 2-component models of admixture), while (2) keeping a consistent, representative reference set for all analyses that were part of the primary pipeline. Meanwhile, we still use the present-day PCA space for interpretable visualization.

      Author response image 3.

      Projection of qpAdm reference population individuals into present-day PCA.

      Author response image 4.

      Comparison of pairwise PCA projection distance to outgroup-f3 similarity across all qpAdm reference population individuals. PCA projection distance was calculated as the euclidean distance on the first two principal components. Outgroup-f3 statistics were calculated relative to Mbuti, which is itself also a qpAdm reference population. Both panels show the same data, but each point is colored by either of the two reference populations involved in the pairwise comparison.

      PCA over time

      It is a very interesting observation that the Fst-vs distance curve does not appear to change after the bronze age. However, I wonder if the comparison of the PCA to the projection could be solidified. In particular, it is not obvious to me how to compare Fig 6 B and C, since the data in C is projected onto that in Fig B, and so we are viewing the historic samples in the context of the present-day ones. Thus, to me, this suggests that ancient samples are most closely related to the folks that contribute to present-day people that roughly live in the same geographic location, at least for the middle east, north Africa and the Baltics, the three regions where the projections are well resolved. Ideally, it would be nice to have independent PCAs (something F-stats based, or using probabilistic PCA or some other framework that allows for missingness). Alternatively, it could be helpful to quantify the similarity and projection error.

      The fact that historical period individuals are “most closely related to the folks that contribute to present-day people that roughly live in the same geographic location” is exactly the point we were hoping to make with Figures 6 B and C. We do realize, however, that the fact that one set of samples is projected into the PC space established by the other may suggest that this is an obvious result. To make it more clear that it is not, we added an additional panel to Figure 6, which shows pre-historical samples projected into the present-day PC space. This figure shows that pre-historical individuals project all across the PCA space and often outside of present-day diversity, with degraded correlation of geographic location and projection location (see also Author response image 5). This illustrates the contrast we were hoping to communicate, where projection locations of historical individuals start to “settle” close to present-day individuals from similar geographic locations, especially in contrast with pre-historic individuals.

      Author response image 5.

      Comparing geographic distance to PCA distance between pairs of historical and pre-historical individuals matched by geographic space. For each historical period individual we selected the closest pre-historical individual by geographic distance in an effort to match the distributions of pairwise geographic distance across the two time periods (left). For these distributions of individuals matched by geographic distance, we then queried the euclidean distance between their projection locations in the first two principal components (right).

    1. Author Response

      Reviewer #3 (Public Review):

      The authors explore the use of SRT as a host-directed therapy for use in combination with other first-line TB antibiotics. This manuscript is of substantial importance since TB is a major world health concern, and there is growing interest in the development of host-directed therapies to augment existing therapies for TB. Demonstrating the effectiveness of adding an FDA-approved drug to existing cocktails of anti-TB drugs has potentially exciting implications.

      The manuscript is bolstered by their use of multiple in vitro and in vivo models of infection, as well as a clinically relevant strain of TB. While their findings generally support the use of SRT as an effective HDT/treatment, the mechanistic details underlying the effectiveness of SRT remain somewhat obscure, and as presented, the in vitro experiments support more limited conclusions.

      Major concerns:

      In vitro studies (i.e. bacterial culture) were only performed with SRT up to 6 uM while the cultured cell experiments used a range up to 20 uM. 5 uM had almost no effect on the viability/growth of Mtb in macrophages. The authors should use the same concentrations in vitro as their macrophage studies to test whether SRT directly impacts Mtb viability to be able to rule in/out that SRT does not impact Mtb viability when cultured.

      We haven’t seen any appreciable decrease in the growth of Mtb at upto 20M in in vitro experiments, nearly 30-40% restriction after 8 days of culture. We used in combination of HR a lower dose of 6mM in combination with HR to offset the effect of minimal SRT inhibitory effects so that only the effect of SRT is understood.

      The mechanism of action of SRT during TB infection and the conclusions drawn by the authors are not supported by the limited experimentation. SRT is presented as an antagonist of polyI:C-induced type I IFNs, but during TB infection, cytosolic DNA sensing via the cGAS/STING axis constitutes the major pathway through which type I IFNs are induced in macrophages.

      To offer more support that SRT inhibits type I IFN, the authors should consider measuring the the actual amount of type I IFN using an IFNb ELISA. Additionally, the authors should use human/mouse primary macrophages (not just THP1 reporter cells) and measure transcript levels (at key time points post infection) and protein levels of type I IFN and other proinflammatory mediators (e.g. TNFa, IL-1, IL-6) +/- SRT to determine if SRT is specific to the type I IFN response. If this is indeed the case, other NFkB genes/cytokines should not be impacted.

      Moreover, to draw the conclusion that "augmentation property of SRT is due to its ability to inhibit IFN signalling" a set of experiments using an IFN blocking antibody would enhance Figure 2, as both cGAS and STING KO macs have significant differences in basal gene expression and their ability to respond to innate immune stimuli.

      Because the first half of the paper focuses on type I IFNs during macrophage infection to explain the mechanism of action for SRT, additional analysis of the mouse infections to examine levels of type I IFNs, as well as IL-1B and IFN-g (in serum/tissues?), is important for connecting the two halves of the manuscript. The in vivo data would also be strengthened by quantitative analysis of histological changes by, for example, blinded pathology scoring. This type of quantitation would also permit statistical analyses of this important pathology readout.

      We have performed analyse of tissue cytokine levels and did not see stark differences in the levels between HRZE and HRZES at two time points of 4 and 8weeks post treatment (Figure below). We feel that such studies would need a more comprehensive analyses of the immunological response induced in the host by the treatment at multiple time points. Such studies would be part of a more focussed plan in the future proposals and manuscripts. We have also conducted a manual scoring of the lesions between the groups and have recorded this data in the manuscript (Fig.4-figure supplement 1)

      The authors conclude that SRT functions through an inflammasome-related function, but this conclusion requires further support of actual inflammasome activation, such as IL-1B secretion by ELISA or IL-1B processing by western blot analysis, rather than Il1b gene expression alone. Additional functional readouts of inflammasome activation like cell death assays would also strengthen this conclusion.

      We thank the reviewer for these suggestions. These studies are currently underway and will be part of a future manuscript detailing the mechanistics of SRT mediated increase in antibiotic efficacy.

      What strain of TB was used in these studies? The results and methods do not indicate the strain used, which is critical to know since different strains have varying pathogenesis phenotypes.

      We have used Mtb Erdman for routine drug sensitive and N73 for the drug tolerant studies. This has been added in the text.

      Minor concerns:

      It might be worth consistently using the more common INH and RIF abbreviations to increase the clarity/readability of the MS and figures.

      We have used the conventional clinical abbreviations used for INH and Rifampicin What is the physiological concentration of SRT when taken for depression and how does that compare to the concentrations used in vitro? Are the in vitro concentrations feasible to achieve in patients?

      In Figure 3B, why is there a spike in TNF-a in the HRS treated cells only at 42h?

      The authors wish to thank the reviewer for this query. We have reanalysed the data and have depicted the modified figures in the current text version. The spike at 42H for TNF was an oversight and due to an erroneous representation of the values in the figure.

      Was statistical analysis performed on the data in Figure 3B and D?

      Yes, we have incorporated this information in the modified figure.

      A description/discussion of the different mouse strains use in infection - what benefits each has as a model and why several were used - would help convey the impact of the in vivo studies.

      These have been incorporated in the text. A discussion of the mouse strains and their immunopathology in infection has been included in the text.

      Since antibiotics and SRT were administered ad libitum, how did the authors ensure that mice took enough of the antibiotics and especially SRT? Is it known whether these drugs affect the water taste enough to affect a mouse's willingness to drink them?

      We preferred the use of ad libitum delivery of TB drugs in drinking water as used in the previous studies by Vilchèze et .al, 2018 Antimicrob Agents Chemother 23;62(3):e02165-17. To avoid non drinking, we used 5% glucose in the water of all animals including the non-antibiotic treated groups. We also followed the uptake of water during the treatment and found comparable levels of usage between the groups.

      Was statistical analysis performed on time-to-death experiments?

      Because of the inherent differences in the susceptibility and response between males and females C3HEBFEJ mice, we did not perform statistical analyses between the groups.

      Were CFUs measured in mice from Figure 4 to determine empirically how effective the antibiotic treatments were? And if SRT impacted their effectiveness?

      We have not tested the effect of SRT on bacterial burdens on bacteria treated with HR alone as these studies were aimed at deciphering chronic pathology. We have tested the effect on bacterial loads in the C3HEBFEJ model with the four-drug therapy and the C57BL6 and Balbc models of infection.

      The H&E images could use some additional labels to more easily discern what groups they belong to.

      These have been incorporated in the figure.

    1. Author Response

      Reviewer #1 (Public Review):

      This is a carefully-conducted fMRI study looking at how neural representations in the hippocampus, entorhinal cortex, and ventromedial prefrontal cortex change as a function of local and global spatial learning. Collectively, the results from the study provide valuable additional constraints on our understanding of representational change in the medial temporal lobes and spatial learning. The most notable finding is that representational similarity in the hippocampus post-local-learning (but prior to any global navigation trials) predicts the efficiency of subsequent global navigation.

      Strengths:

      The paper has several strengths. It uses a clever two-phase paradigm that makes it possible to track how participants learn local structure as well as how they piece together global structure based on exposure to local environments. Using this paradigm, the authors show that - after local learning - hippocampal representations of landmarks that appeared within the same local environment show differentiation (i.e., neural similarity is higher for more distant landmarks) but landmarks that appeared in different local environments show the opposite pattern of results (i.e., neural similarity is lower for more distant landmarks); after participants have the opportunity to navigate globally, the latter finding goes away (i.e., neural similarity for landmarks that occurred in different local environments is no longer influenced by the distance between landmarks). Lastly, the authors show that the degree of hippocampal sensitivity to global distance after local-only learning (but before participants have the opportunity to navigate globally) negatively predicts subsequent global navigation efficiency. Taken together, these results meaningfully extend the space of data that can be used to constrain theories of MTL contributions to spatial learning.

      We appreciate Dr. Norman’s generous feedback here along with his other insightful comments. Please see below for a point-by-point response. We note that responses to a number of Dr. Norman’s points were surfaced by the Editor as Essential revisions; as such, in a number of instances in the point-by-point below we direct Dr. Norman to our responses above under the Essential revisions section.

      Weaknesses:

      General comment 1: The study has an exploratory feel, in the sense that - for the most part - the authors do not set forth specific predictions or hypotheses regarding the results they expected to obtain. When hypotheses are listed, they are phrased in a general way (e.g., "We hypothesized that we would find evidence for both integration and differentiation emerging at the same time points across learning, as participants build local and global representations of the virtual environment", and "We hypothesized that there would be a change in EC and hippocampal pattern similarity for items located on the same track vs. items located on different tracks" - this does not specify what the change will be and whether the change is expected to be different for EC vs. hippocampus). I should emphasize that this is not, unto itself, a weakness of the study, and it appears that the authors have corrected for multiple comparisons (encompassing the range of outcomes explored) throughout the paper. However, at times it was unclear what "denominator" was being used for the multiple comparisons corrections (i.e., what was the full space of analysis options that was being corrected for) - it would be helpful if the authors could specify this more concretely, throughout the paper.

      We appreciate this guidance and the importance of these points. We have taken a number of steps to clarify our hypotheses, we now distinguish a priori predictions from exploratory analyses, and we now explicitly indicate throughout the manuscript how we corrected for multiple comparisons. For full details, please see above for our response to Essential Revisions General comment #1.

      General comment 2: Some of the analyses featured prominently in the paper (e.g., interactions between context and scan in EC) did not pass multiple comparisons correction. I think it's fine to include these results in the paper, but it should be made clear whenever they are mentioned that the results were not significant after multiple comparisons correction (e.g., in the discussion, the authors say "learning restructures representations in the hippocampus and in the EC", but in that sentence, they don't mention that the EC results fail to pass multiple comparisons correction).

      Thank you for encouraging greater clarity here. As noted directly above, we now explicitly indicate our a priori predictions, we state explicitly which results survive multiple comparisons correction, and we added necessary caveats for effects that should be interpreted with caution.

      General comment 3: The authors describe the "flat" pattern across the distance 2, 3, and 4 conditions in Figure 4c (post-global navigation) and in Figure 5b (in the "more efficient" group) as indicating integration. However, this flat pattern across 2, 3, and 4 (unto itself) could simply indicate that the region is insensitive to location - is there some other evidence that the authors could bring to bear on the claim that this truly reflects integration? Relatedly, in the discussion, the authors say "the data suggest that, prior to Global Navigation, LEs had integrated only the nearest landmarks located on different tracks (link distance 2)" - what is the basis for this claim? Considered on its own, the fact that similarity was high for link distance 2 does not indicate that integration took place. If the authors cannot get more direct evidence for integration, it might be useful for them to hedge a bit more in how they interpret the results (the finding is still very interesting, regardless of its cause).

      Based on the outcomes of additional behavioral and neural analyses that were helpfully suggested by reviewers, we revised discussion of this aspect of the data. Please see our response above under Essential Revisions General comment #4 for full details of the changes made to the manuscript.

      Reviewer #2 (Public Review):

      This paper presents evidence of neural pattern differentiation (using representational similarity analysis) following extensive experience navigating in virtual reality, building up from individual tracks to an overall environment. The question of how neural patterns are reorganized following novel experiences and learning to integrate across them is a timely and interesting one. The task is carefully designed and the analytic setup is well-motivated. The experimental approach provides a characterization of the development of neural representations with learning across time. The behavioral analyses provide helpful insight into the participants' learning. However, there were some aspects of the conceptual setup and the analyses that I found somewhat difficult to follow. It would also be helpful to provide clearer links between specific predictions and theories of hippocampal function.

      We appreciate the Reviewer’s careful read of our manuscript and their thoughtful guidance for improvement, which we believe strengthened the revised product. We note that responses to a number of the Reviewer’s points were surfaced by the Editor as Essential revisions; as such, in a number of instances in the point-by-point below we direct the Reviewer to our responses above under the Essential revisions section.

      General comment 1: The motivation in the Introduction builds on the assumption that global representations are dependent on local ones. However, I was not completely sure about the specific predictions or assumptions regarding integration vs. differentiation and their time course in the present experimental design. What would pattern similarity consistent with 'early evidence of global map learning' (p. 7) look like? Fig. 1D was somewhat difficult to understand. The 'state space' representation is only shown in Figure 1 while all subsequent analyses are averaged pairwise correlations. It would be helpful to spell out predictions as they relate to the similarity between same-route vs. different-route neural patterns.

      We appreciate this feedback. An increase in pattern similarity across features that span tracks would indicate the linking of those features together. ‘Early evidence’ here describes the point in experience where participants had traversed local (within-track) paths but had yet to traverse across-tracks.

      Figure 1D seeks to communicate the high-level conceptual point about how similarity (abstractly represented as state-space distance) may change in one of two directions as a function of experience.

      General comment 2: The shared landmarks could be used by the participants to infer how the three tracks connected even before they were able to cross between them. It is possible that the more efficient navigators used an explicit encoding strategy to help them build a global map of the world. While I understand the authors' reasoning for excluding the shared landmarks (p. 13), it seems like it could be useful to run an analysis including them as well - one possibility is that they act as 'anchors' and drive the similarity between different tracks early on; another is that they act as 'boundaries' and repel the representations across routes. Assuming that participants crossed over at these landmarks, these seem like particularly salient aspects of the environment.

      We agree that these shared landmarks play an important role in learning the global environment and guiding participants’ navigation. However, they also add confounding elements to the analyses; mainly, shared landmarks are located near multiple goal locations and associated with multiple tracks, and transition probabilities differ at shared landmarks because they have an increased number of neighboring landmarks and fractals. In the initial submission, shared landmarks were included in all analyses except (a) global distance models and (b) context models (which compare items located on the same vs different tracks).

      With respect to (a) the global distance models, we ran these models while including shared landmarks and the results did not differ (see figure below and compare to Fig. 5 in the revised manuscript):

      Distance representations in the Global Environment, with shared landmarks included. These data can be compared to Figure 5 of the revised manuscript, which does not include shared landmarks (see page 5 of this response letter).

      We continue to report the results from models excluding shared landmarks due to the confounding factors described above, with the following addition to the Results section:

      “We excluded shared landmarks from this model as they are common to multiple tracks; however, the results do not differ if these landmarks are included in the analysis.”

      With respect to (b) the context analyses (which compare items located on the same vs different tracks), we cannot include shared landmarks in these analyses because they are common amongst multiple tracks and thus confound the analyses. Finally, we are unable to conduct additional analyses investigating shared landmarks specifically (for example, examining how similarity between shared landmarks evolves across learning) due to very low trial counts. We share the Reviewer’s perspective that the role of shared landmarks during the building of map representations promises to provide additional insights and believe this is a promising question for future investigation.

      General comment 3: What were the predictions regarding the fractals vs. landmarks (p. 13)? It makes sense to compare like-to-like, but since both were included in the models it would be helpful to provide predictions regarding their similarity patterns.

      We are grateful for the feedback on how to improve the consistency of results reporting. In the revision, we updated the relevant sections of the manuscript to include results from fractals. Please see our above response to Essential Revisions General comment #5 for additions made to the text.

      General comment 4: The median split into less-efficient and more-efficient groups does not seem to be anticipated in the Introduction and results in a small-N group comparison. Instead, as the authors have a wealth of within-individual data, it might be helpful to model single-trial navigation data in relation to pairwise similarity values for each given pair of landmarks in a mixed-effects model. While there won't be a simple one-to-one mapping and fMRI data are noisy, this approach would afford higher statistical power due to more within-individual observations and would avoid splitting the sample into small subgroups.

      We appreciate this very helpful suggestion. Following this guidance, we removed the median-split analysis and ran a mixed-effects model relating trial-wise navigation data (at the beginning of the Global Navigation Task) to pairwise similarity values for each given pair of landmarks and fractals (Post Local Navigation). We also altered our approach to the across-participant analysis examining brain-behavior relationships. Please see our above response to Essential Revisions General comment #3 for additions to the revised manuscript.

      General comment 5: If I understood correctly, comparing Fig. 4B and Fig. 5B suggests that the relationship between higher link distance and lower representational similarity was driven by less efficient navigators. The performance on average improved over time to more or less the same level as within-track (Fig. 2). Were less efficient navigators particularly inefficient on trials with longer distances? In the context of models of hippocampal function, this suggests that good navigators represented all locations as equidistant while poorer navigators showed representations more consistent with a map - locations that were further apart were more distant in their representational patterns. Perhaps more fine-grained analyses linking neural patterns to behavior would be helpful here.

      Following the above guidance, we removed the median-split analyses when exploring across-participant brain-behavior relationships (see Essential Revisions General comment #3), replacing it with a mixed-effects model analysis, and we revised our discussion of the across-track link distance effects (see Essential Revisions General comment #4). For this reason, we were hesitant and ultimately decided against conducting the proposed fine-grained analyses on the median-split data.

      General comment 6: I'm not completely sure how to interpret the functional connectivity analysis between the vmPFC and the hippocampus vs. visual cortex (Fig. 6). The analysis shows that the hippocampus and visual cortex are generally more connected than the vmPFC and visual cortex - but this relationship does not show an experience-dependent relationship and is consistent with resting-state data where the hippocampus tends to cluster into the posterior DMN network.

      We expected to see an experience-dependent relationship between vmPFC and hippocampal pattern similarity, and agree that these findings are difficult to interpret. Based on comments from several reviewers, we removed the second-order similarity analysis from the manuscript in favor of an analysis which models the relationship between vmPFC pattern similarity and hippocampal pattern similarity. Moreover, given the exploratory nature of the vmPFC analyses, and following guidance from Reviewer 1 about the visual cortex control analyses, both were moved to the Appendix. Please see our above response to Essential Revisions General comment #7 for further details of the changes made to the manuscript.

      Reviewer #3 (Public Review):

      Fernandez et al. report results from a multi-day fMRI experiment in which participants learned to locate fractal stimuli along three oval-shaped tracks. The results suggest the concurrent emergence of a local, differentiated within-track representation and a global, integrated cross-track representation. More specifically, the authors report decreases in pattern similarity for stimuli encountered on the same track in the entorhinal cortex and hippocampus relative to a pre-task baseline scan. Intriguingly, following navigation on the individual tracks, but prior to global navigation requiring track-switching, pattern similarity in the hippocampus correlated with link distances between landmark stimuli. This effect was only observed in participants who navigated less efficiently in the global navigation task and was absent after global navigation.

      Overall, the study is of high quality in my view and addresses relevant questions regarding the differentiation and integration of memories and the formation of so-called cognitive maps. The results reported by the authors are interesting and are based upon a well-designed experiment and thorough data analysis using appropriate techniques. A more detailed assessment of strengths and weaknesses can be found below.

      Strengths

      1) The authors address an interesting question at the intersection of memory differentiation and integration. The study is further relevant for researchers interested in the question of how we form cognitive maps of space.

      2) The study is well-designed. In particular, the pre-learning baseline scan and the random-order presentation of stimuli during MR scanning allow the authors to track the emergence of representations in a well-controlled fashion. Further, the authors include an adequate control region and report direct comparisons of their effects against the patterns observed in this control region.

      3) The manuscript is well-written. The introduction provides a good overview of the research field and the discussion does a good job of summarizing the findings of the present study and positioning them in the literature.

      We thank Dr. Bellmund for his positive evaluation of the manuscript. We greatly appreciate the insightful feedback, which we believe strengthened the manuscript’s clarity and potential impact. We note that responses to a number of Dr. Bellmund’s points were surfaced by the Editor as Essential revisions; as such, in a number of instances in the point-by-point below we direct the Reviewer to our responses above under the Essential revisions section.

      Weaknesses

      General comment 1: Despite these distinct strengths, the present study also has some weaknesses. On the behavioral level, I am wondering about the use of path inefficiency as a metric for global navigation performance. Because it is quantified based on the local response, it conflates the contributions of local and global errors.

      We appreciate this point with respect to path inefficiency during global navigation. As noted below, following Dr. Bellmund’s further insightful guidance, we now complement the path inefficiency analyses with additional metrics of across-track (global) navigation performance, which effectively separate local from global errors (please see below response to Author recommendation #1).

      General comment 2: For the distance-based analysis in the hippocampus, the authors choose to only analyze landmark images and do not include fractal stimuli. There seems to be little reason to expect that distances between the fractal stimuli, on which the memory task was based, would be represented differently relative to distances between the landmarks.

      We are grateful for the feedback on how to improve the consistency of results reporting. In the revision, we updated the relevant sections of the manuscript to include results from fractals. Please see our above response to Essential Revisions General comment #5 for full details.

      General comment 3: Related to the aforementioned analysis, I am wondering why the authors chose the link distance between landmarks as their distance metric for the analysis and why they limit their analysis to pairs of stimuli with distance 1 or 2 and do not include pairs separated by the highest possible distance (3).

      We appreciate the request for clarification here. Beginning with the latter question, we note that the highest possible distance varies between within-track vs. across-track paths. If participants navigate in the Local Navigation Task using the shortest or most efficient path, the highest possible within-track link distance between two stimuli is 2. For this reason, the Local Navigation/within-track analysis includes link distances of 1 and 2. For the Global Navigation analysis, we also include pairs of stimuli with link distances of 3 and 4 when examining across-track landmarks.

      Regarding the use of link distance as the distance metric, we note that the path distance (a.u.) varies only slightly between pairs of stimuli with the same link distance. As such, categorical treatment link distance accounts for the vast majority of the variance in path distance and thus is a suitable approach. Please note that in the new trial-level brain-behavior analysis included in the revised manuscript (which replaces the median-split analysis), we used the length of the optimal path.

      General comment 4: Surprisingly, the authors report that across-track distances can be observed in the hippocampus after local navigation, but that this effect cannot be detected after global, cross-track navigation. Relatedly, the cross-track distance effect was detected only in the half of participants that performed relatively badly in the cross-track navigation task. In the results and discussion, the authors suggest that the effect of cross-track distances cannot be detected because participants formed a "more fully integrated global map". I do not find this a convincing explanation for why the effect the authors are testing would be absent after global navigation and for why the effect was only present in those participants who navigated less efficiently.

      We appreciate Dr. Bellmund’s input here, which was shared by other reviewers. We revised and clarified the Discussion based on reviewer comments. Please see our above response to Essential Revisions General comment #4 for full details.

      General comment 5: The authors report differences in the hippocampal representational similarity between participants who navigated along inefficient vs. efficient paths. These are based on a median split of the sample, resulting in a comparison of groups including 11 and 10 individuals, respectively. The median split (see e.g. MacCallum et al., Psychological Methods, 2002) and the low sample size mandate cautionary interpretation of the resulting findings about interindividual differences.

      We appreciate the feedback we received from multiple reviewers with respect to the median-split brain-behavior analysis. We replaced the median-split analysis with the following: 1) a mixed-effects model predicting neural pattern similarity Post Local Navigation, with a continuous metric of task performance (each participant’s median path inefficiency for across-track trials in the first four test runs of Global Navigation) and link distance as predictors; and 2) a mixed-effects model relating trial-wise navigation data to pairwise similarity values for each given pair of landmarks and fractals (as suggested by Reviewer 2). Please see our above response to Essential Revisions General comment #3 for additions to the revised manuscript.

    1. Author Response

      Reviewer #2 (Public Review):

      Silberberg et al. present a series of cryo-EM structures of the ATP dependent bacterial potassium importer KdpFABC, a protein that is inhibited by phosphorylation under high environmental K+ conditions. The aim of the study was to sample the protein's conformational landscape under active, non-phosphorylated and inhibited, phosphorylated (Ser162) conditions.

      Overall, the study presents 5 structures of phosphorylated wildtype protein (S162-P), 3 structures of phosphorylated 'dead' mutant (D307N, S162-P), and 2 structures of constitutively active, non-phosphorylatable protein (S162A).

      The true novelty and strength of this work is that 8 of the presented structures were obtained either under "turnover" or at least 'native' conditions without ATP, ie in the absence of any non-physiological substrate analogues or stabilising inhibitors. The remaining 2 were obtained in the presence of orthovanadate.

      Comparing the presented structures with previously published KdpFACB structures, there are 5 structural states that have not been reported before, namely an E1-P·ADP state, an E1-P tight state captured in the autoinhibited WT protein (with and without vanadate), and two different nucleotide-free 'apo' states and an E1·ATP early state.

      Of these new states, the 'tight' states are of particular interest, because they appear to be 'off-cycle', dead end states. A novelty lies in the finding that this tight conformation can exist both in nucleotide-free E1 (as seen in the published first KdpFABC crystal structure), and also in the phosphorylated E1-P intermediate.

      By EPR spectroscopy, the authors show that the nucleotide free 'tight' state readily converts into an active E1·ATP conformation when provided with nucleotide, leading to the conclusion that the E1-P·ADP state must be the true inhibitory species. This claim is supported by structural analysis supporting the hypothesis that the phosphorylation at Ser162 could stall the KdpB subunit in an E1P state unable to convert into E2P. This is further supported by the fact that the phosphorylated sample does not readily convert into an E2P state when exposed to vanadate, as would otherwise be expected.

      The structures are of medium resolution (3.1 - 7.4 Å), but the key sites of nucleotide binding and/or phosphorylation are reasonably well supported by the EM maps, with one exception: in the 'E1·ATP early' state determined under turnover conditions, I find the map for the gamma phosphate of ATP not overly convincing, leaving the question whether this could instead be a product-inhibited, Mg-ADP bound E1 state resulting from an accumulation of MgADP under the turnover conditions used. Overall, the manuscript is well written and carefully phrased, and it presents interesting novel findings, which expand our knowledge about the conformational landscape and regulatory mechanisms of the P-type ATPase family.

      We thank the reviewer for their comments and helpful insights. We have addressed the points as follows:

      However in my opinion there are the following weaknesses in the current version of the manuscript:

      1) A lack of quantification. The heart of this study is the comparison of the newly determined KdpFABC structures with previously published ones (of which there are already 10). Yet, there are no RMSD calculations to illustrate the magnitude of any structural deviations. Instead, the authors use phrases like 'similar but not identical to', 'has some similarities', 'virtually identical', 'significant differences'. This makes it very hard to appreciate the true level of novelty/deviation from known structures.

      This is a very valid point and we thank the reviewers for bringing it up. To provide a better overview and appreciation of conformational similarities and significant differences we have calculated RMSDs between all available structures of KdpFABC. They are summarised in the new Table 1 – Table Supplement 2. We have included individual rmsd values, whenever applicable and relevant, in the respective sections in the text and figures. We note that the RMSDs were calculated only between the cytosolic domains (KdpB N,A,P domains) after superimposition of the full-length protein on KdpA, which is rigid across all conformations of KdpFABC (see description in material and methods lines 1184-1191 or the caption to Table 1 – Table Supplement 2). We opted to not indicate the RMSD calculated between the full-length proteins, as the largest part of the complex does not undergo large structural changes (see Figure 1 – Figure Supplement 1, the transmembrane region of KdpB as well as KdpA, KdpC and KdpF show relatively small to no rearrangements compared to the cytosolic domains), and would otherwise obscure the relevant RMSD differences discussed here.

      Also the decrease in EPR peak height of the E1 apo tight state between phosphorylated and non-phosphorylated sample - a key piece of supporting data - is not quantified.

      EPR distance distributions have been quantified by fitting and integrating a gaussian distribution curve, and have been added to the corresponding results section (lines 523-542) and the methods section (lines 1230-1232).

      2) Perhaps as a consequence of the above, there seems to be a slight tendency towards overstatements regarding the novelty of the findings in the context of previous structural studies. The E1-P·ATP tight structure is extremely similar to the previously published crystal structure (5MRW), but it took me three reads through the paper and a structural superposition (overall RMSD less than 2Å), to realise that. While I do see that the existing differences, the two helix shifts in the P- and A- domains - are important and do probably permit the usage of the term 'novel conformation' (I don't think there is a clear consensus on what level of change defines a novel conformation), it could have been made more clear that the 'tight' arrangement of domains has actually been reported before, only it was not termed 'tight'.

      As indicated above we have now included an extensive RMSD table between all available KdpFABC structures. To ensure a meaningful comparison, the rmsd are only calculated between the cytosolic domains after superimposition of the full-length protein on KdpA, as the transmembrane region of KdpFABC is largely rigid (see figure below panel B). However, we have to note that in the X-ray structure the transmembrane region of KdpB is displaced relative to the rest of the complex when compared to the arrangement found in any of the other 18 cryo-EM structures, which all align well in the TMD (see figure below panel C). These deviations make the crystal structure somewhat of an outlier and might be a consequence of the crystal packing (see figure below panel A). For completeness in our comparison with the X-Ray structure, we have included an RMSD calculated when superimposed on KdpA and additional RMSD that was calculated between structures when aligned on the TMD of KdpB (see figure below panel D,E). The reported RMSD that the reviewer mentiones of less than 2Å was probably obtained when superimposing the entire complex on each other (see figure below panel F). However, we do not believe that this is a reasonable comparison as the TMD of the complex is significantly displaced, which stands in strong contrast to all other RMSDs calculated between the rest of the structures where the TMD aligns well (see figure below panel B).

      From the resulting comparisons, we conclude that the E1P-tight and the X-Ray structure do have a certain similarity but are not identical. In particular not in the relative orientation of the cytosolic domains to the rest of the complex. We hope that including the RMSD in the text and separately highlighting the important features of the E1P tight state in the section “E1P tight is the consequence of an impaired E1P/E2P transition“ makes the story now more conclusive.

      Likewise, the authors claim that they have covered the entire conformational cycle with their 10 structures, but this is actually not correct, as there is no representative of an E2 state or functional E1P state after ADP release.

      This is correct, and we have adjusted the phrasing to “close to the entire conformational cycle” or “the entire KdpFABC conformational cycle except the highly transient E1P state after ADP release and E2 state after dephosphorylation.”

      3) A key hypothesis this paper suggests is that KdpFABC cannot undergo the transition from E1P tight to E2P and hence gets stuck in this dead end 'off cycle' state. To test this, the authors analysed an S162-P sample supplied with the E2P inducing inhibitor orthovanadate and found about 11% of particles in an E2P conformation. This is rationalised as a residual fraction of unphosphorylated, non-inhibited, protein in the sample, but the sample is not actually tested for residual unphosphorylated fraction or residual activity. Instead, there is a reference to Sweet et al, 2020. So the claim that the 11% E2P particles in the vanadate sample are irrelevant, whereas the 14% E1P tight from the turnover dataset are of key importance, would strongly benefit from some additional validation.

      We have added an ATPase assay that shows the residual ATPase activity of WT KdpFABC compared to KdpFABS162AC, both purified from E. coli LB2003 cells, which is identical to the protein production and purification for the cryo-EM samples (see Figure 2-Suppl. Figure 5). The residual ATPase activity is ca. 14% of the uninhibited sample, which correlates with the E2-P fraction in the orthovanadate sample.

      Reviewer #3 (Public Review):

      The authors have determined a range of conformations of the high-affinity prokaryotic K+ uptake system KdpFABC, and demonstrate at least two novel states that shed further light on the structure and function of these elusive protein complexes.

      The manuscript is well-written and easy to follow. The introduction puts the work in a proper context and highlights gaps in the field. I am however missing an overview of the currently available structures/states of KdpFABC. This could also be implemented in Fig. 6 (highlighting new vs available data). This is also connected to one of my main remarks - the lack of comparisons and RMSD estimates to available structures. Similarity/resemblance to available structures is indicated several times throughout the manuscript, but this is not quantified or shown in detail, and hence it is difficult for the reader to grasp how unique or alike the structures are. Linked to this, I am somewhat surprised by the lack of considerable changes within the TM domain and the overlapping connectivity of the K indicated in Table 1 - Figure Supplement 1. According to Fig. 6 the uptake pathway should be open in early E1 states, but not in E2 states, contrasting to the Table 1 - Figure Supplement 1, which show connectivity in all structures? Furthermore, the release pathway (to the inside) should be open in the E2-P conformation, but no release pathway is shown as K ions in any of the structures in Table 1 - Figure Supplement 1. Overall, it seems as if rather small shifts in-between the shown structures (are the structures changing from closed to inward-open)? Or is it only KdpA that is shown?

      We thank the reviewer for their positive response and constructive criticisms. We have addressed these comments as follows:

      1. The overview of the available structures has been implemented in Fig. 6, with the new structures from this study highlighted in bold.

      2. RMSD values have been added to all comparisons, with a focus on the deviations of the cytosolic domains, which are most relevant to our conformational assignments and discussions.

      3. To highlight the (comparatively small) changes in the TMD, we have expanded Table 1 - Figure Supplement 1 to include panels showing the outward-open half-channel in the E1 states with a constriction at the KdpA/KdpB interface and the inward-open half-channel in the E2 states. The largest observable rearrangements do however take place in the cytosolic domains. This is an absolute agreement with previous studies, which focused more on the transition occurring within the transmembrane region during the transport cycle (Stock et al, Nature Communication 2018; Silberberg et al, Nature Communication 2021; Sweet et al., PNAS 2021).

      4. The ions observed in the intersubunit tunnel are all before the point at which the tunnel closes, explaining why there is no difference in this region between E1 and E2 structures. Moreover, as we discussed in our last publication (Silberberg, Corey, Hielkema et al., 2021, Nat. Comms.), the assignment of non-protein densities along the entire length of the tunnel is contentious and can only be certain in the selectivity filter of KdpA and the CBS of KdpB.

      5. The release pathway from the CBS does not feature any defined K+ coordination sites, so ions are not expected to stay bound along this inward-open half-channel.

      My second key remark concerns the "E1-P tight is the consequence of an impaired E1-P/E2-P transition" section, and the associated discussion, which is very interesting. I am not convinced though that the nucleotide and phosphate mimic-stabilized states (such as E1-P:ADP) represent the high-energy E1P state, as I believe is indicated in the text. Supportive of this, in SERCA, the shifts from the E1:ATP to the E1P:ADP structures are modest, while the following high-energy Ca-bound E1P and E2P states remain elusive (see Fig. 1 in PMID: 32219166, from 3N8G to 3BA6). Or maybe this is not what the authors claim, or the situation is different for KdpFABC? Associated, while I agree with the statement in rows 234-237 (that the authors likely have caught an off-cycle state), I wonder if the tight E1-P configuration could relate to the elusive high-energy states (although initially counter-intuitive as it has been caught in the structure)? The claims on rows 358-360 and 420-422 are not in conflict with such an idea, and the authors touch on this subject on rows 436-450. Can it be excluded that it is the proper elusive E1P state? If the state is related to the E1P conformation it may well have bearing also on other P-type ATPases and this could be expanded upon.

      This a good point, particularly since the E1P·ADP state is the most populated state in our sample, which is also counterintuitive to “high-energy unstable state”. One possible explanation is that this state already has some of the E1-P strains (which we can see in the clash of D307-P with D518/D522), but the ADP and its associated Mg2+ in particular help to stabilize this. Once ADP dissociates and takes the Mg2+ with it, the full destabilization takes effect in the actual high-energy E1P state. Nonetheless, we consider it fair to compare the E1P tight with the E1P·ADP to look for electrostatic relaxation. We have clarified the sequence of events and our hypothesized role the ADP/Mg2+ have in stabilizing the E1P·ADP state that we can see (lines 609-619): “Moreover, a comparison of the E1P tight structure with the E1P·ADP structure, its most immediate precursor in the conformational cycle obtained, reveals a number of significant rearrangements within the P domain (Figure 5B,C). First, Helix 6 (KdpB538-545) is partially unwound and has moved away from helix 5 towards the A domain, alongside the tilting of helix 4 of the A domain (Figure 5B,C – arrow 2). Second, and of particular interest, are the additional local changes that occur in the immediate vicinity of the phosphorylated KdpBD307. In the E1P·ADP structure, the catalytic aspartyl phosphate, located in the D307KTG signature motif, points towards the negatively charged KdpBD518/D522. This strain is likely to become even more unfavorable once ADP dissociates in the E1P state, as the Mg2+ associated with the ADP partially shields these clashes. The ensuing repulsion might serve as a driving force for the system to relax into the E2 state in the catalytic cycle.”

      We believe it is highly unlikely that the reported E1-P tight state represents an on-cycle high-energy E1P intermediate. For one, we observe a relaxation of electrostatic strains in this structure, in particular when compared to the obtained E1P ADP state. By contrast, the E1P should be the most energetically unfavourable state possible to ensure the rapid transition to the E2P state. As such, this state should be a transient state, making it less likely to be obtainable structurally as an accumulated state. Additionally, the association of the N domain with the A domain in the tight conformation, which would have to be reverted, would be a surprising intermediary step in the transition from E1P to E2P. Altogether, the here reported E1P tight state most likely represents an off-cycle state.

    1. Author Response

      Reviewer #1 (Public Review):

      Buglak et al. describe a role for the nuclear envelope protein Sun1 in endothelial mechanotransduction and vascular development. The study provides a full mechanistic investigation of how Sun1 is achieving its function, which supports the concept that nuclear anchoring is important for proper mechanosensing and junctional organization. The experiments have been well designed and were quantified based on independent experiments. The experiments are convincing and of high quality and include Sun1 depletion in endothelial cell cultures, zebrafish, and in endothelial-specific inducible knockouts in mice.

      We thank the reviewer for their enthusiastic comments and for noting our use of multiple model systems.

      Reviewer #2 (Public Review):

      Endothelial cells mediate the growth of the vascular system but they also need to prevent vascular leakage, which involves interactions with neighboring endothelial cells (ECs) through junctional protein complexes. Buglak et al. report that the EC nucleus controls the function of cell-cell junctions through the nuclear envelope-associated proteins SUN1 and Nesprin-1. They argue that SUN1 controls microtubule dynamics and junctional stability through the RhoA activator GEF-H1.

      In my view, this study is interesting and addresses an important but very little-studied question, namely the link between the EC nucleus and cell junctions in the periphery. The study has also made use of different model systems, i.e. genetically modified mice, zebrafish, and cultured endothelial cells, which confirms certain findings and utilizes the specific advantages of each model system. A weakness is that some important controls are missing. In addition, the evidence for the proposed molecular mechanism should be strengthened.

      We thank the reviewer for their interest in our work and for highlighting the relative lack of information regarding connections between the EC nucleus and cell periphery, and for noting our use of multiple model systems. We thank the reviewer for suggesting additional controls and mechanistic support, and we have made the revisions described below.

      Specific comments:

      1) Data showing the efficiency of Sun1 inactivation in the murine endothelial cells is lacking. It would be best to see what is happening on the protein level, but it would already help a great deal if the authors could show a reduction of the transcript in sorted ECs. The excision of a DNA fragment shown in the lung (Fig. 1-suppl. 1C) is not quantitative at all. In addition, the gel has been run way too short so it is impossible to even estimate the size of the DNA fragment.

      We agree that the DNA excision is not sufficient to demonstrate excision efficiency. We attempted examination of SUN1 protein levels in mutant retinas via immunofluorescence, but to date we have not found a SUN1 antibody that works in mouse retinal explants. We argue that mouse EC isolation protocols enrich but don’t give 100% purity, so that RNA analysis of lung tissue also has caveats. Finally, we contend that our demonstration of a consistent vascular phenotype in Sun1iECKO mutant retinas argues that excision has occurred. To test the efficiency of our excision protocol, we bred Cdh5CreERT2 mice with the ROSAmT/mG excision reporter (cells express tdTomato absent Cre activity and express GFP upon Cre-mediated excision (Muzumdar et al., 2007). Utilizing the same excision protocol as used for the Sun1iECKO mice, we see a significantly high level of excision in retinal vessels only in the presence of Cdh5CreERT2 (Reviewer Figure 1).

      Reviewer Figure 1: Cdh5CreERT2 efficiently excises in endothelial cells of the mouse postnatal retina. (A) Representative images of P7 mouse retinas with the indicated genotypes, stained for ERG (white, nucleus). tdTomato (magenta) is expressed in cells that have not undergone Cre-mediated excision, while GFP (green) is expressed in excised cells. Scale bar, 100μm. (B) Quantification of tdTomato fluorescence relative to GFP fluorescence as shown in A. tdTomato and GFP fluorescence of endothelial cells was measured by creating a mask of the ERG channel. n=3 mice per genotype. ***, p<0.001 by student’s two-tailed unpaired t-test.

      2) The authors show an increase in vessel density in the periphery of the growing Sun1 mutant retinal vasculature. It would be important to add staining with a marker labelling EC nuclei (e.g. Erg) because higher vessel density might reflect changes in cell size/shape or number, which has also implications for the appearance of cell-cell junctions. More ECs crowded within a small area are likely to have more complicated junctions. Furthermore, it would be useful and straightforward to assess EC proliferation, which is mentioned later in the experiments with cultured ECs but has not been addressed in the in vivo part.

      We concur that ERG staining is important to show any changes in nuclear shape or cell density in the post-natal retina. We now include this data in Figure1-figure supplement 1F-G. We do not see obvious changes in nuclear shape or number, though we do observe some crowding in Sun1iECKO retinas, consistent with increased density. However, when normalized to total vessel area, we do not observe a significant difference in the nuclear signal density in Sun1iECKO mutant retinas relative to controls.

      3) It appears that the loss of Sun1/sun1b in mice and zebrafish is compatible with major aspects of vascular growth and leads to changes in filopodia dynamics and vascular permeability (during development) without severe and lasting disruption of the EC network. It would be helpful to know whether the loss-of-function mutants can ultimately form a normal vascular network in the retina and trunk, respectively. It might be sufficient to mention this in the text.

      We thank the reviewer for pointing this out. It is true that developmental defects in the vasculature resulting from various genetic mutations are often resolved over time. We’ve made text changes to discuss viability of Sun1 global KO mice and lack of perduring effects in sun1 morphant fish, perhaps resulting from compensation by SUN2, which is partially functionally redundant with SUN1 in vivo (Lei et al., 2009; Zhang, et al., 2009) (p. 20).

      4) The only readout after the rescue of the SUN1 knockdown by GEF-H1 depletion is the appearance of VE-cadherin+ junctions (Fig. 6G and H). This is insufficient evidence for a relatively strong conclusion. The authors should at least look at microtubules. They might also want to consider the activation status of RhoA as a good biochemical readout. It is argued that RhoA activity goes up (see Fig. 7C) but there is no data supporting this conclusion. It is also not clear whether "diffuse" GEF-H1 localization translates into increased Rho A activity, as is suggested by the Rho kinase inhibition experiment. GEF-H1 levels in the Western blot in (Fig. 6- supplement 2C) have not been quantitated.

      We agree that analysis of RhoA activity and additional analysis of rescued junctions strengthens our conclusions, so we performed these experiments. New data (Figure 6IJ) shows that co-depletion of SUN1 and GEF-H1 rescues junction integrity as measured by biotin-matrix labeling. Interestingly, co-depletion of SUN1 and GEF-H1 does not rescue reduced microtubule density at the periphery (Figure 6-figure supplement 3BC), placing GEF-H1 downstream of aberrant microtubule dynamics in SUN1 depleted cells. This is consistent with our model (Figure 8) describing how loss of SUN1 leads to increased microtubule depolymerization, resulting in release and activation of GEF-H1 that goes on to affect actomyosin contractility and junction integrity. In addition, we include images of the junctions in GEF-H1 single KD (Figure 6-figure supplement 3BC) and quantify the western blot in Figure 6-figure supplement 3A.

      We performed RhoA activity assays and new data shows that SUN1 depletion results in increased RhoA activation, while co-depletion of SUN1 and GEF-H1 ameliorates this increase (Figure 6-figure supplement 2D). This is consistent with our model in which loss of SUN1 leads to increased RhoA activity via release of GEF-H1 from microtubules. In addition, we now cite a recent study describing that GEF-H1 is activated when unbound to microtubules, with this activation resulting in increased RhoA activity (Azoitei et al., 2019).

      5) The criticism raised for the GEF-H1 rescue also applies to the co-depletion of SUN1 and Nesprin-1. This mechanistic aspect is currently somewhat weak and should be strengthened. Again, Rho A activity might be a useful and quantitative biochemical readout.

      We respectfully point out that we showed that co-depletion of nesprin-1 and SUN1 rescues SUN1 knockdown effects via several readouts, including rescue of junction morphology, biotin labeling, microtubule localization at the periphery, and GEFH1/microtubule localization. We’ve moved this data to the main figure (Figure 7B-C, E-F) to better highlight these mechanistic findings. These results are consistent with our model that nesprin-1 effects are upstream of GEF-H1 localization. We also added results showing that nesprin-1 knockdown alone does not affect junction integrity, microtubule density, or GEF-H1/microtubule localization (Figure 7-figure supplement 1B-G).

      Reviewer #3 (Public Review):

      Here, Buglak and coauthors describe the effect of Sun1 deficiency on endothelial junctions. Sun1 is a component of the LINC complex, connecting the inner nuclear membrane with the cytoskeleton. The authors show that in the absence of Sun1, the morphology of the endothelial adherens junction protein VE-cadherin is altered, indicative of increased internalization of VE-cadherin. The change in VE-cadherin dynamics correlates with decreased angiogenic sprouting as shown using in vivo and in vitro models. The study would benefit from a stricter presentation of the data and needs additional controls in certain analyses.

      We thank the reviewer for their insightful comments, and in response we have performed the revisions described below.

      1) The authors implicate the changes in VE-cadherin morphology to be of consequence for "barrier function" and mention barrier function frequently throughout the text, for example in the heading on page 12: "SUN1 stabilizes endothelial cell-cell junctions and regulates barrier function". The concept of "barrier" implies the ability of endothelial cells to restrict the passage of molecules and cells across the vessel wall. This is tested only marginally (Suppl Fig 1F) and these data are not quantified. Increased leakage of 10kDa dextran in a P6-7 Sun1-deficient retina as shown here probably reflects the increased immaturity of the Sun1-deficient retinal vasculature. From these data, the authors cannot state that Sun1 regulates the barrier or barrier function (unclear what exactly the authors refer to when they make a distinction between the barrier as such on the one hand and barrier function on the other). The authors can, if they do more experiments, state that loss of Sun1 leads to increased leakage in the early postnatal stages in the retina. However, if they wish to characterize the vascular barrier, there is a wide range of other tissue that should be tested, in the presence and absence of disease. Moreover, a regulatory role for Sun1 would imply that Sun1 normally, possibly through changes in its expression levels, would modulate the barrier properties to allow more or less leakage in different circumstances. However, no such data are shown. The authors would need to go through their paper and remove statements regarding the regulation of the barrier and barrier function since these are conclusions that lack foundation.

      We thank the reviewer for pointing out that the language used regarding the function and integrity of the junctions is confusing, although we suggest that the endothelial cell properties measured by our assays are typically equated with “barrier function” in the literature. However, we have edited our language to precisely describe our results as suggested by the reviewer.

      2) In Fig 6g, the authors show that "depletion of GEF-H1 in endothelial cells that were also depleted for SUN1 rescued the destabilized cell-cell junctions observed with SUN1 KD alone". However, it is quite clear that Sun1 depletion also affects cell shape and cell alignment and this is not rescued by GEF-H1 depletion (Fig 6g). This should be described and commented on. Moreover please show the effects of GEF-H1 alone.

      We thank the reviewer for pointing out the effects on cell shape. SUN1 depletion typically leads to shape changes consistent with elevated contractility, but this is considered to be downstream of the effects quantified here. We updated the panel in Figure 6G to a more representative image showing cell shape rescue by co-depletion of SUN1 and GEF-H1. We present new data panels showing that GEF-H1 depletion alone does not affect junction integrity (Figure 6I-J). We also present new data showing that co-depletion of GEF-H1 and SUN1 does not rescue microtubule density at the periphery (Figure 6-figure supplement 3B-C), consistent with our model that GEF-H1 activation is downstream of microtubule perturbations induced by SUN1 loss.

      3) In Fig. 6a, the authors show rescue of junction morphology in Sun1-depleted cells by deletion of Nesprin1. The effect of Nesprin1 KD alone is missing.

      We thank the reviewer for this comment, and we now include new panels (Figure 7figure supplement 1B-G) demonstrating that Nesprin-1 depletion does not affect biotin-matrix labeling, peripheral microtubule density, or GEF-H1/microtubule localization absent co-depletion with SUN1. These findings are consistent with our model that Nesprin-1 loss does not affect cell junctions on its own because it is held in a non-functional complex with SUN1 that is not available in the absence of SUN1.

      References

      Azoitei, M. L., Noh, J., Marston, D. J., Roudot, P., Marshall, C. B., Daugird, T. A., Lisanza, S. L., Sandί, M., Ikura, M., Sondek, J., Rottapel, R., Hahn, K. M., Danuser, & Danuser, G. (2019). Spatiotemporal dynamics of GEF-H1 activation controlled by microtubule- and Src-mediated pathways. Journal of Cell Biology, 218(9), 3077-3097. https://doi.org/10.1083/jcb.201812073

      Denis, K. B., Cabe, J. I., Danielsson, B. E., Tieu, K. V, Mayer, C. R., & Conway, D. E. (2021). The LINC complex is required for endothelial cell adhesion and adaptation to shear stress and cyclic stretch. Molecular Biology of the Cell, mbcE20110698. https://doi.org/10.1091/mbc.E20-11-0698

      King, S. J., Nowak, K., Suryavanshi, N., Holt, I., Shanahan, C. M., & Ridley, A. J. (2014). Nesprin-1 and nesprin-2 regulate endothelial cell shape and migration. Cytoskeleton (Hoboken, N.J.), 71(7), 423–434. https://doi.org/10.1002/cm.21182

      Lei, K., Zhang, X., Ding, X., Guo, X., Chen, M., Zhu, B., Xu, T., Zhuang, Y., Xu, R., & Han, M. (2009). SUN1 and SUN2 play critical but partially redundant roles in anchoring nuclei in skeletal muscle cells in mice. PNAS, 106(25), 10207–10212.

      Muzumdar, M. D., Tasic, B., Miyamichi, K., Li, L., & Luo, L. (2007). A global doublefluorescent Cre reporter mouse. Genesis, 45(9), 593-605. https://doi.org/10.1002/dvg.20335

      Ueda, N., Maekawa, M., Matsui, T. S., Deguchi, S., Takata, T., Katahira, J., Higashiyama, S., & Hieda, M. (2022). Inner Nuclear Membrane Protein, SUN1, is Required for Cytoskeletal Force Generation and Focal Adhesion Maturation. Frontiers in Cell and Developmental Biology, 10, 885859. https://doi.org/10.3389/fcell.2022.885859

      Zhang, X., Lei, K., Yuan, X., Wu, X., Zhuang, Y., Xu, T., Xu, R., & Han, M. (2009). SUN1/2 and Syne/Nesprin-1/2 complexes connect centrosome to the nucleus during neurogenesis and neuronal migration in mice. Neuron, 64(2), 173–187. https://doi.org/10.1016/j.neuron.2009.08.018.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Teplenin and coworkers assesses the combined effects of localized depolarization and excitatory electrical stimulation in myocardial monolayers. They study the electrophysiological behaviour of cultured neonatal rat ventricular cardiomyocytes expressing the light-gated cation channel Cheriff, allowing them to induce local depolarization of varying area and amplitude, the latter titrated by the applied light intensity. In addition, they used computational modeling to screen for critical parameters determining state transitions and to dissect the underlying mechanisms. Two stable states, thus bistability, could be induced upon local depolarization and electrical stimulation, one state characterized by a constant membrane voltage and a second, spontaneously firing, thus oscillatory state. The resulting 'state' of the monolayer was dependent on the duration and frequency of electrical stimuli, as well as the size of the illuminated area and the applied light intensity, determining the degree of depolarization as well as the steepness of the local voltage gradient. In addition to the induction of oscillatory behaviour, they also tested frequency-dependent termination of induced oscillations.

      Strengths:

      The data from optogenetic experiments and computational modelling provide quantitative insights into the parameter space determining the induction of spontaneous excitation in the monolayer. The most important findings can also be reproduced using a strongly reduced computational model, suggesting that the observed phenomena might be more generally applicable.

      Weaknesses:

      While the study is thoroughly performed and provides interesting mechanistic insights into scenarios of ventricular arrhythmogenesis in the presence of localized depolarized tissue areas, the translational perspective of the study remains relatively vague. In addition, the chosen theoretical approach and the way the data are presented might make it difficult for the wider community of cardiac researchers to understand the significance of the study.

      Reviewer #2 (Public review):

      In the presented manuscript, Teplenin and colleagues use both electrical pacing and optogenetic stimulation to create a reproducible, controllable source of ectopy in cardiomyocyte monolayers. To accomplish this, they use a careful calibration of electrical pacing characteristics (i.e., frequency, number of pulses) and illumination characteristics (i.e., light intensity, surface area) to show that there exists a "sweet spot" where oscillatory excitations can emerge proximal to the optogenetically depolarized region following electrical pacing cessation, akin to pacemaker cells. Furthermore, the authors demonstrate that a high-frequency electrical wave-train can be used to terminate these oscillatory excitations. The authors observed this oscillatory phenomenon both in vitro (using neonatal rat ventricular cardiomyocyte monolayers) and in silico (using a computational action potential model of the same cell type). These are surprising findings and provide a novel approach for studying triggered activity in cardiac tissue.

      The study is extremely thorough and one of the more memorable and grounded applications of cardiac optogenetics in the past decade. One of the benefits of the authors' "two-prong" approach of experimental preps and computational models is that they could probe the number of potential variable combinations much deeper than through in vitro experiments alone. The strong similarities between the real-life and computational findings suggest that these oscillatory excitations are consistent, reproducible, and controllable.

      Triggered activity, which can lead to ventricular arrhythmias and cardiac sudden death, has been largely attributed to sub-cellular phenomena, such as early or delayed afterdepolarizations, and thus to date has largely been studied in isolated single cardiomyocytes. However, these findings have been difficult to translate to tissue and organ-scale experiments, as well-coupled cardiac tissue has notably different electrical properties. This underscores the significance of the study's methodological advances: the use of a constant depolarizing current in a subset of (illuminated) cells to reliably result in triggered activity could facilitate the more consistent evaluation of triggered activity at various scales. An experimental prep that is both repeatable and controllable (i.e., both initiated and terminated through the same means).

      The authors also substantially explored phase space and single-cell analyses to document how this "hidden" bi-stable phenomenon can be uncovered during emergent collective tissue behavior. Calibration and testing of different aspects (e.g., light intensity, illuminated surface area, electrical pulse frequency, electrical pulse count) and other deeper analyses, as illustrated in Appendix 2, Figures 3-8, are significant and commendable.

      Given that the study is computational, it is surprising that the authors did not replicate their findings using well-validated adult ventricular cardiomyocyte action potential models, such as ten Tusscher 2006 or O'Hara 2011. This may have felt out of scope, given the nice alignment of rat cardiomyocyte data between in vitro and in silico experiments. However, it would have been helpful peace-of-mind validation, given the significant ionic current differences between neonatal rat and adult ventricular tissue. It is not fully clear whether the pulse trains could have resulted in the same bi-stable oscillatory behavior, given the longer APD of humans relative to rats. The observed phenomenon certainly would be frequency-dependent and would have required tedious calibration for a new cell type, albeit partially mitigated by the relative ease of in silico experiments.

      For all its strengths, there are likely significant mechanistic differences between this optogenetically tied oscillatory behavior and triggered activity observed in other studies. This is because the constant light-elicited depolarizing current is disrupting the typical resting cardiomyocyte state, thereby altering the balance between depolarizing ionic currents (such as Na+ and Ca2+) and repolarizing ionic currents (such as K+ and Ca2+). The oscillatory excitations appear to later emerge at the border of the illuminated region and non-stimulated surrounding tissue, which is likely an area of high source-sink mismatch. The authors appear to acknowledge differences in this oscillatory behavior and previous sub-cellular triggered activity research in their discussion of ectopic pacemaker activity, which is canonically expected more so from genetic or pathological conditions. Regardless, it is exciting to see new ground being broken in this difficult-to-characterize experimental space, even if the method illustrated here may not necessarily be broadly applicable.

      We thank the reviewers for their thoughtful and constructive feedback, as well as for recognizing the conceptual and technical strengths of our work. We are especially pleased that our integrated use of optogenetics, electrical pacing, and computational modelling was seen as a rigorous and innovative approach to investigating spontaneous excitability in cardiac tissue.

      At the core of our study was the decision to focus exclusively on neonatal rat ventricular cardiomyocytes. This ensured a tightly controlled and consistent environment across experimental and computational settings, allowing for direct comparison and deeper mechanistic insight. While extending our findings to adult or human cardiomyocytes would enhance translational relevance, such efforts are complicated by the distinct ionic properties and action potential dynamics of these cells, as also noted by Reviewer #2. For this foundational study, we chose to prioritize depth and clarity over breadth.

      Our computational domain was designed to faithfully reflect the experimental system. The strong agreement between both domains is encouraging and supports the robustness of our framework. Although some degree of theoretical abstraction was necessary (thereby sometimes making it a bit harder to read), it reflects the intrinsic complexity of the collective behaviours we aimed to capture such as emergent bi-stability. To make these ideas more accessible, we included simplified illustrations, a reduced model, and extensive supplementary material.

      A key insight from our work is the emergence of oscillatory behaviour through interaction of illuminated and non-illuminated regions. Rather than replicating classical sub-cellular triggered activity, this behaviour arises from systems-level dynamics shaped by the imposed depolarizing current and surrounding electrotonic environment. By tuning illumination and local pacing parameters, we could reproducibly induce and suppress these oscillations, thereby providing a controllable platform to study ectopy as a manifestation of spatial heterogeneity and collective dynamics.

      Altogether, our aim was to build a clear and versatile model system for investigating how spatial structure and pacing influence the conditions under which bistability becomes apparent in cardiac tissue. We believe this platform lays strong groundwork for future extensions into more physiologically and clinically relevant contexts.

      In revising the manuscript, we carefully addressed all points raised by the reviewers. We have also responded to each of their specific comments in detail, which are provided below.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Please find my specific comments and suggestions below:

      (1) Line 64: When first introduced, the concept of 'emergent bi-stability' may not be clear to the reader.

      We concur that the full breadth of the concept of emergent bi-stability may not be immediately clear upon first mention. Nonetheless, its components have been introduced separately: “emergent” was linked to multicellular behaviour in line 63, while “bi-stability” was described in detail in lines 39–56. We therefore believe that readers could form an intuitive understanding of the combined term, which will be further clarified as the manuscript develops. To further ease comprehension of the reader, we have added the following clarification to line 64:

      “Within this dynamic system of cardiomyocytes, we investigated emergent bi-stability (a concept that will be explained more thoroughly later on) in cell monolayers under the influence of spatial depolarization patterns.”

      (2) Lines 67-80: While the introduction until line 66 is extremely well written, the introduction of both cardiac arrhythmia and cardiac optogenetics could be improved. It is especially surprising that miniSOG is first mentioned as a tool for optogenetic depolarisation of cardiomyocytes, as the authors would probably agree that Channelrhodopsins are by far the most commonly applied tools for optogenetic depolarisation (please also refer to the literature by others in this respect). In addition, miniSOG has side effects other than depolarisation, and thus cannot be the tool of choice when not directly studying the effects of oxidative stress or damage.

      The reviewer is absolutely correct in noting that channelrhodopsins are the most commonly applied tools for optogenetic depolarisation. We introduced miniSOG primarily for historical context: the effects of specific depolarization patterns on collective pacemaker activity were first observed with this tool (Teplenin et al., 2018). In that paper, we also reported ultralong action potentials, occurring as a side effect of cumulative miniSOG-induced ROS damage. In the following paragraph (starting at line 81), we emphasize that membrane potential can be controlled much better using channelrhodopsins, which is why we employed them in the present study.

      (3) Line 78: I appreciate the concept of 'high curvature', but please always state which parameter(s) you are referring to (membrane voltage in space/time, etc?).

      We corrected our statement to include the specification of space curvature of the depolarised region:

      “In such a system, it was previously observed that spatiotemporal illumination can give rise to collective behaviour and ectopic waves (Teplenin et al. (2018)) originating from illuminated/depolarised regions (with high spatial curvature).”

      (4) Line 79: 'bi-stable state' - not yet properly introduced in this context.

      The bi-stability mentioned here refers back to single cell bistability introduced in Teplenin et al. (2018), which we cited again for clarity.

      “These waves resulted from the interplay between the diffusion current and the single cell bi-stable state (Teplenin et al. (2018)) that was induced in the illuminated region.”

      (5) Line 84-85: 'these ion channels allow the cells to respond' - please describe the channel used; and please correct: the channels respond to light, not the cells. Re-ordering this paragraph may help, because first you introduce channels for depolarization, then you go back to both de- and hyperpolarization. On the same note, which channels can be used for hyperpolarization of cardiomyocytes? I am not aware of any, even WiChR shows depolarizing effects in cardiomyocytes during prolonged activation (Vierock et al. 2022). Please delete: 'through a direct pathway' (Channelrhodopsins a directly light-gated channels, there are no pathways involved).

      We realised that the confusion arose from our use of incorrect terminology: we mistakenly wrote hyperpolarisation instead of repolarisation. In addition to channelrhodopsins such as WiChR, other tools can also induce a repolarising effect, including light-activatable chloride pumps (e.g., JAWS). However, to improve clarity, we recognize that repolarisation is not relevant to our manuscript and therefore decided to remove its mention (see below). Regarding the reported depolarising effects of WiChR in Vierock et al. (2022), we speculate that these may arise either from the specific phenotype of the cardiomyocytes used in the study, i.e. human induced pluripotent stem cell-derived atrial myocytes (aCMs), or from the particular ionic conditions applied during patch-clamp recordings (e.g., a bath solution containing 1 mM KCl). Notably, even after prolonged WiChR activation, the aCMs maintained a strongly negative maximum diastolic potential of approximately –55 mV.

      “Although effects of illuminating miniSOG with light might lead to formation of depolarised areas, it is difficult to control the process precisely since it depolarises cardiomyocytes indirectly. Therefore, in this manuscript, we used light-sensitive ion channels to obtain more refined control over cardiomyocyte depolarisation. These ion channels allow the cells to respond to specific wavelengths of light, facilitating direct depolarisation (Ördög et al. (2021, 2023)). By inducing cardiomyocyte depolarisation only in the illuminated areas, optogenetics enables precise spatiotemporal control of cardiac excitability, an attribute we exploit in this manuscript (Appendix 2 Figure 1).”

      (6) Figure 1: What would be the y-axis of the 'energy-like curves' in B? What exactly did you plot here?

      The graphs in Figure 1B are schematic representations intended to clarify the phenomenon for the reader. They do not depict actual data from any simulation or experiment. We clarified this misunderstanding by specifying that Figure 1B is a schematic representation of the effects at play in this paper.

      “(B) Schematic representation showing how light intensity influences collective behaviour of excitable systems, transitioning between a stationary state (STA) at low illumination intensities and an oscillatory state (OSC) at high illumination intensities. Bi-stability occurs at intermediate light intensities, where transitions between states are dependent on periodic wave train properties. TR. OSC, transient oscillations.”

      To expand slightly beyond the paper: our schematic representation was inspired by a common visualization in dynamical systems used to illustrate bi-stability (for an example, see Fig. 3 in Schleimer, J. H., Hesse, J., Contreras, S. A., & Schreiber, S. (2021). Firing statistics in the bistable regime of neurons with homoclinic spike generation. Physical Review E, 103(1), 012407.). In this framework, the y-axis can indeed be interpreted as an energy landscape, which is related to a probability measure through the Boltzmann distribution: . Here, p denotes the probability of occupying a particular state (STA or OSC). This probability can be estimated from the area (BCL × number of pulses) falling within each state, as shown in Fig. 4C. Since an attractor corresponds to a high-probability state, it naturally appears as a potential well in the landscape.

      (7) Lines 92-93: 'this transition resulted for the interaction of an illuminated region with depolarized CM and an external wave train' - please consider rephrasing (it is not the region interacting with depolarized CM; and the external wave train could be explained more clearly).

      We rephrased our unclear sentence as follows:

      “This transition resulted from the interaction of depolarized cardiomyocytes in an illuminated region with an external wave train not originating from within the illuminated region.”

      (8) Figure 2 and elsewhere: When mentioning 'frequency', please state frequency values and not cycle lengths. Please also reconsider your distinction between high and low frequencies; 200 ms (5 Hz) is actually the normal heart rate for neonatal rats (300 bpm).

      In the revised version, we have clarified frequency values explicitly and included them alongside period values wherever frequency is mentioned, to avoid any ambiguity. We also emphasize that our use of "high" and "low" frequency is strictly a relative distinction within the context of our data, and not meant to imply a biological interpretation.

      (9) Lines 129-131: Why not record optical maps? Voltage dynamics in the transition zone between depolarised and non-depolarised regions might be especially interesting to look at?

      We would like to clarify that optical maps were recorded for every experiment, and all experimental traces of cardiac monolayer activity were derived from these maps. We agree with the reviewer that the voltage dynamics in the transition zone are particularly interesting. However, we selected the data representations that, in our view, best highlight the main mechanisms. When we analysed full voltage profiles, they didn’t add extra insights to this main mechanism. As the other reviewer noted, the manuscript already presents a wide range of regimes, so we decided not to introduce further complexity.

      (10) Lines 156-157: Why was the model not adapted to match the biophysical properties (e.g., kinetics, ion selectivity, light sensitivity) of Cheriff?

      The model was not adapted to the biophysical properties of Cheriff, because this would entail a whole new study involving extensive patch-clamping experiments, fitting, and calibration to model the correct properties of the ion channel. Beyond considerations of time efficiency, incorporating more specific modelling parameters would not change the essence of our findings. While numeric parameter ranges might shift, the core results would remain unchanged. This is a result of our experimental design where we applied constant illumination of long duration (6s or longer), thus making a difference in kinetical properties of an optogenetic tool irrelevant. In addition, we were able to observe qualitatively similar phenomena using many other depolarising optogenetic tools (e.g. ChR2, ReaChR, CatCh and more) in our in-vitro experiments. We ended up with Cheriff as our optotool-of-choice for the practical reasons of good light-sensitivity and a non-overlapping spectrum with our fluorescent dyes.

      Therefore, computationally using a more general depolarising ion channel hints at the more general applicability of the observed phenomena, supporting our claim of a universal mechanism  (demonstrated experimentally with CheRiff and computationally with ChR2).

      (11) Line 158: 1.7124 mW/mm^2 - While I understand that this is the specific intensity used as input in the model, I am convinced that the model is not as accurate to predict behaviour at this specific intensity (4 digits after the comma), especially given that the model has not been adapted to Cheriff (probably more light sensitive than ChR2). Can this be rephrased?

      We did not aim for quantitative correspondence between the computational model and the biological experiments, but rather for qualitative agreement and mechanistic insight (see line 157). Qualitative comparisons are computationally obtained in a whole range of different intensities, as demonstrated in the 3D diagram of Fig. 4C. We wanted to demonstrate that at one fixed light intensity (chosen to be 1.7124 mW/mm^2 for the most clear effect), it was possible for all three states (STA, OSC. TR. OSC.) to coexist depending on the number of pulses and their period. Therefore the specific intensity used in the computational model is correct, and for reproducibility, we have left it unchanged while clarifying that it refers specifically to the in silico model:

      “Simulating at a fixed constant illumination of 1.7124 𝑚𝑊∕𝑚𝑚<sup>2</sup> and a fixed number of 4 pulses, frequency dependency of collective bi-stability was reproduced in Figure 4A.”

      (12) Lines 160, 165, and elsewhere: 'Once again, Once more' - please delete or rephrase.

      We agree that we could have written these binding words better and reformulated them to:

      “Similar to the experimental observations, only intermediate electrical pacing frequencies (500-𝑚𝑠 period) caused transitions from collective stationary behaviour to collective oscillatory behaviour and ectopic pacemaker activity had periods (710 𝑚𝑠) that were different from the stimulation train period (500 𝑚𝑠). Figure 4B shows the accumulation of pulses necessary to invoke a transition from the collective stationary state to the collective oscillatory state at a fixed stimulation period (600 𝑚𝑠). Also in the in silico simulations, ectopic pacemaker activity had periods (750 𝑚𝑠) that were different from the stimulation train period (600 𝑚𝑠). Also for the transient oscillatory state, the simulations show frequency selectivity (Appendix 2 Figure 4B).”

      (13) Line 171: 'illumination strength': please refer to 'light intensity'.

      We have revised our formulation to now refer specifically to “light intensity”:

      “We previously identified three important parameters influencing such transitions: light intensity, number of pulses, and frequency of pulses.”

      (14) Lines 187-188: 'the illuminated region settles into this period of sending out pulses' - please rephrase, the meaning is not clear.

      We reformulated our sentence to make its content more clear to the reader:

      “For the conditions that resulted in stable oscillations, the green vertical lines in the middle and right slices represent the natural pacemaker frequency in the oscillatory state. After the transition from the stationary towards the oscillatory state, oscillatory pulses emerging from the illuminated region gradually dampen and stabilize at this period, corresponding to the natural pacemaker frequency.”

      (15) Figure 7: A)- please state in the legend which parameter is plotted on the y-axis (it is included in the main text, but should be provided here as well); C) The numbers provided in brackets are confusing. Why is (4) a high pulse number and (3) a low pulse number? Why not just state the number of pulses and add alpha, beta, gamma, and delta for the panels in brackets? I suggest providing the parameters (e.g., 800 ms cycle length, 2 pulses, etc) for all combinations, but not rate them with low, high, etc. (see also comment above).

      We appreciate the reviewer’s comments and have revised the caption for figure 7, which now reads as follows:

      “Figure 7. Phase plane projections of pulse-dependent collective state transitions. (A) Phase space trajectories (displayed in the Voltage – x<sub>r</sub> plane) of the NRVM computational model show a limit cycle (OSC) that is not lying around a stable fixed point (STA). (B) Parameter space slice showing the relationship between stimulation period and number of pulses for a fixed illumination intensity (1.72 𝑚𝑊 ∕𝑚𝑚2) and size of the illuminated area (67 pixels edge length). Letters correspond to the graphs shown in C. (C) Phase space trajectories for different combinations of stimulus train period and number of pulses (α: 800 ms cycle length + 2 pulses, β: 800 ms cycle length + 4 pulses, γ: 250 ms cycle length + 3 pulses, δ: 250 ms cycle length + 8 pulses). α and δ do not result in a transition from the resting state to ectopic pacemaker activity, as under these circumstances the system moves towards the stationary stable fixed point from outside and inside the stable limit cycle, respectively. However, for β and γ, the stable limit cycle is approached from outside and inside, respectively, and ectopic pacemaker activity is induced.”

      (16) Line 258: 'other dimensions by the electrotonic current' - not clear, please rephrase and explain.

      We realized that our explanation was somewhat convoluted and have therefore changed the text as follows:

      “Rather than producing oscillations, the system returns to the stationary state along dimensions other than those shown in Figure 7C (Voltage and x<sub>r</sub>), as evidenced by the phase space trajectory crossing itself. This return is mediated by the electrotonic current.”

      (17) Line 263: ‘increased too much’ – please rephrase using scientific terminology.

      We rephrased our sentence to:

      “However, this is not a Hopf bifurcation, because in that case the system would not return to the stationary state when the number of pulses exceeds a critical threshold.”

      (18) Line 275: 'stronger diffusion/electrotonic influence from the non-illuminated region' - not sure diffusion is the correct term here. Please explain by taking into account the membrane potential. Please make sure to use proper terminology. The same applies to lines 281-282.

      We appreciate this comment, which prompted us to revisit on our text. We realised that some sections could be worded more clearly, and we also identified an error in the legend of Supplementary Figure 7. The corresponding corrections are provided below:

      “However, repolarisation reserve does have an influence, prolonging the transition when it is reduced (Appendix 2 Figure 7). This effect can be observed either by moving further from the boundary of the illuminated region, where the electrotonic influence from the non-illuminated region is weaker, or by introducing ionic changes, such as a reduction in I<sub>Ks</sub> and/or I<sub>to</sub>. For example, because the electrotonic influence is weaker in the center of the illuminated region, the voltage there is not pulled down toward the resting membrane potential as quickly as in cells at the border of the illuminated zone.”

      “To add a multicellular component to our single cell model we introduced a current that replicates the effect of cell coupling and its associated electrotonic influence.”

      “Figure 7. The effect of ionic changes on the termination of pacemaker activity. The mechanism that moves the oscillating illuminated tissue back to the stationary state after high frequency pacing is dependent on the ionic properties of the tissue, i.e. lower repolarisation reserves (20% 𝐼<sub>𝐾𝑠</sub> + 50% 𝐼<sub>𝑡𝑜</sub>) are associated with longer transition times.”

      (19) Line 289: -58 mV (to be corrected), -20 mV, and +50 mV - please justify the selection of parameters chosen. This also applies elsewhere- the selection of parameters seems quite arbitrary, please make sure the selection process is more transparent to the reader.

      Our choice of parameters was guided by the dynamical properties of the illuminated cells as well as by illustrative purposes. The value of –58 mV corresponds to the stimulation threshold of the model. The values of 50 mV and –20 mV match those used for single-cell stimulation (Figure 8C2, right panel), producing excitable and bistable dynamics, respectively. We refer to this point in line 288 with the phrase “building on this result.” To maintain conciseness, we did not elaborate on the underlying reasoning within the manuscript and instead reported only the results.

      We also corrected the previously missed minus sign: -58 mV.

      (20) Figure 8 and corresponding text: I don't understand what stimulation with a voltage means. Is this an externally applied electric field? Or did you inject a current necessary to change the membrane voltage by this value? Please explain.

      Stimulation with a specific voltage is a standard computational technique and can be likened to performing a voltage-clamp experiment on each individual cell. In this approach, the voltage of every cell in the tissue is briefly forced to a defined value.

      (21) Figure 8C- panel 2: Traces at -20 mV and + 50 mV are identical. Is this correct? Please explain.

      Yes, that is correct. The cell responds similarly to a voltage stimulus of -20 mV or one of 50 mV, because both values are well above the excitation threshold of a cardiomyocyte.

      (22) Line 344 and elsewhere: 'diffusion current' - This is probably not the correct terminology for gap-junction mediated currents. Please rephrase.

      A diffusion current is a mathematical formulation for a gap junction mediated current here, so , depending on the background of the reader, one of the terms might be used focusing on different aspects of the results. In a mathematical modelling context one often refers to a diffusion current because cardiomyocytes monolayers and tissues can be modelled using a reaction-diffusion equation. From the context of fine-grain biological and biophysical details, one uses the term gap-junction mediated current. Our choice is motivated by the main target audience we have in mind, namely interdisciplinary researchers with a core background in the mathematics/physics/computer science fields.

      However, to not exclude our secondary target audience of biological and medical readers we now clarified the terminology, drawing the parallel between the different fields of study at line 79:

      “These waves resulted from the interplay between the diffusion current (also known in biology/biophysics as the gap junction mediated current) and the bi-stable state that was induced in the illuminated region.”

      (23) Lines 357-58: 'Such ectopic sources are typically initiated by high frequency pacing' - While this might be true during clinical testing, how would you explain this when not externally imposed? What could be biological high-frequency triggers?

      Biological high-frequency triggers could include sudden increases in heart rates, such as those induced by physical activity or emotional stress. Another possibility is the occurrence of paroxysmal atrial or ventricular fibrillation, which could then give rise to an ectopic source.

      (24) Lines 419-420: 'large ionic cell currents and small repolarising coupling currents'. Are coupling currents actually small in comparison to cellular currents? Can you provide relative numbers (~ratio)?

      Coupling currents are indeed small compared to cellular currents. This can be inferred from the I-V curve shown in Figure 8C1, which dips below 0 and creates bi-stability only because of the small coupling current. If the coupling current were larger, the system would revert to a monostable regime. To make this more concrete, we have now provided the exact value of the coupling current used in Figure 8C1.

      “Otherwise, if the hills and dips of the N-shaped steady-state IV curve were large (Figure 8C-1), they would have similar magnitudes as the large currents of fast ion channels, preventing the subtle interaction between these strong ionic cell currents and the small repolarising coupling currents (-0.103649 ≈ 0.1 pA).”

      (25) Line 426: Please explain how ‘voltage shocks’ were modelled.

      We would like to refer the reviewer to our response to comment (20) regarding how we model voltage shocks. In the context of line 426, a typical voltage shock corresponds to a tissue-wide stimulus of 50 mV. Independent of our computational model, line 426 also cites other publications showing that, in clinical settings, high-voltage shocks are unable to terminate ectopic sustained activity, consistent with our findings.

      (26) Lines 429 ff: 0.2pA/pF would correspond to 20 pA for a small cardiomyocyte of 100 pF, this current should be measurable using patch-clamp recordings.

      In trying to be succinct, we may have caused some confusion. The difference between the dips (-0.07 pA/pF) and hills (_≈_0.11 pA/pF) is approximately 0.18 pA/pF. For a small cardiomyocyte, this corresponds to deviations from zero of roughly ±10 pA. Considering that typical RMS noise levels in whole-cell patch-clamp recordings range from 2-10 pA , it is understandable that detecting these peaks and dips in an I-V curve (average current after holding a voltage for an extended period)  is difficult. Achieving statistical significance would therefore require patching a large number of cells.

      Given the already extensive scope of our manuscript in terms of techniques and concepts, we decided not to pursue these additional patch-clamp experiments.

      Reviewer #2 (Recommendations for the authors):

      Given the deluge of conditions to consider, there are several areas of improvement possible in communicating the authors' findings. I have the following suggestions to improve the manuscript.

      (1) Please change "pulse train" straight pink bar OR add stimulation marks (such as "*", or individual pulse icons) to provide better visual clarity that the applied stimuli are "short ON, long OFF" electrical pulses. I had significant initial difficulty understanding what the pulse bars represented in Figures 2, 3, 4A-B, etc. This may be partially because stimuli here could be either light (either continuous or pulsed) or electrical (likely pulsed only). To me, a solid & unbroken line intuitively denotes a continuous stimulation. I understand now that the pink bar represents the entire pulse-train duration, but I think readers would be better served with an improvement to this indicator in some fashion. For instance, the "phases" were much clearer in Figures 7C and 8D because of how colour was used on the Vm(t) traces. (How you implement this is up to you, though!)

      We have addressed the reviewer’s concern and updated the figures by marking each external pulse with a small vertical line (see below).

      (2) Please label the electrical stimulation location (akin to the labelled stimulation marker in circle 2 state in Figure 1A) in at least Figures 2 and 4A, and at most throughout the manuscript. It is unclear which "edge" or "pixel" the pulse-train is originating from, although I've assumed it's the left edge of the 2D tissue (both in vitro and silico). This would help readers compare the relative timing of dark blue vs. orange optical signal tracings and to understand how the activation wavefront transverses the tissue.

      We indicated the pacing electrode in the optical voltage recordings with a grey asterisk. For the in silico simulations, the electrode was assumed to be far away, and the excitation was modelled as a parallel wave originating from the top boundary, indicated with a grey zone.

      (3) Given the prevalence of computational experiments in this study, I suggest considering making a straightforward video demonstrating basic examples of STA, OSC, and TR.OSC states. I believe that a video visualizing these states would be visually clarifying to and greatly appreciated by readers. Appendix 2 Figure 3 would be the no-motion visualization of the examples I'm thinking of (i.e., a corresponding stitched video could be generated for this). However, this video-generation comment is a suggestion and not a request.

      We have included a video showing all relevant states, which is now part of the Supplementary Material.

      (4) Please fix several typos that I found in the manuscript:

      (4A) Line 279: a comma is needed after i.e. when used in: "peculiar, i.e. a standard". However, this is possibly stylistic (discard suggestion if you are consistent in the manuscript).

      (4B) Line 382: extra period before "(Figure 3C)".

      (4C) Line 501: two periods at end of sentence "scientific purposes.." .

      We would like to thank the reviewer for pointing out these typos. We have corrected them and conducted an additional check throughout the manuscript for minor errors.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Petrovic et al. investigate CCR5 endocytosis via arrestin2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data are particularly compelling, clearly demonstrating chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. 

      To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization. 

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays. 

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fitsall model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure. 

      We thank the referee for this positive assessment of our manuscript. Indeed, by stepping away from the common receptor models for understanding internalization (b2AR and V2R), we revealed the phosphorylation level of the receptor as a key factor in driving the sequestration of the receptor from the plasma membrane. We hope that the proposed mechanistic model will aid further studies to obtain an even more detailed understanding of forces driving receptor internalization.

      Reviewer #2 (Public review): 

      Summary: 

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin-mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time, a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation. 

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis. 

      Strengths: 

      The 15N,1H, and 13C, methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of the activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin, with both kD and description of the interfaces. 

      We sincerely thank the referee for this encouraging evaluation of our work and appreciate the recognition of the NMR efforts and insights into the arrestin–clathrin–AP2 interactions.

      Weaknesses: 

      This reviewer has identified only minor weaknesses with the study.

      (1) Arrestin-2 1-418 resonances all but disappear with CCR5pp6 addition. Are they recovered with Ap2Beta2 addition, and is this what is shown in Supplementary Figure 2D? 

      We believe the reviewer is referring to Figure 3 - figure supplement 1. In this figure, the panels E and F show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline) disappear upon the addition of CCR5pp6 (arrestin2<sup>1-418</sup>•CCR5pp6 complex spectrum in red). The panels C and D show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline), which remain unchanged upon addition of AP2b2<sup>701-937</sup> (orange), indicating no complex formation. We also recorded a spectrum of the arrestin2<sup>1-418</sup> •CCR5pp6 complex under addition of AP2b2 <sup>701-937</sup>(not shown), but the arrestin2 resonances in the arrestin2<sup>1418</sup> •CCR5pp6 complex were already too broad for further analysis. This had been already explained in the text.

      “In agreement with the AP2b2 NMR observations, no interaction was observed in the arrestin2 methyl and backbone NMR spectra upon addition of AP2b2 in the absence of phosphopeptide (Figure 3-figure supplement 1C, D). However, the significant line broadening of the arrestin2 resonances upon phosphopeptide addition (Figure 3-figure supplement 1E, F) precluded a meaningful assessment of the effect of the AP2b2 addition on arrestin2 in the presence of phosphopeptide””.

      (2) I don't understand how methyl TROSY spectra of arrestin2 with phosphopeptide could look so broadened unless there are sample stability problems. 

      We thank the referee for this comment. We would like to clarify that in general a broadened spectrum beyond what is expected from the rotational correlation time does not necessarily correlate with sample stability problems. It is rather evidence of conformational intermediate exchange on the micro- to millisecond time scale.

      The displayed <sup>1</sup>H-<sup>15</sup> N spectra of apo arrestin2 already suffer from line broadening due to such intrinsic mobility of the protein. These spectra were recorded with acquisition times of 50 ms (<sup>15</sup>N) and 55 ms (<sup>1</sup>H) and resolution-enhanced by a 60˚-shifted sine-bell filter for <sup>15</sup>N and a 60˚-shifted squared sine-bell filter for <sup>1</sup>H, respectively, which leads to the observed resolution with still reasonable sensitivity. The <sup>1</sup>H-<sup>15</sup> resonances in Fig. 1b (arrestin2<sup>1-393</sup>) look particularly narrow. However, this region contains a large number of flexible residues. The full spectrum, e.g. Figure 1-figure supplement 2, shows the entire situation with a clear variation of linewidths and intensities. The linewidth variation becomes stronger when omitting the resolution enhancement filters.

      The addition of the CCR5pp6 phosphopeptide does not change protein stability, which we assessed by measuring the melting temperature of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup> •CCR5pp6 complex (Tm = 57°C in both cases). We believe that the explanation for the increased broadening of the arrestin2 resonances is that addition of the CCR5pp6, possibly due to the release of the arrestin2 strand b20, amplifies the mentioned intermediate timescale protein dynamics. This results in the disappearance of arrestin2 resonances. 

      We have now included the assessment of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup> •CCR5pp6 stability in the manuscript:

      “The observed line broadening of arrestin2 in the presence of phosphopeptide must be a result of increased protein motions and is not caused by a decrease in protein stability, since the melting temperature of arrestin2 in the absence and presence of phosphopeptide are identical (56.9 ± 0.1 °C)”.

      (3) At one point, the authors added an excess fully phosphorylated CCR5 phosphopeptide (CCR5pp6). Does the phosphopeptide rescue resolution of arrestin2 (NH or methyl) to the point where interaction dynamics with clathrin (CLTC NTD) are now more evident on the arrestin2 surface? 

      Unfortunately, when we titrate arrestin2 with CCR5pp6 (please see Isaikina & Petrovic et. al, Mol. Cell, 2023 for more details), the arrestin2 resonances undergo fast-to-intermediate exchange upon binding. In the presence of phosphopeptide excess, very few resonances remain, the majority of which are in the disordered region, including resonances from the clathrin-binding loop. Due to the peak overlap, we could not unambiguously assign arrestin2 resonances in the bound state, which precluded our assessment of the arrestin2-clathrin interaction in the presence of phosphopeptide. We have made this now clearer in the paragraph ‘The arrestin2-clathrin interaction is independent of arrestin2 activation’

      “Due to significant line broadening and peak overlap of the arrestin2 resonances upon phosphopeptide addition, the influence of arrestin activation on the clathrin interaction could not be detected on either backbone or methyl resonances”.

      (4) Once phosphopeptide activates arrestin-2 and AP2 binds, can phosphopeptide be exchanged off? In this case, would it be possible for the activated arrestin-2 AP2 complex to re-engage a new (phosphorylated) receptor?

      This would be an interesting mechanism. In principle, this should be possible as long as the other (phosphorylated) receptor outcompetes the initial phosphopeptide with higher affinity towards the binding site. However, we do not have experiments to assess this process directly. Therefore, we rather wish not to further speculate.

      (5) Did the authors ever try SEC measurements of arrestin-2 + AP2beta2+CCR5pp6 with and without PIP2, and with and without clathrin (CLTC NTD? The question becomes what the active complex is and how PIP2 modulates this cascade of complexation events in class B receptors. 

      We thank the referee for this question. Indeed, we tested whether PIP2 can stabilize the arrestin2•CCR5pp6•AP2 complex by SEC experiments. Unfortunately, the addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. The resolution of SEC experiments was not sufficient to distinguish arrestin2 in oligomeric form or in arrestin2•CCR5pp6•AP2 complex. We now mention this in the text: 

      “We also attempted to stabilize the arrestin2-AP2b2-phosphopetide complex through the addition of PIP2, which can stabilize arrestin complexes with the receptor (Janetzko et al., 2022). The addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. Unfortunately, the resolution of the SEC experiments was not sufficient to separate the arrestin2 oligomers from complexes with AP2b2”.

      Reviewer #3 (Public review): 

      Summary: 

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field. 

      Strengths: 

      (1) The strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL 376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Figure 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2.

      (2) SEC and NMR data suggest that full-length arr2 (1-418) binding with the 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Figure 3). The pp6 peptide shows the highest degree of arr2 activation and 2-adaptin binding, compared to less phosphorylated peptides or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the AP2 interaction is necessary for CCR5 endocytosis. 

      (3) To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes into endosomes (Figure 4). The data suggest that complex internalization is dependent on AP2 binding, not clathrin (Figure 5). 

      We thank the referee for the careful and encouraging evaluation of our work. We appreciate the recognition of the solidity of our data and the support for our conclusions regarding the distinct roles of AP2 and clathrin in arrestin-mediated receptor internalization.

      Weaknesses:

      The interaction of truncated arr2 (1-393) was not impacted by CCR5 phospho-peptide pp6, suggesting the interaction with clathrin is not dependent on arrestin activation (Figure 2). This raises some questions.

      We thank the referee for raising this concern, as we were also surprised by the discovery that the interaction does not depend on arrestin activation. However, the NMR data clearly show at atomic resolution that arrestin activation does not influence the interaction with clathrin in vitro. Evolutionary, the arrestin-clathrin interaction appears not to be conserved as the visual arrestin completely lacks a clathrin-binding motif. For that reason, we believe that the weak arrestin-clathrin interaction provides more of a supportive role during the internalization rather than the regulatory interaction with AP2, which requires and quantitatively depends on the arrestin2 activation. We have reflected on this in the Discussion:

      “Although the generalization of this mechanism from CCR5 to other arr-class B receptors has to be explored further, it is indirectly corroborated in the visual rhodopsin-arrestin1 system. The arr-class B receptor rhodopsin (Isaikina et al., 2023) also undergoes CME (Moaven et al., 2013) with arrestin1 harboring the conserved AP2 binding motif, but missing the clathrinbinding motif (Figure 1-figure supplement 1A)”.

      Overall, the data are solid, but for added rigor, can these experiments be repeated without tagged receptor and/or arr2? My concern stems from the fact that the stability of the interaction between arr2 and the receptor may be related to the position of the tags.

      We thank the referee for this suggestion, which refers to the cellular experiments; the biophysical experiments were carried out without tags. To eliminate the possibility of tags contributing to receptor-arrestin2 binding in the cellular experiments, we also performed the experiments in the presence of CCR5 antagonist [5P12]CCL5 (Figure 4). These data show that in the case of inactive CCR5, arrestin2 is not recruited to CCR5, nor does it form internalization complexes, which would be the case if the tags were increasing the receptorarrestin interaction. In contrast, if the tags were decreasing the interaction, we would not expect such a strong internalization. As indicated below, we have also attempted to perform our cellular experiments using an N-terminally SNAP-tagged CCR5. Unfortunately, this construct did not express in HeLa cells indicating that SNAP-CCR5 was either toxic or degraded.

      Reviewing Editor Comments: 

      Overall, the reviewers did not suggest much by way of additional experiments. They do suggest several aspects of the manuscript that would benefit from further clarification. 

      Reviewer #1 (Recommendations for the authors): 

      (1) The distinction between arrestin 2S and arrestin 2L as relates to the canonical and non-canonical clathrin binding sites would benefit from clarification, particularly because the second binding site depends on the splice variant. This is something that some readers may not be familiar with (particularly young ones that are hopefully part of the intended readership).

      We thank the referee for this suggestion. We would like to emphasize that in our work, only the long arrestin2 splice variant was used, which contains both binding sites. We have now introduced the splice variants and their relation to the clathrin binding sites in the text. 

      In section ‘Localizing and quantifying the arrestin2-clathrin interaction by NMR spectroscopy’:

      “Clathrin and arrestin interact in their basal state (Goodman et al., 1996), and a structure of a complex between arrestin2 and the clathrin heavy chain N-terminal domain (residues 1-363, named clathrin-N in the following) has been solved by X-ray crystallography (PDB:3GD1) in the absence of an arrestin2-activating phosphopeptide (Kang et al., 2009). This structure (Figure 1-figure supplement 1B) suggests a 2:1 binding model between arrestin2 and clathrinN. The first interaction (site I) is observed between the <sup>376</sup>LIELD<sup>380</sup> clathrin-binding motif of the arrestin2 CBL and the edge of the first two β-sheet blades of clathrin-N, whereas the second interaction (site II) occurs between arrestin2 residues <sup>334</sup>LLGDLA<sup>339</sup> and the 4th and 5th blade of clathrin-N. The latter arrestin interaction site is not present in the arrestin2 splice variant arrestin2S (for short) where an 8-amino acid insert (residues 334-341) between β-strands 18 and 19 is removed (Kang et al., 2009)”.

      Section ‘The arrestin2-clathrin interaction is independent of arrestin2 activation’

      “Figure 2A (left) shows the intensity changes (full spectra in Figure 2-figure supplement 1A) of the clathrin-N <sup>1</sup>H-<sup>15</sup>N TROSY resonances [assignments transferred from BMRB, ID:25403 (Zhuo et al., 2015)] upon addition of a one-molar equivalent of arrestin2<sup>1-393</sup>. A significant intensity reduction due to line broadening is detected for clathrin-N residues 39-40, 48-50, 62-72, 83-90, 101-106, and 108. These residues form a clearly defined binding region at the edges of blade 1 and blade 2 of clathrin-N (Figure 2A, right), which corresponds to interaction site I in the 3GD1 crystal structure, involving the conserved arrestin2 <sup>376</sup>LIELD<sup>380</sup> motif. However, no significant signal attenuation was observed for clathrin-N residues in blade 4 and blade 5, which would correspond to the crystal interaction site II with arrestin2 residues <sup>334</sup>LLGDLA<sup>339</sup> that are absent in the arrestin2S splice variant. Thus only one arrestin2 binding site in clathrin-N is detected in solution, and site II of the crystal structure may be a result of crystal packing”.

      (2) Acronym density is high throughout. While many are standard in the clathrin literature, this could hinder accessibility for readers with a GPCR or arrestin focus.

      We agree with the referee. The acronyms were hard to avoid. The most non-obvious acronym seems ‘CLTC-NTD’ for the N-terminal domain of the clathrin heavy chain, which uses the non-obvious, but common gene name CLTC for the clathrin heavy chain. We have now replaced ‘CLTC-NTD’ by ‘clathrin-N’ and hope that this makes the text easier to follow.

      (3) The NMR section, while impressive in scope, had writing that was more difficult to follow than the rest. I am curious what percentage of resonance could be assigned. 

      We apologize if the NMR sections of this manuscript were unclear. We attempted to provide a very detailed description of the experimental setup and the spectral results. Being experienced NMR spectroscopists, we have tried very hard to obtain good 3D triple resonance spectra for assignments, but their sensitivity is very low. We believe that this is due to the microsecond dynamics present in the system, which makes the heteronuclear transfers inefficient. So far, we have been able to assign ~30% of the visible arrestin2 resonances. We are still validating the assignments and are working on the analysis and an explanation for this arrestin2 behavior. Therefore, at this point, we want to refrain from stronger statements besides that considerable intrinsic microsecond dynamics is impeding the assignment process.

      (4) It may be worth noting in the main text that truncated arrestins have slightly higher basal activation. I was curious why the truncated arrestin was not chosen for the AP2 NMR titrations. Presumably, an effect would be more likely to be seen.

      While some truncated arrestin2 variants (comprising residues 1-382 or 1-360) indeed show higher basal activity than the full-length arrestin2, they typically completely lack the b20 strand (residues 386-390), which is crucial for the formation of a parallel b-sheet with strand b1, and whose release governs arrestin activation. Our truncated arrestin2 construct comprises residues 1-393 and contains strand b20. In our experience, no significant difference in basal activity, as assessed by Fab30 binding, was detected for arrestin2<sup>1-393</sup> and arrestin2<sup>1-418</sup> (Author response image 1).

      Author response image 1.

      SEC profiles showing arrestin2<sup>1–393</sup> (left) and arrestin2<sup>1-418</sup> (right) activation by the CCR5pp6 phosphopeptide as assayed by Fab30 binding. The active ternary arrestin2-phosphopeptide-Fab30 complex elutes at a lower volume than the inactive apo arrestin2 or the binary arrestin2-phosphopeptide complex. Both arrestin2 constructs are activated by the phosphopeptide to a similar level as assessed by the integrated SEC volumes.

      We want to emphasize that we used full-length arrestin2<sup>1-418</sup> in order to assess the AP2 interaction, as the crystal structure of arrestin2 peptide-AP2 (PDB:2IV8) shows residues past the residue 393 involved in binding.

      PDB codes are currently not accompanied by corresponding literature citations throughout. Please add these. 

      Thank you for this suggestion. In the manuscript, we were careful to provide the full literature citation the first time each PDB code is mentioned. To avoid redundancy and maintain clarity, we rather do not want to repeat the citations with every subsequent mentioning of the PDB code.

      (5) The AlphaFold model could benefit from a more transparent discussion of prediction confidence and caveats. The younger crowd (part of the presumed intended readership) tends to be more certain that computational output is 'true'. Figure 1A shows long loops that are likely regions of low confidence in the prediction. Displaying expected disordered regions as transparent or color-coded would help highlight these as flexible rather than stable, especially for that same younger readership. 

      We need to explain that the AlphaFold model of arrestin2 was only used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available apo bovine (PDB:1G4M) and apo human (PDB:8AS4) arrestin2 crystal structures. However, the AlphaFold model of arrestin2 is basically identical to the crystal structures in the regions that are visible in the crystal structures. We have clarified this now in the caption to Figure 1.

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)]. In the other structured regions, the model is virtually identical to the crystal structures”.

      (6) Several figure panels were difficult to interpret due to their small size. Especially microscopy insets, where I needed to simply trust that the authors were accurately describing the data. Enlarging panels is essential, and this may require separating them into different figures.

      We appreciate the referee’s concern regarding figure readability. However, we want to indicate that all our figures are provided as either high-resolution pixel or scalable vector graphics, which allow for zooming in to very fine detail, either electronically or in print. This ensures that microscopy insets and other small panels can be examined clearly when viewed appropriately. We believe the current layout of the figures is necessary to be able to efficiently compare the data between different conditions.

      Many figure panels had text size that was too small. Font inconsistencies across figures also stand out. 

      We apologize for this. We have now enlarged the font size in the figures and made the styles more consistent.

      For Fig. 1F, consider adding individual data points and error bars.

      Thank you for this suggestion. However, Figure 1F already contains the individual data points, with colored circles corresponding to the titration condition. As we did not have replicates of the titration, no error bars are shown. However, the close agreement of the theoretical fit with the individual measured data points stemming from different experiments shows that the statistical errors are indeed very small. We have estimated an overall error for the Kd (as indicated in panel F, right) by error propagation based on an estimate of the chemical shift error as obtained in the NMR software POKY (based on spectral noise). 

      Reviewer #2 (Recommendations for the authors):

      (1) I don't observe two overlapping spectra of Arrestin2 (1393) +/- CLTC NTD in Supplementary Figure 1.

      As explained above all the spectra are shown as scalable vector graphics. The overlapping spectra are visible when zoomed in.

      (2) I'd be tempted to move the discussion of class A and class B GPCRs and their presumed differences to the intro and then motivate the paper with specific questions.

      We appreciate the referee’s suggestion and had a similar idea previously. However, as we do not have data on other class-A or class-B receptors, we rather don’t want to motivate the entire manuscript by this question.

      Reviewer #3 (Recommendations for the authors): 

      (1) What happens with full-length arr2 (1-418) when the phospho-peptide pp6 is added to the reaction? It's unclear to me that 1-418 would behave the same as 1-393 because the arr2 tail of 1-393 is likely sufficiently mobile to accommodate binding to CLTC NTD. I suggest attempting this experiment for added rigor.

      We believe that there is a misunderstanding. The 1-393 and 1-418 constructs differ by the disordered C-terminal tail, which is not involved in the clathrin interaction with the arrestin2 376-380 (LIELD) residues. Accordingly, both 1-393 and 1-418 constructs show almost identical interactions with clathrin (Figure 2A and 2C). Moreover, the phospho-activated arrestin2<sup>1-393</sup> (Figure 2B) interacts identically with clathrin as inactive arrestin2<sup>1-393</sup> and inactive arrestin2<sup>1-418</sup>. We believe that this comparison is sufficient for the conclusion that arrestin activation does not play a role in arrestin-clathrin binding.

      (2) If the tags were moved to the N-terminus of the receptor and/or arr2, I wonder if the complex is as stable (Figure 4)? 

      We thank the referee for their suggestion. We have indeed attempted to perform our experiments using an N-terminally SNAP-tagged CCR5. Unfortunately, this construct did not express in the HeLa cells indicating that SNAP-CCR5 was either toxic or degraded. Unfortunately, as the lab is closing due to the retirement of the PI, we are not able to repeat these experiments with further differently positioned tags. We refer also to our answer above that the experiments with the antagonist [5P12]CCL5 present a certain control.

      (3) A biochemical assay to measure receptor internalization, in addition to the cell biological approach (Figure 5), would add additional rigor to the study and conclusions.

      We tried to measure internalization using a biochemical approach. We tried to pull-down CCR5 from HeLa cells and assess arrestin binding. Unfortunately, even using different buffer conditions, we found that CCR5 was aggregating once solubilized from membranes, preventing us from doing this analysis. We had a similar problem when we exogenously expressed CCR5 in insect cells for purification purposes. We have long experience with CCR5, and this receptor is very aggregation-prone due to extended charged surfaces, which interact with the chemokines.

      As an alternative, and in support of the cellular immunofluorescence assays, we also attempted to obtain internalization data via FACS using a CCR5 surface antibody (CD195 Monoclonal Antibody eBioT21/8). CD195 recognizes the N-terminus of the receptor. Unfortunately, the presence of the chemokine ligand (~ 8 kDa) interferes with antibody binding, precluding the quantitative biochemical assessment of the arrestin2 mutants on the CCR5 internalization.

      For these reasons, we were particularly careful to quantify CCR5 internalization from the immunofluorescence microscopy data using colocalization coefficients as well as puncta counting (Figure 4+5).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewing Editor Comments:

      (A) Revisions related to the first part, regarding data mining and curation:

      (1) One question that arises with the part of the manuscript that discusses the identification and classification of ion channels is whether these will be made available to the wider public. For the 419 human sequences, making a small database to share this result so that these sequences can be easily searched and downloaded would be desirable. There are a variety of acceptable formats for this: GitHub/figshare/zenodo/university website that allows a wider community to access their hard work. Providing such a resource would greatly expand the impact of this paper. The same question can be asked of the 48,000+ ion channels from diverse organisms.

      We thank the reviewer for providing this important feedback. While the long term plan is to provide access to these sequences and annotations through a knowledge base resource like Pharos, we agree with the comments that it would be beneficial to have these sequences made available with the manuscript as well. We have compiled 3 fasta files containing the following: 1) Full length sequences for the curated 419 ion channel sequences. 2) Pore containing domain sequences for the 343 pore domain containing human ion channel sequences. 3) All the identified orthologs for the human ion channels.

      For each sequence in these files, we have extended the ID line to include the most pertinent annotation information to make it readily available. For example, the id>sp|P48995|TRPC1_HUMAN|TRP:VGIC--TRP-TRPC|pore-forming|dom:387-637 provides the classification, unit and domain bounds for the human TRPC1 in the fasta file itself.

      These files have been uploaded to Zenodo and are available for download with doi 10.5281/zenodo.16232527. We have included this in the Data Availability statement of the manuscript as well.

      (2) Regarding the 48,000+ sequences, what checks have been done to confirm that they all represent bona fide, full-length ion channel sequences? Uniprot contains a good deal of unreviewed sequences, especially from single-celled organisms. The process by which true orthologues were identified and extraneous hits discarded should be discussed in more detail, and all inclusion criteria should be described and justified, clearly illustrating that the risk of gene duplicates and fragments in this final set of ion channel orthologues has been avoided. Related to this, does this analysis include or exclude isoforms?

      We thank the reviewer for raising this important point. Our selection of curated proteomes and the KinOrtho pipeline for orthology detection returns, up to an extent, reliable orthologous sequence sets. In brief, our database sequences are retrieved from full proteomes that only include proteins that are part of an official proteome release. Thus, they are mapped from a reference genome to ensure species-specific relevance and avoid redundancy. The >1500 proteomes in this analysis were selected based on their wider use in other orthology detection pipelines like OMA and InParanoid. Our orthology detection pipeline, KinOrtho, performs a fulllength and a domain-based orthology detection which ensures that the orthologous relationships are being defined based on the pore-domain sequence similarity. 

      But we agree with the reviewer that this might leave room for extraneous, fragments or misannotated sequences to be included in our results. Taking this into careful consideration, we have expanded our sequence validation pipeline to include additional checks such as checking the uniport entry type, protein existence evidence and sequence level checks such as evaluating the compositional bias, non-standard codons and sequence lengths. These validation steps are now described in detail in the Methods section under orthology analysis (lines 768-808). All the originally listed orthologous sequences passed this validation pipeline and thus provide additional confidence that they are bona fide full length ion channel sequences.

      We have also expanded this section (lines 758 – 766) to provide more details of the KinOrtho pipeline for orthology detection, which is a previously published method used for orthology detection in kinases by our lab.

      Finally, our orthology analysis excludes isoforms and only spans the primary canonical sequences that are part of the UniProt Proteomes annotated sequence set. The isoforms that are generally available in UniProt Proteomes in a separate file named *_additional.fasta were not included in this analysis.

      (3) The decision to show the families of ion channels in Figure 1 as pie charts within a UMAP embedding is intriguing but somewhat non-intuitive and difficult to understand. Illustrating these results with a standard tree-like visualization of the relationship of these channels to each other would be preferred.

      We appreciate the feedback provided by the reviewer, and understand that a standard tree-like visualization would be much easier to interpret and familiar than a bubble chart based on UMAP embeddings. However, we opted to use the bubble chart for the following reasons:

      Low sequence similarity: the 419 human ICs share very minimal sequence similarity, falling in the twilight zone or lower ( Dolittle, 1992; PMID:1339026). Thus, traditional multiple sequence alignment and phylogenetic reconstruction methods perform very poorly and generate unreliable or even misleading results. To explore the practicality of this option, we pursued performing a multiple sequence alignment of just 3 of the possibly related IC families as suggested by reviewer 2 (CALHM, Pannexins, and Connexins) using the state of the art structure based sequence alignment method Foldmason (doi: https://doi.org/10.1101/2024.08.01.606130). Even then, the sequence alignment and the resulting tree for just these 3 families were poor and unreliable, as illustrated in the attached Author response Image 2.

      Protein embeddings based clustering: Novel LLM based approaches such as the protein language model embeddings offer ways to overcome these limitations by capturing sequence, structure, function and evolutionary properties in a high-dimensional space. Thus, we employed this model using DEDAL followed by UMAP for dimensionality reduction, which preserves biologically meaningful local and global relationships.

      Abstraction at family level: In Figure 1, we aggregate individual channels into family bubbles with their positions representing the average UMAP coordinates of their members. This offers a balance between an intuitive view of how IC families are distributed in the embedding space and reflects potential functional and evolutionary proximities, while not being impeded by individual IC relationships across families.

      We have revised the figure legend (lines 1221 – 1234) with additional description of the visualization and the process used to generate it, and the manuscript text (lines 248-270) provides the rationale behind the selection of this method.

      (4) A strength of this paper is the visualization of 'dark' ion channels. However, throughout the paper, this could be emphasized more as the key advantage of this approach and how this or similar approaches could be used for other families of proteins. Specifically, in the initial statement describing 'light' vs 'dark channels', the importance of this distinction and the historical preference in science to study that which has already been studied can be discussed more, even including references to other studies that take this kind of approach. An example of a relevant reference here is to the Structural Genomics Consortium and its goals to achieve structures of proteins for which functions may not be well-characterized. Clarifying these motivations throughout the entire paper would strengthen it considerably.

      We thank the reviewer for this constructive comment and agree that highlighting the strength of visualizing “dark” channels and prioritizing them for future studies would strengthen the paper. As suggested, we have revised the text throughout the paper (lines 84-89, 176-180) to contextualize and emphasize this distinction. We have also added a reference for the Structural Genomics Consortium, which, along with resources like IDG, has provided significant resources for prioritizing understudied proteins.

      (5) Since the authors have generated the UMAP visualization of the channome, it would be interesting to understand how the human vs orthologue gene sets compare in this space.

      We appreciate the reviewer’s input. It is an interesting idea to explore the UMAP embedding space for the human ICs along with their orthologs. The large number of orthologous sequences (>37,000) would certainly impose a computational challenge to generate embeddings-based pairwise alignments across all of them. Downstream dimensionality reduction from such a large set and the subsequent visualization would also suffer from accuracy and interpretability concerns. However, to follow up on the reviewer’s comments, we selected orthologous sequences from a subset of 12 model organisms spanning all taxa (such as mouse, zebrafish, fruit fly, C. elegans, A. thaliana, S. cerevisiae, E. coli, etc.).This increased the number of sequences for analysis to 1094 from 343, which is still manageable for UMAP. Using the exact same method, we generated the UMAP embeddings plot for this set as shown below. 

      Author response image 1.

      UMAP embeddings of the human ICs alongside orthologs from 12 model organisms

      As shown above, we observed that each orthologous set forms tight, well-defined clusters, preserving local relationships among closely related sequences. For example, a large number of VGICs cluster more closely together compared to Supplementary Figure 1 (with only the human ICs). However, families that were previously distant from others now appear to be even more scattered or pushed further away, indicating a loss of global structure. This pattern suggests that while local distances are well preserved, the global topology of the embedding space could be compromised. Moreover, we find that the placement of ICs with respect to other families is highly sensitive to the parameter choices (e.g., n_neighbors and min_dist), an issue which we did not encounter when using only the human IC sequences. The inclusion of a large number of orthologous sequences that are highly similar to a single human IC but dissimilar to others skews the embedding space, emphasizing local structure at the expense of global relationships.

      Since UMAP and similar dimensionality reduction methods prioritize local over global structure, the resulting embeddings accurately reflect strong ortholog clustering but obscure broader interfamily relationships. Consequently, interpreting the spatial arrangement of human IC families with respect to one another becomes unreliable. We have made this plot available as part of this response, and anyone interested can access this in the response document.   

      (6) Figure 1 should say more clearly that this is an analysis of the human gene set and include more of the information in the text: 419 human ion channel sequences, 75 sequences previously unidentified, 4 major groups and 55 families, 62 outliers, etc. Clearer visualizations of these categories and numbers within the UMAP (and newly included tree) visualization would help guide the reader to better understand these results. Specifically, which are the 75 previously unidentified sequences?

      We thank the reviewer for the comments. To address this, we have revised Figure 1 and added more information, including a clear header that states that these are only human IC sets, numbers showing the total number of ICs, and the number of ICs in each group. We have further included new Supplementary Figure 2 and Supplementary Table 2, which show the overlap of IC sequences across the different resources. Supplementary Figure 2 is an upset plot that provides a snapshot of the overlap between curated human ICs in this study compared to KEGG, GtoP, and Pharos. Supplementary Table 2 provides more details on this overlap by listing, for each human IC, whether they are curated as an IC in the 3 IC annotation resources. We believe these additions should provide all the information, including the unidentified sequences we are adding to this resource.

      (7) Overall, the manuscript needs to provide a clearer description of the need for a better-curated sequence database of ion channels, as well as how existing resources fall short.

      We thank the reviewer for pointing out this important gap in the description. As suggested, we have revised the text thoroughly in the Introduction section to address this comment. Specifically, we have added sections to describe existing resources at sequence and structure levels that currently provide details and/or classification of human ion channels. Then, we highlight the facts that these resources are missing some characterized pore-containing ICs, do not include any information on auxiliary channels, and lack a holistic evolutionary perspective, which raises the need for a better-curated database of ion channels. Please refer to lines 57-63, 73-79, and 95 – 119 for these changes and additions.

      (8) Some of the analysis pipeline is unclear. Specifically, the RAG analysis seems critical, but it is unclear how this works - is it on top of the GPT framework and recursively inquires about the answer to prompts? Some example prompts would be useful to understand this.

      We thank the reviewer for highlighting this gap in explanation. We understand that the details provided in the Methods and Supplementary Figure 1 may not have sufficiently explained the pipeline, and are missing some important details. The RAG pipeline leverages vector-based retrieval integrated with OpenAI’s GPT-4o model to systematically search literature and generate evidence-based answers. The process is as follows:

      Literature sources (PubMed articles) relevant to the annotated ion channels were converted into vector representations stored in a Qdrant database.

      Queries constructed from the annotated IC dataset were submitted to the vector database, retrieving contextually relevant literature segments.

      Retrieved contexts served as inputs to the GPT-4o model, which produced structured JSON-formatted responses containing direct evidence regarding ion selectivity and gating mechanisms, along with associated confidence scores.

      To clarify this further, we have rewritten the relevant subsection in lines 649 - 718. Now, this section provides a detailed description of the RAG pipeline. Also, we have improved Supplementary Figure 1 to provide a clearer description of the pipeline. We have also provided an example prompt template to illustrate the query. These additions clarify how the pipeline functions and demonstrate its practical utility for IC annotation.

      (9) The existence of 76 auxiliary non-pore containing 'ion channel' genes in this analysis is a little confusing, as it seems a part of the pipeline is looking for pore-lining residues. Furthermore, how many of these are picked up in the larger orthologues search? Are these harder to perform checks on to ensure that they are indeed ion channel genes? A further discussion of the choice to include these auxiliary sequences would be relevant. This could just be further discussion of the literature that has decided to do this in the past.

      We thank the reviewer for this comment, and agree that further clarification of our selection and definition of auxiliary IC sequences would be helpful. As the reviewer has pointed out, one of the annotation pipeline steps is indeed looking for the pore-lining residues. Any sequences that do not have a pore-containing domain are then considered to be auxiliary, and we search for additional evidence of their binding with one of the annotated pore-containing ICs. If such evidence is not found in the literature, we remove them from our curated IC list. 

      In response to the above comment, we have revised the manuscript text to provide these details. In the Introduction section, we have added references to previous literature that have described auxiliary ICs and also pointed out that the existing ion channel resources do not account for such auxiliary channels (lines 73-79, 107-108,148-149). We have also expanded the Methods section to describe the selection and definition of auxiliary channels (lines 640-646).

      With regards to the orthology analysis, since auxiliary channels do not have a pore domain, and our orthology pipeline requires a pore domain similarity search and hit, we did not include them in this part of the analysis. We have clarified the text in the Results section to ensure this is communicated properly throughout the manuscript (lines 212-215, 260-263). 

      (10) Why are only evolutionary relationships between rat, mouse, and human shown in Figure 3A? These species are all close on the evolutionary timeline.

      We thank the reviewer for this comment. Figure 3A currently provides a high-level evolutionary relationship across the 6 human CALHM members as a pretext for the pattern based Bayesian analysis. However, since this analysis is based on a wider set of orthologs that span taxa, we agree that a larger tree that includes more orthologs is warranted.

      We have now revised Figure 3A to include an expanded tree that includes 83 orthologs from all 6 human CALHM members spanning 14 organisms from different taxa, ranging from mammals, fishes, birds, nematodes, and cnidarians. The overall structure of the tree is still consistent with 2 major clades as before, with CALHM 1 and 3 in the first clade and CALHM 2,4,5, and 6 in the second clade, with good branch support.

      (B) Revisions related to the second part, regarding the analysis of CAHLM channel mutations:

      (1) It would strengthen the manuscript if it included additional discussion and references to show that previous methods to analyze conserved residues in CALHM were significantly lacking. What results would previous methods give, and why was this not enough? Were there just not enough identified CALHM orthologues to give strong signals in conservation analysis? Also, the amino acid conservation between CLHM-1 and CALHM1 is extremely low. Thus, there are other CALHM orthologs that give strong signals in conservation analysis. There are ~6 papers that perform in-depth analysis of the role of conserved residues in the gating of CALHM channels (human and C. elegans) that were not cited (Ma et al, Am J Physiol Cell Physiol, 2025; Syrjanen et al, Nat Commun, 2023; Danielli et al, EMBO J, 2023; Kwon et al, Mol Cells, 2021; Tanis et al, Am J Physiol Cell Physiol, 2017; Tanis et al, J Neurosci, 2013; Ma et al, PNAS, 2013) - these data needs to be discussed in the context of the present work.

      We thank the reviewer for the comment and agree that these are excellent studies that have advanced understanding of conserved residues in CALHM gating. While their analyses compared a limited set of sequences, focusing on residues conserved in specific CALHM homologs or species like C. elegans, our analysis encompasses thousands of sequences across the entire CALHM family, allowing us to identify residues conserved across all family members over evolution. We also coupled this sequence analysis with hypotheses derived from our published structural studies (Choi et al., Nature, 2019), which highlighted the NTH/S1 region as a critical element in channel gating. Based on this, we focused on evolutionarily conserved residues in the S1–S2 linker and at the interface of S1 with the rest of the TMD, reasoning that if S1 movement is essential for gating, these two structural elements (acting as a hinge and stabilizing interface, respectively) would be key determinants of the conformational dynamics of S1. These regions have been largely overlooked in previous studies. As a result, the residues highlighted in our study do not overlap with those previously reported but instead provide complementary insights into gating mechanisms in this unique channel family. Together, our study and the published literature suggest that many regions and residues in CALHM proteins are critical for gating: while some are conserved across the entire family evolutionarily, others appear conserved only within certain species or subfamilies.

      To address the reviewer’s comment, and to highlight the points mentioned above, we have added a brief discussion of these studies and the relevant citations in the revised manuscript (lines 378– 385, 563–576).

      (2) Whereas the current-voltage relations for WT channels are clearly displayed, the data that is shown for the mutants does not allow for determining if their gating properties are indeed different than WT.

      First, the current amplitudes for the mutants were quantified at just one voltage, which makes it impossible to determine if their voltage-dependence was different than WT, which would be a strong indicator for an effect in gating. Current-voltage relations as done for the WT channels should be included for at least some key mutations, which should include additional relevant controls like the use of Gd3+ as an inhibitor to rule out the contribution of some endogenous currents.

      We thank the reviewer for this comment. To address this, we performed additional experiments using a multi-step pulse protocol to obtain current-voltage relations for WT CALHM1, CALHM1(I109W), WT CALHM6, and CALHM6(W113A). Our initial two-step protocol (−80 mV and +120 mV) covers both the physiological voltage range and the extended range commonly used in biophysical characterization of ion channels. Most mutants did not exhibit channel activation even within this broad range. We therefore focused on the three mutants that did show substantial activation to perform full I–V analysis as suggested. In all groups, currents activated at 37 °C were significantly inhibited by Gd<sup>3+</sup>, consistent with published reports (Ma et al., AJP 2025; Danielli et al., EMBO J 2023; Syrjänen et al., Nat Commun 2023). Notably, for CALHM6(Y51A), while this mutation did not significantly alter current amplitudes at positive membrane potentials, it markedly reduced currents at negative potentials, rendering the channel outwardly rectifying and altering its voltage dependence. These new data are incorporated into Figure 5 (panels A–O) and discussed in the manuscript. Figure 5 now also shows current amplitudes at both +120 mV and −80 mV in 0 mM Ca<sup>2+</sup> at 37 °C to facilitate direct comparison between WT and mutants. The previous data at 5 mM Ca<sup>2+</sup> and 0 mM Ca<sup>2+</sup> at 22 °C have been moved to Supplementary Figure 5 as requested.

      Second, it is unclear whether the three experimental conditions (5 mM Ca<sup>2+</sup>, and 0 Ca<sup>2+</sup>, at 22 and 37C) were measured in the same cell in each experiment, or if they represent different experiments. This should be clarified. If measurements at each condition were done in the same experiment, direct comparison between the three conditions within each individual experiment could further help identify mutations with altered gating.

      We thank the reviewer for pointing this out and apologize for the confusion. All three conditions (5 mM Ca<sup>2+</sup> at 22 °C, 0 mM Ca<sup>2+</sup> at 22 °C, and 0 mM Ca<sup>2+</sup> at 37 °C) were sequentially measured in the same cell within each experiment. The currents were then averaged across cells and plotted for each group.

      Third, in line 334, the authors state that "expression levels of wild-type proteins and mutants are comparable." However, Western blots showing CALHM protein abundance (Supplementary Fig. 3) are not of acceptable quality; in the top blot, WT CALHM1 appears too dim, representative blots were not shown for all mutants, and individual data points should be included on the group data quantitation of the blots, together with a statistical test comparing mutants with the WT control.

      We thank the reviewer for the comment and agree that representative blots were not shown for all mutants. Supplementary Figure 4 (previously Supplementary Figure 3) has been updated to include representative blots for all mutants, individual data points in the quantification, and statistical tests comparing each mutant to the WT control.

      A more serious concern is that the total protein quantitation is not very informative about the functional impact of mutations in ion channels, because mutations can severely impact channel localization in the plasma membrane without reducing the total protein that is translated. In mammalian cells, CALHM6 is localized to intracellular compartments and only translocates to the plasma membrane in response to an activating stimulus (Danielli et al, EMBO J, 2023). Thus, if CALHM6 is only intracellular, the protein amount would not change, but the measured current would. Abundant intracellular CALHM1 has also been observed in mammalian cells transfected with this protein (Dreses-Werringloer et al., Cell, 2008). Quantitation of surface-biotinylated channels would provide information on whether there are differences between the constructs in relation to surface expression rather than gating. An alternative approach to biotinylation would be to express GFP-tagged constructs in Xenopus oocytes and look for surface expression. This is what has been done in previous CALHM channel studies.

      Without evidence for the absence of defects in localization or clear alterations in gating properties, it is not possible to conclude whether mutant channels have altered activity. Does the analysis of sequences provide any testable hypotheses about substitutions with different side chains at the same position in the sequence?

      We thank the reviewer for this very important comment. We agree that total protein levels alone do not distinguish between intracellular retention and proper trafficking to the plasma membrane. To address this, we performed surface biotinylation assays for all WT and mutant CALHM1 and CALHM6 constructs to assess their plasma membrane localization. The results show that mutants have either comparable or substantially higher surface expression levels than WT, consistent with the Western blot data. Together, these findings support our original interpretation that the observed differences in electrophysiological currents are not due to trafficking defects but reflect functional effects. These new data are presented in Supplementary Figure 5.

      (3) Line 303 - 13 aligned amino acids were conserved across all CALHM homologs - are these also aligned in related connexin and pannexin families? It is likely that cysteines and proline in TM2 are since CALHM channels overall share a lot of similarities with connexins and pannexins (Siebert et al, JBC, 2013). As in line 207, it would be expected that pannexins, connexins, and CALHM channel families would group together. Related to this, see Line 406 - in connexins, there is also a proline kink in TM2 that may play a role in mediating conformational changes between channel states (Ri et al, Biophysical Journal, 1999). This should be discussed.

      We thank the reviewer for the suggestion. We attempted a structure based sequence alignment of representative structures from all 3 families (CALHM, connexins and pannexins), but the resulting alignments are very poor and have a lot of gapped regions, making it very difficult to comment on the similarities mentioned in this comment. This is actually expected, as although CALHM, connexins, and pannexins are all considered “large-pore” channels, the TMD arrangement and conformation of CALHM are distinct from those of connexins and pannexins. Below, we have included a snapshot of the alignment at the conserved cysteine regions of the CALHM homologs, along with the resulting tree, which has very low support values and has difficulty placing the connexins properly, making it difficult to interpret.

      Author response image 2.

      Structure based sequence alignment and phylogenetic analysis of available crystal structures of members from the CALHM, Pannexin and Connexin families. Top: The resulting sequence alignment is very sparse and does not show conservation of residues in the TM regions. The CPC motif with conserved cysteines in CALHM family is shown. Bottom: Phylogenetic tree based on the alignment has low support values making it difficult to interpret.

      (4) Line 36 - This work does not have experimental evidence to show that the selected evolutionarily conserved residues alter gating functions.

      Our electrophysiology data demonstrate that the selected evolutionarily conserved residues have a major impact on CALHM1 and CALHM6 gating. As shown in Figure 5, mutations at these residues produce two distinct phenotypes: (1) nonconductive channels, and (2) altered voltage dependence, resulting in outward rectification. Importantly, these functional changes occur despite normal total expression and surface trafficking, as confirmed by Western blotting and surface biotinylation (Supplementary Figure 4). These findings indicate that the affected residues are critical for the conformational dynamics underlying channel gating rather than for protein expression or localization.

      (5) Line 296-297 - This could also be put in the context of what we already know about CALHM gating. While all cryo EM structures of CALHM channels are in the open state, we still do understand some things about gating mechanism (Tanis et al Am J Physiol Cell Physiol, Cell Physiol 2017; Ma et al Am J Physiol Cell Physiol, Cell Physiol 2025) with the NT modulating voltage dependence and stabilizing closed channel states and the voltage dependent gate being formed by proximal regions of TM1.

      Thank you for providing this suggestion. As suggested, we have revised the text to place our findings in the context of current knowledge about CALHM gating and have added the relevant citations (lines 370-373).

      (6) Lines 314-315 - Just because residues are conserved does not mean that they play a role in channel gating. These residues could also be important for structure, ion selectivity, etc.

      We agree that evolutionary conservation alone does not imply a role in gating. However, our hypothesis derives from the positioning of these conserved residues, and previous studies that have indicated the importance of the NTH/S1 region for channel gating function. More importantly, our electrophysiology data indicate that these conserved residues specifically impact channel gating in CALHM1 and CALHM6. We have revised the text in lines 404-406 to clarify this further.

      (7) Line 333 - while CALHM6 is less studied than CALHM1, there is knowledge of its function and gating properties. Should CALHM6 be considered a "dark" channel? The IDG development level in Pharos is Tbio. There have been multiple papers published on this channel (ex: Ebihara et al, J Exp Med, 2010; Kasamatsu et al, J Immunol 2014; Danielli et al, EMBO J, 2023).

      We thank the reviewer for noting this important discrepancy. We have updated the text and labels related to CALHM6 to reflect its status as Tbio in the manuscript.

      (8) Please cite Jeon et al., (Biochem Biophys Res Commun, 2021), who have already shown temperature-dependence of CALHM1.

      Thank you for the comment. We have added the citation.  

      (9) It would be helpful to have a schematic showing amino acid residues, TM domains, highlighted residues mutated, etc.

      Thank you for the suggestion. We have revised the figure and added labels for the TM domains, and highlighted the mutated residues.

      Reviewer #1 (Recommendations for the authors):

      (1) Why in the title is 'ion-channels' hyphenated but in the text it is not?

      This has been changed.

      (2) Line 78: 'Cryo-EM' is not defined before the acronym is used.

      This has been fixed.

      (3) Typo in line 519: KinOrthto.

      This has been fixed.

      (4) Capitalizing 'Tree of Life' is a bit strange in section 2 of the results and the Discussion.

      We have removed the capitalization as suggested.

      (5) In Figure 3 and Supplementary Figure 4A, the gene names in the tree are CAHM and not CALHM - I assume this is an error.

      This has been made consistent to CALHM.

      (6) Font sizes throughout all figures, with the exception of Figure 1, need to be more legible. The X-axis labels in Figure 2A are hard to read, for example (though I can see that there is also the CAHM/CALHM typo here...). A good rule of thumb is that they should be the same size as the manuscript text. Furthermore, the grey backgrounds of Figure 4 and Figure 5 are off-putting; just having a white background here should be sufficient.

      This has been addressed. We have increased the font size in all figures with these revisions. The styling for Figure 4 and 5 has also been made consistent with other figures.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 36 - This work does not have experimental evidence to show that the selected evolutionarily conserved residues alter gating functions.

      Addressed in comment #4 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (2) Line 168 - should also be Supplemental Table 1.

      This has been addressed.

      (3) Line 170 - 419 human ion channel sequences were identified and this was an increase of 75 sequences over previous number. Which 75 proteins are these?

      This is now shown in Supplementary Figure 2 and Supplementary Table 2. Supplementary Figure 2 shows an Upset plot with the number of sequences that overlap across databases and the novel sequences that we have added as part of this study. The 75 specifically refers to the sequences that were not included in Pharos, which was chosen to refer to this number since it has the highest number of ICs listed out of all the other resources. Further, Supplementary Table 2 now provides a list of individual ICs and whether they were present in each of the 3 databases compared.

      (4) Line 289 - Ca2+ (not Ca); other similar mistakes throughout the manuscript

      These have been fixed.

      (5) Line 291-292 - Please include more about functions for CALHM channels; ex. CALHM1 regulates cortical neuron excitability (Ma et al, PNAS 2012), CLHM-1 regulates locomotion and induces neurodegeneration in C. elegans (Tanis et al. Journal of Neuroscience 2013); see above for references on CALHM6 function.

      We have added the functions as suggested.

      (6) Line 296-297 - This could also be put in the context of what we already know about CALHM gating. While all cryo EM structures of CALHM channels are in the open state, we still do understand some things about gating mechanism (Tanis et al Am J Physiol Cell Physiol, Cell Physiol 2017; Ma et al Am J Physiol Cell Physiol, Cell Physiol 2025) with the NT modulating voltage dependence and stabilizing closed channel states and the voltage dependent gate being formed by proximal regions of TM1.

      Addressed in comment #5 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (7) Lines 314-315 - Just because residues are conserved does not mean that they play a role in channel gating. These residues could also be important for structure, ion selectivity, etc.

      Addressed in comment #6 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (8) Line 333 - While CALHM6 is less studied than CALHM1, there is knowledge of its function and gating properties. Should CALHM6 be considered a "dark" channel? The IDG development level in Pharos is Tbio. There have been multiple papers published on this channel (ex: Ebihara et al, J Exp Med, 2010; Kasamatsu et al, J Immunol 2014; Danielli et al, EMBO J, 2023).

      Addressed in comment #7 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (9) Line 627 - Do you mean that 5 mM CaCl2 was replaced with 5 mM EGTA in 0 Ca2+ solution?

      This is correct.  

      (10) Why are only evolutionary relationships between rat, mouse, and human shown in Figure 3A? These species are all close on the evolutionary timeline.

      Addressed in comment #10 for Part A Revisions related to the first part, regarding data mining and curation above.

      (11) Figure 5 - no need to show the currents at room temperature in the main text since there are robust currents at 37 degrees; this could go into the supplement. Also, please cite Jeon et al. (Biochem Biophys Res Commun, 2021), who have already shown temperature-dependence of CALHM1.

      Addressed in comment #8 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (12) It would be helpful to have a schematic showing amino acid residues, TM domains, highlighted residues mutated etc.

      Addressed in comment #9 for Part B Revisions related to the second part, regarding the analysis of CAHLM channel mutations above.

      (13) Use of S1-S4 to refer to the transmembrane "segments" is not standard; rather, TM1-TM4 would generally be used to refer to transmembrane domains.

      We have used the S1–S4 helix notation to maintain consistency with the nomenclature employed in our previous study (Choi et al., Nature, 2019).

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      The manuscript by Ivan et al aimed to identify epitopes on the Abeta peptide for a large set of anti-Abeta antibodies, including clinically relevant antibodies. The experimental work was well done and required a major experimental effort, including peptide mutational scanning, affinity determinations, molecular dynamics simulations, IP-MS, WB, and IHC. Therefore, it is of clear interest to the field. The first part of the work is mainly based on an assay in which peptides (15-18-mers) based on the human Abeta sequence, including some containing known PTMs, are immobilized, thus preventing aggregation. Although some results are in agreement with previous experimental structural data (e.g. for 3D6), and some responses to diseaseassociated mutations were different when compared to wild-type sequences (e.g. in the case of Aducanumab) - which may have implications for personalized treatment - I have concerns about the lack of consideration of the contribution of conformation (as in small oligomers and large aggregates) in antibody recognition patterns. The second part of the study used fulllength Abeta in monomeric or aggregated forms to further investigate the differential epitope interaction between Aducanumab, Donanemab, and Lecanemab (Figures 5-7). Interestingly, these results confirmed the expected preference of these antibodies for aggregated Abeta, thus reinforcing my concerns about the conclusions drawn from the results obtained using shorter and immobilized forms of Abeta. Overall, I understand that the work is of interest to the field and should be published without the need for additional experimental data. However, I recommend a thorough revision of the structure of the manuscript in order to make it more focused on the results with the highest impact (second part).

      We thank the reviewer for highlighting this critical aspect. Our rationale for beginning with the high-resolution, aggregation-independent peptide microarray was to systematically dissect sequence requirements, including PTMs, truncations, and elongations, at single–amino acid resolution. This platform defines linear epitope preferences without the confounding influence of aggregation and enabled analyses that would not have been technically feasible with fulllength Aβ. This rationale is now clarified in the Introduction (lines 72–77).

      At the same time, the physiological relevance of antibody binding can only be assessed in the context of aggregation. Prompted by the reviewer’s comments, we restructured the manuscript to foreground the full-length, aggregation-dependent data (Figures 5–7). These assays demonstrate that Aducanumab preferentially recognizes aggregated peptide over monomers and that pre-adsorption with fibrils, but not monomers, blocks tissue reactivity (lines 585–599; Fig. 5B). They also show that Lecanemab can capture soluble Aβ in CSF by IP-MS (lines 544–547; Fig. 4B, Fig. 6–Supplement 1), and that Donanemab strongly binds low-molecular-weight pyroGlu-Aβ while also recognizing highly aggregated Aβ1-42 (lines 668–684; Fig. 7).

      The revised Conclusion now explicitly states the complementarity of the two approaches: microarrays for precise sequence and modification mapping, and full-length aggregation assays for context and physiological relevance (lines 705–714).

      Finally, prompted by the reviewer’s feedback, we refined the discussion of therapeutic antibodies to move beyond a descriptive dataset and provide mechanistic clarity. Specifically, the dimerization-supported, valency-dependent binding mode of Aducanumab and the additional structural contributions required for Lecanemab binding to aggregated Aβ are now integrated into the reworked Conclusion (lines 725–741).

      Reviewer #2 (Public review):  

      This paper investigates binding epitopes of different anti-Abeta antibodies. Background information on the clinical outcome of some of the antibodies in the paper, which might be important for readers to know, is lacking. There are no references to clinical outcomes from antibodies that have been in clinical trials. This paper would be much more complete if the status of the antibodies were included. The binding characteristics of aducanumab, donanemab, and Lecanemab should be compared with data from clinical phase 3 studies. 

      Aducanumab was identified at Neurimmune in Switzerland and licensed to Biogen and Eisai. Aducanumab was retracted from the market due to a very high frequency of the side-effect amyloid-related imaging abnormalities-edema (ARIA-E). Gantenerumab was developed by Roche and had two failed phase 3 studies, mainly due to a high frequency of ARIA-E and low efficacy of Abeta clearance. Lecanemab was identified at Uppsala University, humanized by BioArctic, and licensed to Eisai, who performed the clinical studies. Eisai and Biogen are now marketing Lecanemab as Leqembi on the world market. Donanemab was developed by Ely Lilly and is sold in the US as Kisunla. 

      We thank the reviewer for this valuable suggestion. In the revised manuscript, we have included a concise overview of the clinical status and outcomes of the therapeutic antibodies in the Introduction. This new section (lines 81–99) summarizes the origins, phase 3 trial outcomes, and current regulatory status of Aducanumab, Lecanemab, and Donanemab, as well as mentioning Gantenerumab as a comparator. Key aspects such as ARIA-E incidence, amyloid clearance efficacy, and regulatory decisions are now referenced to provide the necessary clinical context.

      These additions directly link our epitope mapping data with the clinical performance and safety profiles of the antibodies, thereby making the translational implications of our results clearer for both research and therapeutic applications.

      Limitations: 

      (1) Conclusions are based on Abeta antigens that may not be the primary targets for some conformational antibodies like aducanumab and Lecanemab. There is an absence of binding data for soluble aggregated species.

      We thank the reviewer for raising this important point. To address the absence of data on soluble aggregated species, we added IP-MS experiments using pooled human CSF as a physiologically relevant source of endogenous Aβ. Lecanemab enriched several endogenous soluble Aβ variants (Aβ1–40, Aβ1–38, Aβ1–37, Aβ1–39, and Aβ1–42), whereas Aducanumab did not yield detectable signals (Figure 4B; lines 544–547). These results directly distinguish between synthetic and patient-derived Aβ and highlight Lecanemab’s capacity to capture soluble Aβ species under biologically relevant conditions.

      (2) Quality controls and characterization of different Abeta species are missing. The authors need to verify if monomers remain monomeric in the blocking studies for Figures 5 and 6. 

      We thank the reviewer for this comment. In Figure 5 we show that pre-adsorption with monomeric Aβ1–42 does not prevent Aducanumab binding, whereas fibrillar Aβ1–42 completely abolishes staining, consistent with Aducanumab’s avidity-driven preference for higher-order aggregates.

      For Lecanemab (Figure 6), we observed a partial preference for aggregated Aβ1–42 over HFIP-treated monomeric and low-n oligomeric forms. We note, as now stated in the revised manuscript (lines 622–623), that monomeric preparations may partially re-aggregate under blocking conditions, which represents an inherent limitation of such experiments.

      To further address this, we performed additional blocking experiments using shorter Aβ peptides, which are less prone to aggregation. These peptides did not block immunohistochemical staining (Figure 6 – Supplement 1), underscoring that both epitope length and conformational state contribute to Lecanemab binding. This conclusion is also consistent with recent data presented at AAIC 2023.

      (3) The authors should discuss the limitations of studying synthetic Abeta species and how aggregation might hide or reveal different epitopes. 

      We thank the reviewer for this important comment. We now explicitly discuss the limitations of using synthetic Aβ peptides, including that aggregation state can mask or expose epitopes in ways that differ from endogenous species. This discussion has been added in the revised manuscript (lines 737–742).

      As noted in our replies to Points (2) and (4) here, and to Reviewer #1, we addressed this experimentally by complementing the high-resolution, aggregation-independent mapping with blocking studies using aggregated and monomeric Aβ preparations, and by validating key findings with IP-MS of human CSF as a physiologically relevant source of soluble Aβ. Together, these complementary approaches mitigate the limitations of synthetic peptides and provide a more comprehensive picture of antibody–Aβ interactions

      (4) The authors should elaborate on the differences between synthetic Abeta and patientderived Abeta. There is a potential for different epitopes to be available. 

      We thank the reviewer for this comment. In the revised manuscript we now discuss how comparisons between synthetic and patient-derived Aβ species reveal additional, likely conformational epitopes that are not accessible in short or monomeric synthetic forms. To address this directly, we performed IP-MS with pooled human CSF. Lecanemab enriched a diverse set of endogenous soluble Aβ1–X species (Aβ1–40, Aβ1–38, Aβ1–37, Aβ1–39, and Aβ1–42), whereas Aducanumab did not yield measurable pull-down (Figure 4B; lines 544– 547). These results emphasize that patient-derived Aβ displays distinct aggregation dynamics and epitope accessibility.

      We have expanded on this point in the Conclusion (lines 737–742), underscoring the

      importance of integrating both synthetic and native Aβ sources to capture the full range of antibody targets. 

      Reviewer #1 (Recommendations for the authors): 

      This revision should prioritize the presentation of results obtained using the full-length Abeta peptide, given its more direct relevance to expected antibody recognition patterns in physiological contexts, and discuss the evidence for using synthetic Abeta. 

      We thank the reviewer for this recommendation. The revised manuscript now places stronger emphasis on results obtained with full-length Aβ peptides, particularly in Figures 5–7, which analyze binding preferences across monomeric, oligomeric, and fibrillar states (lines 585–599, 609–623, 668–684). We also expanded the Discussion to outline both the rationale and the limitations of using synthetic Aβ. The microarray approach provides high-resolution, aggregation-independent sequence and modification mapping, but must be complemented by experiments with full-length Aβ1–42 under physiologically relevant conditions, such as IP-MS from CSF (lines 544–547) and blocking in IHC (lines 585–599, 622–623, 684), to capture conformational epitopes and validate functional relevance.

      Figure 6. = Please review/better explain the following statement "Lecanemab recognized Aβ140, Aβ1-42, Aβ3-40, Aβ-3-40 and phosphorylated pSer8-Aβ1-40 on CIEF-immunoassay and Bicine-Tris SDS-PAGE/ Western blot, indicating that the Lecanemabbs epitope is located in the N-terminal region of the Aβ sequence". Is it possible that N-truncated peptides do not form aggregates as efficiently as (or conformationally distinct from) full-length ones? 

      In the revised text we now clarify that Lecanemab recognized Aβ1-40, Aβ1-42, Aβ3-40, Aβ-340, and phosphorylated pSer8-Aβ1-40 on CIEF-immunoassay (Figure 6A; lines 612–619) and Bicine-Tris SDS-PAGE/Western blot (Figure 6C; lines 639–640). In contrast, shorter Ntruncated variants such as Aβ4-40 and Aβ5-40 did not generate detectable signals under the tested conditions. This is consistent with our initial microarray data (Figure 1), which indicated that Lecanemab binding depends on residues 3–7 of the N-terminus.

      On gradient Bistris SDS-PAGE/Western blot, Lecanemab showed a partial but not exclusive preference for aggregated Aβ1-42 over monomeric or low-n oligomeric forms in the HFIPtreated preparation (Figure 6B; lines 632–633). Immunohistochemical detection of Aβ deposits in AD brain sections was efficiently blocked by pre-adsorption with monomerized, oligomeric, or fibrillar Aβ1-42 (Figure 6E; lines 643–645), but not by shorter synthetic peptides such as Aβ1-16, Aβ1-34, or Aβ1-38 (Figure 6 – Supplement 1; lines 654–663).

      We also note, as now stated in the Results, that re-aggregation of HFIP-treated Aβ1-42 monomers during incubation cannot be entirely excluded (lines 622–623). Taken together, these experiments indicate that both N-terminal sequence length and conformational context are critical for Lecanemab binding, and that truncated peptides may indeed fail to reproduce the aggregate-associated conformations required for full recognition.

      Reviewer #2 (Recommendations for the authors): 

      Introduction: 

      (1) Include examples of Lecanemab, donanemab, and gantenerumab, along with relevant references. 

      We expanded the clinical-context paragraph that already covers Aducanumab, Lecanemab, and Donanemab (lines 81–96) and added Gantenerumab. 

      (2) Address why gantenerumab was not included in the study. 

      Due to the focus of our current study on antibodies with recently approved or late-stage clinical use (Aducanumab, Donanemab, Lecanemab), Gantenerumab was not included. 

      (3) Table 1: Correct the reference for Lecanemab, should be reference 44. 

      Table 1 has been updated to correct the Lecanemab reference.

      (4) Line 84: Add Uppsala University and Eisai alongside Biogen for Lecanemab. 

      Line 84 has been revised to acknowledge Uppsala University and Eisai alongside Biogen for the development of Lecanemab (lines 90–96).

      (5) Line 539: Include the reference: "Lecanemab, Aducanumab, and Gantenerumab - Binding Profiles to Different Forms of Amyloid-Beta Might Explain Efficacy and Side Effects in Clinical Trials for Alzheimer's Disease. doi: 10.1007/s13311-022-01308-6. 

      We thank the reviewer for drawing attention to this important reference (now cited as Ref. 83) provides a state-of-the-art comparison of binding profiles of Lecanemab, Aducanumab, and Gantenerumab, and we have now properly incorporated it into our manuscript. 

      (6) Line 657-659: State that the findings are also applicable to Lecanemab. 

      Discrepancies between analysis of the short synthetic fragments and the full-length Abeta are now resolved for Aducanumab and Lecanemab and put into context in the results section and the conclusion lines 725-740. 

      (7) Figures 5 and 6: Discuss how to ensure that monomers remain monomers under the study conditions, considering the aggregation-prone nature of Abeta1-42. This aggregation could impact Lecanemab's binding to "monomers." To our knowledge, Lecanemab does not bind to monomers. The binding properties observed diverge from previously described properties for Lecanemab. Explore reasons for these discrepancies and suggest conducting complementary experiments using a solution-based assay, as per Söderberg et al, 2023. In Figure 6, note that Lecanemab is strongly avidity-driven, potentially causing densely packed monomers to expose Abeta as aggregated, affecting binding interpretation on SDS-PAGE. 

      We thank the reviewer for this important point. In the revised Results and Discussion we explicitly note that HFIP-treated Aβ1–42 monomers may partially re-aggregate during incubation, which cannot be fully excluded (lines 622–623).

      To complement these data, we show that Lecanemab successfully enriched soluble endogenous Aβ species (Aβ1–40, Aβ1–38, Aβ1–37, Aβ1–39, and Aβ1–42) in IP-MS from pooled CSF (lines 544–547; Fig. 4B), demonstrating its ability to bind soluble Aβ under physiologically relevant conditions.

      We also now cite the Söderberg et al. (2023, PMID: 36253511) study, which reported weak but detectable binding of Lecanemab to monomeric Aβ (their Fig. 1 and Table 6). This supports our interpretation that Lecanemab is aggregation-sensitive rather than strictly aggregationdependent, in contrast to Aducanumab.

      To further address sequence and conformational contributions, we performed blocking experiments with shorter, non-HFIP-treated Aβ peptides (Aβ1–16, Aβ1–34, Aβ1–38). These peptides did not block Lecanemab staining in IHC (lines 654–657; Fig. 6 – Supplement 1), indicating that both extended sequence and conformational context are necessary for recognition.

      Finally, our findings are in line with preliminary data by Yamauchi et al. (AAIC 2023, DOI: 10.1002/alz.065104), who proposed that Lecanemab recognizes either a conformational epitope spanning the N-terminus and mid-region, or a structural change in the mid-region induced by the N-terminus.

    1. Author response:

      We thank you for your efforts in reviewing our manuscript.  We sincerely appreciate that the reviewers were all enthusiastic about our comparison of native chemical ligation (NCL) and non-canonical amino acid (ncAA) mutagenesis methods for installing acetyl lysine (AcK) in alpha-synuclein, as well as the wide variety of biochemical experiments enabled by our ncAA approach.  We respond to the critiques specific to each reviewer here.

      Reviewer #1:

      Expressed concern that in vitro studies of effects on membrane binding were not followed up with neurotransmitter trafficking experiments.  While we certainly think that such studies would be interesting, they would presumably require the use of acetylation mimic mutants (Lys-to-Gln mutations), which we would want to validate by comparison to our semi-synthetic proteins with authentic AcK.  Such experiments are planned for a follow-up manuscript, and we will investigate the reviewer’s suggested experiment at that time.

      Reviewer #1 Noted that the method of in vitro seeding really reports on the impact of acetylation on the elongation phase of aggregation.  We will clarify this in our revisions.  They also expressed concern that this was different than the role that acetylation would play in seeding cellular aggregation with pre-acetylated fibrils.  We will also acknowledge and clarify this in our revisions.  Having the monomer population acetylated in cells presents technical challenges that might also be addressed with Gln mutant mimics, and we plan to pursue such experiments in the follow-up manuscript noted above.

      Reviewer #1 Criticized the fact that the pre-formed fibrils used in seeding would not have the same polymorph as PD or MSA fibrils derived from patient material.  They were also critical of how our cryo-EM structure of AcK80 fibrils related to the PD and MSA polymorphs.  Finally, while the reviewer liked the MS experiments used to quantify acetylation levels from patient samples, they felt that our findings then threw the physiological relevance of our structural and biochemical experiments into question.  We believe that all of these critiques can be addressed by clarifying our purpose.  We are not necessarily trying to claim that our AcK80 fold is populated in health or disease, but that by driving Lys80 acetylation, one could push fibrils to adopt this conformation, which is less aggregation-prone.  A similar argument has been made in investigations of alpha-synuclein glycosylation and phosphorylation.  Our results in Figure 9 imply that this could be done with HDAC8 inhibition.  We will revise the manuscript to make these ideas clearer, while being sure to acknowledge the limitations noted by Reviewer #1.

      Reviewer #2:

      Expressed concern over our use of SDS micelles for initial investigation of the 12 AcK variants, rather than the phospholipid vesicles used in later FCS and NMR experiments.  We will note this shortcoming in revisions of our manuscript, but we do not believe that using vesicles instead would change the conclusions of these experiments (that only AcK43 produces an effect, and a modest one at that).

      We will add additional detail to the figure captions, as requested by Reviewer #2.

      Reviewer #2 shared some of the concerns of Reviewer #1 regarding the distinctions of which phase of aggregation we were investigating in our in vitro experiments.  As noted above, we will clarify this language.

      Finally, Reviewer #2 stated that “It is not clear from the EM data that the structures of the different lysine acetylated variants are different.”  We feel that it is quite clear from structures in Figure 8 and the EM density maps in Figure S38 that the AcK80 fold is indeed different.  Although the overall polymorphs are somewhat similar to WT, the position of K80 clearly changes upon acetylation, altering the local fold significantly and the global fold more moderately.

      Reviewer #3:

      Found the results convincing, including the potential therapeutic implications.  The only concern noted was that they found the difficulties in semi-synthesis of AcK-modified alpha-synuclein surprising given that it has been made many times before through NCL.  Indeed, our own laboratory has made alpha-synuclein through NCL, and the yields reported here are in keeping with our own previous results.  However, since NCL did not give higher yields than ncAA methods, and it is significantly easier to scan AcK positions using ncAAs, we felt that ncAAs are the method of choice in this case.  We will clarify this position in the revised manuscript.

      In conclusion, on behalf of all authors, I again thank the reviewers for both their positive and negative observations in helping us to improve our manuscript.  We will revise it to strive for greater clarity as we have noted in this letter.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Summary:

      The major issues are the need for more information concerning WNK expression in brain regions and additional confirmation of the role of sortilin on WNT signaling. There is a lack of sufficient evidence supporting sortilin's involvement in insulin- and WNK-dependent GLUT4 regulation. The recommendation is to examine what WNK kinase is selectively expressed in the region of interest and then explore its engagement with the sortilin and GLUT4 pathways. Further identification of components of the WNK/OSr1/SPAK-sortilin pathway that regulate GLUT4 in brain slices or primary neurons will be helpful in confirming the results. The use of knock-down or knock-out models would be helpful to explore the direct interaction of the pathways. Immortalized and primary cells also represent useful models.

      Together our results indicate that one or more WNK family members regulate insulin sensitivity.  As all WNK family members are expressed in relevant brain regions, whether the results are due to actions of a single WNK family member or more likely due to their combined impact will be an important question to ask in the future.  

      There are multiple publications describing how sortilin is involved in insulin-dependent Glut4 trafficking; thus, we did not further address that issue.  We have added data on an additional action of WNK463 which indicates that it can block association of OSR1 with sortilin.  While these results do not delve further into how sortilin works, they support the conclusion that WNK/OSR1/SPAK can influence insulin-dependent glucose transport via distinct cellular events (AS160, sortilin, Akt) which are WNK463 sensitive.  

      Altogether we added 12 new panels of data from new and previously performed experiments and we modified 3 existing subfigures in response to comments.

      Weaknesses:

      (1) The study used a WNK643 inhibitor as the only tool to manipulate WNK1-4 activity. This inhibitor seems selective; however, it has been reported that it exhibits different efficiency in inhibiting the individual WNK kinases among each other (e.g. PMID: 31017050, PMID: 36712947). Additionally, the authors do not analyze nor report the expression profiles or activity levels of WNK1, WNK2, WNK3, and WNK4 within the relevant brain regions (i.e. hippocampus, cortex, amygdala). Combined, these weaknesses raise concerns about the direct involvement of WNK kinases within the selected brain regions and behavior circuits. It would be beneficial if the authors provided gene profiling for WNK1, 2, 3, and -4 (e.g. using Allen brain atlas). To confirm the observations, the authors should either add results from using other WNK inhibitors or, preferentially, analyze knock-down or knock-out animals/tissue targeting the single kinases.

      Thank you for the excellent suggestion to include mRNA data for the four WNKs. We have included a supplementary figure showing expression of WNK1-4 mRNAs in prefrontal cortex and the hippocampus curated from the Allen Brain Atlas. As per the Allen Brain Atlas, all four WNKs are detected in these regions with WNK4 mRNA the most highly expressed followed by WNK2, WNK3 and then WNK1 (Figure S1A).   

      With regard to the use of WNK463, we continue to use WNK463 because we have examined its actions in cell lines that only express WNK1, e.g. A549 (Haman Center lung cancer RNA-seq data), and in A549 with WNK1 deleted using CRISPR in which we saw no effects of WNK463 on several assays we use for WNK1 including suppression of autophagy.  WNK463 was reported in the literature to inhibit only the four WNKs out of more than 400 kinases tested, indicating more selectivity than many small molecules used to target other enzymes.  In other cell lines, we also use WNK1 knockdown which replicates the effect of WNK463 (Figure S7A-D). However, in SHSY5Y cells, WNK1 knockdown did not replicate the effect of WNK463 on pAKT levels (Figure S7E-F), suggesting a cooperativity among other WNK family members in neuronal cells. This makes WNK463 an ideal tool to test our hypotheses in this study as it targets all 4 WNKs (WNK1-4).  

      (2) The authors do not report any data on whether the global inhibition of WNKs affects insulin levels. Since the authors wish to demonstrate the synergistic effect of simultaneous insulin treatment and WNK1-4 inhibition, such data are missing.

      Thank you for this comment. To obtain this information, we treated C57BL/6J mice with WNK463 for 3 days once daily at a dose of 6 mg/kg and then fasted overnight. Plasma insulin levels were measured. Results showed that the plasma insulin levels trended upwards in the WNK463 treated animals compared to the vehicle treated groups but failed to reach any statistical significance. We have now included these data in supplementary figure S5A.

      The study discovered that the Sortilin receptor binds to OSR1, leading the authors to speculate that Sortilin may be involved in the insulin-dependent GLUT4 surface trafficking. However, the authors do not provide any evidence supporting Sortilin's involvement in insulin- or WNKdependent GLUT4 trafficking. Thus, this conclusion should be qualified, rephrased, or additional data included.

      Work from several groups have shown that sortilin is involved in insulin-dependent GLUT4 trafficking, for example [9-11,135-139] as we noted in the manuscript. We now show that WNK463 blocks co-immunoprecipitation of Flag-tagged sortilin with endogenous OSR1 in HEK293T cells. This result supports our model for WNK/OSR1/SPAK- insulin mediated regulation of sortilin.  We included these data in figures 5M, 5N.

      Minor issues:

      (1) The method and result sections lack information regarding the gender and age of mice used in the behavioral experiments. This information should be added.

      Thank you for pointing this out. We apologize for the omission. The requested information has now been added in the methods section.

      (2) The authors present an analysis of relative protein levels in Figure 1B and Figure 4B, however, the original immunoblots (?) are not included in the study. These data should be added to provide complete and transparent evidence for the analysis.

      Thank you for this request. The blots have now been included in the supplementary figure S2A and Figure 4B, respectively.  

      (3) The basis for Figure 3A needs to be explained and supported with suitable references either in the background or in the result section.

      Thank you for pointing this out. Figure 3A has been moved to Figure 3H as it represents the model summary of the data presented in Figure 3. Other figure numbers have been changed accordingly.  This figure 3A (now 3H) and the model diagram of Figure 5 (now Figure 5O) are now cited in the Discussion, where the results are considered in detail.      

      (4) Figure 4E should be labeled as 'Primary cortical neurons' for clarity, as the major focus is on the hippocampus. To increase consistency, the authors should consider performing the same experiment on hippocampal cultures or explaining using cortical neurons.

      Thank you for the suggestion. Figure 4E (now 4F) has been labelled as Primary cortical neurons for clarity. The major focus of this study is to understand the regulation of WNKmediated regulation of insulin signaling in the areas of the brain that are insulin sensitive such as the hippocampus and the prefrontal cortex. Therefore, we included cortical neurons to test this hypothesis.  

      (5) Figure 5B: The use of whole brain extracts is inconsistent with the rest of the study, especially considering the indication of differing insulin activity in selected brain regions. The authors should explain why they could not use only hippocampal tissue.

      In this manuscript, we are trying to test our hypothesis in insulin-sensitive neuronal cells which includes, but not limited to, the hippocampus. Figure 5B used whole brain extracts, which contain brain regions that are insulin-sensitive as well as insulin-insensitive regions, to show the association between OSR1 and AS160. However, this observation was replicated in the insulin-sensitive SH-SY5Y cell model suggesting that association of OSR1 and AS160 is modulated in the presence of insulin as shown in Figure 5B, 5C. We added data from SH-SY5Y cells showing effects of WNK463. These data support the concept that this is an interaction that is modulated by WNKs and will occur as long as both OSR1/SPAK and AS160 are expressed.

      (6) Figure 5B-C - Knock-out or knock-down condition should be included in the co-IP experiment. This is especially straightforward to generate in the SH-SY5Y cells. Moreover, these figures lack loading controls.

      If we understand correctly, the issue with regard to including knockdown conditions stems from the issues raised regarding specificity of the antibody which we have addressed in point 10 below. We have now included input blots for both AS160 and OSR1 which serve as the loading control for the IP experiment in figure 5B and 5C.

      (7) Figure 5C-D - A condition with WNK463 inhibition alone is missing. This condition is necessary for evaluating the effects of WNK643 inhibition with and without insulin stimulation.

      Thank you for this observation. We have now added the data for that condition.  The aim of this experiment in Figure 5C (now 5B and 5C) is to show that insulin is important to facilitate interaction between OSR1 and AS160 in differentiated SHSY5Y cells and the effect of WNK463 to diminish this insulin-dependent interaction. With only WNK463, there was minimal interaction between AS160 and OSR1 as now shown in Figure 5B, 5C.

      (8) Figure 5G - This figure shows the overexpression of plasmids in HEK cells, however, it lacks samples that overexpress the plasmid individually (single expression). Such data should be added, especially when the addition of the blocking peptide does not fully disable the interaction between AS160 and SPAK. Additionally, this figure also lacks a loading control, which is essential for validating the results.

      Thank you for this comment. Figure 5G (now Figure 5F, 5G) is an in vitro IP in which we have mixed a purified Flag-SPAK fragment residues 50-545 with a lysate from cells expressing Myc-AS160 (residues 193-446). This is essentially an in vitro IP; because it is not an IP experiment from cell lysates where we overexpressed these plasmids which would require a loading control. The lysates were divided in half and one half did not receive the blocking peptide while the other half did, creating a control. From our experience, this blocking peptide does not completely block interactions between SPAK/OSR1 and NKCC2 fragments which are well-characterized interacting partners [a]. The reason for the partial block in interactions could also be attributed to the multivalent nature of interaction between these proteins. This confusion in our methodology used has been noted and we have tried to explain it with more clarity in the methods, results and the figure legend section. Our Commun. Biol. paper [134] that describes this assay and uses it extensively is now available online.

      (a) Piechotta K, Lu J, Delpire E. Cation chloride cotransporters interact with the stressrelated kinases Ste20-related proline-alanine-rich kinase (SPAK) and oxidative stress response 1 (OSR1) J Biol Chem. 2002;277:50812–50819. doi: 10.1074/jbc.M208108200.

      (9) Figure 5J, L - These figures are missing negative controls. The authors should add Sortilin knock-down or knock-out conditions for the immunoprecipitation experiments. Also, the figures lack loading controls. Moreover, the labeling "Control" should be specified, as it is unclear what this condition represents.

      Thank you for noting the lack of clarity in the controls provided. Controls in Figure 5J and 5L refer to IgG Control which serves as the negative control in this case. This has now been specified in the figures (and added Figures 5M and 5N, as well). The issue with OSR1 and sortilin antibody specificity and cross-reaction has been addressed in point 10.

      (10) Figure 5I - The fluorescent signals for the individual channels of OSR1 and Sortilin appear identical (even within the background signal). This raises concerns about potential antibody cross-reaction. One potential solution would be to include additional stainings with different antibodies and perform staining of each protein alone to ensure the specificity of the colocalization.

      Thank you for pointing this out and giving us an opportunity to provide better images that will address the issues raised regarding antibody cross-reaction and antibody specificity. We realize that the images that we originally provided appeared to show all the puncta colocalize which could give rise to the concern about potential antibody cross-reaction. We have replaced them with more appropriate representative images that clearly show some selected regions of common staining as well as regions where there is no overlap.  

      (11) Figures 5D, 5F, 5H, 5L, 5M: These analyses should be first normalized to the loading control such as GAPDH.

      In Figure 5F (now 5E), the analysis has been normalized to the total AS160 protein levels. Because we are reporting changes in pAS160 protein, normalizing it to the total AS160 gives a better idea about the changes in the phosphorylated AS160 form compared to the whole protein and this is more appropriate compared to other loading controls such as GAPDH.  

      In Figure 5H (now Figure 5G), the analysis is an in vitro IP assay using purified protein fragments. Therefore, using GAPDH as a control is not applicable in this case. Please refer to our response to comment 8 for details.

      In Figures 5L, 5M and 5D (now 5K, 5L, 5C) shown, the IP proteins have been normalized to the input protein levels serving as a loading control for the IP experiment. 

      (12) Figure 5K: The significance/meaning of the red star is unclear. It should be explained in the figure legend.

      Thank you for the opportunity to enhance the readability of our manuscript. The meaning of red star denotes the condition in the yeast two-hybrid assay which shows the binding of CCT of OSR1 with C-terminus of sortilin. This has now been clarified in the figure legend.

      (13) Differences in WNK643 dosage and administration periods can affect the results. There is a lack of explanation with regard to the divergent WNK643 treatments of mice across different behavior conditions of fear conditioning, the novel object test, and the elevated plus maze test. This should be considered.

      Thank you for pointing out that the explanation regarding the WNK463 dosage and times are unclear. WNK463 was dosed 3 days before the start of the behavior experiment daily at a dose of 6 mg/kg and continued throughout the test protocol. This is the same protocol used for all experiments.  The text describing the protocol has been reworded with more clarity on dosage and times in methods and result section.

    1. Author Response:

      We thank all reviewers for their time and effort to carefully review our paper and for the constructive comments on our manuscript. Below we outline our planned revisions to the public reviews of the three reviewers.

      In our revision, we will include more details regarding our ABR measurements (including temperature, animal metadata), analysis (including filter settings) and lay out a much more detailed motivation for our ABR signal design. Furthermore, we will provide a more detailed discussion on the caveats of the technique and the interpretation of ABR data in general and our data specifically. Furthermore, we will add more discussion on differences between ABR based audiograms and behavioural data. The authors have extensive experience with the ABR technique and are well aware of its limitations, but also its strengths for use in animals that cannot be trained on behavioural tasks such as the very young zebra finches in this study. These additions will strengthen our paper. We think our conclusions remain justified by our data.

      Reviewer #1 and #2:

      We thank both reviewers for their positive words and suggested improvements. The planned general improvements listed above will take care of all suggestions and comments in the public review.

      Reviewer #3:

      We thank the reviewer for the detailed critique of our manuscript and many suggestions for improvement. The planned general improvements listed above will take care of many of the suggestions and comments listed in the public review. Here we will highlight a few first responses that we will address in detail in our resubmission.

      The reviewer’s major critiques can be condensed to the following four points.

      (1) ABR cannot be done in such small animals.

      This critique is unfounded. ABR measures the summed activity in the auditory pathway, and with smaller distance from brainstem to electrodes in small animals, the ABR signals are expected to have higher amplitude and consequently better SNR.  Thus, smaller animals should lead to higher amplitude ABR signals. We have successfully recorded ABR in animals smaller than 2 DPH zebra finches to support this claim (zebrafish (Jørgensen et al., 2012), 10 mm froglets (Goutte et al., 2017) and 5 mm salamanders (Capshaw et al., 2020). It is more surprising the technique still provides robust signals even in very large animals such as Minke whales (Houser et al., 2024).

      (2) The ABR methods used does not follow protocol for other published work in birds. Particularly the 25 ms long duration tone bursts may have underestimated high frequency hearing.

      There is no fixed protocol for ABR measurements, and several studies of bird ABR have used as long or even longer durations. Longer-duration signals were chosen deliberately and are necessary to have a sufficient number of cycles and avoid frequency splatter at our lowest frequencies used (see Lauridsen et al., 2021).

      (3) Sensitivity data should be corrected from ABR to behavioural data.

      We present the results of our measurements on hearing sensitivity using ABR, and ABR based thresholds are generally less sensitive than thresholds based on behavioural studies (presented in Fig 2c). Correcting for these measurements to behavioural thresholds is of course possible, but presenting only the corrected thresholds would be a misrepresentation of our sensitivity data. Even so it should be done only within species and age group and such data is currently not available. In our revision, we will include elaborate discussion on this topic.

      (4) Results are inconsistent with papers in developing songbirds.

      We agree that our results do not support and even question the claims in earlier work. These papers however do either 1) not measure hearing physiology or 2) do so in different species. To our best knowledge there is presently no data published on the auditory physiology development in songbird embryos. Our data are consistent with what is known about the physiology of auditory development in all birds studied so far. We will provide a detailed discussion on this topic in our revision.

      References

      Capshaw et al. (2020) J Exp Biol 223: jeb236489

      Goutte et al. (2017) Sci Rep 7: 12121, doi 10.1038/s41598-017-12145-5

      Houser et al. (2024) Science 386, 902-906. DOI:10.1126/science.ado7580).

      Jørgensen et al. (2012) Adv Exp Med Biol 730: 117-119

      Lauridsen et al (2021) J Exp Biol 224: jeb237313. https://doi.org/10.1242/jeb.237313

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This Reviewer was positive about the study, stating ‘The findings are interesting and important to increase the understanding both of the synaptic transmissions in the main olfactory bulb and the DA neuron diversity.’ They provided a number of helpful suggestions for improving the paper, which we have incorporated as follows:

      (1) It is known that there are two types of DA neurons in the glomerular layer with different diameters and capacitances (Kosaka and Kosaka, 2008; Pignatelli et al., 2005; Angela Pignatelli and Ottorino Belluzzi, 2017). In this manuscript, the authors need to articulate better which layer the imaging and ephys recordings took place, all glomerular layers or with an exception. Meanwhile, they have to report the electrophysiological properties of their recordings, including capacitances, input resistance, etc.

      We thank the Reviewer for this clarification. Indeed, the two dopaminergic cell types we study here correspond directly to the subtypes previously identified based on cell size. Our previous work showed that axon-bearing OB DA neurons have significantly larger somas than their anaxonic neighbours (Galliano et al. 2018), and we replicate this important result in the present study (Figure 3D). In terms of electrophysiological correlates of cell size, we now provide full details of passive membrane properties in the new Supplementary Figure 4, as requested. Axon-bearing DA neurons have significantly lower input resistance and show a non-significant trend towards higher cell capacitance. Both features are entirely consistent with the larger soma size in this subtype. We apologise for the oversight in not fully describing previous categorisations of OB DA neurons, and have now added this information and the appropriate citations to the Introduction (lines 56 to 59 of the revised manuscript). 

      In terms of cell location, all cells in this study were located in the OB glomerular layer. We sampled the entire glomerular layer in all experiments, including the glomerular/EPL border where the majority of axon-bearing neurons are located (Galliano et al. 2018). This is now clarified in the Materials and Methods section (lines 535 to 537 and 614 to 616 of the revised manuscript).

      (2) It is understandable that recording the DA neurons in the glomerular layer is not easy. However, the authors still need to increase their n's and repeat the experiments at least three times to make their conclusion more solid. For example (but not limited to), Fig 3B, n=2 cells from 1 mouse. Fig.4G, the recording only has 3 cells.

      Despite the acknowledged difficulty of these experiments, we have now added substantial extra data to the study as requested. We have increased the number of cells and animals to further support the following findings:

      Fig 3B: we now have n=5 cells from N=3 mice. We have created a new Supplementary Figure 1 to show all the examples.

      Figure 4G: we now have n=6 cells from N=4 mice.

      Figure 5G: we now have n=3 cells from N=3 mice.

      The new data now provide stronger support for our original conclusions. In the case of auto-evoked inhibition after the application of D1 and D2 receptor antagonists, a nonsignificant trend in the data suggests that, while dopamine is clearly not necessary for the response, it may play a small part in its strength. We have now included this consideration in the Results section (lines 256 to 264 of the revised manuscript).

      (3) The statistics also use pseudoreplicates. It might be better to present the biology replicates, too.

      Indeed, in a study focused on the structural and functional properties of individual neurons, we performed all comparisons with cell as the unit of analysis. This did often (though not always) involve obtaining multiple data points from individual mice, but in these low-throughput experiments n was never hugely bigger than N. The potential impact of pseudoreplicates and their associated within-animal correlations was therefore low. We checked this in response to the Reviewer’s comment by running parallel nested analyses for all comparisons that returned significant differences in the original submission. These are the cases in which we would be most concerned about potential false positive results arising from intra-animal correlations, which nested tests specifically take into account (Aarts et al., 2013). In every instance we found that the nested tests also reported significant differences between anaxonic and axonbearing cell types, thus fully validating our original statistical approach. We now report this in the relevant section of the Materials and Methods (lines 686 to 691 of the revised manuscript).

      (4) In Figure 4D, the authors report the values in the manuscript. It is recommended to make a bar graph to be more intuitive.

      This plot does already exist in the original manuscript. We originally describe these data to support the observation that an auto-evoked inhibition effect exists in anaxonic neurons (corresponding to now lines 240 to 245 of the revised manuscript). We then show them visually in their entirety when we compare them to the lack of response in axon-bearing neurons, depicted in Figure 5C. We still believe that this order of presentation is most appropriate for the flow of information in the paper, so have maintained it in our revised submission.

      (5) In Figure 4F and G, although the data with three cells suggest no phenotype, the kinetics looked different. So, the authors might need to explore that aside from increasing the n.

      We thank the Reviewer for this suggestion. To quantify potential changes in the autoevoked inhibition response kinetics, we fitted single exponential functions and compared changes in the rate constant (k; Methods, lines 650 to 652 of the revised manuscript). Overall, we observed no consistent or significant change in rate constant values after adding DA receptor antagonists. This finding is now reported in the Results section (lines 260 to 263 of the revised manuscript) and shown in a new Supplementary Figure 3.

      (6) Similarly, for Figure 4I and J, L and M, it is better to present and analyze it like F and G, instead of showing only the after-antagonist effect.

      We agree that the ideal scenario would have been to perform the experiments in Figure 4J and 4M the same way as those in Figure 4G, with a before vs after comparison. Unfortunately, however, this was not practically possible. 

      When attempting to apply carbenoxelone to already-patched cells, we found that this drug highly disrupted the overall health and stability of our recordings immediately after its application. This is consistent with previous reports of similar issues with this compound (e.g. Connors 2012, Epilepsy Currents; Tovar et al., 2009, Journal of Neurophysiology). After many such attempts, the total yield of this experiment was one single cell from one animal. Even so, as shown in the traces below, we were able to show that the auto-evoked inhibition response was not eliminated in this specific case:

      Author response image 1.

      Traces of an AEI response recorded before (magenta) and after (green) the application of carbenoxolone (n=1 cell from N=1 mouse).

      In light of these issues, we instead followed published protocols in applying the carbenoxolone directly in the bath without prior recording for 20 minutes (following Samailova et al., 2003, Journal of Neurochemistry) and ran the protocol after that time. Given that our main question was to ask whether gap junctions were strictly necessary for the presence of any auto-evoked inhibition response, our positive findings in these experiments still allowed us to draw clear conclusions.

      In contrast, the issue with the NKCC1 antagonist bumetanide was time. As acknowledged by this Reviewer, obtaining and maintaining high-quality patch recordings from OB DA neurons is technically challenging. Bumetanide is a slow-acting drug when used to modify neuronal chloride concentrations, because in addition to the time it takes to reach the neurons and effectively block NKCC1, the intracellular levels of chloride subsequently change slowly. Studies using this drug in slice physiology experiments typically use an incubation time of at least 20 minutes (e.g. Huberfeld et al., 2007, Journal of Neuroscience), which was incompatible with productive data collection in OB DA neurons. Again, after many unsuccessful efforts, we were forced instead to include bumetanide in the bath without prior recording for 20-30 minutes. As with the carbenoxolone experiment, our goal here was to establish whether autoevoked inhibition was in any way retained in the presence of this drug, so our positive result again allowed us to draw clear conclusions.

      Reviewer #1 (Recommendations for the authors):

      (1) I suggest the authors reconsider the terminology. For example, they use "strikingly" in their title. The manuscript reported two different transmitter release strategies but not the mechanisms, and the word "strikingly" is not professional, either.

      We appreciate the Reviewer’s attention to clarity and tone in the manuscript title, and have nevertheless decided to retain the original wording. The almost all-or-nothing differences between closely related cell types shown in structural and functional properties here (Figures 3F & 5C) are pronounced, extremely clear and easily spotted – all properties appropriate for the word ‘striking.’ In addition, we note that the use of this term is not at all unprofessional, with a PubMed search for ‘strikingly’ in the title of publications returning over 200 hits.

      (2) Similarly, almost all confocal scopes are 3D because images can be taken at stacks. So "3D confocal" is misleading.

      We understand that this is misleading. We have now replaced the sentence ‘Example snapshot of a 3D confocal stack of…’ by ‘Example confocal images of…’ in all the figure legends that apply.

      (3) It is recommended to present the data in bar graphs with data dots instead of showing the numbers in the manuscript directly.

      We agree entirely, and now present data plots for all comparisons reported in the study (Supplementary Figures 2, 4 and 5).

      Reviewer #2 (Recommendations for the authors):

      (1) Several experiments report notably small sample sizes, such as in Figures 3B and 5G, where data from only 2 cells derived from 1-2 mice are presented. Figures 4E-G also report the experimental result only from 3 cells derived from 3 mice. To enhance the statistical robustness and reliability of the findings, these experiments should be replicated with larger sample sizes.

      As per our response to Reviewer 1’s comment #2 above, and to directly address the concern that some evidence was ‘incomplete’, we have now added significant extra data and analysis to this revised submission (Figures 4 and 5; and Supplementary Figure 1). We believe that this has further enhanced the robustness and reliability of our findings, as requested.

      (2) The authors utilize vGAT-Cre for Figures 1-3 and DAT-tdTomato for Figures 4-5, raising concerns about consistency in targeting the same population of dopaminergic neurons. It remains unclear whether all OB DA neurons express vGAT and release GABA. Clarification and additional evidence are needed to confirm whether the same neuronal population was studied across these experiments.

      Although we indeed used different mouse lines to investigate structural and functional aspects of transmitter release, we can be very confident that both approaches allowed us to study the same two distinct DA cell types being compared in this paper. Existing data to support this position are already clear and strong, so in this revision we have focused on the Reviewer’s suggestion to clarify the approaches we chose.

      First, it is well characterised that in mouse and many other species all OB DA neurons are also GABAergic. This has been demonstrated comprehensively at the level of neurochemical identity and in terms of dopamine/GABA co-release, and is true across both small-soma/anaxonic and large-soma/axon-bearing subclasses (Kosaka & Kosaka 2008; 2016; Maher & Westbrook 2008; Borisovska et al., 2013; Vaaga et al., 2016; Liu et al. 2013). To specifically confirm vGAT expression, we have also now provided additional single-cell RNAseq data and immunohistochemical label in a revised Figure 1 (see also Panzanelli et al., 2007, now referenced in the paper, who confirmed endogenous vGAT colocalisation in TH-positive OB neurons). Most importantly, by using vGAT-cre mice here we were able to obtain sufficient numbers of both anaxonic and axon-bearing DA neurons among the vGAT-cre-expressing OB population. We could unambiguously identify these cells as dopaminergic because of their expression of TH protein which, due to the absence of noradrenergic neurons in the OB, is a specific and comprehensive marker for dopaminergic cells in this brain region (Hokfelt et al., 1975; Rosser et al., 1986; Kosaka & Kosaka 2016). Crucially, both axon-bearing and anaxonic OB DA subtypes strongly express TH (Galliano et al., 2018, 2021). We have now added additional text to the relevant Results section (lines 99 to 108 of the revised manuscript) to clarify these reasons for studying vGAT-cre mice here.

      We were also able to clearly identify and sample both subtypes of OB DA neuron using DAT-tdT mice. Our previous published work has thoroughly characterised this exact mouse line at the exact ages studied in the present paper (Galliano et al., 2018; Byrne et al., 2022). We know that DAT-tdT mice provide rather specific label for TH-expressing OB DA neurons (75% co-localisation; Byrne et al., 2022), but most importantly we know which non-DA neurons are labelled in this mouse line and how to avoid them. All nonTH-expressing but tdT-positive cells in juvenile DAT-tdT mice are small, dimly fluorescent and weakly spiking neurons of the calretinin-expressing glomerular subtype (Byrne et al., 2022). These cells are easily detected during physiological recordings, and were excluded from our study here. This information is now provided in the relevant Methods section (lines 616 to 619 of the revised manuscript, also referenced in lines 236 to 240 of the results section), and we apologise for its previous omission. Finally, we have shown both structurally and functionally that both axon-bearing and anaxonic OB DA subtypes are labelled in DAT-tdT mice (Galliano et al., 2018, Tufo et al., 2025; present study). Overall, these additional clarifications firmly establish that the same neuronal populations were indeed studied across our experiments.

      (3) The low TH+ signal in Figure 1D raises questions regarding the successful targeting of OB DA neurons. Further validation, such as additional staining, is required to ensure that the targeted neurons are accurately identified.

      As noted in our response to the previous comment, TH is a specific marker for dopaminergic neurons in the mouse OB, and is widely used for this purpose. Labelling for TH in our tissue is extremely reliable, and in fact gives such strong signal that we were forced to reduce the primary antibody concentration to 1:50,000 to prevent bleedthrough into other acquisition channels. Even at this concentration it was extremely straightforward to unambiguously identify TH-positive cells based on somatic immunofluorescence. We recognise, however, that the original example image in Figure 1D was not sufficiently clear, and have now provided a new example which illustrates the TH-based identification of these cells much more effectively. 

      (4) Estimating the total number of dopaminergic neurons in the olfactory bulb, along with the relative proportions of anaxonic and axon-bearing neuron subtypes, would provide valuable context for the study. Presenting such data is crucial to underscore the biological significance of the findings.

      This information has already been well characterised in previous studies. Total dopaminergic cell number in the OB is ~90,000 (Maclean & Shipley, 1988; Panzanelli et al., 2007; Parrish-Aungst et al., 2007). In terms of proportions, anaxonic neurons make up the vast majority of these cells, with axon-bearing neurons representing only ~2.5% of all OB dopaminergic neurons at P28 (Galliano et al., 2018). Of course, the relatively low number of the axon-bearing subtype does not preclude its having a potentially large influence on glomerular networks and sensory processing, as demonstrated by multiple studies showing the functional effects of inter-glomerular inhibition (Kosaka & Kosaka, 2008; Liu et al., 2013; Whitesell et al., 2013; Banerjee et al., 2015). This information has now been added to the Introduction (line 47 and lines 59 to 62 of the revised manuscript).

      (5) The authors report that in-utero injection was performed based on the premise that the two subclasses of dopaminergic neurons in the olfactory bulb are generated during embryonic development. However, it remains unclear whether in-utero injection is essential for distinguishing between these two subclasses. While the manuscript references a relevant study, the explanation provided is insufficient. A more detailed justification for employing in-utero injection would enhance the manuscript's clarity and methodological rigor.

      We apologise for the lack of clarity in explaining the approach. In utero injection is not absolutely essential for distinguishing between the two subclasses, but it does have two major advantages. 1) Because infection happens before cells migrate to their final positions, it produces sparse labelling which permits later unambiguous identification of individual cells’ processes; and 2) Because both subclasses are generated embryonically (compared to the postnatal production of only anaxonic DA neurons), it allows effective targeting of both cell types. We have now expanded the relevant section of the Results to explain the rationale for our approach in more detail (lines 109 to 116 of the revised manuscript).

      (6) In Figures 1A and 4A, it appears that data from previously published studies were utilized to illustrate the differential mRNA expression in dopaminergic neurons of the olfactory bulb. However, the Methods section and the manuscript lack a detailed description of how these dopaminergic neurons were classified or analyzed. Given that these figures contribute to the primary dataset, providing additional explanation and context is essential to ensure clarity of the findings.

      We apologise for the lack of clarity. We have now extended the part of the methods referring to the RNAseq data analysis (lines 666 to 678 of the revised manuscript). 

      (7) In Figure 2C, anaxonic dopamine neurons display considerable variability in the number of neurotransmitter release sites, with some neurons exhibiting sparse sites while others exhibit numerous sites. The authors should address the potential biological or methodological reasons for this variability and discuss its significance.

      We thank the Reviewer for highlighting this feature of our data. We have now outlined potential methodological reasons for the variability, whilst also acknowledging that it is consistent with previous reports of presynaptic site distributions in these cells (Kiyokage et al., 2017; Results, lines 169 to 172 of the revised manuscript). We have also added a brief discussion of the potential biological significance (Discussion, lines 446 to 450).

      (8) In the images used to differentiate anaxonic and axon-bearing neurons, the soma, axons, and dendrites are intermixed, making it difficult to distinguish structures specific to each subclass. Employing subclass-specific labeling or sparse labeling techniques could enhance clarity and accuracy in identifying these structures.

      Distinguishing these structures is indeed difficult, and was the main reason we used viral label to produce sparse labelling (see response to comment #5 above). In all cases we were extremely careful, including cells only when we could be absolutely certain of their anaxonic or axon-bearing identity, and could also be certain of the continuity of all processes. Crucially, while the 2D representations we show in our figures may suggest a degree of intermixing, we performed all analyses on 3D image stacks, significantly improving our ability to accurately assign structures to individual cells. We have now added extra descriptions of this approach in the relevant Methods section (lines 546 to 548 of the revised manuscript).

      (9) In Figure 3, the soma area and synaptophysin puncta density are compared between axon-bearing and anaxonic neurons. However, the figure only presents representative images of axon-bearing neurons. To ensure a fair and accurate comparison, representative images of both neuron subtypes should be included.

      The original figures did include example images of puncta density (or lack of puncta) in both cell types (Figure 2B and Figure 3E). For soma area, we have now included representative images of axon-bearing and anaxonic neurons with an indication of soma area measurement in a new Supplementary Figure 2A.

      (10) In Figure 4B, the authors state that gephyrin and synaptophysin puncta are in 'very close proximity.' However, it is unclear whether this proximity is sufficient to suggest the possibility of self-inhibition. Quantifying the distance between gephyrin and synaptophysin puncta would provide critical evidence to support this claim. Additionally, analyzing the distribution and proportion of gephyrinsynaptophysin pairs in close proximity would offer further clarity and strengthen the interpretation of these findings.

      We thank the Reviewer for raising this issue. We entirely agree that the example image previously shown did not constitute sufficient evidence to claim either close proximity of gephyrin and synaptophysin puncta, nor the possibility of self-inhibition. We are not in a position to perform a full quantitative analysis of these spatial distributions, nor do we think this is necessary given previous direct evidence for auto-evoked inhibition in OB dopaminergic cells (Smith and Jahr, 2002; Murphy et al., 2005; Maher and Westbrook, 2008; Borisovska et al., 2013) and our own demonstration of this phenomenon in anaxonic neurons (Figure 4). We have therefore removed the image and the reference to it in the text. 

      (11) In Figures 4J and 4M, the effects of the drugs are presented without a direct comparison to the control group (baseline control?). Including these baseline control data is essential to provide a clear context for interpreting the drug effects and to validate the conclusions drawn from these experiments.

      We appreciate the Reviewer’s attention to this important point. As this concern was also raised by Reviewer 1 (their point #6), we have provided a detailed response fully addressing it in our replies to Reviewer 1 above. 

      (12) In Lines 342-344, the authors claim that VMAT2 staining is notoriously difficult. However, several studies (e.g., Weihe et al., 2006; Cliburn et al., 2017) have successfully utilized VMAT2 staining. Moreover, Zhang et al., 2015 - a reference cited by the authors - demonstrates that a specific VMAT2 antibody effectively detects VMAT2. Providing evidence of VMAT2 expression in OB DA neurons would substantiate the claim that these neurons are GABA-co-releasing DA neurons and strengthen the study's conclusions.

      As noted in response to this Reviewer’s comment #2 above, there is clear published evidence that OB DA neurons are GABA- and dopamine-releasing cells. These cells are also known to express VMAT2 (Cave et al., 2010; Borisovska et al., 2013; Vergaña-Vera et al., 2015). We do not therefore believe that additional evidence of VMAT2 expression is necessary to strengthen our study’s conclusions. We did make every effort to label VMAT2-positive release sites in our neurons, but unfortunately all commercially available antibodies were ineffective. The successful staining highlighted by the Reviewer was either performed in the context of virally driven overexpression (Zhang et al., 2015) or was obtained using custom-produced antibodies (Weihe et al., 2006; Cliburn et al., 2017). We have now modified the Discussion text to provide more clarification of these points (lines 393 to 395 of the revised manuscript).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This paper investigates the physical mechanisms underlying cell intercalation, which then enables collective cell flows in confluent epithelia. The authors show that T1 transitions (the topological transitions responsible for cell intercalation) correspond to the unbinding of groups of hexatic topological defects. Defect unbinding, and hence cell intercalation and collective cell flows, are possible when active stresses in the tissue are extensile. This result helps to rationalize the observation that many epithelial cell layers have been found to exhibit extensile active nematic behavior.

      Strengths

      The authors obtain their results based on a combination of active hexanematic hydrodynamics and a multiphase field (MPF) model for epithelial layers, whose connection is a strength of the paper. With the hydrodynamic approach, the authors find the active flow fields produced around hexatic topological defects, which can drive defect unbinding. Using the MPF simulations, the authors show that T1 transitions tend to localize close to hexatic topological defects.

      We are grateful to Reviewer #1, for appreciating and highlighting the strengths of work.

      Weaknesses

      Citations are sometimes not comprehensive. Cases of contractile behavior found in collective cell flows, which would seemingly contradict some of the authors’ conclusions, are not discussed.

      I encourage the authors to address the comments and questions below.

      We are thankful to Reviewer #1, for their questions and comments. We have addressed them point by point below, and have amended the manuscript accordingly.

      (1) In Equation 1, what do the authors mean by the cluster’s size ℓ? How is this quantity defined? The calculations in the Methods suggest that ℓ indicates the distance between the p-atic defects and the center of the T1 cell cluster, but this is not clearly defined.

      We are thank Reviewer #1 for their question. We define the cluster size as the initial distance between the center of the quadrupole and any defect (see Methods). In a primary cell cluster, where cells themselves are the defects, the cluster’s size is the distance between the center of the central junction and the center of any cell in the cluster. Hence, this is half the diameter of an cell which, for example in a typical, confluent MDCK epithelial monolayer, would be about 10µm. We have added this clarification in the definition of the cluster size, above Eq. (1).

      (2) The multiphase field model was developed and reviewed already, before the Loewe et al. 2020 paper that the authors cite. Earlier papers include Camley et al. PNAS 2014, Palmieri et al. Sci. Rep. 2015, Mueller et al. PRL 2019, and Peyret et al. Biophys. J. 2019, as reviewed in Alert and Trepat. Annu. Rev. Condens. Matter Phys. 2020.

      We thank the referee for their suggestion to incorporate further MPF literature. We have done so in the amended manuscript.

      (3) At what time lag is the mean-squared displacement in Figure 3f calculated? How does the choice of a lag time affect these data and the resulting conclusions?

      The scatter plot in Fig. 3f was constructed by dividing the system into square subregions of size ∆ℓ = 35 l.u., each containing approximately 4 cells. For each subregion, we analyzed a time window of ∆t = 25 × 10<sup>3</sup> iterations, measuring both the normalized mean square displacement of cells (relative to the subregion area ∆ℓ<sup>2</sup>) and the average defect density. The normalized displacement is calculated as m.s.d. , where t∗ denotes the start time of the observation window. We chose the time window ∆t used to compute the mean square displacement to match the characteristic duration of T1 events and defect lifetimes in our simulations. Observation times much longer (∆t > 35 × 10<sup>3</sup>) than the typical T1 event duration would cause the two sets of data points to merge into a single group, suggesting no correlation between cell motility and defect density beyond defect life-time.

      (4) The authors argue that their results provide an explanation for the extensile behavior of cell layers. However, there are also examples of contractile behavior, such as in Duclos et al., Nat. Phys., 2017 and in P´erez-Gonz´alez et al., Nat. Phys., 2019. In both cases, collective cell flows were observed, which in principle require cell intercalations. How would these observations be rationalized with the theory proposed in this paper? Can these experiments and the theory be reconciled?

      The contractile or extensile nature of stress in epithelia depends crucially on the specific tissue type and its biological context. Different cell populations, depending on their position along the epithelial/mesenchymal spectrum, can exhibit either contractile or extensile behaviors. Our theory applies to tissues where hexatic order dominates at the cellular scale, particularly in confluent systems where neighbor exchanges occur primarily through T1 transitions. In contrast, the systems studied by Duclos et al., Nat. Phys. (2018) and Perez-Gonzalez et al. (Nat. Phys., 2019) exhibit nematic order at the cellular level, meaning their dynamics are governed by fundamentally different mechanisms. Since our framework is derived for hexatic-dominated tissues, it does not directly apply to those cases, though a hybrid hexanematic descriptions previously developed by some of the authors in Armengol-Collado et al. eLife 13:e86400 (2024) could help reconcile these observations. In general, a key distinction must be made between the contractility of individual cells and the extensile/contractile nature of the collective force network. To illustrate this, consider a cell exerting a 6- fold symmetric force distribution: each vertex force arises from an imbalance in junctional tensions with neighboring cells, which are themselves contractile due to actomyosin activity. However, the resulting vertex forces can be either contractile or extensile depending on network geometry and tension distribution. This is captured in our coarse-grained description [see Armengol-Collado et al. eLife 13:e86400 (2024)], where the active stress emerges from higher-order moments of cellular forces. Specifically, the deviatoric part of the hexatic active stress tensor , where is the cell radius, the number cell density and the intensity of cellular tension. The negative sign of the coefficient of the active stress shows that the active stress is extensile—consistently with observations in various epithelial systems (e.g., Saw et al., Nature 2017; Blanch-Mercader et al., Phys. Rev. Lett. 2018). Finally, we note that the connection between cellular-scale forces and large-scale extensility has been rationalized in other contexts, such as active nematics (Balasubramaniam et al., Nat. Mater. 2021).

      Reviewer #2 (Public Review):

      This paper studies the role of hexatic defects in the collective migration of epithelia. The authors emphasize that epithelial migration is driven by cell intercalation events and not just isolated T1 events, and analyze this through the lens of hexatic topological defects. Finally, the authors study the effect of active and passive forces on the dynamics of hexatic defects using analytical results, and numerical results in both continuum and phase-field models.

      The results are very interesting and highlight new ways of studying epithelial cell migration through the analysis of the binding and unbinding of hexatic defects.

      We are grateful to Reviewer #2, for their interest and for emphasizing the novelty of our work.

      Strengths

      (1) The authors convincingly argue that intercalation events are responsible for collective cell migration, and that these events are accompanied by the formation and unbinding of hexatic topological defects.

      (2) The authors clearly explain the dynamics of hexatic defects during T1 transitions, and demonstrate the importance of active and passive forces during cell migration.

      (3) The paper thoroughly studies the T1 transition through the viewpoint of hexatic defects. A continuum model approach to study T1 transitions in cell layers is novel and can lead to valuable new insights.

      We thank the Reviewer for their kind and supporting words, and for highlighting the clarity, persuasiveness, and thoroughness.

      Weaknesses

      (1) The authors could expand on the dynamics of existing hexatic defects during epithelial cell migration, in addition to how they are created during T1 transitions.

      We thank the referee for their comment. The detailed analysis of dislocation-pair unbinding modes and their statistical impact on the transition to collective migration is comprehensively addressed in our subsequent work Puggioni et al., arXiv:2502.09554. In the present study, we focus specifically on the fundamental mechanism enabling dislocation unbinding: active extensile stresses generate flows that drive dislocation pairs apart, while passive elastic stresses tend to pull them together (Krommydas et al., Phys. Rev. Lett. 2023; Armengol- Collado et al., arXiv:2502.13104). When active forces dominate over passive restoring forces, the dislocations unbind. This represents a crucial distinction from classical Berezinskii–Kosterlitz–Thouless or Kosterlitz–Thouless–Halperin–Nelson–Youn transitions, where thermal fluctuations drive defect unbinding. In our system, the process is fundamentally activity-driven. Nevertheless, the resulting state - characterized by unbound defects and collective migration - bears strong analogy to the melting transition in equilibrium systems. We emphasize that the dynamics of passive defects has been previously examined in Krommydas et al., Phys. Rev. Lett. 2023. A discussion of these aspects can be found in the Appendix “Numerical simulations of defect annihilation and unbinding”.

      (2) The different terms in the MPF model used to study cell layer dynamics are not fully justified. In particular, it is not clear why the model includes self-propulsion and rotational diffusion in addition to nematic and hexatic stresses, and how these quantities are related to each other.

      We thank the referee for their comment. The MPF model’s terms (e.g., self-propulsion, rotational diffusion), reflect the stochastic, deformable nature of cells as active droplets migrating with near-constant speed. We emphasize that self-propulsion is the only non-equilibrium mechanism in our model — no additional active stresses (nematic or hexatic) are imposed. We have clarified this point in the revised manuscript and expanded our discussion of the MPF model.

      (3) The authors could provide some physical intuition on what an active extensile or contractile term in the hexatic order parameter means, and how this is related to extensility and contractility in active nematics and/or for cell layers.

      We thank the referee for their comment. As we explain in the reply to comment [4] of Reviewer #1, the contractile or extensile nature of stress in epithelia depends crucially on the specific tissue type and its biological context. Different cell populations, depending on their position along the epithelial/mesenchymal spectrum, can exhibit either contractile or extensile behaviors. Our theory applies to tissues where hexatic order dominates at the cellular scale, particularly in confluent systems where neighbor exchanges occur primarily through T1 transitions. In contrast, the systems studied by Duclos et al., Nat. Phys. (2018) and Perez-Gonzalez et al. (Nat. Phys., 2019) exhibit nematic order at the cellular level, meaning their dynamics are governed by fundamentally different mechanisms. Since our framework is derived for hexatic-dominated tissues, it does not directly apply to those cases, though a hybrid hexanematic descriptions previously developed by some of the authors in Armengol-Collado et al. eLife 13:e86400 (2024) could help reconcile these observations. In general, a key distinction must be made between the contractility of individual cells and the extensile/contractile nature of the collective force network. To illustrate this, consider a cell exerting a 6-fold symmetric force distribution: each vertex force arises from an imbalance in junctional tensions with neighboring cells, which are themselves contractile due to actomyosin activity. However, the resulting vertex forces can be either contractile or extensile depending on network geometry and tension distribution. This is captured in our coarse-grained description [see Armengol-Collado et al. eLife 13:e86400 (2024)], where the active stress emerges from higher-order moments of cellular forces. Specifically, the deviatoric part of the hexatic active stress tensor , where is the cell radius, the number cell density and the intensity of cellular tension. The negative sign of the coefficient of the active stress shows that the active stress is extensile—consistently with observations in various epithelial systems (e.g., Saw et al., Nature 2017; Blanch-Mercader et al., Phys. Rev. Lett. 2018). Finally, we note that the connection between cellular-scale forces and large-scale extensility has been rationalized in other contexts, such as active nematics (Balasubramaniam et al., Nat. Mater. 2021).

      Recommendations for the Authors: Reviewer #2 (Recommendations for the Authors):

      (1) The authors point out that hexatic topological defects are produced in quadrupoles (L109). Does this also mean that these defects can be annihilated only in quadrupoles as well? In the same vein, are hexatic defects always bound in pairs, as suggested by the schematics, or is it possible to observe an isolated hexatic defect?

      We thank the referee for their question. Hexatic disclinations (the defect monopoles discussed in this work), much like electrons and positrons, can annihilate in any number of neutral charge configuration (dipole, quadrupole, octupole, etc.). Unbinding a pair of hexatic disinclination, however, costs much more energy than unbinding a quadrupole to dipoles. Hence isolated defects appear in abundance only in late, fully disordered phase, where the system has completely “melted”. For more details on how defect unbinding modes affect tissue dynamics, please see our subsequent work Puggioni et al., arXiv:2502.09554.

      (2) Could you clarify if the flows described in Figures 2(a)-(b), panel (i) are driven by a passive backflow term without activity? Could you compare the magnitudes of these flows compared to the typical active terms?

      We thank the referee for their question. In panel 2(b) there is only passive backflow. In 2(a) instead, both terms are included, and are in a regime of parameters where the active flow overcomes the active flow (and hence the active force overcomes the passive force as delineated in the discussions section). In turn, the magnitude of the passive flows, is studied in detail in our previous work Krommydas et al., (Phys. Rev. Lett. 2023).

      (3) Could you clarify how the continuum hexatic model and MPF model are related to each other? What are the similarities and differences in the dynamics of these models?

      We thank the referee for this insightful question. A key point of our work is precisely that the continuum hexatic model and the MPF (Multi-Phase Field) model are distinct in nature.

      The MPF model is an established agent-based framework used to simulate tissue dynamics at the cellular level. It captures individual cell behaviors and interactions through phase-field variables. In our work, we use the MPF model as a benchmark to extract statistical features of tissue dynamics, such as defect motion and orientational correlations. In contrast, our continuum hexatic model is a coarse-grained hydrodynamic theory that describes the dynamics of orientational order in active tissues. It is built on symmetry principles and conservation laws, and it does not rely on microscopic cell-level details. Instead, it captures the collective behavior of the system through a hexatic order parameter and its coupling to flow and activity.

      Despite their conceptual differences, the MPF model and our hydrodynamic theory exhibit similar statistical features. This agreement—also observed in the independent study by Jain et al. (Phys. Rev. Res. 2024)—provides strong support for the validity and generality of our continuum description.

      (4) When multiple references by the same author and year are cited using alphabets, the second alphabet is not in bold e.g. Giomi et al., 2022b, a in Line 75, and others.

      We are grateful to the referee carefully going through the manuscript and pointing out these typos. We have corrected them in the amended manuscript.

      Reviewer #3 (Public Review):

      In this manuscript, the authors discuss epithelial tissue fluidity from a theoretical perspective. They focus on the description of topological transitions whereby cells change neighbors (T1 transitions). They explain how such transitions can be described by following the fate of hexatic defects. They first focus on a single T1 transition and the surrounding cells using a hydrodynamic model of active hexatics. They show that successful T1 intercalations, which promote tissue fluidity, require a sufficiently large extensile hexatic activity in the neighborhood of the cells attempting a T1 transition. If such activity is contractile or not sufficiently extensile, the T1 is reversed, hexatic defects annihilate, and the epithelial network configuration is unchanged. They then describe a large epithelium, using a phase field model to describe cells. They show a correlation between T1 events and hexatic defects unbinding, and identify two populations of T1 cells: one performing T1 cycles (failed T1), and not contributing to tissue migration, and one performing T1 intercalation (successful T1) and leading to the collective cell migration.

      Strengths

      The manuscript is scientifically sound, and the variety of numerical and analytical tools they use is impressive. The approach and results are very interesting and highlight the relevance of hexatic order parameters and their defects in describing tissue dynamics.

      We thank the Reviewer for recognizing the scientific soundness of the manuscript, the breadth of numerical and analytical tools employed, as well as their interest in our work.

      Weaknesses

      (1) Goal and message of the paper. (a) In my opinion, the article is mainly theoretical and should be presented as such. For instance, their conclusions and the consequences of their analysis in terms of biology are not extremely convincing, although they would be sufficient for a theory paper oriented to physicists or biophysicists. The choice of journal and potential readership should be considered, and I am wondering whether the paper structure should be re-organized, in order to have side-by-side the methods and the results, for instance (see also below).

      We thank the referee for their criticism. In response, we have made an effort to reword certain parts of the manuscript. As with any theoretical study, the biological implications of our work can only be fully assessed through experimental validation — a prospect we look forward to. Nevertheless, we have submitted our work to the subsection of Physics of Life, which we believe is perfectly suited to our content.

      (b) Currently, the two main results sections are somewhat disconnected, because they use different numerical models, and because the second section only marginally uses the results from the first section to identify/distinguish T1.

      We thank the referee, for their comment. In the second section we are using statistics from the MPF model, to support the analytical and numerical findings of our hydrodynamic theory of cell intercalation. In the time between our submission, further qualitative evidence have been brought to light in the work of Jain et al. (Phys. Rev. Res. 2024).

      (2) Quite surprisingly, the authors use a cell-based model to describe the macroscopic tissuescale behavior, and a hydrodynamic model to describe the cell-based events. In particular, their hydrodynamic description (the active hexatic model) is supposed to be a coarse-grained description, valid to capture the mesoscopic physics, and yet, they use it to describe cellscale events (T1 transitions). For instance, what is the meaning of the velocity field they are discussing in Figure 2? This makes me question the validity of the results of their first part.

      We thank the referee for their comment. There are many excellent discrete models of epithelial tissues in the literature (e.g., Bi et al., Phys. Rev. X 2016; Pasupalak et al., Soft Matter 2020; Graner et al., Phys. Rev. Lett. 1992), each capturing essential biological features such as cell division, apoptosis and sorting. While these models have provided invaluable insights, our work takes a different approach by developing a continuum theory aimed at describing epithelial dynamics at two levels: (1) mesoscopic intercalation events and (2) macroscopic collective migration. Crucially, our goal is not to replicate a specific discrete model — which would risk constructing a “model of a model” — but rather to derive a hydrodynamic description of tissue dynamics grounded in symmetry principles and conservation laws. Along this logic, the velocity field in our theory should be interpreted as an Eulerian (continuum) velocity, representing the coarse-grained flow of the tissue rather than the Lagrangian motion of individual cells. This distinction is central to our framework, which operates at scales where cellular details are averaged out, yet retains the essential physics of hexatic order and active stresses. We validate our predictions against the Multiphase Field (MPF) model. [We thank Reviewer 1 for their suggestion to incorporate further MPF literature.] Furthermore, Jain et al. (Phys. Rev. Res. 2024) have used the MPF to predict flow patterns around T1 transitions and obtained results compatible with those of our hydrodynamic theory. From this comparison we can conclude that both the MPF and our theory are able to capture the same aspect of cell intercalation in epithelial layer. This, however, does not imply that other discrete models of epithelia can reproduce this aspect too, nor that our theory is specifically tailored to the MPF model. We have clarified these points in the revised manuscript and expanded our discussion of the MPF model.

      (3) The quality of the numerical results presented in the second part (phase field model) could be improved. (a) In terms of analysis of the defects. It seems that they have all the tools to compare their cell-resolved simulations and their predictions about how a T1 event translates into defects unbinding. However, their analysis in Figure 3e is relatively minimal: it shows a correlation between T1 cells and defects. But it says nothing about the structure and evolution of the defects, which, according to their first section, should be quite precise.

      We thank the referee for their comment. Further qualitative evidence have been brought to light in the work of Jain et al. (Phys. Rev. Res. 2024), were the exact flow pattern predicted by our hydrodynamic theory is obtained, in the MPF, around cells undergoing T1 rearrangements.

      (b) In terms of clarity of the presentation. For instance, in Figure 3f, they plot the mean-square displacement as a function of a defect density. I thought that MSD was a time-dependent quantity: they must therefore consider MSD at a given time, or averaged over time. They should be explicit about what their definition of this quantity is.

      We thank the referee for raising this point. As clarified in our response to Reviewer 1, point 3, the mean square displacement (MSD) plotted in Fig. 3f is computed over a fixed time window of ∆t = 25×103 iterations, chosen to match the typical duration of T1 events and defect lifetimes. [See also reply to Reviewer #1, point (3).] The MSD is normalized by the subregion area and averaged over time within each window. We have now made this explicit in the amended version of the manuscript.

      (c) In terms of statistics. For instance, Figure 3g is used to study the role of rotational diffusion on the average time between T1s. The error bars in this figure are huge and make their claims hardly supported. Their claim of a ”monotonic decay” of the average time between intercalations is also not fully supported given their statistics.

      We appreciate the Reviewer’s comment regarding the statistical robustness of Fig. 3g. While we acknowledge that the error bars are substantial – reflecting the inherent variability in cell intercalation dynamics – the yellow curve does exhibit a consistent downward trend in the average time between T1 transitions as rotational diffusion increases. This monotonic decrease is visible across the entire range of variation of the rotational diffusion Dr, and is statistically supported when considering the trend over independent simulations. To address this concern, we have revised the main text to adjusted the wording: instead of stating that “the former is a monotonically decreasing function of Dr,” we now write that “the former displays a decreasing trend with Dr,” which better reflects the statistical variability while preserving the observed behavior.

      Reviewer #3 (Recommendations for the Authors):

      (1) Section 1 is difficult to follow due to multiple reasons: early but delayed definitions, unclear use of T1 intercalation vs. T1 cycles, disconnected figures and unclear simulation descriptions. We recommend including simulation setup details earlier and restructuring the flow of arguments.

      We thank the referee for their comment. We have made an effort in rewording and clarifying things in our amended manuscript. We are slightly confused by what they mean by “early but delayed definitions”, if they could clarify, we would be happy to amend the position and phrasing of these definitions accordingly.

      (2) It could be useful to have an additional figure early on defining schematically hexatic defects and an illustration showing an epithelium (or a simulation), similar to what the authors have produced in some of their other publications on this topic.

      We thank the referee for their comment. Figures 3c and 3d show what a hexatic defect looks like in a simulation of the epithelium. Following the referee’s recommendation, we have added a note in the caption of figure 3, citing our work were we show the same defects in MDCK epithelial monolayers (Armengol et al., Nat. Phys. 2023).

      (3) Minor points and typos:

      Line 88: the bond between vertices shrinks, not the vertices.

      Figure 1: the 1/6 is displayed as 1 6 (fraction bar missing).

      Line 232: “and order” → “one/an order”.

      Line 237: Fig. 3g) → Fig. 3g

      Line 298: ”nu” and ”v” hard to distinguish in eLife font.

      Methods: define all notation clearly (e.g., tensor product exponent, D/Dt in Eq. 3c).

      Methods: ”cell orientation, coarse-graining and topological defects” section is difficult to follow, schematic would help.

      Line 457 onward: unclear how panels (ii-iv) of Fig. 2ab are obtained.

      Line 480 onward: not referenced in main text.

      Figure 2: “avalancHe” typo.

      Figure 2 caption: “cell intercalaTION” typo.

      Movies are neither referenced nor explained.

      Figure 5 and 6 are not referenced in the main text.

      We thank the referee for their detailed read of the paper. We have corrected all typos.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review)

      Summary:

      This study by Park and colleagues uses longitudinal saliva viral load data from two cohorts (one in the US and one in Japan from a clinical trial) in the pre-vaccine era to subset viral shedding kinetics and then use machine learning to attempt to identify clinical correlates of different shedding patterns. The stratification method identifies three separate shedding patterns discriminated by peak viral load, shedding duration, and clearance slope. The authors also assess micro-RNAs as potential biomarkers of severity but do not identify any clear relationships with viral kinetics.

      Strengths:

      The cohorts are well developed, the mathematical model appears to capture shedding kinetics fairly well, the clustering seems generally appropriate, and the machine learning analysis is a sensible, albeit exploratory approach. The micro-RNA analysis is interesting and novel.

      Weaknesses:

      The conclusions of the paper are somewhat supported by the data but there are certain limitations that are notable and make the study's findings of only limited relevance to current COVID-19 epidemiology and clinical conditions.

      We sincerely appreciate the reviewer’s thoughtful and constructive comments, which have been invaluable in improving the quality of our study. We have carefully revised the manuscript to address all points raised.

      (1) The study only included previously uninfected, unvaccinated individuals without the omicron variant. It has been well documented that vaccination and prior infection both predict shorter duration shedding. Therefore, the study results are no longer relevant to current COVID-19 conditions. This is not at all the authors' fault but rather a difficult reality of much retrospective COVID research.

      Thank you for your comment. We agree with the review’s comment that some of our results could not provide insight into the current COVID-19 conditions since most people have either already been infected with COVID-19 or have been vaccinated. We revised our manuscript to discuss this (page 22, lines 364-368). Nevertheless, we believe it is novel that we have extensively investigated the relationship between viral shedding patterns in saliva and a wide range of clinical and microRNA data, and that developing a method to do so remains important. This is important for providing insight into early responses to novel emerging viral diseases in the future. Therefore, we still believe that our findings are valuable.

      (2) The target cell model, which appears to fit the data fairly well, has clear mechanistic limitations. Specifically, if such a high proportion of cells were to get infected, then the disease would be extremely severe in all cases. The authors could specify that this model was selected for ease of use and to allow clustering, rather than to provide mechanistic insight. It would be useful to list the AIC scores of this model when compared to the model by Ke.

      Thank you for your feedback and suggestion regarding our mathematical model. As the reviewer pointed out, in this study, we adopted a simple model (target cell-limited model) to focus on reconstruction of viral dynamics and stratification of shedding patterns rather than exploring the mechanism of viral infection in detail. Nevertheless, we believe that the target cell-limited model provides reasonable reconstructed viral dynamics as it has been used in many previous studies. We revised manuscript to clarify this point (page 10, lines 139-144). Also, we revised our manuscript to provide more detailed description of the model comparison along with information about AIC (page 10, lines 130-135).

      (3) Line 104: I don't follow why including both datasets would allow one model to work better than the other. This requires more explanation. I am also not convinced that non-linear mixed effects approaches can really be used to infer early model kinetics in individuals from one cohort by using late viral load kinetics in another (and vice versa). The approach seems better for making populationlevel estimates when there is such a high amount of missing data.

      Thank you for your feedback. We recognized that our explanation was insufficient by your comment. We intended to describe that, rather than comparing performance of the two models, data fitting can be performed with same level for both models by including both datasets. We revised the manuscript to clarify this point (page 10, lines 135-139).

      Additionally, we agree that nonlinear mixed effects models are a useful approach for performing population-level estimates of missing data. On the other hand, in addition, the nonlinear mixed effects model has the advantage of making the reasonable parameter estimation for each individual with not enough data points by considering the distribution of parameters of other individuals. Paying attention to these advantages, we adopted a nonlinear mixed effects model in our study. We also revised the manuscript to clarify this (page 27, lines 472-483).

      (4) Along these lines, the three clusters appear to show uniform expansion slopes whereas the NBA cohort, a much larger cohort that captured early and late viral loads in most individuals, shows substantial variability in viral expansion slopes. In Figure 2D: the upslope seems extraordinarily rapid relative to other cohorts. I calculate a viral doubling time of roughly 1.5 hours. It would be helpful to understand how reliable of an estimate this is and also how much variability was observed among individuals.

      We appreciate your detailed feedback on the estimated up-slope of viral dynamics. As the reviewer noted, the pattern differs from that observed in the NBA cohort, which may be due to their measurement of viral load from upper respiratory tract swabs. In our estimation, the mean and standard deviation of the doubling time (defined as ln2/(𝛽𝑇<sub>0</sub>𝑝𝑐<sup>−1</sup> − 𝛿)) were 1.44 hours and 0.49 hours, respectively. Although direct validation of these values is challenging, several previous studies, including our own, have reported that viral loads in saliva increase more rapidly than in the upper respiratory tract swabs, reaching their peak sooner. Thus, we believe that our findings are consistent with those of previous studies. We revised our manuscript to discuss this point with additional references (page 20, lines 303-311).

      (5) A key issue is that a lack of heterogeneity in the cohort may be driving a lack of differences between the groups. Table 1 shows that Sp02 values and lab values that all look normal. All infections were mild. This may make identifying biomarkers quite challenging.

      Thank you for your comment regarding heterogeneity in the cohort. Although the NFV cohort was designed for COVID-19 patients who were either mild or asymptomatic, we have addressed this point and revised the manuscript to discuss it (page 21, lines 334-337).

      (6) Figure 3A: many of the clinical variables such as basophil count, Cl, and protein have very low pre-test probability of correlating with virologic outcome.

      Thank you for your comment regarding some clinical information we used in our study. We revised our manuscript to discuss this point (page 21, lines 337-338).

      (7) A key omission appears to be micoRNA from pre and early-infection time points. It would be helpful to understand whether microRNA levels at least differed between the two collection timepoints and whether certain microRNAs are dynamic during infection.

      Thank you for your comment regarding the collection of micro-RNA data. As suggested by the reviewer, we compared micro-RNA levels between two time points using pairwise t-tests and Mann-Whitney U tests with FDR correction. As a result, no micro-RNA showed a statistically significant difference. This suggests that micro-RNA levels remain relatively stable during the course of infection, at least for mild or asymptomatic infection, and may therefore serve as a biomarker independent of sampling time. We have revised the manuscript to include this information (page 17, lines 259-262).

      (8) The discussion could use a more thorough description of how viral kinetics differ in saliva versus nasal swabs and how this work complements other modeling studies in the field.

      We appreciate the reviewer’s thoughtful feedback. As suggested, we have added a discussion comparing our findings with studies that analyzed viral dynamics using nasal swabs, thereby highlighting the differences between viral dynamics in saliva and in the upper respiratory tract. To ensure a fair and rigorous comparison, we referred to studies that employed the same mathematical model (i.e., Eqs.(1-2)). Accordingly, we revised the manuscript and included additional references (page 20, lines 303-311).

      Furthermore, we clarified the significance of our study in two key aspects. First, it provides a detailed analysis of viral dynamics in saliva, reinforcing our previous findings from a single cohort by extending them across multiple cohorts. Second, this study uniquely examines whether viral dynamics in saliva can be directly predicted by exploring diverse clinical data and micro-RNAs. Notably, cohorts that have simultaneously collected and reported both viral load and a broad spectrum of clinical data from the same individuals, as in our study, are exceedingly rare. We revised the manuscript to clarify this point (page 20, lines 302-311).

      (9) The most predictive potential variables of shedding heterogeneity which pertain to the innate and adaptive immune responses (virus-specific antibody and T cell levels) are not measured or modeled.

      Thank you for your comment. We agree that antibody and T cell related markers may serve as the most powerful predictors, as supported by our own study [S. Miyamoto et al., PNAS (2023), ref. 24] as well as previous reports. While this point was already discussed in the manuscript, we have revised the text to make it more explicit (page 21, lines 327-328).

      (10) I am curious whether the models infer different peak viral loads, duration, expansion, and clearance slopes between the 2 cohorts based on fitting to different infection stage data.

      Thank you for your comment. We compared features between 2 cohorts as reviewer suggested. As a result, a statistically significant difference between the two cohorts (i.e., p-value ≤ 0.05 from the t-test) was observed only at the peak viral load, with overall trends being largely similar. At the peak, the mean value was 7.5 log<sub>10</sub> (copies/mL) in the Japan cohort and 8.1 log<sub>10</sub> (copies/mL) in the Illinois cohort, with variances of 0.88 and 0.87, respectively, indicating comparable variability.

      Reviewer #2 (Public review)

      Summary:

      This study argues it has found that it has stratified viral kinetics for saliva specimens into three groups by the duration of "viral shedding"; the authors could not identify clinical data or microRNAs that correlate with these three groups.

      Strengths:

      The question of whether there is a stratification of viral kinetics is interesting.

      Weaknesses:

      The data underlying this work are not treated rigorously. The work in this manuscript is based on PCR data from two studies, with most of the data coming from a trial of nelfinavir (NFV) that showed no effect on the duration of SARS-CoV-2 PCR positivity. This study had no PCR data before symptom onset, and thus exclusively evaluated viral kinetics at or after peak viral loads. The second study is from the University of Illinois; this data set had sampling prior to infection, so has some ability to report the rate of "upswing." Problems in the analysis here include:

      We are grateful to the reviewer for the constructive feedback, which has greatly enhanced the quality of our study. In response, we have carefully revised the manuscript to address all comments.

      The PCR Ct data from each study is treated as equivalent and referred to as viral load, without any reports of calibration of platforms or across platforms. Can the authors provide calibration data and justify the direct comparison as well as the use of "viral load" rather than "Ct value"? Can the authors also explain on what basis they treat Ct values in the two studies as identical?

      Thank you for your comment regarding description of viral load data. We recognized the lack of explanation for the integration of viral load data by reviewer's comment. We calculated viral load from Ct value using linear regression equations between Ct and viral load for each study's measurement method, respectively. We revised the manuscript to clarify this point in the section of Saliva viral load data in Methods.

      The limit of detection for the NFV PCR data was unclear, so the authors assumed it was the same as the University of Illinois study. This seems a big assumption, as PCR platforms can differ substantially. Could the authors do sensitivity analyses around this assumption?

      Thank you for your comment regarding the detection limit for viral load data. As reviewer suggested, we conducted sensitivity analysis for assumption of detection limit for the NFV dataset. Specifically, we performed data fitting in the same manner for two scenarios: when the detection limit of NFV PCR was lower (0 log<sub>10</sub> copies/mL) or higher (2 log<sub>10</sub> copies/mL) than that of the Illinois data (1.08 log<sub>10</sub> copies/mL), and compared the results.

      As a result, we obtained largely comparable viral dynamics in most cases (Supplementary Fig 6). When comparing the AIC values, we observed that the AIC for the same censoring threshold was 6836, whereas it increased to 7403 under the low censoring threshold and decreased to 6353 under the higher censoring threshold. However, this difference may be attributable to the varying number of data points treated as below the detection limit. Specifically, when the threshold is set higher, more data are treated as below the detection limit, which may result in a more favorable error calculation. To discuss this point, we have added a new figure (Supplementary Fig 6) and revised the manuscript accordingly (page 25, lines 415-418).

      The authors refer to PCR positivity as viral shedding, but it is viral RNA detection (very different from shedding live/culturable virus, as shown in the Ke et al. paper). I suggest updating the language throughout the manuscript to be precise on this point.

      We appreciate the reviewer’s feedback regarding the terminology used for viral shedding. In response, we have revised all instances of “viral shedding” to “viral RNA detection” throughout the manuscript as suggested.

      Eyeballing extended data in Figure 1, a number of the putative long-duration infections appear to be likely cases of viral RNA rebound (for examples, see S01-16 and S01-27). What happens if all the samples that look like rebound are reanalyzed to exclude the late PCR detectable time points that appear after negative PCRs?

      We sincerely thank the reviewer for the valuable suggestion. In response, we established a criterion to remove data that appeared to exhibit rebound and subsequently performed data fitting

      (see Author response image 1 below). The criterion was defined as: “any data that increase again after reaching the detection limit in two measurements are considered rebound and removed.” As a result, 15 out of 144 cases were excluded due to insufficient usable data, leaving 129 cases for analysis. Using a single detection limit as the criterion would have excluded too many data points, while defining the criterion solely based on the magnitude of increase made it difficult to establish an appropriate “threshold for increase.”

      The fitting result indicates that the removal of rebound data may influence the fitting results; however, direct comparison of subsequent analyses, such as clustering, is challenging due to the reduced sample size. Moreover, the results can vary substantially depending on the criterion used to define rebound, and establishing a consistent standard remains difficult. Accordingly, we retained the current analysis and have added a discussion of rebound phenomena in the Discussion section as a limitation (page 22, lines 355-359). We once again sincerely appreciate the reviewer’s insightful and constructive suggestion.

      Author response image 1.

      Comparison of model fits before and after removing data suspected of rebound. Black dots represent observed measurements, and the black and yellow curves show the fitted viral dynamics for the full dataset and the dataset with rebound data removed, respectively.

      There's no report of uncertainty in the model fits. Given the paucity of data for the upslope, there must be large uncertainty in the up-slope and likely in the peak, too, for the NFV data. This uncertainty is ignored in the subsequent analyses. This calls into question the efforts to stratify by the components of the viral kinetics. Could the authors please include analyses of uncertainty in their model fits and propagate this uncertainty through their analyses?

      We sincerely appreciate the reviewer’s detailed feedback on model uncertainty. To address this point, we revised Extended Fig 1 (now renumbered as Supplementary Fig 1) to include 95% credible intervals computed using a bootstrap approach. In addition, to examine the potential impact of model uncertainty on stratified analyses, we reconstructed the distance matrix underlying stratification by incorporating feature uncertainty. Specifically, for each individual, we sampled viral dynamics within the credible interval and averaged the resulting feature, and build the distance matrix using it. We then compared this uncertainty-adjusted matrix with the original one using the Mantel test, which showed a strong correlation (r = 0.72, p < 0.001). Given this result, we did not replace the current stratification but revised the manuscript to provide this information through Result and Methods sections (page 11, lines 159-162 and page 28, lines 512-519). Once again, we are deeply grateful for this insightful comment.

      The clinical data are reported as a mean across the course of an infection; presumably vital signs and blood test results vary substantially, too, over this duration, so taking a mean without considering the timing of the tests or the dynamics of their results is perplexing. I'm not sure what to recommend here, as the timing and variation in the acquisition of these clinical data are not clear, and I do not have a strong understanding of the basis for the hypothesis the authors are testing.

      We appreciate the reviewers' feedback on the clinical data. We recognized that the manuscript lacked description of the handling of clinical data by your comment. In this research, we focused on finding “early predictors” which could provide insight into viral shedding patterns. Thus, we used clinical data measured in the earliest time (date of admission) for each patient. Another reason is that the date of admission is the almost only time point at which complete clinical data without any missing values are available for all participants. We revised our manuscript to clarify this point (page 5, lines 90-95).

      It's unclear why microRNAs matter. It would be helpful if the authors could provide more support for their claims that (1) microRNAs play such a substantial role in determining the kinetics of other viruses and (2) they play such an important role in modulating COVID-19 that it's worth exploring the impact of microRNAs on SARS-CoV-2 kinetics. A link to a single review paper seems insufficient justification. What strong experimental evidence is there to support this line of research?

      We appreciate the reviewer’s comments regarding microRNA. Based on this feedback, we recognized the need to clarify our rationale for selecting microRNAs as the analyte. The primary reason was that our available specimens were saliva, and microRNAs are among the biomarkers that can be reliably measured in saliva. At the same time, previous studies have reported associations between microRNAs and various diseases, which led us to consider the potential relevance of microRNAs to viral dynamics, beyond their role as general health indicators. To better reflect this context, we have added supporting references (page 17, lines 240-243).

      Reviewer #3 (Public review)

      The article presents a comprehensive study on the stratification of viral shedding patterns in saliva among COVID-19 patients. The authors analyze longitudinal viral load data from 144 mildly symptomatic patients using a mathematical model, identifying three distinct groups based on the duration of viral shedding. Despite analyzing a wide range of clinical data and micro-RNA expression levels, the study could not find significant predictors for the stratified shedding patterns, highlighting the complexity of SARS-CoV-2 dynamics in saliva. The research underscores the need for identifying biomarkers to improve public health interventions and acknowledges several limitations, including the lack of consideration of recent variants, the sparsity of information before symptom onset, and the focus on symptomatic infections. 

      The manuscript is well-written, with the potential for enhanced clarity in explaining statistical methodologies. This work could inform public health strategies and diagnostic testing approaches. However, there is a thorough development of new statistical analysis needed, with major revisions to address the following points:

      We sincerely appreciate the thoughtful feedback provided by Reviewer #3, particularly regarding our methodology. In response, we conducted additional analyses and revised the manuscript accordingly. Below, we address the reviewer’s comments point by point.

      (1) Patient characterization & selection: Patient immunological status at inclusion (and if it was accessible at the time of infection) may be the strongest predictor for viral shedding in saliva. The authors state that the patients were not previously infected by SARS-COV-2. Was Anti-N antibody testing performed? Were other humoral measurements performed or did everything rely on declaration? From Figure 1A, I do not understand the rationale for excluding asymptomatic patients. Moreover, the mechanistic model can handle patients with only three observations, why are they not included? Finally, the 54 patients without clinical data can be used for the viral dynamics fitting and then discarded for the descriptive analysis. Excluding them can create a bias. All the discarded patients can help the virus dynamics analysis as it is a population approach. Please clarify. In Table 1 the absence of sex covariate is surprising.

      We appreciate the detailed feedback from the reviewer regarding patient selection. We relied on the patient's self-declaration to determine the patient's history of COVID-19 infection and revised the manuscript to specify this (page 6, lines 83-84).

      In parameter estimation, we used the date of symptom onset for each patient so that we establish a baseline of the time axis as clearly as possible, as we did in our previous works. Accordingly, asymptomatic patients who do not have information on the date of symptom onset were excluded from the analysis. Additionally, in the cohort we analyzed, for patients excluded due to limited number of observations (i.e., less than 3 points), most patients already had a viral load close to the detection limit at the time of the first measurement. This is due to the design of clinical trial, as if a negative result was obtained twice in a row, no further follow-up sampling was performed. These patients were excluded from the analysis because it hard to get reasonable fitting results. Also, we used 54 patients for the viral dynamics fitting and then only used the NFV cohort for clinical data analysis. We acknowledge that our description may have confused readers. We revised our manuscript to clarify these points regarding patient selecting for data fitting (page 6, lines 96-102, page 24, lines 406-407, and page 7, lines 410-412). In addition, we realized, thanks to the reviewer’s comment, that gender information was missing in Table 1. We appreciate this observation and have revised the table to include gender (we used gender in our analysis). 

      (2) Exact study timeline for explanatory covariates: I understand the idea of finding « early predictors » of long-lasting viral shedding. I believe it is key and a great question. However, some samples (Figure 4A) seem to be taken at the end of the viral shedding. I am not sure it is really easier to micro-RNA saliva samples than a PCR. So I need to be better convinced of the impact of the possible findings. Generally, the timeline of explanatory covariate is not described in a satisfactory manner in the actual manuscript. Also, the evaluation and inclusion of the daily symptoms in the analysis are unclear to me.

      We appreciate the reviewer’s feedback regarding the collection of explanatory variables. As noted, of the two microRNA samples collected from each patient, one was obtained near the end of viral shedding. This was intended to examine potential differences in microRNA levels between the early and late phases of infection. No significant differences were observed between the two time points, and using microRNA from either phase alone or both together did not substantially affect predictive accuracy for stratified groups. Furthermore, microRNA collection was motivated primarily by the expectation that it would be more sensitive to immune responses, rather than by ease of sampling. We have revised the manuscript to clarify these points regarding microRNA (page 17, lines 243-245 and 259-262).

      Furthermore, as suggested by the reviewer, we have also strengthened the explanation regarding the collection schedule of clinical information and the use of daily symptoms in the analysis (page 6, lines 90-95, page 14, lines 218-220,).

      (3) Early Trajectory Differentiation: The model struggles to differentiate between patients' viral load trajectories in the early phase, with overlapping slopes and indistinguishable viral load peaks observed in Figures 2B, 2C, and 2D. The question arises whether this issue stems from the data, the nature of Covid-19, or the model itself. The authors discuss the scarcity of pre-symptom data, primarily relying on Illinois patients who underwent testing before symptom onset. This contrasts earlier statements on pages 5-6 & 23, where they claim the data captures the full infection dynamics, suggesting sufficient early data for pre-symptom kinetics estimation. The authors need to provide detailed information on the number or timing of patient sample collections during each period.

      Thank you for the reviewer’s thoughtful comments. The model used in this study [Eqs.(1-2)] has been employed in numerous prior studies and has successfully identified viral dynamics at the individual level. In this context, we interpret the rapid viral increase observed across participants as attributable to characteristics of SARS-CoV-2 in saliva, an interpretation that has also been reported by multiple previous studies. We have added the relevant references and strengthened the corresponding discussion in the manuscript (page 20, lines 303-311).

      We acknowledge that our explanation of how the complementary relationship between the two cohorts contributes to capturing infection dynamics was not sufficiently clear. As described in the manuscript, the Illinois cohort provides pre-symptomatic data, whereas the NFV cohort offers abundant end-phase data, thereby compensating for each other’s missing phases. By jointly analyzing the two cohorts with a nonlinear mixed-effects model, we estimated viral dynamics at the individual-level. This approach first estimates population-level parameters (fixed effects) using data from all participants and then incorporates random effects to account for individual variability, yielding the most plausible parameter values.

      Thus, even when early-phase data are lacking in the NFV cohort, information from the Illinois cohort allows us to infer most reasonable dynamics, and the reverse holds true for the end phase. In this context, we argued that combining the two cohorts enables mathematical modeling to capture infection dynamics at the individual level. Recognizing that our earlier description could be misleading, we have carefully reinforced the relevant description (page 27, lines 472-483). In addition, as suggested by the reviewer, we have added information on the number of data samples available for each phase in both cohorts (page 7, lines 106-109).

      (4) Conditioning on the future: Conditioning on the future in statistics refers to the problematic situation where an analysis inadvertently relies on information that would not have been available at the time decisions were made or data were collected. This seems to be the case when the authors create micro-RNA data (Figure 4A). First, when the sampling times are is something that needs to be clarified by the authors (for clinical outcomes as well). Second, proper causal inference relies on the assumption that the cause precedes the effect. This conditioning on the future may result in overestimating the model's accuracy. This happens because the model has been exposed to the outcome it's supposed to predict. This could question the - already weak - relation with mir-1846 level.

      We appreciate the reviewer’s detailed feedback. As noted in Reply to Comments 2, we collected micro-RNA samples at two time points, near the peak of infection dynamics and at the end stage, and found no significant differences between them. This suggests that micro-RNA levels are not substantially affected by sampling time. Indeed, analyses conducted using samples from the peak, late stage, or both yielded nearly identical results in relation to infection dynamics. To clarify this point, we revised the manuscript by integrating this explanation with our response in Reply to Comments 2 (page 17, lines 259-262). In addition, now we also revised manuscript to clarify sampling times of clinical information and micro-RNA (page 6, lines 90-95).

      (5) Mathematical Model Choice Justification and Performance: The paper lacks mention of the practical identifiability of the model (especially for tau regarding the lack of early data information). Moreover, it is expected that the immune effector model will be more useful at the beginning of the infection (for which data are the more parsimonious). Please provide AIC for comparison, saying that they have "equal performance" is not enough. Can you provide at least in a point-by-point response the VPC & convergence assessments?

      We appreciate the reviewer’s detailed feedback regarding the mathematical model. We acknowledge the potential concern regarding the practical identifiability of tau (incubation period), particularly given the limited early-phase data. In our analysis, however, the nonlinear mixed-effects model yielded a population-level estimate of 4.13 days, which is similar with previously reported incubation periods for COVID-19. This concordance suggests that our estimate of tau is reasonable despite the scarcity of early data.

      For model comparison, first, we have added information on the AIC of the two models to the manuscript as suggested by the reviewer (page 10, lines 130-135). One point we would like to emphasize is that we adopted a simple target cell-limited model in this study, aiming to focus on reconstruction of viral dynamics and stratification of shedding patterns rather than exploring the mechanism of viral infection in detail. Nevertheless, we believe that the target cell-limited model provides reasonable reconstructed viral dynamics as it has been used in many previous studies. We revised manuscript to clarify this (page 10, lines 135-144). 

      Furthermore, as suggested, we have added the VPC and convergence assessment results for both models, together with explanatory text, to the manuscript (Supplementary Fig 2, Supplementary Fig 3, and page 10, lines 130-135). In the VPC, the observed 5th, 50th, and 95th percentiles were generally within the corresponding simulated prediction intervals across most time points. Although minor deviations were noted in certain intervals, the overall distribution of the observed data was well captured by the models, supporting their predictive performance (Supplementary Fig 2). In addition, the log-likelihood and SAEM parameter trajectories stabilized after the burn-in phase, confirming appropriate convergence (Supplementary Fig 3).

      (6) Selected features of viral shedding: I wonder to what extent the viral shedding area under the curve (AUC) and normalized AUC should be added as selected features.

      We sincerely appreciate the reviewer’s valuable suggestion regarding the inclusion of additional features. Following this recommendation, we considered AUC (or normalized AUC) as an additional feature when constructing the distance matrix used for stratification. We then evaluated the similarity between the resulting distance matrix and the original one using the Mantel test, which showed a very high correlation (r = 0.92, p < 0.001). This indicates that incorporating AUC as an additional feature does not substantially alter the distance matrix. Accordingly, we have decided to retain the current stratification analysis, and we sincerely thank the reviewer once again for this interesting suggestion.

      (7) Two-step nature of the analysis: First you fit a mechanistic model, then you use the predictions of this model to perform clustering and prediction of groups (unsupervised then supervised). Thus you do not propagate the uncertainty intrinsic to your first estimation through the second step, ie. all the viral load selected features actually have a confidence bound which is ignored. Did you consider a one-step analysis in which your covariates of interest play a direct role in the parameters of the mechanistic model as covariates? To pursue this type of analysis SCM (Johnson et al. Pharm. Res. 1998), COSSAC (Ayral et al. 2021 CPT PsP), or SAMBA ( Prague et al. CPT PsP 2021) methods can be used. Did you consider sampling on the posterior distribution rather than using EBE to avoid shrinkage?

      Thank you for the reviewer’s detailed suggestions regarding our analysis. We agree that the current approach does not adequately account for the impact of uncertainty in viral dynamics on the stratified analyses. As a first step, we have revised Extended Data Fig 1 (now renumbered as Supplementary Fig 1) to include 95% credible intervals computed using a bootstrap approach, to present the model-fitting uncertainty more explicitly. Then, to examine the potential impact of model uncertainty on stratified analyses, we reconstructed the distance matrix underlying stratification by incorporating feature uncertainty. Specifically, for each individual, we sampled viral dynamics within the credible interval and averaged the resulting feature, and build the distance matrix using it. We then compared this uncertainty-adjusted matrix with the original one using the Mantel test, which showed a strong correlation (r = 0.72, p < 0.001). Given this result, we did not replace the current stratification but revised the manuscript to provide this information (page 11, lines 159-162 and page 28, 512-519).

      Furthermore, we carefully considered the reviewer’s proposed one-step analysis. However, implementation was constrained by data-fitting limitations. Concretely, clinical information is available only in the NFV cohort. Thus, if these variables are to be entered directly as covariates on the parameters, the Illinois cohort cannot be included in the data-fitting process. Yet the NFV cohort lacks any pre-symptomatic observations, so fitting the model to that cohort alone does not permit a reasonable (well-identified/robust) fitting result. While we were unable to implement the suggestion under the current data constraints, we sincerely appreciate the reviewer’s thoughtful and stimulating proposal.

      (8) Need for advanced statistical methods: The analysis is characterized by a lack of power. This can indeed come from the sample size that is characterized by the number of data available in the study. However, I believe the power could be increased using more advanced statistical methods. At least it is worth a try. First considering the unsupervised clustering, summarizing the viral shedding trajectories with features collapses longitudinal information. I wonder if the R package « LongituRF » (and associated method) could help, see Capitaine et al. 2020 SMMR. Another interesting tool to investigate could be latent class models R package « lcmm » (and associated method), see ProustLima et al. 2017 J. Stat. Softwares. But the latter may be more far-reached.

      Thank you for the reviewer’s thoughtful suggestions regarding our unsupervised clustering approach. The R package “LongitiRF” is designed for supervised analysis, requiring a target outcome to guide the calculation of distances between individuals (i.e., between viral dynamics). In our study, however, the goal was purely unsupervised clustering, without any outcome variable, making direct application of “LongitiRF” challenging.

      Our current approach (summarizing each dynamic into several interpretable features and then using Random Forest proximities) allows us to construct a distance matrix in an unsupervised manner. Here, the Random Forest is applied in “proximity mode,” focusing on how often dynamics are grouped together in the trees, independent of any target variable. This provides a practical and principled way to capture overall patterns of dynamics while keeping the analysis fully unsupervised.

      Regarding the suggestion to use latent class mixed models (R package “lcmm”), we also considered this approach. In our dataset, each subject has dense longitudinal measurements, and at many time points, trajectories are very similar across subjects, resulting in minimal inter-individual differences. Consequently, fitting multi-class latent class mixed models (ng ≥ 2) with random effects or mixture terms is numerically unstable, often producing errors such as non-positive definite covariance matrices or failure to generate valid initial values. Although one could consider using only the time points with the largest differences, this effectively reduces the analysis to a feature-based summary of dynamics. Such an approach closely resembles our current method and contradicts the goal of clustering based on full longitudinal information.

      Taken together, although we acknowledge that incorporating more longitudinal information is important, we believe that our current approach provides a practical, stable, and informative solution for capturing heterogeneity in viral dynamics. We would like to once again express our sincere gratitude to the reviewer for this insightful suggestion.

      (9) Study intrinsic limitation: All the results cannot be extended to asymptomatic patients and patients infected with recent VOCs. It definitively limits the impact of results and their applicability to public health. However, for me, the novelty of the data analysis techniques used should also be taken into consideration.

      We appreciate your positive evaluation of our research approach and acknowledge that, as noted in the Discussion section as our first limitation, our analysis may not provide valid insights into recent VOCs or all populations, including asymptomatic individuals. Nonetheless, we believe it is novel that we extensively investigated the relationship between viral shedding patterns in saliva and a wide range of clinical and micro-RNA data. Our findings contribute to a deeper and more quantitative understanding of heterogeneity in viral dynamics, particularly in saliva samples. To discuss this point, we revised our manuscript (page 22, lines 364-368).

      Strengths are:

      Unique data and comprehensive analysis.

      Novel results on viral shedding.

      Weaknesses are:

      Limitation of study design.

      The need for advanced statistical methodology.

      Reviewer #1 (Recommendations For The Authors):

      Line 8: In the abstract, it would be helpful to state how stratification occurred.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 2, lines 8-11).

      Line 31 and discussion: It is important to mention the challenges of using saliva as a specimen type for lab personnel.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 36-41).

      Line 35: change to "upper respiratory tract".

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, line 35).

      Line 37: "Saliva" is not a tissue. Please hazard a guess as to which tissue is responsible for saliva shedding and if it overlaps with oral and nasal swabs.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 42-45).

      Line 42, 68: Please explain how understanding saliva shedding dynamics would impact isolation & screening, diagnostics, and treatments. This is not immediately intuitive to me.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 3, lines 48-50).

      Line 50: It would be helpful to explain why shedding duration is the best stratification variable.

      We thank the reviewer for the feedback. We acknowledge that our wording was ambiguous. The clear differences in the viral dynamics patterns pertain to findings observed following the stratification, and we have revised the manuscript to make this explicit (page 4, lines 59-61).

      Line 71: Dates should be listed for these studies.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly (page 6, lines 85-86).

      Reviewer #2 (Recommendations For The Authors):

      Please make all code and data available for replication of the analyses.

      We appreciate the suggestion. Due to ethical considerations, it is not possible to make all data and code publicly available. We have clearly stated in the manuscript about it (Data availability section in Methods).

      Reviewer #3 (Recommendations For The Authors):

      Here are minor comments / technical details:

      (1) Figure 1B is difficult to understand.

      Thank you for the comment. We updated Fig 1B to incorporate more information to aid interpretation.

      (2) Did you analyse viral load or the log10 of viral load? The latter is more common. You should consider it. SI Figure 1 please plot in log10 and use a different point shape for censored data. The file quality of this figure should be improved. State in the material and methods if SE with moonlit are computed with linearization or importance sampling.

      Thank you for the comment. We conducted our analyses using log10-transformed viral load. Also, we revised Supplementary Fig 1 (now renumbered as Supplementary Fig 4) as suggested. We also added Supplementary Fig 3 and clarified in the Methods that standard errors (SE) were obtained in Monolix from the Fisher information matrix using the linearization method (page 28, lines 498-499).

      (3) Table 1 and Figure 3A could be collapsed.

      Thank you for the comment, and we carefully considered this suggestion. Table 1 summarizes clinical variables by category, whereas Fig 3A visualizes them ordered by p-value of statistical analysis. Collapsing these into a single table would make it difficult to apprehend both the categorical summaries and the statistical ranking at a glance, thereby reducing readability. We therefore decided to retain the current layout. We appreciate the constructive feedback again. 

      (4) Figure 3 legend could be clarified to understand what is 3B and 3C.

      We thank the reviewer for the feedback and have reinforced the description accordingly.

      (5) Why use AIC instead of BICc?

      Thank you for your comment. We also think BICc is a reasonable alternative. However, because our objective is predictive adequacy (reconstruction of viral dynamics), we judged AIC more appropriate. In NLMEM settings, the effective sample size required by BICc is ambiguous, making the penalty somewhat arbitrary. Moreover, since the two models reconstruct very similar dynamics, our conclusions are not sensitive to the choice of criterion.

      (6) Bibliography. Most articles are with et al. (which is not standard) and some are with an extended list of names. Provide DOI for all.

      We thank the reviewer for the feedback, and have revised the manuscript accordingly.

      (7) Extended Table 1&2 - maybe provide a color code to better highlight some lower p-values (if you find any interesting).

      We thank the reviewer for the feedback. Since no clinical information and micro-RNAs other than mir-1846 showed low p-values, we highlighted only mir-1846 with color to make it easier to locate.

      (8) Please make the replication code available.

      We appreciate the suggestion. Due to ethical considerations, it is not possible to make all data and code publicly available. We have clearly stated in the manuscript about it (Data availability section in Methods).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this work, van Paassen et al. have studied how CD8 T cell functionality and levels predict HIV DNA decline. The article touches on interesting facets of HIV DNA decay, but ultimately comes across as somewhat hastily done and not convincing due to the major issues. 

      (1) The use of only 2 time points to make many claims about longitudinal dynamics is not convincing. For instance, the fact that raw data do not show decay in intact, but do for defective/total, suggests that the present data is underpowered. The authors speculate that rising intact levels could be due to patients who have reservoirs with many proviruses with survival advantages, but this is not the parsimonious explanation vs the data simply being noisy without sufficient longitudinal follow-up. n=12 is fine, or even reasonably good for HIV reservoir studies, but to mitigate these issues would likely require more time points measured per person. 

      (1b) Relatedly, the timing of the first time point (6 months) could be causing a number of issues because this is in the ballpark for when the HIV DNA decay decelerates, as shown by many papers. This unfortunate study design means some of these participants may already have stabilized HIV DNA levels, so earlier measurements would help to observe early kinetics, but also later measurements would be critical to be confident about stability. 

      The main goal of the present study was to understand the relationship of the HIV-specific CD8 T-cell responses early on ART with the reservoir changes across the subsequent 2.5-year period on suppressive therapy. We have revised the manuscript in order to clarify this.  We chose these time points because the 24 week time point is past the initial steep decline of HIV DNA, which takes place in the first weeks after ART initiation. It is known that HIV DNA continues to decay for years after (Besson, Lalama et al. 2014, Gandhi, McMahon et al. 2017). 

      (2) Statistical analysis is frequently not sufficient for the claims being made, such that overinterpretation of the data is problematic in many places. 

      (2a) First, though plausible that cd8s influence reservoir decay, much more rigorous statistical analysis would be needed to assert this directionality; this is an association, which could just as well be inverted (reservoir disappearance drives CD8 T cell disappearance). 

      To correlate different reservoir measures between themselves and with CD8+ T-cell responses at 24 and 156 weeks, we now performed non-parametric (Spearman) correlation analyses, as they do not require any assumptions about the normal distribution of the independent and dependent variables. Benjamini-Hochberg corrections for multiple comparisons (false discovery rate, 0.25) were included in the analyses and did not change the results. 

      Following this comment we would like to note that the association between the T-cell response at 24 weeks and the subsequent decrease in the reservoir cannot be bi-directional (that can only be the case when both variables are measured at the same time point). Therefore, to model the predictive value of T-cell responses measured at 24 weeks for the decrease in the reservoir between 24 and 156 weeks, we fitted generalized linear models (GLM), in which we included age and ART regimen, in addition to three different measures of HIV-specific CD8+ T-cell responses, as explanatory variables, and changes in total, intact, and total defective HIV DNA between 24 and 156 weeks ART as dependent variables.

      (2b) Words like "strong" for correlations must be justified by correlation coefficients, and these heat maps indicate many comparisons were made, such that p-values must be corrected appropriately. 

      We have now used Spearman correlation analysis, provided correlation coefficients to justify the wording, and adjusted the p-values for multiple comparisons (Fig. 1, Fig 3., Table 2). Benjamini-Hochberg corrections for multiple comparisons (false discovery rate, 0.25) were included in the analyses and did not change the results.  

      (3) There is not enough introduction and references to put this work in the context of a large/mature field. The impacts of CD8s in HIV acute infection and HIV reservoirs are both deep fields with a lot of complexity. 

      Following this comment we have revised and expanded the introduction to put our work more in the context of the field (CD8s in acute HIV and HIV reservoirs). 

      Reviewer #2 (Public review): 

      Summary: 

      This study investigated the impact of early HIV specific CD8 T cell responses on the viral reservoir size after 24 weeks and 3 years of follow-up in individuals who started ART during acute infection. Viral reservoir quantification showed that total and defective HIV DNA, but not intact, declined significantly between 24 weeks and 3 years post-ART. The authors also showed that functional HIV-specific CD8⁺ T-cell responses persisted over three years and that early CD8⁺ T-cell proliferative capacity was linked to reservoir decline, supporting early immune intervention in the design of curative strategies. 

      Strengths: 

      The paper is well written, easy to read, and the findings are clearly presented. The study is novel as it demonstrates the effect of HIV specific CD8 T cell responses on different states of the HIV reservoir, that is HIV-DNA (intact and defective), the transcriptionally active and inducible reservoir. Although small, the study cohort was relevant and well-characterized as it included individuals who initiated ART during acute infection, 12 of whom were followed longitudinally for 3 years, providing unique insights into the beneficial effects of early treatment on both immune responses and the viral reservoir. The study uses advanced methodology. I enjoyed reading the paper. 

      Weaknesses: 

      All participants were male (acknowledged by the authors), potentially reducing the generalizability of the findings to broader populations. A control group receiving ART during chronic infection would have been an interesting comparison. 

      We thank the reviewer for their appreciation of our study. Although we had indeed acknowledged the fact that all participants were male, we have clarified why this is a limitation of the study (Discussion, lines 296-298). The reviewer raises the point that it would be useful to compare our data to a control group. Unfortunately, these samples are not yet available, but our study protocol allows for a control group (chronic infection) to ensure we can include a control group in the future.

      Reviewer #1 (Recommendations for the authors): 

      Minor: 

      On the introduction: 

      (1) One large topic that is mostly missing completely is the emerging evidence of selection on HIV proviruses during ART from the groups of Xu Yu and Matthias Lichterfeld, and Ya Chi Ho, among others. 

      Previously, it was only touched upon in the Discussion. Now we have also included this in the Introduction (lines 77-80).

      (2) References 4 and 5 don't quite match with the statement here about reservoir seeding; we don't completely understand this process, and certainly, the tissue seeding aspect is not known. 

      Line 61-62: references were changed and this paragraph was rewritten to clarify.

      (3) Shelton et al. showed a strong relationship with HIV DNA size and timing of ART initiation across many studies. I believe Ananwaronich also has several key papers on this topic. 

      References by Ananwaronich are included (lines 91-94).

      (4) "the viral levels decline within weeks of AHI", this is imprecise, there is a peak and a decline, and an equilibrium. 

      We agree and have rewritten the paragraph accordingly.

      (5) The impact of CD8 cells on viral evolution during primary infection is complex and likely not relevant for this paper. 

      We have left viral evolution out of the introduction in order to keep a focus on the current subject.

      (6) The term "reservoir" is somewhat polarizing, so it might be worth mentioning somewhere exactly what you think the reservoir is, I think, as written, your definition is any HIV DNA in a person on ART? 

      Indeed, we refer to the reservoir when we talk about the several aspects of the reservoir that we have quantified with our assays (total HIV DNA, unspliced RNA, intact and defective proviral DNA, and replication-competent virus). In most instances we try to specify which measurement we are referring to. We have added additional reservoir explanation to clarify our definition to the introduction (lines 55-58).

      (7) I think US might be used before it is defined. 

      We thank the reviewer for this notification, we have now also defined it in the Results section (line 131).

      (8) In Figure 1 it's also not clear how statistics were done to deal with undetectable values, which can be tricky but important. 

      We have now clarified this in the legend to Figure 2 (former Figure 1). Paired Wilcoxon tests were performed to test the significance of the differences between the time points. Pairs where both values were undetectable were always excluded from the analysis. Pairs where one value was undetectable and its detection limit was higher than the value of the detectable partner, were also excluded from the analysis. Pairs where one value was undetectable and its detection limit was lower than the value of the detectable partner, were retained in the analysis.

      In the discussion: 

      (1) "This confirms that the existence of a replication-competent viral reservoir is linked to the presence of intact HIV DNA." I think this statement is indicative of many of the overinterpretations without statistical justification. There are 4 of 12 individuals with QVOA+ detectable proviruses, which means there are 8 without. What are their intact HIV DNA levels? 

      We thank the reviewer for the question that is raised here. We have now compared the intact DNA levels (measured by IPDA) between participants with positive vs. negative QVOA output, and observed a significant difference. We rephrased the wording as follows: “We compared the intact HIV DNA levels at the 24-week timepoint between the six participants, from whom we were able to isolate replicating virus, and the fourteen participants, from whom we could not. Participants with positive QVOA had significantly higher intact HIV DNA levels than those with negative QVOA (p=0.029, Mann-Whitney test; Suppl. Fig. 3). Five of six participants with positive QVOA had intact DNA levels above 100 copies/106 PBMC, while thirteen of fourteen participants with negative QVOA had intact HIV DNA below 100 copies/106 PBMC (p=0.0022, Fisher’s exact test). These findings indicate that recovery of replication-competent virus by QVOA is more likely in individuals with higher levels of intact HIV DNA in IPDA, reaffirming a link between the two measurements.”

      (2) "To determine whether early HIV-specific CD8+ T-cell responses at 24 weeks were predictive for the change in reservoir size". This is a fundamental miss on correlation vs causation... it could be the inverse. 

      We thank the reviewer for the remark. We have calculated the change in reservoir size (the difference between the reservoir size at 24 weeks and 156 weeks ART) and analyzed if the HIVspecific CD8+ T-cell response at 24 weeks ART are predictive for this change. We do not think it can be inverse, as we have a chronological relationship (CD8+ responses at week 24 predict the subsequent change in the reservoir).

      (3) "This may suggest that active viral replication drives the CD8+ T-cell response." I think to be precise, you mean viral transcription drives CD8s, we don't know about the full replication cycle from these data. 

      We agree with the reviewer and have changed “replication” to “transcription” (line 280).

      (4) "Remarkably, we observed that the defective HIV DNA levels declined significantly between 24 weeks and 3 years on ART. This is in contrast to previous observations in chronic HIV infection (30)". I don't find this remarkable or in contrast: many studies have analyzed and/or modeled defective HIV DNA decay, most of which have shown some negative slope to defective HIV DNA, especially within the first year of ART. See White et al., Blankson et al., Golob et al., Besson et al., etc In addition, do you mean in long-term suppressed? 

      The point we would like to make is that,  compared to other studies, we found a significant, prominent decrease in defective DNA (and not intact DNA) over the course of 3 years, which is in contrast to other studies (where usually the decrease in intact is significant and the decrease in defective less prominent). We have rephrased the wording (lines 227-230) as follows:

      “We observed that the defective HIV DNA levels decreased significantly between 24 and 156 weeks of ART. This is different from studies in CHI, where no significant decrease during the first 7 years of ART (Peluso, Bacchetti et al. 2020, Gandhi, Cyktor et al. 2021), or only a significant decrease during the first 8 weeks on ART, but not in the 8 years thereafter, was observed (Nühn, Bosman et al. 2025).”

      Reviewer #2 (Recommendations for the authors): 

      (1) Page 4, paragraph 2 - will be informative to report the statistics here. 

      (2) Page 4, paragraph 4 - "General phenotyping of CD4+ (Suppl. Fig. 3A) and CD8+ (Supplementary Figure 3B) T-cells showed no difference in frequencies of naïve, memory or effector CD8+ T-cells between 24 and 156 weeks." - What did the CD4+ phenotyping show? 

      We thank the reviewer for the remark. Indeed, there were also no differences in frequencies of naïve, memory or effector CD4+ T-cells between 24 and 156 weeks. We have added this to the paragraph (now Suppl. Fig 4), lines 166-168.

      (3) Page 5, paragraph 3 - "Similarly, a broad HIV-specific CD8+ T-cell proliferative response to at least three different viral proteins was observed in the majority of individuals at both time points" - should specify n=? for the majority of individuals. 

      At time point 24 weeks, 6/11 individuals had a response to env, 10/11 to gag, 5/11 to nef, and 4/11 to pol. At 156 weeks, 8/11 to env, 10/11 to gag, 8/11 to nef and 9/11 to pol. We have added this to the text (lines 188-191).

      (4) Seven of 22 participants had non-subtype B infection. Can the authors explain the use of the IPDA designed by Bruner et. al. for subtype B HIV, and how this may have affected the quantification in these participants? 

      Intact HIV DNA was detectable in all 22 participants. We cannot completely exclude influence of primer/probe-template mismatches on the quantification results, however such mismatches could also have occurred in subtype B participants, and droplet digital PCR that IPDA is based on is generally much less sensitive to these mismatches than qPCR.

      (5) Page 7, paragraph 2 - the authors report a difference in findings from a previous study ("a decline in CD8 T cell responses over 2 years" - reference 21), but only provide an explanation for this on page 9. The authors should consider moving the explanation to this paragraph for easier understanding. 

      We agree with the reviewer that this causes confusion. Therefore, we have revised and changed the order in the Discussion.

      (6) Page 7, paragraph 2 - Following from above, the previous study (21) reported this contradicting finding "a decline in CD8 T cell responses over 2 years" in a CHI (chronic HIV) treated cohort. The current study was in an acute HIV treated cohort. The authors should explain whether this may also have resulted in the different findings, in addition to the use of different readouts in each study.

      We thank the reviewer for this attentiveness. Indeed, the study by Takata et al. investigates the reservoir and HIV-specific CD8+ T-cell responses in both the RV254/ SEARCH010 study who initiated ART during AHI and the RV304/ SEARCH013 who initiated ART during CHI. We had not realized that the findings of the decline in CD8 T cell responses were solely found in the RV304/ SEARCH013 (CHI cohort). It appears functional HIV specific immune responses were only measured in AHI at 96 weeks, so we have clarified this in the Discussion. 

      Besson, G. J., C. M. Lalama, R. J. Bosch, R. T. Gandhi, M. A. Bedison, E. Aga, S. A. Riddler, D. K. McMahon, F. Hong and J. W. Mellors (2014). "HIV-1 DNA decay dynamics in blood during more than a decade of suppressive antiretroviral therapy." Clin Infect Dis 59(9): 1312-1321.

      Gandhi, R. T., J. C. Cyktor, R. J. Bosch, H. Mar, G. M. Laird, A. Martin, A. C. Collier, S. A. Riddler, B. J. Macatangay, C. R. Rinaldo, J. J. Eron, J. D. Siliciano, D. K. McMahon and J. W. Mellors (2021). "Selective Decay of Intact HIV-1 Proviral DNA on Antiretroviral Therapy." J Infect Dis 223(2): 225-233.

      Gandhi, R. T., D. K. McMahon, R. J. Bosch, C. M. Lalama, J. C. Cyktor, B. J. Macatangay, C. R. Rinaldo, S. A. Riddler, E. Hogg, C. Godfrey, A. C. Collier, J. J. Eron and J. W. Mellors (2017). "Levels of HIV-1 persistence on antiretroviral therapy are not associated with markers of inflammation or activation." PLoS Pathog 13(4): e1006285.

      Nühn, M. M., K. Bosman, T. Huisman, W. H. A. Staring, L. Gharu, D. De Jong, T. M. De Kort, N. Buchholtz, K. Tesselaar, A. Pandit, J. Arends, S. A. Otto, E. Lucio De Esesarte, A. I. M. Hoepelman, R. J. De Boer, J. Symons, J. A. M. Borghans, A. M. J. Wensing and M. Nijhuis (2025). "Selective decline of intact HIV reservoirs during the first decade of ART followed by stabilization in memory T cell subsets." Aids 39(7): 798-811.

      Peluso, M. J., P. Bacchetti, K. D. Ritter, S. Beg, J. Lai, J. N. Martin, P. W. Hunt, T. J. Henrich, J. D. Siliciano, R. F. Siliciano, G. M. Laird and S. G. Deeks (2020). "Differential decay of intact and defective proviral DNA in HIV-1-infected individuals on suppressive antiretroviral therapy." JCI Insight 5(4).

    1. Author response:

      Reviewer #1 (Public Review):

      Summary

      We thank the reviewer for the constructive and thoughtful evaluation of our work. We appreciate the recognition of the novelty and potential implications of our findings regarding UPR activation and proteasome activity in germ cells.

      (1) The microscopy images look saturated, for example, Figure 1a, b, etc. Is this a normal way to present fluorescent microscopy?

      The apparent saturation was not present in the original images, but likely arose from image compression during PDF generation. While the EMA granule was still apparent, in the revised submission, we will provide high-resolution TIFF files to ensure accurate representation of fluorescence intensity and will carefully optimize image display settings to avoid any saturation artifacts.

      (2) The authors should ensure that all claims regarding enrichment/lower vs. lower values have indicated statistical tests.

      We fully agree. In the revised version, we will correct any quantitative comparisons where statistical tests were not already indicated, with a clear statement of the statistical tests used, including p-values in figure legends and text.

      (a) In Figure 2f, the authors should indicate which comparison is made for this test. Is it comparing 2 vs. 6 cyst numbers?

      We acknowledge that the description was not sufficiently detailed. Indeed, the test was not between 2 vs 6 cyst numbers, but between all possible ways 8-cell cysts or the larger cysts studied could fragment randomly into two pieces, and produce by chance 6-cell cysts in 13 of 15 observed examples. We will expand the legend and main text to clarify that a binomial test was used to determine that the proportion of cysts producing 6-cell fragments differed very significantly from chance.

      Revised text:

      “A binomial test was used to assess whether the observed frequency of 6-cell cyst products differed from random cyst breakage. Production of 6-cell cysts was strongly preferred (13/15 cysts; ****p < 0.0001).”

      (b) Figures 4d and 4e do not have a statistical test indicated.

      We will include the specific statistical test used and report the corresponding p-values directly in the figure legends.

      (3) Because the system is developmentally dynamic, the major conclusions of the work are somewhat unclear. Could the authors be more explicit about these and enumerate them more clearly in the abstract?

      We will revise the abstract to better clarify the findings of this study. We will also replace the term Visham with mouse fusome to reflect its functional and structural analogy to the Drosophila and Xenopus fusomes, making the narrative more coherent and conclusive.

      (4) The references for specific prior literature are mostly missing (lines 184-195, for example).

      We appreciate this observation of a problem that occurred inadvertently when shortening an earlier version.  We will add 3–4 relevant references to appropriately support this section.

      (5) The authors should define all acronyms when they are first used in the text (UPR, EGAD, etc).

      We will ensure that all acronyms are spelled out at first mention (e.g., Unfolded Protein Response (UPR), Endosome and Golgi-Associated Degradation (EGAD)).

      (6)  The jumping between topics (EMA, into microtubule fragmentation, polarization proteins, UPR/ERAD/EGAD, GCNA, ER, balbiani body, etc) makes the narrative of the paper very difficult to follow.

      We are not jumping between topics, but following a narrative relevant to the central question of whether female mouse germ cells develop using a fusome.  EMA, microtubule fragmentation, polarization proteins, ER, and balbiani body are all topics with a known connection to fusomes. This is explained in the general introduction and in relevant subsections. We appreciate this feedback that further explanations of these connections would be helpful. In the revised manuscript, use of the unified term mouse fusome will also help connect the narrative across sections.  UPR/ERAD/EGAD are processes that have been studied in repair and maintenance of somatic cells and in yeast meiosis.  We show that the major regulator XbpI is found in the fusome, and that the fusome and these rejuvenation pathway genes are expressed and maintained throughout oogenesis, rather than only during limited late stages as suggested in previous literature.

      (7) The heading title "Visham participates in organelle rejuvenation during meiosis" in line 241 is speculative and/or not supported. Drawing upon the extensive, highly rigorous Drosophila literature, it is safe to extrapolate, but the claim about regeneration is not adequately supported.

      We believe this statement is accurate given the broad scope of the term "participates." It is supported by localization of the UPR regulator XbpI to the fusome. XbpI is the ortholog of HacI a key gene mediating UPR-mediated rejuvenation during yeast meiosis.  We also showed that rejuvenation pathway genes are expressed throughout most of meiosis (not previously known) and expanded cytological evidence of stage-specific organelle rejuvenation later in meiosis, such as mitochondrial-ER docking, in regions enriched in fusome antigens. However, we recognize the current limitations of this evidence in the mouse, and want to appropriately convey this, without going to what we believe would be an unjustified extreme of saying there is no evidence. 

      Reviewer #2 (Public Review):

      We thank the reviewer for the comprehensive summary and for highlighting both the technical achievement and biological relevance of our study. We greatly appreciate the thoughtful suggestions that have helped us refine our presentation and terminology.

      (1) Some titles contain strong terms that do not fully match the conclusions of the corresponding sections.

      (1a) Article title “Mouse germline cysts contain a fusome-like structure that mediates oocyte development”

      We will change the statement to: “Mouse germline cysts contain a fusome that supports germline cyst polarity and rejuvenation.”

      (1b) Result title “Visham overlaps centrosomes and moves on microtubules” We acknowledge that “moves” implies dynamics. We will include additional supplementary images showing small vesicular components of the mouse fusome on spindle-derived microtubule tracks.

      (1c) Result title “Visham associates with Golgi genes involved in UPR beginning at the onset of cyst formation”

      We will revise this title to: “The mouse fusome associates with the UPR regulatory protein Xbp1 beginning at the onset of cyst formation” to reflect the specific UPR protein that was immunolocalized. 

      (1d) Result title “Visham participates in organelle rejuvenation during meiosis”

      We will revise this to: “The mouse fusome persists during organelle rejuvenation in meiosis.”

      (2) The authors aim to demonstrate that Visham is a fusome-like structure. I would suggest simply referring to it as a "fusome-like structure" rather than introducing a new term, which may confuse readers and does not necessarily help the authors' goal of showing the conservation of this structure in Drosophila and Xenopus germ cells. Interestingly, in a preprint from the same laboratory describing a similar structure in Xenopus germ cells, the authors refer to it as a "fusome-like structure (FLS)" (Davidian and Spradling, BioRxiv, 2025).

      We appreciate the reviewer’s insightful comment. To maintain conceptual clarity and align with existing literature, we will refer to the structure as the mouse fusome throughout the manuscript, avoiding introduction of a new term.

      Reviewer #3 (Public Review):

      We thank the reviewer for emphasizing the importance of our study and for providing constructive feedback that will help us clarify and strengthen our conclusions.

      (1) Line 86 - the heading for this section is "PGCs contain a Golgi-rich structure known as the EMA granule" 

      We agree that the enrichment of Golgi within the EMA PGCs was not shown until the next section. We will revise this heading to:

      “PGCs contain an asymmetric EMA granule.”

      (2)  Line 105-106, how do we know if what's seen by EM corresponds to the EMA1 granule?

      We will clarify that this identification is based on co-localization with Golgi markers (GM130 and GS28) and response to Brefeldin A treatment, which will be included as supplementary data. These findings support that the mouse fusome is Golgi-derived and can therefore be visualized by EM. The Golgi regions in E13.5 cyst cells move close together and associate with ring canals as visualized by EM (Figure 1E), the same as the mouse fusomes identified by EMA.

      (3) Line 106-107-states "Visham co-stained with the Golgi protein Gm130 and the recycling endosomal protein Rab11a1". This is not convincing as there is only one example of each image, and both appear to be distorted.

      Space is at a premium in these figures, but we have no limitation on data documenting this absolutely clear co-localization. We will replace the existing images with high-resolution, non-compressed versions for the final figures to clearly illustrate the co-staining patterns for GM130 and Rab11a1.

      (4) Line 132-133---while visham formation is disrupted when microtubules are disrupted, I am not convinced that visham moves on microtubules as stated in the heading of this section.

      We will include additional supplementary data showing small mouse fusome vesicles aligned along microtubules.

      (5) Line 156 - the heading for this section states that Visham associates with polarity and microtubule genes, including pard3, but only evidence for pard3 is presented.

      We agree and will revise the heading to: “Mouse fusome associates with the polarity protein Pard3.” We are adding data showing association of small fusome vesicles on microtubules.  

      (6)  Lines 196-210 - it's strange to say that UPR genes depend on DAZ, as they are upregulated in the mutants. I think there are important observations here, but it's unclear what is being concluded.

      UPR genes are not upregulated in DAZ in the sense we have never documented them increasing. We show that UPR genes during this time behave like pleuripotency genes and normally decline, but in DAZ mutants their decline is slowed.  We will rephrase the paragraph to clarify that Dazl mutation partially decouples developmental processes that are normally linked, which alters UPR gene expression relative to cyst development.

      (7) Line 257-259-wave 1 and 2 follicles need to be explained in the introduction, and how these fits with the observations here clarified.

      Follicle waves are too small a focus of the current study to explain in the introduction, but we will request readers to refer to the cited relevant literature (Yin and Spradling, 2025) for further details.

      We sincerely thank all reviewers for their insightful and constructive feedback. We believe that the planned revisions—particularly the refined terminology, improved image quality, clarified statistics, and restructured abstract—will substantially strengthen the manuscript and enhance clarity for readers.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors conduct both experiments and modeling of human cytomegalovirus (HCMV) infection in vitro to study how the infectivity of the virus (measured by cell infection) scales with the viral concentration in the inoculum. A naïve thought would be that this is linear in the sense that doubling the virus concentration (and thus the total virus) in the inoculum would lead to doubling the fraction of infected cells. However, the authors show convincingly that this is not the case for HCMV, using multiple strains, two different target cells, and repeated experiments. In fact, they find that for some regimens (inoculum concentration), infected cells increase faster than the concentration of the inoculum, which they term "apparent cooperativity". The authors then provided possible explanations for this phenomenon and constructed mathematical models and simulations to implement these explanations. They show that these ideas do help explain the cooperativity, but they can't be conclusive as to what the correct explanation is. In any case, this advances our knowledge of the system, and it is very important when quantitative experiments involving MOI are performed.

      Strengths:

      Careful experiments using state-of-the-art methodologies and advancing multiple competing models to explain the data.

      Weaknesses:

      There are minor weaknesses in explaining the implementation of the model. However, some specific assumptions, which to this reviewer were unclear, could have a substantial impact on the results. For example, whether cell infection is independent or not. This is expanded below.

      Suggestions to clarify the study:

      (1) Mathematically, it is clear what "increase linearly" or "increase faster than linearly" (e.g., line 94) means. However, it may be confusing for some readers to then look at plots such as in Figure 2, which appear linear (but on the log-log scale) and about which the authors also say (line 326) "data best matching the linear relationship on a log-log scale". 

      This is a good point. In our revision, we will include a clarification to indicate that linear on the log-log scale relationship does not imply linear relationship on the linear-linear scale.

      (2) One of the main issues that is unclear to me is whether the authors assume that cell infection is independent of other cells. This could be a very important issue affecting their results, both when analyzing the experimental data and running the simulations. One possible outcome of infection could be the generation of innate mediators that could protect (alter the resistance) of nearby cells. I can imagine two opposite results of this: i) one possibility is that resistance would lead to lower infection frequencies and this would result in apparent sub-linear infection (contrary to the observations); or ii) inoculums with more virus lead to faster infection, which doesn't allow enough time for the "resistance" (innate effect) to spread (potentially leading to results similar to the observations, supra-linear infection). 

      In our models we assumed cells to be independent of each other (see also responses to other similar points). Because we measure infection in individual cells, assuming cells are independent is a reasonable first approximation. However, the reviewer makes an excellent point that there may be some between-cell signaling happening in the culture that “alerts” or “conditions” cells to change their “resistance”. It is also possible that at higher genome/cell numbers, exposure of cells to virions or virion debris may change the state of cells in the culture, and more cells become “susceptible” to infection. This is a good point that we will list in Limitations subsection of Discussion; it is a good hypothesis to test in our future experiments.

      (3) Another unclear aspect of cell infection is whether each cell only has one chance to be infected or multiple chances, i.e., do the authors run the simulation once over all the cells or more times? 

      Each cell has only one chance to be infected. Algorithm 1 clearly states that; we will add an extra sentence in “Agent-based simulations” to indicate this point.

      (4) On the other hand, the authors address the complementary issue of the virus acting independently or not, with their clumping model (which includes nice experimental measurements). However, it was unclear to me what the assumption of the simulation is in this case. In the case of infection by a clump of virus or "viral compensation", when infection is successful (the cell becomes infected), how many viruses "disappear" and what happens to the rest? For example, one of the viruses of the clump is removed by infection, but the others are free to participate in another clump, or they also disappear. The only thing I found about this is the caption of Figure S10, and it seems to indicate that only the infected virus is removed. However, a typical assumption, I think, is that viruses aggregate to improve infection, but then the whole aggregate participates in infection of a single cell, and those viruses in the clump can't participate in other infections. Viral cooperativity with higher inocula in this case would be, perhaps, the result of larger numbers of clumps for higher inocula. This seems in agreement with Figure S8, but was a little unclear in the interpretation provided. 

      This is a good point. We did not remove the clump if one of the virions in the clump manages to infect a cell, and indeed, this could be the reason why in some simulations we observe apparent cooperativity when modeling viral clumping. This is something we will explore in our revision.

      (5) In algorithm 1, how does P_i, as defined, relate to equation 1? 

      These are unrelated because eqn.(1) is a phenomenological model that links infection per cell to genomes per cell. P_i in algorithm 1 is “physics-inspired” potential barrier.

      (6) In line 228, and several other places (e.g., caption of Table S2), the authors refer to the probability of a single genome infecting a cell p(1)=exp(-lambda), but shouldn't it be p(1)=1-exp(-lambda) according to equation 1?

      Indeed, it was a typo, p(1)=1-exp(-lambda) per eqn 1. Thank you, it will be corrected in the revised paper.

      (7) In line 304, the accrued damage hypothesis is defined, but it is stated as a triggering of an antiviral response; one would assume that exposure to a virion should increase the resistance to infection. Otherwise, the authors are saying that evolution has come up with intracellular viral resistance mechanisms that are detrimental to the cell. As I mentioned above, this could also be a mechanism for non-independent cell infection. For example, infected cells signal to neighboring cells to "become resistance" to infection. This would also provide a mechanism for saturation at high levels. 

      We do not know how exposure of a cell to one virion would change its “antiviral state”, i.e., to become more or less resistant to the next infection. If a cell becomes more resistant, there is no possibility to observe apparent cooperativity in infection of cells, so this hypothesis cannot explain our observations with n>1. Whether this mechanism plays a role in saturation of cell infection rate at lower than 1 value when genome/cell is large is unclear but is a possibility. We will add this point to Discussion in revision.

      (8) In Figure 3, and likely other places, t-tests are used for comparisons, but with only an n=5 (experiments). Many would prefer a non-parametric test. 

      We repeated the analyses in Fig 3 with Mann-Whitney test, results were the same, so we would like to keep results from the t-test in the paper.

      Reviewer #2 (Public review):

      In their article, Peterson et al. wanted to show to what extent the classical "single hit" model of virion infection, where one virion is required to infect a cell, does not match empirical observations based on human cytomegalovirus in vitro infection model, and how this would have practical impacts in experimental protocols.

      They first used a very simple experimental assay, where they infected cells with serially diluted virions and measured the proportion of infected cells with flow cytometry. From this, they could elegantly show how the proportion of infected cells differed from a "single hit" model, which they simulated using a simple mathematical model ("powerlaw model"), and better fit a model where virions need to cooperate to infect cells. They then explore which mechanism could explain this apparent cooperation:

      (1) Stochasticity alone cannot explain the results, although I am unsure how generalizable the results are, because the mathematical model chosen cannot, by design, explain such observations only by stochasticity. 

      Our null model simulations are not just about stochasticity; they also include variability in virion infectivity and cell resistance to infection. We agree that simulations cannot truly prove that such variability cannot result in apparent cooperativity; however, we also provide a mathematical proof that increase in frequency of infected cells should be linear with virion concentration at small genome/cell numbers.

      (2) Virion clumping seemed not to be enough either to generally explain such a pattern. For that, they first use a mathematical model showing that the apparent cooperation would be small. However, I am unsure how extreme the scenario of simulated virion clumping is. They then used dynamic light scattering to measure the distribution of the sizes of clumps. From these estimates, they show that virion clumps cannot reproduce the observed virion cooperation in serial dilution assays. However, the authors remain unprecise on how the uncertainty of these clumps' size distribution would impact the results, as most clumps have a size smaller than a single virion, leaving therefore a limited number of clumps truly containing virions. 

      As we stated in the paper, clumping may explain apparent cooperativity in simulations depending on how stock dilution impacts distribution of virions/clump. This could be explored further, however, better experimental measurements of virions/clump would be highly informative (but we do not have resources to do these experiments at present). Our point is that the degree of apparent cooperativity is dependent on the target cell used (n is smaller on epithelial cells than on fibroblasts) that is difficult to explain by clumping which is a virion property. Per comment by reviewer 1, we will do some more analyses of the clumping model to investigate importance of clump removal per successful infection on the detected degree of apparent cooperativity.

      The two models remain unidentifiable from each other but could explain the apparent virion cooperativity: either due to an increase in susceptibility of the cell each time a virion tries to infect it, or due to viral compensation, where lesser fit viruses are able to infect cells in co-infection with a better fit virion. Unfortunately, the authors here do not attempt to fit their mathematical model to the experimental data but only show that theoretical models and experimental data generate similar patterns regarding virion apparent cooperation. 

      In the revision we will provide examples of simulations that “match” experimental data with a relatively high degree of apparent cooperativity; we have done those before but excluded them from the current version since they are a bit messy. Fitting simulations to data may be an overkill.

      Finally, the authors show that this virions cooperation could make the relationship between the estimated multiplicity of infection and viruses/cell deviate from the 1:1 relationship. Consequently, the dilution of a virion stock would lead to an even stronger decrease in infectivity, as more diluted virions can cooperate less for infection.

      Overall, this work is very valuable as it raises the general question of how the estimate of infectivity can be biased if extrapolated from a single virus titer assay. The observation that HCMV virions often cooperate and that this cooperation varies between contexts seems robust. The putative biological explanations would require further exploration.

      This topic is very well known in the case of segmented viruses and the semi-infectious particles, leading to the idea of studying "sociovirology", but to my knowledge, this is the first time that it was explored for a nonsegmented virus, and in the context of MOI estimation. 

      Thank you.

      Reviewer #3 (Public review): 

      Summary:

      The authors dilute fluorescent HCMV stocks in small steps (df ≈ 1.3-1.5) across 23 points, quantify infections by flow cytometry at 3 dpi, and fit a power-law model to estimate a cooperativity parameter n (n > 1 indicates apparent cooperativity). They compare fibroblasts vs epithelial cells and multiple strains/reporters, and explore alternative mechanisms (clumping, accrued damage, viral compensation) via analytical modeling and stochastic simulations. They discuss implications for titer/MOI estimation and suggest a method for detecting "apparent cooperativity," noting that for viruses showing this behavior, MOI estimation may be biased.

      Strengths:

      (1) High-resolution titration & rigor: The small-step dilution design (23 serial dilutions; tailored df) improves dose-response resolution beyond conventional 10× series.

      (2) Clear quantitative signal: Multiple strain-cell pairs show n > 1, with appropriate model fitting and visualization of the linear regime on log-log axes.

      (3) Mechanistic exploration: Side-by-side modeling of clumping vs accrued damage vs compensation frames testable hypotheses for cooperativity. 

      Thank you.

      Weaknesses:

      (1) Secondary infection control: The authors argue that 3 dpi largely avoids progeny-mediated secondary infection; this claim should be strengthened (e.g., entry inhibitors/control infections) or add sensitivity checks showing results are robust to a small secondary-infection contribution. 

      This is an important point. We do believe that the current knowledge about HCMV virion production time – it takes 3-4 days to make virions per multiple papers (see Fig 7 in Vonka and Benyesh-Melnick JB 1966; Fig 3B in Stanton et al JCI 2010; and Fig 1A in Li et al. PNAS 2015) – is sufficient to justify our experimental design but we do agree that an additional control to block novel infections with would be useful. We had previously performed experiments with a HCMV TB-gL-KO that cannot make infectious virions (but the stock virions can be made from complemented target cells). We will investigate if our titration experiments with this virus strain have sufficient resolution to detect apparent cooperativity. However, at present we do not have the resources to perform novel experiments.  

      (2) Discriminating mechanisms: At present, simulations cannot distinguish between accrued damage and viral compensation. The authors should propose or add a decisive experiment (e.g., dual-color coinfection to quantify true coinfection rates versus "priming" without coinfection; timed sequential inocula) and outline expected signatures for each mechanism. 

      Excellent suggestion. Because infection of a cell is a result of the joint viral infectivity and cell resistance, it may be hard to discriminate between these alternatives unless we specify them as particular molecular mechanisms. But we will try our best and list potential future experiments in the revised version of the paper.

      (3) Decline at high genomes/cell: Several datasets show a downturn at high input. Hypotheses should be provided (cytotoxicity, receptor depletion, and measurement ceiling) and any supportive controls. 

      Another good point. We do not have a good explanation, but we do not believe this is because of saturation of available target cells.  It seemed to only happen (or was most pronounced) with the ME stocks, which are typically lower in titer and so the higher MOI were nearly undiluted stock. It may be the effect of the conditioned medium.  Or perhaps there are non-infectious particles like dense bodies (enveloped particles that lack a capsid and genome) and non-infectious, enveloped particles (NIEPs) that compete for receptors or otherwise damage cells and these don’t get diluted out at the higher doses.  We plan to include these points in Discussion of the revised version of the paper.

      (4) Include experimental data: In Figure 6, please include the experimentally measured titers (IU/mL), if available. 

      This is a model-simulated scenario, and as such, there is no measured titers.

      (5) MOI guidance: The practical guidance is important; please add a short "best-practice box" (how to determine titer at multiple genomes/cell and cell densities; when single-hit assumptions fail) for end-users. 

      Good suggestion. We will include best-practice box using guidelines developed in Ryckman lab over the years in the revised version of the paper.

      Overall note to all reviews: We have deposited our codes and the data on github; yet, none of the reviewers commented on it.

    1. Author response:

      We thank the editor and reviewers for their thoughtful feedback. We agree with eLife’s overall assessment that, while profiling terminating ribosomes is informative in revealing termination dynamics, the underlying mechanisms require more evidence. Our revision will focus on three conceptual points.

      (1) We will tone down the statement that putative mRNA:rRNA interaction contributes to sequence-specific termination pausing.

      (2) We will clarify the potential role of Rps26 in regulating translation termination.

      (3) We will expand the discussion of tissue-specific termination pausing.

      Reviewer #1 (Public Review):

      (1) We admit that the modest effects of ABCE1 were partly due to the incomplete ABCE1 knockdown in HEK293 cells. Since the elevated ribosome density occurred at all stop codons, we argue that the action of ABCE1 is likely independent of the sequence context. We will rephrase relevant statements in the revised manuscript.

      (2) In terms of Rps26 structures, we agree the structural rearrangement in the absence of Rps26 is highly speculative. However, we do not believe the Rps26 stoichiometry is solely dependent on stress. We will clarify this important point in the revised manuscript.

      (3) We apologize for the confusion about 18S rRNA “scanning” and will revise the sentence in the main text.

      (4) We agree that functional significance of testis-specific termination dynamics is unclear. Since other reviewers raised similar concern, we will expand the discussion of tissue-specific termination pausing in the revised manuscript.

      Reviewer #2 (Public Review):

      We appreciate the Reviewer’s time and efforts in reviewing our manuscript. We are grateful for the insightful comments and many recommendations made by the reviewer to improve our manuscript. We feel that the reviewer may have some misunderstanding in terms of the sequence motif associated with the termination pausing, partly because of the lack of clarity in our original description of the results from MPRA and reporter assays. We will ensure that the reviewer’s points are fully addressed in the revised manuscript.

      Reviewer #3 (Public Review):

      We thank the reviewer’s positive comment on our manuscript. We agree that the tissue-specific termination differences were poorly described in the main text. Notably, other reviewers raised similar concerns. We will expand the relevant discussion in the revised manuscript, outlining this as a limitation and a future direction.

      Reviewer #4 (Public Review):

      We believe the reviewer mixed xthe public view with recommendation comments. The reviewer appears to be preoccupied by previous studies and questioned some inconsistency in our results. With the development of new technology such as eRF1-seq, we are encouraged to present “new” and “different” findings. All other reviewers appreciate the development of eRF1-seq to profile terminating ribosomes. In fact, we do not believe our data is fundamentally different from the established principles. Rather, our data provides new perspectives to further our understanding of ribosome dynamics at stop codons. We thank the reviewer for understanding.

      The reviewer is quite confused by our sequencing analysis based on peak height, or read density, which is commonly used to infer ribosome dynamics such as pausing. Regarding the sequencing analysis and reporter assays in cells expressing 18S mutant (Figure 5) and Rps26 (Figure 7), we feel that the reviewer has some misunderstanding. In the revised manuscript, we will do our best to clarify those relevant issues. Finally, the reviewer’s comment on base pairing is well-received and we will thoroughly revise the main text and discussion in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, Bisht et al address the hypothesis that protein folding chaperones may be implicated in aggregopathies and in particular Tau aggregation, as a means to identify novel therapeutic routes for these largely neurodegenerative conditions.

      The authors conducted a genetic screen in the Drosophila eye, which facilitates the identification of mutations that either enhance or suppress a visible disturbance in the nearly crystalline organization of the compound eye. They screened by RNA interference all 64 known Drosophila chaperones and revealed that mutations in 20 of them exaggerate the Tau-dependent phenotype, while 15 ameliorated it. The enhancer of the degeneration group included 2 subunits of the typically heterohexameric prefoldin complex and other co-translational chaperones.

      The authors characterized in depth one of the prefoldin subunits, Pfdn5, and convincingly demonstrated that this protein functions in the regulation of microtubule organization, likely due to its regulation of proper folding of tubulin monomers. They demonstrate convincingly using both immunohistochemistry in larval motor neurons and microtubule binding assays that Pfdn5 is a bona fide microtubule-associated protein contributing to the stability of the axonal microtubule cytoskeleton, which is significantly disrupted in the mutants.

      Similar phenotypes were observed in larvae expressing Frontotemporal dementia with Parkinsonism on chromosome 17-associated mutations of the human Tau gene V377M and R406W. On the strength of the phenotypic evidence and the enhancement of the TauV377Minduced eye degeneration, they demonstrate that loss of Pfdn5 exaggerates the synaptic deficits upon expression of the Tau mutants. Conversely, the overexpression of Pfdn5 or Pfdn6 ameliorates the synaptic phenotypes in the larvae, the vacuolization phenotypes in the adult, and even memory defects upon TauV377M expression.

      Strengths

      The phenotypic analyses of the mutant and its interactions with TauV377M at the cell biological, histological, and behavioral levels are precise, extensive, and convincing and achieve the aims of characterization of a novel function of Pfdn5. 

      Regarding this memory defect upon V377M tau expression. Kosmidis et al (2010), PMID: 20071510, demonstrated that pan-neuronal expression of Tau<sup>V377M</sup> disrupts the organization of the mushroom bodies, the seat of long-term memory in odor/shock and odor/reward conditioning. If the novel memory assay the authors use depends on the adult brain structures, then the memory deficit can be explained in this manner. 

      (1) If the mushroom bodies are defective upon Tau<sup>V377M</sup>. expression, does overexpression of Pfdn5 or 6 reverse this deficit? This would argue strongly in favor of the microtubule stabilization explanation.

      We thank the reviewer for this insightful comment. Consistent with Kosmidis et al. (2010), we confirm that expression of hTau<sup>V377M</sup> disrupts the architecture of mushroom bodies.   In addition, we find, as suggested by the reviewer, that coexpression of either Pfdn5 or Pfdn6 with hTau<sup>V377M</sup> significantly restores the organization of the mushroom bodies. These new findings strongly support the hypothesis that Pfdn5 or Pfdn6 mitigate hTau<sup>V377M</sup> -induced memory deficits by preserving the structure of the mushroom body, likely through stabilizing the microtubule network. This data has now been included in the revised manuscript (Figure 7H-O).

      (2) The discovery that Pfdn5 (and 6 most likely) affects tauV377M toxicity is indeed a novel and important discovery for the Tauopathies field. It is important to determine whether this interaction affects only the FTDP-17-linked mutations or also WT Tau isoforms, which are linked to the rest of the Tauopathies. Also, insights on the mode(s) that Pfdn5/6 affect Tau toxicity, such as some of the suggestions above, are aiming at will likely be helpful towards therapeutic interventions.

      We agree that determining whether prefoldin modulates the toxicity of both mutant and wildtype Tau is critical for understanding its broader relevance to Tauopathies. We have now performed additional experiments required to address this issue. These new data show that loss of Pfdn5 also exacerbates toxicity associated with wildype Tau (hTau<sup>WT</sup>), in a manner similar to that observed with hTau<sup>V337M</sup> or hTau<sup>R406W</sup>. Specifically, overexpression of hTau<sup>WT</sup> in a Pfdn5 mutant background leads to Tau aggregate formation (Figure S7G-I), and coexpression of Pfdn5 with hTau<sup>WT</sup> reduces the associated synaptic defects (Figure S11F-L). These findings underscore a general role for Pfdn5 in modulating diverse Tauopathy-associated phenotypes and suggest that it could be a broadly relevant therapeutic target. 

      Weakness

      (3) What is unclear, however, is how Pfdn5 loss or even overexpression affects the pathological Tau phenotypes. Does Pfdn5 (or 6) interact directly with TauV377M? Colocalization within tissues is a start, but immunoprecipitations would provide additional independent evidence that this is so.

      We appreciate this important suggestion. To investigate a potential direct interaction between Pfdn5 and Tau<sup>V377M</sup>, we performed co-immunoprecipitation experiments using lysates from adult fly brain expressing hTau<sup>V337M</sup>. Under the conditions tested, we did not detect a direct physical interaction. While this does not support a direct interaction, it does not strongly refute it either. We note that Pfdn5 and Tau are colocalized within axons (Figure S13J-K). At this stage, we are unable to resolve the issue of direct vs indirect association. If indirect, then Tau and Pfdn5 act within the same subcellular compartments (axon); if direct, then either only a small fraction of the total cellular proteins is in the Tau-Pfdn5 complex and therefore difficult to detect in bulk protein westerns, or the interactions are dynamic or occur in conditions that we have not been able to mimic in vitro. 

      (4) Does Pfdn5 loss exacerbate Tau<sup>V377M</sup> phenotypes because it destabilizes microtubules, which are already at least partially destabilized by Tau expression? Rescue of the phenotypes by overexpression of Pfdn5 agrees with this notion. 

      However, Cowan et al (2010) pmid: 20617325 demonstrated that wildtype Tau accumulation in larval motor neurons indeed destabilizes microtubules in a Tau phosphorylation-dependent manner. So, is Tau<sup>V377M</sup> hyperphosphorylated in the larvae?? What happens to Tau<sup>V377M</sup> phosphorylation when Pfdn5 is missing and presumably more Tau is soluble and subject to hyperphosphorylation as predicted by the above?

      We completely agree that it is important to link Tau-induced phenotypes with the microtubule destabilization and phosphorylation state of Tau.   We performed immunostaining using futsch antibody to check the microtubule organization at the NMJ and observed a severe reduction in futsch intensity when Tau<sup>V337M</sup> was expressed in the Pfdn5 mutant (ElavGal4>Tau<sup>V337M</sup>; DPfdn5<sup>15/40</sup>), suggesting that Pfdn5 absence exacerbates the hTau<sup>V337M</sup> defects due to more microtubule destabilization (Figure S6F-J). 

      We have performed additional experiments to examine the phosphorylation state of hTau in Drosophila larval axons. Immunocytochemistry indicated that only a subset of hTau aggregates in Pfdn5 mutants (Elav-Gal4>Tau<sup>V337M</sup>; DPfdn5<sup>15/40</sup>) are recognized by phospho-hTau antibodies.   For instance, the AT8 antibody (targeting pSer202/pThr205) (Goedert et al., 1995) labelled only a subset of aggregates identified by the total hTau antibody (D5D8N) (Figure S9AE). Moreover, feeding these larvae (Elav-Gal4>Tau<sup>V337M</sup; DPfdn5<sup>15/40</sup>) with LiCl, which blocks GSK3b, still showed robust Tau aggregation (Figure S9F-J). 

      These results imply that: a) soluble phospho-hTau levels in Pfdn5 mutants are low and not reliably detected with a single phospholylation-specific antibody; b) Loss of Pfdn5 results in Tau aggregation in a hyperphosphorylation-independent manner similar to what has been reported earlier (LI et al. 2022); and c) the destabilization of microtubules in Elav-Gal4>Tau<sup>V337M</sup>; DPfdn5<sup>15/40</sup> results in Tau dissociation and aggregate formation. These data and conclusions have been incorporated into the revised manuscript.

      (5) Expression of WT human Tau (which is associated with most common Tauopathies other than FTDP-17) as Cowan et al suggest has significant effects on microtubule stability, but such Tauexpressing larvae are largely viable. Will one mutant copy of the Pfdn5 knockout enhance the phenotype of these larvae?? Will it result in lethality? Such data will serve to generalize the effects of Pfdn5 beyond the two FDTP-17 mutations utilized.

      We have now examined whether heterozygous loss of Pfdn5 (∆Pfdn5/+) enhances the effect of Tau expression. While each genotype (hTau<sup>V337M</sup>, hTau<sup>WT</sup> or ∆Pfdn5/+) alone is viable, Elav-Gal4 driven expression of hTau<sup>V337M</sup> or hTau<sup>WT</sup> in Pfdn5 heterozygous background does not cause lethality. 

      (6) Does the loss of Pfdn5 affect TauV377M (and WTTau) levels?? Could the loss of Pfdn5 simply result in increased Tau levels? And conversely, does overexpression of Pfdn5 or 6 reduce Tau levels?? This would explain the enhancement and suppression of Tau<sup>V377M</sup> (and possibly WT Tau) phenotypes. It is an easily addressed, trivial explanation at the observational level, which, if true, begs for a distinct mechanistic approach.

      To test whether Pfdn5 modulates Tau phenotypes by altering Tau protein levels, we performed western blot analysis under Pfdn5 or Pfdn6 overexpression conditions and observed no change in hTau<sup>V337M</sup> levels (Figure 6O). However, in the absence of Pfdn5, both hTau<sup>V337M</sup> and hTau<sup>WT</sup> form large, insoluble aggregates that are not detected in soluble lysates by standard western blotting but are visualized by immunocytochemistry (Figure S7G-I). Thus, the apparent reduction in Tau levels on western blots reflects a solubility shift, not an actual decrease in Tau expression. These findings argue against a simple model in which Pfdn5 regulates Tau abundance and instead support a mechanism in which Pfdn5 loss leads to change in Tau conformation, leading to its sequesteration away for already destabilized microtubules.  

      (7) Finally, the authors argue that Tau<sup>V377M</sup> forms aggregates in the larval brain based on large puncta observed especially upon loss of Pfdn5. This may be so, but protocols are available to validate this molecularly the presence of insoluble Tau aggregates (for example, pmid: 36868851) or soluble Tau oligomers, as these apparently differentially affect Tau toxicity. Does Pfdn5 loss exaggerate the toxic oligomers, and overexpression promote the more benign large aggregates??

      We have performed additional experiments to analyze the nature of these aggregates using 1,6-HD. The 1,6-hexanediol can dissolve the Tau aggregate seeds formed by Tau droplets, but cannot dissolve the stable Tau aggregates (WEGMANN et al. 2018). We observed that 5% 1,6hexanediol failed to dissolve these Tau aggregates (Figure S8), demonstrating the formation of stable filamentous flame-shaped NFT-like aggregates in the absence of Pfdn5 (Figure 5D and Figure S9).

      Reviewer #2 (Public review):

      Bisht et al detail a novel interaction between the chaperone, Prefoldin 5, microtubules, and taumediated neurodegeneration, with potential relevance for Alzheimer's disease and other tauopathies. Using Drosophila, the study shows that Pfdn5 is a microtubule-associated protein, which regulates tubulin monomer levels and can stabilize microtubule filaments in the axons of peripheral nerves. The work further suggests that Pfdn5/6 may antagonize Tau aggregation and neurotoxicity. While the overall findings may be of interest to those investigating the axonal and synaptic cytoskeleton, the detailed mechanisms for the observed phenotypes remain unresolved and the translational relevance for tauopathy pathogenesis is yet to be established. Further, a number of key controls and important experiments are missing that are needed to fully interpret the findings.

      The strength of this study is the data showing that Pfdn5 localizes to axonal microtubules and the loss-of-function phenotypic analysis revealing disrupted synaptic bouton morphology. The major weakness relates to the experiments and claims of interactions with Tau-mediated neurodegeneration. 

      In particular, it is unclear whether knockdown of Pfdn5 may cause eye phenotypes independent of Tau. 

      Our new experiments confirm that knockdown of Pfdn5 alone does not cause eye phenotypes.

      Further, the GMR>tau phenotype appears to have been incorrectly utilized to examine agedependent, neurodegeneration.

      In response, we have modulated and explained our conclusions in this regard as described later in our “rebuttal.”

      This manuscript argues that its findings may be relevant to thinking about mechanisms and therapies applicable to tauopathies; however, this is premature given that many questions remain about the interactions from Drosophila, the detailed mechanisms remain unresolved, and absent evidence that Tau and Pfdn may similarly interact in the mammalian neuronal context. Therefore, this work would be strongly enhanced by experiments in human or murine neuronal culture or supportive evidence from analyses of human data.

      The reviewer is correct that the impact would be greater if Pfdn5-Tau interactions were also examined in human tissue.   While we have not attempted these experiments ourselves, we hope that our observations will stimulate others to test the conservation of phenomena we describe. There are, however, several lines of circumstantial evidence from human Alzheimer’s disease datasets that implicate PFDN5 in disease pathology. For example, recent compilations and analyses of proteomic data show reductions of CCT components, TBCE, as well as Prefoldin subunits, including PFDN5, in AD tissue (HSIEH et al. 2019; TAO et al. 2020; JI et al. 2022; ASKENAZI et al. 2023; LEITNER et al. 2024; SUN et al. 2024). Furthermore, whole blood mRNA expression data from Alzheimer's patients revealed downregulation of PFDN5 transcript (JI et al. 2022). Together, these findings from human data are consistent with the roles of PFDN5 in suppressing diverse neurodegenerative processes. We have incorporated these points into the discussion section of the revised manuscript.

      Reviewer #1 (Recommendations for the authors):

      See public review for experimental recommendations focusing on the Tau Pfdn interactions.  I would refrain from using the word aggregates, I would call them puncta, unless there is molecular or visual (ie AFM) evidence that they are indeed insoluble aggregates.  Finally, although including the full genotypes written out below the axis in the bar graphs is appreciated, it nevertheless makes them difficult to read due to crowding in most cases and somewhat distracting from the figure. 

      In my opinion, a more reader-friendly manner of reporting the phenotypes will be highly helpful. For example, listing each component of the genotype on the left of each bar graph and adding a cross or a filled circle under the bar to inform of the full genotype of the animals used.

      As described in the response to the previous comment, we now have strong direct evidences to support our view that the observed puncta are stable Tau aggregates. Thus, we feel justified to use the term Tau-aggregates in preference to Tau puncta. 

      We have tried to write the genotypes to make them more reader-friendly.

      Reviewer #2 (Recommendations for the authors):

      (1) Lines 119-121: 35 modifiers from 64 seem like an unusually high hit rate. Are these individual genes or lines? Were all modifiers supported by at least 2 independent RNAi strains targeting non-overlapping sequences? A supplemental table should be included detailing all genes and specific strains tested, with corresponding results.

      We agree with the reviewer that 35 modifiers from 64 genes may be too high. However, since the genes knocked down in the study are chaperones, crucial for maintaining proteostasis, we may have got unusually high hits. The information related to individual genes and lines is provided in Supplemental Table 1. We have now included an additional Supplemental Table 3, which lists the genes and the RNAi lines used in Figure 1, detailing the sequence target information. The table also specifies the number of independent RNAi strains used and the corresponding results. 

      (2) Figure 1: The authors quantify the areas of ommatidial fusion and necrosis as degeneration, but it is difficult to appreciate the aberrations in the photos provided. Was any consideration given to also quantifying eye size?

      We have processed the images to enhance their contrast and make the aberrations clearer. The percentage of degenerated eye area (Figure 1M) was normalized with total eye area. The method for quantifying degenerated area has been explained in the materials and methods section.

      (3) Figure 1: a) Only enhancers of rough eyes are shown but no controls are included to evaluate whether knockdown of these genes causes eye toxicity in the absence of Tau. These are important missing controls. All putative Tau enhancers, including Pdn5/6, need to be tested with GMR-GAL4 independently of Tau to determine whether they cause a rough eye. In a previous publication from some of the same investigators (Raut et al 2017), knockdown of Pfdn using eyGAL4 was shown to induce severe eye morphology defects - this raises questions about the results shown here. 

      We agree that assessing the effects of HSP knockdown independent of Tau is essential to confirm modifier specificity. We have now performed these knockdowns, and the data are reported in Supplemental Table 1. For RNAi lines represented in Figure 1, which enhanced Tau-induced degeneration/eye developmental defect, except for one of the RNAi lines against Pfdn6 (GD34204), no detectable eye defects were observed when knocked down with GMR-Gal4 at 25°C, suggesting that enhancement is specific to the Tau background. 

      Use of a more eye-specific GMR-Gal4 driver at 25°C versus broader expressing ey-Gal4 at 29°C in prior work (Raut et al. 2017) likely reflects the differences in the eye morphological defects.

      (b) Besides RNAi, do the classical Pdn5 deletion alleles included in this work also enhance the tau rough eye when heterozygous? Please also consider moving the Pfdn5/6 overexpression studies to evaluate possible suppression of the Tau rough eye to Figure 1, as it would enhance the interpretation of these data (but see also below).

      GMR-Gal4 driven expression of hTau<sup>V337M</sup> or hTau<sup>WT</sup> in Pfdn5 heterozygous background does not enhance rough eye phenotype. 

      (4) For genes of special interest, such as Pdn5, and other genes mentioned in the results, the main figure, or discussion, it is also important to perform quantitative PCR to confirm that the RNAi lines used actually knock down mRNA expression and by how much. These studies will establish specificity.

      We agree that confirming RNAi efficiency via quantitative PCR (qPCR) is essential for validating the knockdown efficiency. We have now included qPCR data, especially for key modifiers, confirming effective knockdown (Figure S2).

      (5) Lines 235-238: how do you conclude whether the tau phenotype is "enhanced" when Pfdn5 causes a similar phenotype on its own? Could the combination simply be additive? Did overexpression of Pdn5 suppress the UAS-hTau NMJ bouton phenotype (see below)? 

      Although Pfdn5 mutants and hTau expression individually increase satellite boutons, their combination leads to a significantly more severe and additional phenotype, such as significantly decreased bouton size and increased bouton number, indicating an enhancing rather than purely additive interaction (Figure 4 and Figure S6C). Moreover, we now show that overexpression of Pfdn5 significantly suppressed the hTau<sup>V337M</sup>-induced NMJ phenotypes. This new data has been incorporated as Figure S11F-L in the revised manuscript. 

      Alternatively, did the authors consider reducing fly tau in the Pdn5 mutant background?

      In new additional experiments, we observe that double mutants for Drosophila Tau (dTau) and Pfdn5 also exhibit severe NMJ defects, suggesting genetic interactions between dTau and Pfdn5. This data is shown below for the reviewer.

      Author response image 1.

      A double mutant combination of dTau and Pfdn5 aggravates the synaptic defects at the Drosophila NMJ. (A-D') Confocal images of NMJ synapses at muscle 4 of A2 hemisegment showing synaptic morphology in (A-A') control, (B-B') ΔPfdn5<SUP>15/40</SUP>, (C-C') dTauKO/dTauKO (Drosophila Tau mutant), (D-D') dTauKO/dTauKO; ∆Pfdn5<SUP>15/40</SUP> double immunolabeled for HRP (green), and CSP (magenta). The scale bar in D for (A-D') represents 10 µm. 

      (6) It may be important to further extend the investigation to the actin cytoskeleton. It is noted that Pfdn5 also stabilizes actin. Importantly, tau-mediated neurodegeneration in Drosophila also disrupts the actin cytoskeleton, and many other regulators of actin modify tau phenotypes.

      We appreciate the suggestion to examine the actin cytoskeleton. While prior studies indicate that Pfdn5 might regulate the actin cytoskeleton and that Tau<sup>V377M</sup> hyperstabilizes the actin cytoskeleton, we did not observe altered actin levels in Pfdn5 mutants (Figure 2G). However, actin dynamics may represent an additional mechanism through which Pfdn5 might temporally influence Tauopathy. Future work will address potential actin-related mechanisms in Tauopathy.

      (7) Figure 2: in the provided images, it is difficult to appreciate the futsch loops. Please include an image with increased magnification. It appears that fly strains harboring a genomic rescue BAC construct are available for Pfdn-this would be a complementary reagent to test besides Pfdn overexpression.

      We have updated Figure 2 to include high magnification NMJ images as insets, clearly showing the Futsch loops. While we have not yet tested a genomic rescue BAC construct for Pfdn5, we plan to use the fly line harboring this construct in future work.

      (8) Figure 3: Some of the data is not adequately explained. The use of Ran as a loading control seems rather unusual. What is the justification? Pfdn appears to only partially co-localize with a-tubulin in the axon; can the authors discuss or explain this? Further, in Pfdn5 mutants, there appears to be a loss of a-tubulin staining (3b'); this should also be discussed.

      We appreciate the reviewer's concern regarding the choice of loading control for our Western blot analysis. Importantly, since Tubulin levels and related pathways were the focus of our analysis, traditional loading controls such as α- or β-tubulin or actin were deemed unsuitable due to potential co-regulation. Ran, a nuclear GTPase involved in nucleocytoplasmic transport, is not known to be transcriptionally or post-translationally regulated by Tubulin-associated signaling pathways. To ensure its reliability as a loading control, we confirmed by densitometric analysis that Ran expression showed minimal variability across all samples. Hence, we used Ran for accurate normalization in the Western blot data represented in this manuscript. We have also used GAPDH as a loading control and found no difference with respect to Ran as a loading control across samples.

      We appreciate the reviewer's comment regarding the interpretation of our Pearson's correlation coefficient (PCC) results. While the mean colocalization value of 0.6 represents a moderate positive correlation (MUKAKA 2012), which may not reach the conventional threshold for "high positive" colocalization (usually considered 0.7-0.9), it nonetheless indicates substantial spatial overlap between the proteins of interest. Importantly, colocalization analysis provides supportive but indirect evidence for molecular proximity.  To further validate the interaction, we performed a microtubule binding assay, which directly demonstrates the binding of Pfdn5 to stabilized microtubules.

      In accordance with the western blot analysis shown in Figure 2G-I, the levels of Tubulin are reduced in the Pfdn5 mutants (Figure 3B''). We have incorporated and discussed this in the revised manuscript.

      (9) Figure 4: Overexpression of Pfdn appears to rescue the supernumerary satellite bouton numbers induced by human Tau; however, interpretation of this experiment is somewhat complicated as it is performed in Pfdn mutant genetic background. Can overexpression of Pfdn on its own rescue the Tau bouton defect in an otherwise wildtype background?

      We have now coexpressed Pfdn5 and hTau<SUP>V337M</SUP> in an otherwise wild-type background. As shown in Figure S11F-L, Pfdn5 overexpression suppresses Tau-induced bouton defects. We have incorporated the data in the Results section to support the role of Pfdn5 as a modifier of Tau toxicity.

      (10) Lines 256-263 / Figure 5: (a) What exactly are these tau-positive structures (punctae) being stained in larval brains in Fig 5C-E? Most prior work on tau aggregation using Drosophila models has been done in the adult brain, and human wildtype or mutant Tau is not known to form significant numbers of aggregates in neurons (although aggregates have been described following glia tau expression). 

      Therefore, the results need to be further clarified. Besides the provided schematic, a zoomed-out image showing the whole larval brain is needed here for orientation. Have these aggregates been previously characterized in the literature? 

      We agree with the reviewer that the expression of the wildtype or mutant form of human Tau in Drosophila is not known to form aggregates in the larval brain, in contrast to the adult brain (JACKSON et al. 2002; OKENVE-RAMOS et al. 2024). Consistent with previous reports, we also observed that Tau expression on its own does not form aggregates in the Drosophila larval brain.

      However, in the absence of Pfdn5, microtubule disruption is severe, leading to reduced Taumicrotubule binding and formation of globular/round or flame-shaped tangles like aggregates in the larval brain. Previous studies have reported that 1,6-hexanediol can dissolve the Tau aggregate seeds formed by Tau droplets, but cannot dissolve the stable Tau aggregates (WEGMANN et al. 2018). We observed that 5% 1,6-Hexanediol failed to dissolve these Tau puncta, demonstrating the formation of stable aggregates in the absence of Pfdn5. Additionally, we now performed a Tau solubility assay and show that in the absence of Pfdn5, a significant amount of Tau goes in the pellet fraction, which could not be detected by phospho-specific AT8 Tau antibody (targeting pSer202/pThr205) but was detected by total hTau antibody (D5D8N) on the western blots (Figure S8). These data further reinforce our conclusion that  Pfdn5 prevents the transition of hTau from soluble and/or microtubule-associated state to an aggregated, insoluble, and pathogenic state. These new data have been incorporated into the revised manuscript.

      (b) Can additional markers (nuclei, cell membrane, etc.) be used to highlight whether the taupositive structures are present in the cell body or at synapses?

      We performed the co-staining of Tau and Elav to assess the aggregated Tau localization. We found that in the presence of Pfdn5, Tau is predominantly cytoplasmic and localised to the cell body and axons. In the absence of Pfdn5, Tau forms aggregates but is still localized to the cell body or axons. However, some of the aggregates are very large, and the subcellular localization could not be determined (Figure S8M-N'). These might represent brain regions of possible nuclear breakdown and cell death (JACKSON et al. 2002).

      (c) It would also be helpful to perform western blots from larval (and adult) brains examining tau protein levels, phospho-tau species, possible higher-molecular weight oligomeric forms, and insoluble vs. soluble species. These studies would be especially important to help interpret the potential mechanisms of observed interactions.

      Western blot analysis revealed that overexpression of Pfdn5 does not alter total Tau levels (Figure 6O). In Pfdn5 mutants, however, hTau<sup>V337M</sup> levels were reduced in the supernatant fraction and increased in the pellet fraction, indicating a shift from soluble monomeric Tau to aggregated Tau.

      (d) Does overexpression of Pdn5 (UAS-Pdn5) suppress the formation of tau aggregates? I would therefore recommend that additional experiments be performed looking at adult flies (perhaps in Pfdn5 heterozygotes or using RNAi due to the larval lethality of Pdn5 null animals).

      Overexpression of Pfdn5 significantly reduced Tau-aggregates (Elav-Gal4/UASTau<sup>V337M</sup>; UAS-Pfdn5; DPfdn5<sup>15/40</sup>) observed in Pfdn5 mutants (Figure 5E). Coexpression of Pfdn5 and hTau<sup>V337M</sup> suppresses the Tau aggregates/puncta in 30-day adult brain. Since heterozygous DPfdn<sup>15</sup>/+ did not show a reduction in Pfdn5 levels, we did not test the suppression of Tau aggregates in  DPfdn<sup>15</sup>/+; Elav>UAS-Pfdn5, UAS-Tau<sup>V337M</sup>.

      (11) Figure 6, panels A-N: The GMR>Tau rough eye is not a "neurodegenerative" but rather a predominantly developmental phenotype. It results from aberrant retinal developmental patterning and the subsequent secretion/formation of the overlying eye cuticle (lenslets). I am confused by the data shown suggesting a "shrinking eye size" and increasing roughened surface over time (a GMR>tau eye similar to that shown in panel B cannot change to appear like the one in panel H with aging). The rough eye can be quite variable among a population of animals, but it is usually fixed at the time the adult fly ecloses from the pupal case, and quite stable over time in an individual animal. Therefore, any suppression of the Tau rough eye seen at 30 days should be appreciable as soon as the animals eclose. These results need to be clarified. If indeed there is robust suppression of Tau rough eye, it may be more intuitive and clearer to include these data with Figure 1, when first showing the loss-of-function enhancement of the Tau rough eye. Also, why is Pfdn6 included in these experiments but not in the studies shown in Figures 2-5?

      We thank the reviewer for their careful and knowledgeable assessment of the GMR>Tau rough eye model. We appreciate the clarification that the rough eye phenotype could be “developmental” rather than neurodegenerative.”  Our initial observations regarding "shrinking eye size" and "increased surface roughness" clearly show age-related progression of structural change.   Such progression has been observed and reported by others (IIJIMA-ANDO et al. 2012; PASSARELLA AND GOEDERT 2018).   We observed an age-dependent increase in the number of fused ommatidia in GMR-Gal4 >Tau, which were rescued by Pfdn5 or Pfdn6 expression. We noted that adult-specific induction of hTau<sup>V337M</sup> adult flies using the Gal80<sup>ts</sup> and GMR-GeneSwitch (GMR-GS) systems was not sufficient to induce a significant eye phenotype; thus, early expression of Tau in the developing eye imaginal disc appears to be required for the adult progressive phenotype that we observe. We feel that it is inadequate to refer to this adult progressive phenotype as “developmental,” and while admittedly arguable whether this can be termed “degenerative.”   

      To address neurodegeneration more directly, we focused on 30-day-old adult fly brains and demonstrated that Pfdn5 overexpression suppresses age-dependent Tau-induced neurodegeneration in the central nervous system (Figure 6H-N and Figure S12). This supports our central conclusion regarding the neuroprotective role of Pfdn5 in age-associated Tau pathology. Since we found an enhancement in the Tau-induced synaptic and eye phenotypes by Pfdn6 knockdown, we also generated CRISPR/Cas9-mediated loss-of-function mutants for Pfdn6. However, loss of Pfdn6 resulted in embryonic/early first instar lethality, which precluded its detailed analysis at the larval stages.

      (12) Figure 6, panels O-T: the elav>tau image appears to show a different frontal section plane compared to the other panels. It is advisable to show images at a similar level in all panels since vacuolar pathology can vary by region. It is also useful to be able to see the entire brain at a lower power, but the higher power inset view is obscuring these images. I would recommend creating separate panels rather than showing them as insets.

      In the revised figure, we now display the low- and high-magnification images as separate, clearly labeled panels instead of using insets. This improves visibility of the brain morphology while providing detailed views of the vacuolar pathology (Figure 6H-L).

      (13) Figure 6/7: For the experiments in which Pfdn5/6 is overexpressed and possibly suppresses tau phenotypes (brain vacuoles and memory), it is important to use controls that normalize the number of UAS binding sites, since increased UAS sites may dilute GAL4 and reduced Tau expression levels/toxicity. Therefore, it would be advisable to compare with Elav>Tau flies that also include a chromosome with an empty UAS site or other transgenes, such as UAS-GFP or UAS-lacZ.

      We thank the reviewer for the suggestion. Now we have incorporated proper controls in the brain vacuolization, the mushroom body, and ommatidial fusion rescue experiments. Also, we have independently verified whether Gal4 dilution has any effect on the Tau phenotypes (Figure 6H-L, Figure 7, and Figure S11A-B).

      (14) Lines 311-312: the authors say vacuolization occurs in human neurodegenerative disease, which is not really true to my knowledge and definitely not stated in the citation they use. Please re-phrase.

      Now we have made the appropriate changes in the revised manuscript.

      (15) Figure 7: The authors claim that Pfdn5/6 expression does not impact memory behavior, but there in fact appears to be a decrease in preference index (panel D vs panel B). Does this result complicate the interpretation of the potential interaction with Tau (panel F). Are data from wildtype control flies available?

      In our memory assay, a decrease in performance index (PI) of the trained flies compared to the naïve flies indicates memory formation (normal memory in control flies, Figure 7B). In contrast, a lack of significant difference in PI indicates a memory defect (Figure 7C: hTau<sup>V337M</sup> overexpressed flies). "Decrease in preference index (panel D vs panel B)" is not a sign of memory defect; it may be interpreted as a better memory instead. Hence, neuronal overexpression of Pfdn5 (Figure 7D) or Pfdn6 (Figure 7E) in wildtype neurons does not cause memory deficits. In addition, coexpression of Pfdn5/6 and hTau<sup>V337M</sup> successfully rescues the Tau-induced memory defect (significant drop in PI compared to the PI of naïve flies in Figure 7F-G). Moreover, almost complete rescue of the Tau-induced mushroom body defect on Pfdn5 or Pfdn6 expression further establishes potential interaction between Pfdn5/6 and Tau. This data has been incorporated into the revised manuscript.

      The memory assay itself with extensive data on wildtype flies and various other genotype will shortly be submitted for publication in another manuscript (Majumder et al, manuscript under preparation); However, we can confirm for the reviewer that wildtype flies, trained and assayed by the protocol described, show a significant decrease in performance index compared to the naïve flies, indicative of strong learning and memory performance, very similar to the control genotype data shown in Figure 7B. 

      Additional minor considerations

      (16) Lines 50-52: there are many therapeutic interventions for treating tauopathies, but not curative or particularly effective ones.

      Now we have made the appropriate changes in the revised manuscript.

      (17) Lines 87-106 seem like a duplication of the abstract. Consider deleting or condensing.

      We have made the appropriate changes in the revised manuscript.

      (18) Where is pfdn5 expressed? Development v. adult? Neuron v. glia? Conservation?

      Prefoldin5 is expressed throughout development but strongly localized to the larval trachea and neuronal axons. Drosophila Pfdn5 shows 35% overall identity with human PFDN5. 

      (19) Liine 187: is pfdn5 truly "novel"?

      The role of Pfdn5 as microtubule-binding and stabilizing is a new finding and has not been predicted or described before. Hence, it is a novel neuronal microtubule-associated protein.  

      (20) Figure 5, panel F, genotype labels on the x-axis are confusing; consider simplifying to Control, DPfdn, and Rescue.

      We have made appropriate changes in the figure for better readability.

      (21) Figures 5/8: it might be preferable to use consistent colors for Tau/HRP--Tau is labeled green in Figure 5 and then purple in Figure 8.

      We have made these changes where possible. 

      (22) Lines 311-312: Vacuolar neuropathology is NOT typically observed in human Tauopathy.

      We thank the reviewer for pointing this out. We have made the appropriate changes in the revised manuscript.

      (23) Lines 328-349: The explanation could be made more clear. Naïve flies should not necessarily be called controls. Also, a more detailed explanation of how the preference index is computed would be helpful. Why are some datapoints negative values?

      (a) We have rewritten this paragraph to make the description and explanation clearer. The detailed method and formula to calculate the Preference index have been incorporated in the Materials and Methods section.

      (b) We have replaced the term Control with Naïve. 

      (c) Datapoints with negative values appeared in some of the 'Trained' group flies. It indicates that post-CuSO<sub>4</sub> training, some groups showed repulsion towards the otherwise attractive odor 2,3B. As 2,3B is an attractive odorant, naïve or control flies show attraction towards it compared to air, which is evident from a higher number of flies in the Odor arm (O) compared to that of the Air arm (A) of the Y-maze; thus, the PI [(O-A/O+A)*100] is positive in case of naïve fly groups. Training of the flies led to an association of the attractive odorant with bitter food, leading to a decrease of attraction, and even repulsion towards the odorant in a few instances, resulting in less fly count in the odor arm compared to the air arm. Hence, the PI becomes negative as (O-A) is negative in such instances. Thus, it is not an anomaly but indicates strong learning. 

      (24) Line 403: misspelling "Pdfn"

      We have corrected this.

      (25) Lines 423-425: recommend re-phrasing, since tauopathies are human diseases. Mice and other animal models may be susceptible to tau-mediated neuronal dysfunction but not Tauopathy, per see.

      We have made the appropriate changes in the revised manuscript.

      (26) Lines 468-469: "tau neuropathology" rather than "tau associated neuropathies".

      We have made the appropriate changes in the revised manuscript. 

      References

      Askenazi, M., T. Kavanagh, G. Pires, B. Ueberheide, T. Wisniewski et al., 2023 Compilation of reported protein changes in the brain in Alzheimer's disease. Nat Commun 14: 4466.

      Hsieh, Y. C., C. Guo, H. K. Yalamanchili, M. Abreha, R. Al-Ouran et al., 2019 Tau-Mediated Disruption of the Spliceosome Triggers Cryptic RNA Splicing and Neurodegeneration in Alzheimer's Disease. Cell Rep 29: 301-316 e310.

      Iijima-Ando, K., M. Sekiya, A. Maruko-Otake, Y. Ohtake, E. Suzuki et al., 2012 Loss of axonal mitochondria promotes tau-mediated neurodegeneration and Alzheimer's disease-related tau phosphorylation via PAR-1. PLoS Genet 8: e1002918.

      Jackson, G. R., M. Wiedau-Pazos, T. K. Sang, N. Wagle, C. A. Brown et al., 2002 Human wildtype tau interacts with wingless pathway components and produces neurofibrillary pathology in Drosophila. Neuron 34: 509-519.

      Ji, W., K. An, C. Wang and S. Wang, 2022 Bioinformatics analysis of diagnostic biomarkers for Alzheimer's disease in peripheral blood based on sex differences and support vector machine algorithm. Hereditas 159: 38.

      Leitner, D., G. Pires, T. Kavanagh, E. Kanshin, M. Askenazi et al., 2024 Similar brain proteomic signatures in Alzheimer's disease and epilepsy. Acta Neuropathol 147: 27.

      Li, L., Y. Jiang, G. Wu, Y. A. R. Mahaman, D. Ke et al., 2022 Phosphorylation of Truncated Tau Promotes Abnormal Native Tau Pathology and Neurodegeneration. Mol Neurobiol 59: 6183-6199.

      Mershin, A., E. Pavlopoulos, O. Fitch, B. C. Braden, D. V. Nanopoulos et al., 2004 Learning and memory deficits upon TAU accumulation in Drosophila mushroom body neurons. Learn Mem 11: 277-287.

      Mukaka, M. M., 2012 Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24: 69-71.

      Okenve-Ramos, P., R. Gosling, M. Chojnowska-Monga, K. Gupta, S. Shields et al., 2024 Neuronal ageing is promoted by the decay of the microtubule cytoskeleton. PLoS Biol 22: e3002504.

      Passarella, D., and M. Goedert, 2018 Beta-sheet assembly of Tau and neurodegeneration in Drosophila melanogaster. Neurobiol Aging 72: 98-105.

      Sun, Z., J. S. Kwon, Y. Ren, S. Chen, C. K. Walker et al., 2024 Modeling late-onset Alzheimer's disease neuropathology via direct neuronal reprogramming. Science 385: adl2992.

      Tao, Y., Y. Han, L. Yu, Q. Wang, S. X. Leng et al., 2020 The Predicted Key Molecules, Functions, and Pathways That Bridge Mild Cognitive Impairment (MCI) and Alzheimer's Disease (AD). Front Neurol 11: 233.

      Wegmann, S., B. Eftekharzadeh, K. Tepper, K. M. Zoltowska, R. E. Bennett et al., 2018 Tau protein liquid-liquid phase separation can initiate tau aggregation. EMBO J 37.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Epigenetic regulation complex (PRC2) is essential for neural crest specification, and its misregulation has been shown to cause severe craniofacial defects. This study shows that Eed, a core PRC2 component, is critical for craniofacial osteoblast differentiation and mesenchymal proliferation after neural crest induction. Using mouse genetics and single-cell RNA sequencing, the researcher found that conditional knockout of Eed leads to significant craniofacial hypoplasia, impaired osteogenesis, and reduced proliferation of mesenchymal cells in post-migratory neural crest populations.

      Overall, the study is superficial and descriptive. No in-depth mechanism was analyzed and the phenotype analysis is not comprehensive.

      We thank the reviewer for sharing their expertise and for taking the time to provide helpful suggestions to improve our study. We are gratified that the striking phenotypes we report from Eed loss in post-migratory neural crest craniofacial tissues were appreciated. The breadth and depth of our phenotyping techniques, including skeletal staining, micro-CT, echocardiogram, immunofluorescence, histology, and primary craniofacial cell culture provide comprehensive data in support our hypothesis that PRC2 is required for epigenetic control of craniofacial osteoblast differentiation. To provide mechanistic data in support of this hypothesis, we have now performed CUT&Tag H3K27me3 chromatin profiling on nuclei harvested from E12.5 or E16.5 Sox10-Cre Eed<sup>Fl/WT</sup> and Sox10-Cre Eed<sup>Fl/Fl</sup> craniofacial tissue. These new data, which are presented in Fig. 5, Supplementary Fig. 9, and Supplementary Tables 7-10 of our revised manuscript, validate our hypothesis that epigenetic regulation of chromatin architecture downstream of PRC2 activity underlies craniofacial osteoblast differentiation. In particular, we now show that Eed-dependent H3K27me3 methylation is associated with correct temporal expression of transcription factors that are necessary for craniofacial differentiation and patterning, such as including Msx1, Pitx1, Pax7, which were initially nominated by single-cell RNA sequencing of E12.5 Sox10-Cre Eed<sup>Fl/WT</sup> and Sox10-Cre Eed<sup>Fl/Fl</sup> craniofacial tissues in Fig. 4, Supplementary Fig. 5-7, and Supplementary Tables 1-6.

      Reviewer #2 (Public review):

      Summary:

      The role of PRC2 in post-neural crest induction was not well understood. This work developed an elegant mouse genetic system to conditionally deplete EED upon SOX10 activation. Substantial developmental defects were identified for craniofacial and bone development. The authors also performed extensive single-cell RNA sequencing to analyze differentiation gene expression changes upon conditional EED disruption.

      Strengths:

      (1) Elegant genetic system to ablate EED post neural crest induction.

      (2) Single-cell RNA-seq analysis is extremely suitable for studying the cell type-specific gene expression changes in developmental systems.

      We thank the reviewer for their generous and helpful comments on our study. We are happy that our mouse genetic and single-cell RNA sequencing approaches were appropriate in pairing the craniofacial phenotypes we report with distinct gene expression changes in post-migratory neural crest tissues upon Eed deletion.

      Weaknesses:

      (1) Although this study is well designed and contains state-of-the-art single-cell RNA-seq analysis, it lacks the mechanistic depth in the EED/PRC2-mediated epigenetic repression. This is largely because no epigenomic data was shown.

      Thank you for this suggestion. As described in response to Reviewer #1, we have now performed CUT&Tag H3K27me3 chromatin profiling on nuclei harvested from E12.5 or E16.5 Sox10-Cre Eed<sup>Fl/WT</sup> and Sox10-Cre Eed<sup>Fl/Fl</sup> craniofacial tissues to provide mechanistic epigenomic data in support of our hypothesis that hat PRC2 is required for craniofacial osteoblast differentiation. These new data, which are presented in Fig. 5, Supplementary Fig. 9, and Supplementary Tables 7-10 of our revised manuscript, integrate genome-wide and targeted metaplot visualizations across genotypes with in-depth analyses of methylation rich regions and genes associated with methylation rich loci. Broadly, these new data reveal that changes in H3K27me3 occupancy correlate with gene expression changes from single-cell RNA sequencing of E12.5 Sox10-Cre Eed<sup>Fl/WT</sup> and Sox10-Cre Eed<sup>Fl/Fl</sup> craniofacial tissues in Fig. 4, Supplementary Fig. 5-7, and Supplementary Tables 1-6.

      (2) The mouse model of conditional loss of EZH2 in neural crest has been previously reported, as the authors pointed out in the discussion. What is novel in this study to disrupt EED? Perhaps a more detailed comparison of the two mouse models would be beneficial.

      We acknowledge and cite the study the reviewer has indicated (Schwarz et al. Development 2014) in our initial and revised manuscripts. This elegant investigation uses Wnt1-Cre to delete Ezh2 and reports a phenotype similar to the one we observed with Sox10-Cre deletion of Eed, but our study adds depth to the understanding of PRC2’s vital role in neural crest development by ablating Eed, which has a unique function in the PRC2 complex by binding to H3K27me3 and allosterically activating Ezh2. In this sense, our study sheds light on whether phenotypes arising from deletion of Eed, the PRC2 “reader”, differ from phenotypes arising from deletion of Ezh2, the PRC2 “writer”, in neural crest derived tissues. Moreover, we provide the first single-cell RNA sequencing and epigenomic investigations of craniofacial phenotypes arising from PRC2 activity in the developing neural crest. Due to limitations associated with the Wnt1-Cre transgene (Lewis et al. Developmental Biology 2013), which targets pre-migratory neural crest cells, our investigations used Sox10Cre, which targets the migratory neural crest and is completely recombined by E10.5. We have included a detailed comparison of these mouse models in the Discussion section of our revised manuscript, and we thank the reviewer for this thoughtful suggestion. 

      (3) The presentation of the single-cell RNA-seq data may need improvement. The complexity of the many cell types blurs the importance of which cell types are affected the most by EED disruption.

      We thank the reviewer for the opportunity to improve the presentation of our single-cell RNA sequencing data. In response, we have added Supplementary Fig. 8 to our revised manuscript, which shows the cell clusters most affected by EED disruption in UMAP space across genotypes. Because we wanted to capture the fill diversity of cell types underlying the phenotypes we report, we did not sort Sox10+ cells (via FACS, for example) from craniofacial tissues before single-cell RNA sequencing. Our resulting single-cell RNA sequencing data are therefore inclusive of a diversity of cell types in UMAP space, and the prevalence of many of these cell types was unaffected by epigenetic disruption of neural crest derived tissues. The prevalence of the cell clusters that are most affected across genotypes and which are most relevant to our analyses of the developing neural crest are shown in Fig. 4c (and now also in Supplementary Fig. 8), including C0 (differentiating osteoblasts), C4 (mesenchymal stem cells), C5 (mesenchymal stem cells), and C7 (proliferating mesenchymal stem cells). Marker genes and pseudobulked differential expression analyses across these clusters are shown in Fig. 4d and Fig. 4e-h, respectively. 

      (4) While it's easy to identify PRC2/EED target genes using published epigenomic data, it would be nice to tease out the direct versus indirect effects in the gene expression changes (e.g Figure 4e).

      We agree with the reviewer that the single-cell RNA sequencing data in our initial submission do not provide insight into direct versus indirect changes in gene expression downstream of PRC2. In contrast, the CUT&Tag chromatin profiling data that we have generated for this revision provides mechanistic insight into H3K27me3 occupancy and direct effects on gene expression resulting from PRC2 inactivation in our mouse models.

      REVIEWING EDITOR COMMENTS

      The following are recommended as essential revisions

      (1) The study is overall superficial and primarily descriptive, lacking in-depth mechanistic analysis and comprehensive phenotype evaluation.

      Please see responses to Reviewer #1 and Reviewer #2 (weaknesses 1 and 4) above. 

      (2) The authors did not investigate the temporal and spatial expression of Eed during cranial neural crest development, which is crucial for explaining the observed phenotypes.

      The temporal and spatial expression of Eed during embryogenesis is well studied. Eed is ubiquitously expressed starting at E5.5, peaks at E9.5, and is downregulated but maintained at a high basal expression level through E18.5 (Schumacher et al. Nature 1996). Although comprehensive analysis of Eed expression in neural crest tissues has not been reported (to our knowledge), Eed physically and functionally interacts with Ezh2 (Sewalt et al. Mol Cell Biol 1998), which is enriched at a diversity of timepoints throughout all developing craniofacial tissues (Schwarz et al. Development 2014). In our study, we confirmed enrichment of Eed expression in craniofacial tissues throughout development using QPCR, and have provided a more detailed description of these published and new findings in the Discussion section of our revised manuscript. 

      (3) There is no apoptosis analysis provided for any of the samples.

      We evaluated the presence of apoptotic cells in E12.5 craniofacial sections using immunofluorescence for Cleaved Caspase 3 in Supplementary Fig. 3d. Although we found a modest increase in the labeling index of apoptotic cells, there was insufficient evidence to conclude that apoptosis is a substantial factor in craniofacial hypoplasia resulting from Eed loss in post-migratory neural crest craniofacial tissues. We have clarified these findings in the Results and Discussion sections of our revised manuscript. 

      (4) As Eed is a core component of the PRC2 complex, were any other components altered in the Eed cKO mutant? How does Eed regulation influence osteogenic differentiation and proliferation through known pathways?

      We thank the editors for this thoughtful inquiry. Although we did not specifically investigate expression or stability of other PRC2 components in Eed conditional mutants, and little is known about how Eed regulates osteogenic differentiation or proliferation through any pathway, our single-cell RNA sequencing data presented in Fig. 4, Supplementary Fig. 5-7, and Supplementary Tables 1-6 provide a significant conceptual advance with mechanistic implications for understanding bone development downstream of Eed and do not reveal any alterations in the expression of other PRC2 components across genotypes. We have clarified these important details in the Discussion section of our revised manuscript. 

      (5) The authors may compare the Eed cKO phenotype with that of the previous EZH2 cKO mouse model since both Eed and EZH2 are essential subunits of PRC2.

      Please see responses to editorial comment 2 above and the last paragraph of the Discussion section of our revised manuscript for comparisons between Eed and Ezh2 knockout phenotypes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors validate the contribution of RAP2A to GB progression. RAp2A participates in asymmetric cell division, and the localization of several cell polarity markers, including cno and Numb.

      Strengths:

      The use of human data, Drosophila models, and cell culture or neurospheres is a good scenario to validate the hypothesis using complementary systems.

      Moreover, the mechanisms that determine GB progression, and in particular glioma stem cells biology, are relevant for the knowledge on glioblastoma and opens new possibilities to future clinical strategies.

      Weaknesses:

      While the manuscript presents a well-supported investigation into RAP2A's role in GBM, several methodological aspects require further validation. The major concern is the reliance on a single GB cell line (GB5), which limits the generalizability of the findings. Including multiple GBM lines, particularly primary patient-derived 3D cultures with known stem-like properties, would significantly enhance the study's relevance.

      Additionally, key mechanistic aspects remain underexplored. Further investigation into the conservation of the Rap2l-Cno/aPKC pathway in human cells through rescue experiments or protein interaction assays would be beneficial. Similarly, live imaging or lineage tracing would provide more direct evidence of ACD frequency, complementing the current indirect metrics (odd/even cell clusters, Numb asymmetry).

      Several specific points require attention:

      (1) The specificity of Rap2l RNAi needs further confirmation. Is Rap2l expressed in neuroblasts or intermediate neural progenitors? Can alternative validation methods be employed?

      There are no available antibodies/tools to determine whether Rap2l is expressed in NB lineages, and we have not been able either to develop any. However, to further prove the specificity of the Rap2l phenotype, we have now analyzed two additional and independent RNAi lines of Rap2l along with the original RNAi line analyzed. We have validated the results observed with this line and found a similar phenotype in the two additional RNAi lines now analyzed. These results have been added to the text ("Results section", page 6, lines 142-148) and are shown in Supplementary Figure 3.

      (2) Quantification of phenotypic penetrance and survival rates in Rap2l mutants would help determine the consistency of ACD defects.

      In the experiment previously mentioned (repetition of the original Rap2l RNAi line analysis along with two additional Rap2l RNAi lines) we have substantially increased the number of samples analyzed (both the number of NB lineages and the number of different brains analyzed). With that, we have been able to determine that the penetrance of the phenotype was 100% or almost 100% in the 3 different RNAi lines analyzed (n>14 different brains/larvae analyzed in all cases). Details are shown in the text (page 6, lines 142-148), in Supplementary Figure 3 and in the corresponding figure legend.

      (3) The observations on neurosphere size and Ki-67 expression require normalization (e.g., Ki-67+ cells per total cell number or per neurosphere size). Additionally, apoptosis should be assessed using Annexin V or TUNEL assays.

      The experiment of Ki-67+ cells was done considering the % of Ki-67+ cells respect the total cell number in each neurosphere. In the "Materials and methods" section it is well indicated: "The number of Ki67+ cells with respect to the total number of nuclei labelled with DAPI within a given neurosphere were counted to calculate the Proliferative Index (PI), which was expressed as the % of Ki67+ cells over total DAPI+ cells"

      Perhaps it was not clearly showed in the graph of Figure 5A. We have now changed it indicating: "% of Ki67+ cells/ neurosphere" in the "Y axis". 

      Unfortunately, we currently cannot carry out neurosphere cultures to address the apoptosis experiments. 

      (4) The discrepancy in Figures 6A and 6B requires further discussion.

      We agree that those pictures can lead to confusion. In the analysis of the "% of neurospheres with even or odd number of cells", we included the neurospheres with 2 cells both in the control and in the experimental condition (RAP2A). The number of this "2 cell-neurospheres" was very similar in both conditions (27,7 % and 27 % of the total neurospheres analyzed in each condition), and they can be the result of a previous symmetric or asymmetric division, we cannot distinguish that (only when they are stained with Numb, for example, as shown in Figure 6B). As a consequence, in both the control and in the experimental condition, these 2-cell neurospheres included in the group of "even" (Figure 6A) can represent symmetric or asymmetric divisions. However, in the experiment shown in Figure 6B, it is shown that in these 2 cellneurospheres there are more cases of asymmetric divisions in the experimental condition (RAP2A) than in the control.

      Nevertheless, to make more accurate and clearer the conclusions, we have reanalyzed the data taking into account only the neurospheres with 3-5-7 (as odd) or 4-6-8 (as even) cells. Likewise, we have now added further clarifications regarding the way the experiment has been analyzed in the methods.

      (5) Live imaging of ACD events would provide more direct evidence.

      We agree that live imaging would provide further evidence. Unfortunately, we currently cannot carry out neurosphere cultures to approach those experiments.

      (6) Clarification of terminology and statistical markers (e.g., p-values) in Figure 1A would improve clarity.

      We thank the reviewer for pointing out this issue. To improve clarity, we have now included a Supplementary Figure (Fig. S1) with the statistical parameters used. Additionally, we have performed a hierarchical clustering of genes showing significant or not-significant changes in their expression levels.

      (7) Given the group's expertise, an alternative to mouse xenografts could be a Drosophila genetic model of glioblastoma, which would provide an in vivo validation system aligned with their research approach.

      The established Drosophila genetic model of glioblastoma is an excellent model system to get deep insight into different aspects of human GBM. However, the main aim of our study was to determine whether an imbalance in the mode of stem cell division, favoring symmetric divisions, could contribute to the expansion of the tumor. We chose human GBM cell lines-derived neurospheres because in human GBM it has been demonstrated the existence of cancer stem cells (glioblastoma or glioma stem cells -GSCs--). And these GSCs, as all stem cells, can divide symmetric or asymmetrically. In the case of the Drosophila model of GBM, the neoplastic transformation observed after overexpressing the EGF receptor and PI3K signaling is due to the activation of downstream genes that promote cell cycle progression and inhibit cell cycle exit. It has also been suggested that the neoplastic cells in this model come from committed glial progenitors, not from stem-like cells.

      With all, it would be difficult to conclude the causes of the potential effects of manipulating the Rap2l levels in this Drosophila system of GBM. We do not discard this analysis in the future (we have all the "set up" in the lab). However, this would probably imply a new project to comprehensively analyze and understand the mechanism by which Rap2l (and other ACD regulators) might be acting in this context, if it is having any effect. 

      However, as we mentioned in the Discussion, we agree that the results we have obtained in this study must be definitely validated in vivo in the future using xenografts with 3D-primary patient-derived cell lines.

      Reviewer #2 (Public review):

      This study investigates the role of RAP2A in regulating asymmetric cell division (ACD) in glioblastoma stem cells (GSCs), bridging insights from Drosophila ACD mechanisms to human tumor biology. They focus on RAP2A, a human homolog of Drosophila Rap2l, as a novel ACD regulator in GBM is innovative, given its underexplored role in cancer stem cells (CSCs). The hypothesis that ACD imbalance (favoring symmetric divisions) drives GSC expansion and tumor progression introduces a fresh perspective on differentiation therapy. However, the dual role of ACD in tumor heterogeneity (potentially aiding therapy resistance) requires deeper discussion to clarify the study's unique contributions against existing controversies. Some limitations and questions need to be addressed.

      (1) Validation of RAP2A's prognostic relevance using TCGA and Gravendeel cohorts strengthens clinical relevance. However, differential expression analysis across GBM subtypes (e.g., MES, DNA-methylation subtypes ) should be included to confirm specificity.

      We have now included a Supplementary figure (Supplementary Figure 2), in which we show the analysis of RAP2A levels in the different GBM subtypes (proneural, mesenchymal and classical) and their prognostic relevance (i.e. the proneural subtype that presents RAP2A levels significantly higher than the others is the subtype that also shows better prognostic).

      (2) Rap2l knockdown-induced ACD defects (e.g., mislocalization of Cno/Numb) are well-designed. However, phenotypic penetrance and survival rates of Rap2l mutants should be quantified to confirm consistency.

      We have now analyzed two additional and independent RNAi lines of Rap2l along with the original RNAi line. We have validated the results observed with this line and found a similar phenotype in the two additional RNAi lines now analyzed. To determine the phenotypic penetrance, we have substantially increased the number of samples analyzed (both the number of NB lineages and the number of different brains analyzed). With that, we have been able to determine that the penetrance of the phenotype was 100% or almost 100% in the 3 different Rap2l RNAi lines analyzed (n>14 different brains/larvae analyzed in all cases). These results have been added to the text ("Results section", page 6, lines 142-148) and are shown in Supplementary Figure 3 and in the corresponding figure legend. 

      (3) While GB5 cells were effectively used, justification for selecting this line (e.g., representativeness of GBM heterogeneity) is needed. Experiments in additional GBM lines (especially the addition of 3D primary patient-derived cell lines with known stem cell phenotype) would enhance generalizability.

      We tried to explain this point in the paper (Results). As we mentioned, we tested six different GBM cell lines finding similar mRNA levels of RAP2A in all of them, and significantly lower levels than in control Astros (Fig. 3A). We decided to focus on the GBM cell line called GB5 as it grew well (better than the others) in neurosphere cell culture conditions, for further analyses. We agree that the addition of at least some of the analyses performed with the GB5 line using other lines (ideally in primary patientderive cell lines, as the reviewer mentions) would reinforce the results. Unfortunately, we cannot perform experiments in cell lines in the lab currently. We will consider all of this for future experiments.

      (4) Indirect metrics (odd/even cell clusters, NUMB asymmetry) are suggestive but insufficient. Live imaging or lineage tracing would directly validate ACD frequency.

      We agree that live imaging would provide further evidence. Unfortunately, we cannot approach those experiments in the lab currently.

      (5) The initial microarray (n=7 GBM patients) is underpowered. While TCGA data mitigate this, the limitations of small cohorts should be explicitly addressed and need to be discussed.

      We completely agree with this comment. We had available the microarray, so we used it as a first approach, just out of curiosity of knowing whether (and how) the levels of expression of those human homologs of Drosophila ACD regulators were affected in this small sample, just as starting point of the study. We were conscious of the limitations of this analysis and that is why we followed up the analysis in the datasets, on a bigger scale. We already mentioned the limitations of the array in the Discussion:

      "The microarray we interrogated with GBM patient samples had some limitations. For example, not all the human genes homologs of the Drosophila ACD regulators were present (i.e. the human homologs of the determinant Numb). Likewise, we only tested seven different GBM patient samples. Nevertheless, the output from this analysis was enough to determine that most of the human genes tested in the array presented altered levels of expression"[....] In silico analyses, taking advantage of the existence of established datasets, such as the TCGA, can help to more robustly assess, in a bigger sample size, the relevance of those human genes expression levels in GBM progression, as we observed for the gene RAP2A."

      (6) Conclusions rely heavily on neurosphere models. Xenograft experiments or patient-derived orthotopic models are critical to support translational relevance, and such basic research work needs to be included in journals.

      We completely agree. As we already mentioned in the Discussion, the results we have obtained in this study must be definitely validated in vivo in the future using xenografts with 3D-primary patient-derived cell lines.

      (7) How does RAP2A regulate NUMB asymmetry? Is the Drosophila Rap2l-Cno/aPKC pathway conserved? Rescue experiments (e.g., Cno/aPKC knockdown with RAP2A overexpression) or interaction assays (e.g., Co-IP) are needed to establish molecular mechanisms.

      The mechanism by which RAP2A is regulating ACD is beyond the scope of this paper. We do not even know how Rap2l is acting in Drosophila to regulate ACD. In past years, we did analyze the function of another Drosophila small GTPase, Rap1 (homolog to human RAP1A) in ACD, and we determined the mechanism by which Rap1 was regulating ACD (including the localization of Numb): interacting physically with Cno and other small GTPases, such as Ral proteins, and in a complex with additional ACD regulators of the "apical complex" (aPKC and Par-6). Rap2l could be also interacting physically with the "Ras-association" domain of Cno (domain that binds small GTPases, such as Ras and Rap1). We have added some speculations regarding this subject in the Discussion:

      "It would be of great interest in the future to determine the specific mechanism by which Rap2l/RAP2A is regulating this process. One possibility is that, as it occurs in the case of the Drosophila ACD regulator Rap1, Rap2l/RAP2A is physically interacting or in a complex with other relevant ACD modulators."

      (8) Reduced stemness markers (CD133/SOX2/NESTIN) and proliferation (Ki-67) align with increased ACD. However, alternative explanations (e.g., differentiation or apoptosis) must be ruled out via GFAP/Tuj1 staining or Annexin V assays.

      We agree with these possibilities.  Regarding differentiation, the potential presence of increased differentiation markers would be in fact a logic consequence of an increase in ACD divisions/reduced stemness markers. Unfortunately, we cannot approach those experiments in the lab currently.

      (9) The link between low RAP2A and poor prognosis should be validated in multivariate analyses to exclude confounding factors (e.g., age, treatment history).

      We have now added this information in the "Results section" (page 5, lines 114-123).

      (10) The broader ACD regulatory network in GBM (e.g., roles of other homologs like NUMB) and potential synergies/independence from known suppressors (e.g., TRIM3) warrant exploration.

      The present study was designed as a "proof-of-concept" study to start analyzing the hypothesis that the expression levels of human homologs of known Drosophila ACD regulators might be relevant in human cancers that contain cancer stem cells, if those human homologs were also involved in modulating the mode of (cancer) stem cell division. 

      To extend the findings of this work to the whole ACD regulatory network would be the logic and ideal path to follow in the future.

      We already mentioned this point in the Discussion:

      "....it would be interesting to analyze in the future the potential consequences that altered levels of expression of the other human homologs in the array can have in the behavior of the GSCs. In silico analyses, taking advantage of the existence of established datasets, such as the TCGA, can help to more robustly assess, in a bigger sample size, the relevance of those human genes expression levels in GBM progression, as we observed for the gene RAP2A."

      (11) The figures should be improved. Statistical significance markers (e.g., p-values) should be added to Figure 1A; timepoints/culture conditions should be clarified for Figure 6A.

      Regarding the statistical significance markers, we have now included a Supplementary Figure (Fig. S1) with the statistical parameters used. Additionally, we have performed a hierarchical clustering of genes showing significant or notsignificant changes in their expression levels. 

      Regarding the experimental conditions corresponding to Figure 6A, those have now been added in more detail in "Materials and Methods" ("Pair assay and Numb segregation analysis" paragraph).

      (12) Redundant Drosophila background in the Discussion should be condensed; terminology should be unified (e.g., "neurosphere" vs. "cell cluster").

      As we did not mention much about Drosophila ACD and NBs in the "Introduction", we needed to explain in the "Discussion" at least some very basic concepts and information about this, especially for "non-drosophilists". We have reviewed the Discussion to maintain this information to the minimum necessary.

      We have also reviewed the terminology that the Reviewer mentions and have unified it.

      Reviewer #1 (Recommendations for the authors):

      To improve the manuscript's impact and quality, I would recommend:

      (1) Expand Cell Line Validation: Include additional GBM cell lines, particularly primary patient-derived 3D cultures, to increase the robustness of the findings.

      (2) Mechanistic Exploration: Further examine the conservation of the Rap2lCno/aPKC pathway in human cells using rescue experiments or protein interaction assays.

      (3) Direct Evidence of ACD: Implement live imaging or lineage tracing approaches to strengthen conclusions on ACD frequency.

      (4) RNAi Specificity Validation: Clarify Rap2l RNAi specificity and its expression in neuroblasts or intermediate neural progenitors.

      (5) Quantitative Analysis: Improve quantification of neurosphere size, Ki-67 expression, and apoptosis to normalize findings.

      (6) Figure Clarifications: Address inconsistencies in Figures 6A and 6B and refine statistical markers in Figure 1A.

      (7) Alternative In Vivo Model: Consider leveraging a Drosophila glioblastoma model as a complementary in vivo validation approach.

      Addressing these points will significantly enhance the manuscript's translational relevance and overall contribution to the field.

      We have been able to address points 4, 5 and 6. Others are either out of the scope of this work (2) or we do not have the possibility to carry them out at this moment in the lab (1, 3 and 7). However, we will complete these requests/recommendations in other future investigations.

      Reviewer #2 (Recommendations for the authors):

      Major Revision /insufficient required to address methodological and mechanistic gaps.

      (1) Enhance Clinical Relevance

      Validate RAP2A's prognostic significance across multiple GBM subtypes (e.g., MES, DNA-methylation subtypes) using datasets like TCGA and Gravendeel to confirm specificity.

      Perform multivariate survival analyses to rule out confounding factors (e.g., patient age, treatment history).

      (2) Strengthen Mechanistic Insights

      Investigate whether the Rap2l-Cno/aPKC pathway is conserved in human GBM through rescue experiments (e.g., RAP2A overexpression with Cno/aPKC knockdown) or interaction assays (e.g., Co-IP).

      Use live-cell imaging or lineage tracing to directly validate ACD frequency instead of relying on indirect metrics (odd/even cell clusters, NUMB asymmetry).

      (3) Improve Model Systems & Experimental Design

      Justify the selection of GB5 cells and include additional GBM cell lines, particularly 3D primary patient-derived cell models, to enhance generalizability.

      It is essential to perform xenograft or orthotopic patient-derived models to support translational relevance.

      (5) Address Alternative Interpretations

      Rule out other potential effects of RAP2A knockdown (e.g., differentiation or apoptosis) using GFAP/Tuj1 staining or Annexin V assays.

      Explore the broader ACD regulatory network in GBM, including interactions with NUMB and TRIM3, to contextualize findings within known tumor-suppressive pathways.

      (6) Improve Figures & Clarity

      Add statistical significance markers (e.g., p-values) in Figure 1A and clarify timepoints/culture conditions for Figure 6A.

      Condense redundant Drosophila background in the discussion and ensure consistent terminology (e.g., "neurosphere" vs. "cell cluster").

      We have been able to address points 1, partially 3 and 6. Others are either out of the scope of this work or we do not have the possibility to carry them out at this moment in the lab. However, we are very interested in completing these requests/recommendations and we will approach that type of experiments in other future investigations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This study builds on previous work demonstrating that several beta connexins (Cx26, Cx30, and Cx32) have a carbamylation motif which renders them sensitive to CO<sub>2</sub>. In response to CO<sub>2</sub>, hemichannels composed of these connexins open, enabling diffusion of small molecules (such as ATP) between the cytosol and extracellular environment. Here, the authors have identified that an alpha connexin, Cx43, also contains a carbamylation motif, and they demonstrate that CO<sub>2</sub> opens Cx43 hemichannels. Most of the study involves using transfected cells expressing wildtype and mutant Cx43 to define amino acids required for CO<sub>2</sub> sensitivity. Hippocampal tissue slices in culture were used to show that CO<sub>2</sub>-induced synaptic transmission was affected by Cx43 hemichannels, providing a physiological context. The authors point out that the Cx43 gene significantly diverges from the beta connexins that are CO<sub>2</sub> sensitive, suggesting that the conserved carbamylation motif was present before the alpha and beta connexin genes diverged. 

      Strengths: 

      (1) The molecular analysis defining the amino acids that contribute to the CO<sub>2</sub> sensitivity of Cx43 is a major strength of the study. The rigor of analysis was strengthened by using three independent assays for hemichannel opening: dye uptake, patch clamp channel measurements, and ATP secretion. The resulting analysis identified key lysines in Cx43 that were required for CO<sub>2</sub>-mediated hemichannel opening. A double K to E Cx43 mutant produced a construct that produced hemichannels that were constitutively open, which further strengthened the analysis. 

      (2) Using hippocampal tissue sections to demonstrate that CO<sub>2</sub> can influence field excitatory postsynaptic potentials (fEPSPs) provides a native context for CO<sub>2</sub> regulation of Cx43 hemichannels. Cx43 mutations associated with Oculodentodigital Dysplasia (ODDD) inhibited CO<sub>2</sub>-induced hemichannel opening, although the mechanism by which this occurs was not elucidated. 

      Weaknesses: 

      (1) Cx43 channels are sensitive to cytosolic pH, which will be affected by CO<sub>2</sub>. Cytosolic pH was not measured, and how this affects CO<sub>2</sub>-induced Cx43 hemichannel activity was not addressed. 

      We have now addressed this with intracellular pH measurements and removal of the C-terminal pH sensor from Cx43 -the hemichannel remains CO<sub>2</sub> sensitive.

      (2) Cultured cells are typically grown in incubators containing 5% CO<sub>2</sub>, which is ~40 mmHg. It is unclear how cells would be viable if Cx43 hemichannels are open at this PCO2. 

      The cells look completely healthy with normal morphology and no sign of excessive cell death in the cultures. Presumably they have ways of compensating for the effects of partially open Cx43 hemichannels.

      (3) Experiments using Gap26 to inhibit Cx43 hemichannels in fEPSP measurements used a scrambled peptide as a control. Analysis should also include Gap peptides specifically targeting Cx26, Cx30, and Cx32 as additional controls. 

      We don’t feel this is necessary given the extensive prior literature in hippocampus showing the effect of ATP release via open Cx43 hemichannels on fEPSP amplitude that used astrocytic specific knockout of Cx43 and Gap26 (doi: 10.1523/jneurosci.0015-14.2014).

      (4) The mechanism by which ODDD mutations impair CO2-mediated hemichannel opening was not addressed. Also, the potential roles for inhibiting Cx43 hemichannels in the pathology of ODDD are unclear. 

      These pathological mutations that alter CO<SUB>2</SUB> sensitivity are similar to pathological mutation in Cx26 and Cx32, which also remove CO<SUB>2</SUB> sensitivity. Our cryo-EM studies on Cx26 give clues as to why these mutations have this effect -they alter conformational mobility of the channel (Brotherton et al 2022 doi: 10.1016/j.str.2022.02.010 and Brotherton et al 2024 doi: 10.7554/eLife.93686). We assume that similar considerations apply to Cx43, but this requires improved cryoEM structures of Cx43 hemichannels at differing levels of PCO<SUB>2</SUB>.

      We agree that the link between loss of CO<SUB>2</SUB> sensitivity of Cx43 and ODDD is not established and have revised the text to make this clear.

      (5) CO2 has no effect on Cx43-mediated gap junctional communication as opposed to Cx26 gap junctions, which are inhibited by CO2. The molecular basis for this difference was not determined. 

      Cx26 gap junction channels are so far unique amongst CO<SUB>2</SUB> sensitive connexins in being closed by CO<SUB>2</SUB>. We have addressed the mechanism by which this occurs in Nijjar et al 2025 DOI: 10.1113/JP285885 -the requirement of carbamylation of K108 in Cx26 (in addition to K125) for GJC closure.

      (6) Whether there are other non-beta connexins that have a putative carbamylation motif was not addressed. Additional discussion/analysis of how the evolutionary trajectory for Cx43 maintaining a carbamylation motif is unique for non-beta connexins would strengthen the study. 

      We have performed a molecular phylogenetic survey to show that the carbamylation motif occurs across the alpha connexin clade and have shown that Cx50 is indeed CO<SUB>2</SUB> sensitive (doi: 10.1101/2025.01.23.634273). This is now in Fig 12.

      Reviewer #2 (Public review): 

      Summary: 

      This paper examines the CO<SUB>2</SUB>  sensitivity of Cx43 hemichannels and gap junctional channels in transiently transfected Hela cells using several different assays, including ethidium dye uptake, ATP release, whole cell patch clamp recordings, and an imaging assay of gap junctional dye transfer. The results show that raising pCO<sub>2</sub> from 20 to 70 mmHg (at a constant pH of 7.3) causes an increase in opening of Cx43 hemichannels but does not block Cx43 gap junctions. This study also showed that raising pCO<SUB>2</SUB> from 20 to 35 mm Hg resulted in an increase in synaptic strength in hippocampal rat brain slices, presumably due to downstream ATP release, suggesting that the CO<SUB>2</SUB> sensitivity of Cx43 may be physiologically relevant. As a further test of the physiological relevance of the CO<sub>2</sub> sensitivity of Cx43, it was shown that two pathological mutations of Cx43 that are associated with ODDD caused loss of Cx43 CO<sub>2</sub>-sensitivity. Cx43 has a potential carbamylation motif that is homologous to the motif in Cx26. To understand the structural changes involved in CO<SUB>2</SUB> sensitivity, a number of mutations were made in Cx43 sites thought to be the equivalent of those known to be involved in the CO<SUB>2</SUB> sensitivity of Cx26, and the CO<SUB>2</SUB> sensitivity of these mutants was investigated. 

      Strengths: 

      This study shows that the apparent lack of functional Cx43 hemichannels observed in a number of previous in vitro function studies may be due to the use of HEPES to buffer the external pH. When Cx43 hemichannels were studied in external solutions in which CO<SUB>2</SUB>/bicarbonate was used to buffer pH instead of HEPES, Cx43 hemichannels showed significantly higher levels of dye uptake, ATP release, and ionic conductance. These findings may have major physiological implications since Cx43 hemichannels are found in many organs throughout the body, including the brain, heart, and immune system. 

      Weaknesses: 

      (1) Interpretation of the site-directed mutation studies is complicated. Although Cx43 has a potential carbamylation motif that is homologous to the motif in Cx26, the results of site-directed mutation studies were inconsistent with a simple model in which K144 and K105 interact following carbamylation to cause the opening of Cx43 hemichannels. 

      The mechanism of opening of Cx43 is more complex than that of Cx26, Cx32 and Cx50 and involves more Lys residues. The 4 Lys residues in Cx43 that are involved in opening the hemichannel have their equivalents in Cx26, but in Cx26 these additional residues seem to be involved in the closing of the GJC rather than opening of the hemichannel (see above). Cx50 is simpler and involves only two Lys residues (doi: 10.1101/2025.01.23.634273), which are equivalent to those in Cx26.

      (2) Secondly, although it is shown that two Cx43 ODDD-associated mutations show a loss of CO<sub>2</sub> sensitivity, there is no evidence that the absence of CO2 sensitivity is involved in the pathology of ODD

      We agree, but this is probably because this has not been directly tested by experiment, as the CO<Sub>2</sub> sensitivity of Cx43 was not previously known. As mentioned above we have revised the text to ensure that this is clear.

      Reviewer #3 (Public review): 

      In this paper, the authors aimed to investigate carbamylation effects on the function of Cx43-based hemichannels. Such effects have previously been characterized for other connexins, e.g., for Cx26, which display increased hemichannel (HC) opening and closure of gap junction channels upon exposure to increased CO<sub>2</sub> partial pressure (accompanied by increased bicarbonate to keep pH constant). 

      The authors used HeLa cells transiently transfected with Cx43 to investigate CO<sub>2</sub> dependent carbamylation effects on Cx43 HC function. In contrast to Cx43-based gap junction channels that are reported here to be insensitive to PCO<sub>2</sub> alterations, they provide evidence that Cx43 HC opening is highly dependent on the PCO2 pressure in the bath solution, over a range of 20 up to 70 mmHg encompassing the physiologically normal resting level of around 40 mmHg. They furthermore identified several Cx43 residues involved in Cx43 HC sensitivity to PCO2: K105, K109, K144 & K234; mutation of 2 or more of these AAs is necessary to abolish CO<sub>2</sub> sensitivity. The subject is interesting and the results indicate that a fraction of HCs is open at a physiological 40 mmHg PCO<sub>2</sub>, which differs from the situation under HEPES buffered solutions where HCs are mostly closed under resting conditions. The mechanism of HC opening with CO<sub>2</sub> gassing is linked to carbamylation, and the authors pinpointed several Lys residues involved in this process. 

      Overall, the work is interesting as it shows that Cx43 HCs have a significant open probability under resting conditions of physiological levels of CO<sub>2</sub> gassing, probably applicable to the brain, heart, and other Cx43 expressing organs. The paper gives a detailed account of various experiments performed (dye uptake, electrophysiology, ATP release to assess HC function) and results concluded from those. They further consider many candidate carbamylation sites by mutating them to negatively charged Glu residues. The paper ends with hippocampal slice work showing evidence for connexin-dependent increases of the EPSP amplitude that could be inhibited by HC inhibition with Gap26 (Figure 10). Another line of evidence comes from the Cx43-linked ODDD genetic disease, whereby L90V as well as the A44V mutations of Cx43 prevented the CO<sub>2</sub>-induced hemichannel opening response (Figure 11). Although the paper is interesting, in its present state, it suffers from (i) a problematic Figure 3, precluding interpretation of the data shown, and (ii) the poor use of hemichannel inhibitors that are necessary to strengthen the evidence in the crucial experiment of Figure 2 and others. 

      The panels in Figure 3 were mislabelled in the accompanying legend possibly leading to some confusion. This has now been corrected.

      We disagree that hemichannel blockers are needed to strengthen the evidence in Figure 2 and other figures. Our controls show that the CO<sub>2</sub>-sensitive responses absolutely requires expression of Cx43 and was modified by mutations of Cx43. It is hard to see how this evidence would be strengthened by use of peptide inhibitors or other blockers of hemichannels that may not be completely selective.

      Reviewing Editor Comments:

      (1) Improve electrophysiological evidence, addressing concerns about the initial experiment and including peptide inhibitor data where applicable. 

      We think the concerns about the electrophysiological evidence arise from a misunderstanding because we gave insufficient information about how we conducted the experiments. We have now provided a much more complete legend, added explanations in the text and given more detail in the Methods. We further respond to the reviewer below.

      We do not agree on the necessity of the peptide inhibitor to demonstrate dependence on Cx43.  We have shown that parental HeLa cells do not release ATP to changes in PCO<sub>2</sub> or voltage (Fig 2D; Butler & Dale 2023, 10.3389/fncel.2023.1330983; Lovatt et al 2025, 10.1101/2025.03.12.642803, 10.1101/2025.01.23.634273). Our previous papers have shown many times that parental HeLa cells do not load with dye to CO<sub>2</sub> or zero Ca<sup>2+</sup> (e.g. Huckstepp et al 2010, 10.1113/jphysiol.2010.192096; Meigh et al 2013, 10.7554/eLife.01213; Meigh et al 2014, 10.7554/eLife.04249), and we have shown that parental HeLa cells do not exhibit the same CO<sub>2</sub> dependent change in whole cell conductance that the Cx43-expressing cells do (Fig 2B). In addition, we shown that mutating key residues in Cx43 alters both CO<sub>2</sub>-sensitive release of ATP and the CO<sub>2</sub>-dependent dye loading without affecting the respective positive control. To bolster this, we have included data for the K144R mutation as a supplement to Fig 3. Given the expense of Gap26 it is impractical to include this as a standard control and unnecessary given the comprehensive controls outlined.

      Collectively, these data show that the responses to CO<sub>2</sub> require expression of Cx43 and can be modified by mutation of Cx43.

      (2) Strengthen the manuscript by measuring the effects of CO on cytosolic pH and Cx43 hemichannel opening. Consider using tail truncation mutants to assess the role of the C-terminal pH sensor in CO-mediated channel opening.

      We agree and have performed the suggested experiments to address this issue.

      (3) Investigate the effect of expressing the K105E/K109E Cx43 double mutant on cell viability.

      In our experiments the cells look completely healthy based on their morphology in brightfield microscopy and growth rates. 

      (4) Discuss and analyze the uniqueness of Cx43 among alpha connexins in maintaining the carbamylation motif.

      now discuss this -Cx43 is not unique. We have added a molecular phylogenetic survey of the alpha connexin clade in Fig 12. Apart from Cx37, the carbamylation motif appears in all the other members of the clade (but not necessarily in the human orthologue). In a different MS, currently posted on bioRxiv, we have documented the CO<sub>2</sub> sensitivity of Cx50 and its dependence on the motif.

      (5) Consider omitting data on ODDD-associated mutations unless there is evidence linking CO<sub>2</sub> sensitivity to disease pathology.

      This experiment is observational, and we are not making claims that there is a direct causal link. Removing the ODDD mutant findings would lose potentially useful information for anyone studying how these mutations alter channel function. We have reworded the text to ensure that we say that the link between loss of CO<sub>2</sub> sensitivity and ODDD remains unproven.

      (6) Justify the choice of high K<sup>⁺</sup> and low external calcium as a positive control in ATP release experiments.

      These two manipulations can open the hemichannel independently of the CO<sub>2</sub> stimulus. Extracellular Ca<sup>2+</sup> is well known to block all connexin hemichannels, and Cx43 is known to be voltage sensitive. The depolarisation from high K<sup>+</sup> is effective at opening the hemichannel and we preferred this as a more physiological way of opening the Cx43 hemichannel. We have added some explanatory text.

      (7) Clarify whether Cx43A44V or Cx43L90V mutations block gap junctional coupling.

      This is an interesting point. Since Cx43 GJCs are not CO<sub>2</sub> sensitive we feel this is beyond the scope of our paper. 

      (8) Discuss the potential implications of pCO₂ changes on myocardial function through alterations in intracellular pH.

      We have modified the discussion to consider this point.

      Reviewer #1 (Recommendations for the authors):

      (1) Measurements of the effects of CO<sub>2</sub> on cytosolic pH/Cx43 hemichannel opening would strengthen the manuscript. Since the pH sensor of Cx43 is on the C terminus, the authors could consider making tail truncation mutants to see how this affects CO<sub>2</sub>-mediated Cx43 channel opening.

      We have done this (truncating after residue 256) -the channel remains highly CO<sub>2</sub> and voltage sensitive. We have also documented the effect of the  hypercapnic solutions on intracellular pH measured with BCECF. These new data are now included as figure supplements to Figure 2.

      (2) What is the impact of expressing the K105E / K109E Cx43 double mutant on cell viability?

      There was no obvious observed impact, cell density was as expected (no evidence of increased cell death), brightfield and fluorescence visualisation indicated normal healthy cells. We have added a movie (Fig 9, movie supplement 1) to show the effect of La<sup>3+</sup> on the GRAB<sub>ATP</sub> signal in cells expressing Cx43<sup>K105E, K109E</sup> so readers can appreciate the morphology and its stability during the recording.

      (3) A quick look at other alpha connexins suggested that Cx43 was unique among alpha connexins in maintaining the carbamylation motif. This merits additional discussion/ analysis.

      This is an interesting point. Cx43 is not unique in the alpha clade in having the carbamylation motif as a number of other human alpha connexins also possess: Cx50, Cx59 and Cx62, and non-human alpha connexins (Cx40, Cx59, Cx46) also possess the motif. We have shown that Cx50 is CO<sub>2</sub> sensitive. We have performed a brief molecular phylogenetic analysis of the alpha connexon clade to highlight the occurrence of the carbamylation motif. This is now presented as Fig 12 to go with the accompanying discussion.

      (4) There were some minor writing issues that should be addressed. For instance, fEPSP is not defined. Also, insets showing positive controls in some experiments were not described in the figure legends.

      We have corrected these issues.

      Reviewer #2 (Recommendations for the authors):

      (1) I would omit the data on the ODDD-associated mutations since there is no evidence that loss of CO<sub>2</sub> sensitivity plays an important role in the underlying disease pathology.

      We are not making the claim CO<sub>2</sub> loss leads to the underlying pathology and have reviewed the text to ensure that we clearly express that this is a correlation not a cause. We think this is worth retaining as many pathological mutations in other CO<sub>2</sub> sensitive connexins (Cx26, Cx32 and Cx50) cause loss of CO<sub>2</sub> sensitivity, and this information may be helpful to other researchers.

      (2) Why is high K+ rather than low external calcium used as a positive control in ATP release experiments?

      We used of high K<sup>+</sup> and depolarisation as a positive control as regard this as a more physiological stimulus than the low external Ca<sup>2+</sup>.

      (3) Does Cx43A44V or Cx43L90V block gap junctional coupling?

      An interesting question but we have not examined this.

      (4) Provide references for biophysical recordings of Cx43 hemichannels performed in HEPES-buffered salines, which document Cx43 hemichannels as being shut.

      have added the original and some later references which examine Cx43 hemichannel gating in HEPES buffer and shows the need for substantial depolarisation to induce channel opening.

      (5) In the heart muscle, changes in PCO<sub>2</sub> have long been hypothesized to cause changes in myocardial function by changing pHi.

      This is true and we now add some discussion of this point. Now that we know that Cx43 is directly sensitive to CO<sub>2</sub> a direct action of CO<sub>2</sub> cannot be ruled out and careful experimentation is required to test this possibility. 

      Reviewer #3 (Recommendations for the authors):

      (1) Page 3: "... homologs of K125 and R104 ... ": the context is linked to Cx26, so Cx26 needs to be added here.

      Done

      (2) Page 4 text and related Figure 2:

      (a) Figure 2A&B: PCO2-dependent Cx43 HC opening is clearly present in the carboxy-fluorescein dye uptake experiments (Figure 2A) as well as in the electrophysiological experiments (Figure 2B). The curves look quite different between these two distinct readouts: dye uptake doubles from 20 to 70 mmHg in Figure 2A while the electrophysiological data double from 45 to 70 mmHg in Figure 2B. These responses look quite distinct and may be linked to a non-linearity of the dye uptake assay or a problem in the electrophysiological measurements of Figure 2B discussed in the next point.

      Different molecules/ions may have different permeabilities through the channel, which could explain the observed difference. Also, there is some contamination of the whole cell conductance change with another conductance (evident in recordings from parental HeLa cells). This is evident particularly at 70 mmHg. If this contaminating conductance were subtracted from the total conductance in the Cx43 expressing cells, then the dose response relations would be more similar. However, we are reluctant to add this additional data processing step to the paper.

      (b) The traces in Figure 2B show that the HC current is inward at 20 mmHg PCO2, while it switches to an outward current at 55mmHg PCO2. HCs are non-selective channels, so their current should switch direction around 0 mV but not at -50 mV. As such, the -50 mV switching point indicates involvement of another channel distinct from non-selective Cx43 hemichannels.

      We think that our incomplete description in the legend led to this misunderstanding. We used a baseline of 35 mmHg (where the channels will be slightly open) and changed to 20 mmHg to close them (or to higher PCO<sub>2</sub> to open them from this baseline), hence a decrease in conductance and loss of outward current for 20 mmHg. The holding potential for the recordings and voltage steps were the same in all recordings. We have now edited the legend and added more information into the methods to clarify this and how we constructed the dose response curve.

      We agree that Cx43 hemichannels are relatively nonselective and would normally be expected to have a reversal potential around 0 mV, but we are using K-Gluconate and the lowered reversal potential (~-65 mV) is likely due to poor permeation of this anion via Cx43.

      (c) A Hill slope of 6 is reported for this curve, which is extremely steep. The paper does not provide any further consideration, making this an isolated statement without any theoretical framework to understand the present finding in such context (i.e., in relation to the PCO2 dependency of Cx channels).

      Yes, we agree -it seems to be the case with all CO<sub>2</sub> sensitive connexins that we have looked at that the Hill coefficient versus CO<sub>2</sub> is >4. Hemichannels are of course hexameric so there is potential for 6 CO<sub>2</sub> molecules to be bound and extensive cooperativity. We have modified the text to give greater context.

      (d) A further remark to Figure 2 is that it does not contain any experiment showing the effect of Cx43 hemichannel inhibition with a reliable HC inhibitor such as Gap26, which is only used in the penultimate illustration of Figure 10. Gap26 should be used in Figure 2 and most of the other figures to show evidence of HC contribution. The lanthanum ions used in Figure 9 are a very non-specific hemichannel blocker and should be replaced by experiments with Gap26.

      We have addressed the first part of this comment above.

      We agree that La<sup>3+</sup> blocks all hemichannels, but in the context of our experiments and the controls we have performed it is entirely adequate and supports our conclusions. Our controls show (mentioned above and below) show that the expression of Cx43 is absolutely required for CO<sub>2</sub>-dependent ATP release (and dye loading). In Figure 9 our use of La<sup>3+</sup> was to show the presence of a constitutively open Cx43 mutant hemichannel. Gap26 would add little to this. Our further controls show that with expression of Cx43<sup>WT</sup> La<sup>3+</sup> did nothing to the ATP signal under baseline conditions (20 mmHg) supporting our conclusion that the mutant channels are constitutively open.

      (e) As the experiments of Figure 2 form the basis of what is to follow, the above remarks cast doubt on the robustness of the experiments and the data produced.

      We disagree, our results are extremely robust: 1) we have used three independent assays confirm the presence of the response; 2) parental HeLa cells do not release ATP, dye load or show large conductance changes to CO<sub>2</sub> showing the absolute requirement for expression of Cx43; 3) mutations of Cx43 (in the carbamylation motif) alter the CO<sub>2</sub> evoked ATP release and dye loading giving further confirmation of Cx43 as the conduit for ATP release and dye loading; and 4) we use standard positive controls (0 Ca<sup>²</sup>, high K<sup></sup>) to confirm cells still have functional channels for those mutations that modified CO<sub>2</sub> sensitivity.

      (f) The sentence "Cells transfected with GRAB-ATP only, showed ... " should be

      modified to "In contrast, cells not expressing Cx43 showed no responses to any applied CO2 concentration as concluded from GRAB-ATP experiments"

      We have modified the text.

      (3) Page 5 and Figures 3 & 4:

      (a) Figure 3 illustrates results obtained with mutations of 4 distinct Lys residues. However, the corresponding legend indicates mutations that are different from the ones shown in the corresponding illustrations, making it impossible to reliably understand and interpret the results shown in panels A-E.

      Thanks for pointing this out. Our apologies, we modified the figure so that the order of the images matched the order of the graph (and the legend) but then forgot to put the new version of the figure in the text. We have now corrected this so that Figure and legend match.

      (b) Figure 4 lacks control WT traces!

      The controls for this (showing that parental HeLa cells do not release ATP in response to CO<sub>2</sub> or depolarisation) are shown in Figure 2.

      (c) Figure 4, Supplement 1: High Hill coefficients of 10 are shown here, but they are not discussed anywhere, as is also the case for the remark on p.4. A Hill steepness of 10 is huge and points to many processes potentially involved. As reported above, these data are floating around in the manuscript without any connection.

      Yes, we agree this is very high and surprising. It may reflect as mentioned above the hexameric nature of the channel and that 4 Lys residues seem to be involved. We have used this equation to give some quantitative understanding of the effect of the mutations on CO<sub>2</sub> sensitivity and still think this is useful. We have no further evidence to interpret these values one way or the other.

      (4) Page 6: Carbamate bridges are proposed to be formed between K105 and K144, and between K109 and K234. The first three of these Lysine residues are located in the 55aa long cytoplasmic loop of Cx43, while K234 is in the juxta membrane region involved in tubulin interactions. Both K144 and and K234 are involved in Cx43 HC inhibition: K144 is the last aa of the L2 peptide (D119-K144 sequence) that inhibits Cx43 hemichannels while K234 is the first aa of the TM2 peptide that reduces hemichannel presence in the membrane (sequence just after TM4, at the start of the C-tail). This context should be added to increase insight and understanding of the CO2 carbamylation effects on Cx43 hemichannel opening.

      Thanks for suggesting this. We have added some discussion of CT to CL interactions in the context of regulation by pH and [Ca<sup>2+</sup>].

      (5) Page 7: The Cx43 ODDD A44V and L90V mutations lead to loss of pCO2 sensitivity in dye loading and ATP assays. However, A44V located in EL1 is reportedly associated with Cx43 HC activation, while L90V in TM2 is associated with HC inhibition. Remarkably, these mutations are focused on non-Lys residues, which brings up the question of how to link this to the paper's main thread.

      This follows the pattern that we have seen for other mutations such as A40V, A88V in Cx26 and several CMTX mutations of Cx32. Our cryoEM structures of Cx26 suggest that these mutations alter the flexibility of the molecule and hence abolish CO<sub>2</sub> sensitivity. We have reworded the text to avoid giving the impression that there is a demonstrated link between loss of CO<sub>2</sub> sensitivity of Cx43 and pathology.

      (6) Page 8: HCs constitutively open - 'constutively' perhaps does not have the best connotation as it is not related to HC constitution but CO2 partial pressure.

      Yes, we agree and have reworded this.

      (7) Page 9: "in all subtypes" -> not clear what is meant - do you mean "in all cell types"?

      We agree this is unclear -it refers to all astrocytic subtypes. We have amended the text.

      (8) Page 10: Composition of hypocapnic recording solution: bubbling description is incomplete "95%O2/5%" and should be "95%O2/5%CO2".

      Changed.

      (9) Page 11: Composition of zero Ca<sup>²⁺</sup> hypocapnic recording solution: perhaps better to call this "nominally Ca<sup>²⁺</sup>-free hypocapnic recording solution" as no Ca<sup>²⁺</sup> buffer is included in this solution

      Thanks for pointing this out. We did in fact add 1 mM EGTA to the solutions but omitted this from the recipe, this has now been corrected.

      (10) Page 11: in M&M I found that the NaHCO3- is lowered to 10 mM in the zero Ca<sup>²⁺</sup>condition, while the control experimental condition has 26 mM NaHCO3-. The zero Ca condition should be kept at a physiologically normal 26 mM NaHCO3- concentration, so why was this done? Lowering NaHCO3- during hemichannel stimulation may result in smaller responses and introduce non-linearities.

      For the dye loading we used 20 mmHg as the baseline condition and increased PCO<sub>2</sub> from this. Hence for the zero Ca<sup>2+</sup> positive control we modified the 20 mmHg hypocapnic solution by substituting Mg<sup>2+</sup> for Ca<sup>2+</sup> and adding EGTA. We have modified the text in the Methods to clarify this.

      Further remarks on the figures:

      (1) Figure 2A: Add 20 & 70 mmHg to the images, to improve the readability of this illustration.

      Done

      (2) Figure 3: WT responses are shown in panel F, but experimental data (images and curves) are lacking and should be included in a revised version.

      The wild type data is shown in Fig 2A. We have some sympathy for the comment, but we felt that Fig 2 should document CO<sub>2</sub> sensitivity, and then the subsequent Figs should analyse its basis. Hence the separation of Cx43<sup>WT</sup> data from the mutant data. In panel F, we state that we have recalculated the WT data from Fig 2A to allow the comparison.

      (3) Figures 4, 6, 8: Color codes for mmHg CO<sub>2</sub> pressure make reading these figures difficult; perhaps better to add mmHg values directly in relation to the traces.

      We have considered this suggestion but feel that the figures would become very cluttered with the additional labelling.

      (4) I wouldn't use colored lines when not necessary, e.g., Figure 9 100 µM La3+; Figure 10 (add 20->35 mmHg PCO2 switch; add scrGap26 above blue bars); Figure 11C & D.

      We agree and can see that in Figs 9 and 10 this muddles our colour scheme in other figures so have modified these figures. There was not space to put the suggested labels.

      (5) The mechanism of increased HC opening is not clear.

      We agree and have discussed various options and the analogy with what we know about Cx26. Ultimately new cryo-EM data is required.

      (6) Figure 10: 35G/35S are weird abbreviations for 35 mmHg Gap26 and scrambled Gap26.

      Yes, but we used these to fit into the available space.

      (7) Figure 11, legend: '20 mmHg PCO2 for each transfection for 70 mmHg PCO2'. It is not clear what is meant here.

      Thanks for pointing this out, we have reworded this to ensure clarity.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      One of the most novel things of the manuscript is the use of a relatively quick photoablation system. Could this technique be applied in other laboratories? While the revised manuscript includes more technical details as requested, the description remains difficult to follow for readers from a biology background. I recommend revising this section to improve clarity and accessibility for a broader scientific audience.

      As suggested, we have adapted the paragraph related to the photoablation technique in the Material & Method section, starting line 1147. We believe it is now easier to follow.

      The authors suggest that in the animal model, early 3h infection with Neisseria do not show increase in vascular permeability, contrary to their findings in the 3D in vitro model. However, they show a non-significant increase in permeability of 70 KDa Dextran in the animal xenograft early infection. As a bioengineer this seems to point that if the experiment would have been done with a lower molecular weight tracer, significant increases in permeability could have been detected. I would suggest to do this experiment that could capture early events in vascular disruption.

      Comparing permeability under healthy and infected conditions using Dextran smaller than 70 kDa is challenging. Previous research (1) has shown that molecules below 70 kDa already diffuse freely in healthy tissue. Given this high baseline diffusion, we believe that no significant difference would be observed before and after N. meningitidis infection, and these experiments were not carried out. As discussed in the manuscript, bacteria-induced permeability in mice occurs at later time points, 16h post-infection, as shown previously (2). As discussed in the manuscript, this difference between the xenograft model and the chip could reflect the absence of various cell types present in the tissue parenchyma or simply vessel maturation time.

      One of the great advantages of the system is the possibility of visualizing infection-related events at high resolution. The authors show the formation of actin in a honeycomb structure beneath the bacterial microcolonies. This only occurred in 65% of the microcolonies. Is this result similar to in vitro 2D endothelial cultures in static and under flow? Also, the group has shown in the past positive staining of other cytoskeletal proteins, such as ezrin, in the ERM complex. Does this also occur in the 3D system?

      We imaged monolayers of endothelial cells in the flat regions of the chip (the two lateral channels) using the same microscopy conditions (i.e., Obj. 40X N.A. 1.05) that have been used to detect honeycomb structures in the 3D vessels in vitro. We showed that more than 56% of infected cells present these honeycomb structures in 2D, which is 13% less than in 3D, and is not significant due to the distributions of both populations. Thus, we conclude that under both in vitro conditions, 2D and 3D, the amount of infected cells exhibiting cortical plaques is similar. These results are in Figure 4E and S4B.

      We also performed staining of ezrin in the chip and imaged both the 3D and 2D regions. Although ezrin staining was visible in 3D (Author response image 1), it was not as obvious as other markers under these infected conditions, and we did not include it in the main text. Interpretation of this result is not straightforward, as the substrate of the cells is different, and it would require further studies on the behavior of ERM proteins in these different contexts.

      Author response image 1.

      F-actin (red) and ezrin (yellow) staining after 3h of infection with N. meningitidis (green) in 2D (top) and 3D (bottom) vessel-on-chip models.

      Recommendation to the authors:

      Reviewer #1 (Recommendation to the authors):

      I appreciate that the authors addressed most of my comments, of special relevance are the change of the title and references to infection-on-chip. I think that the current choice of words better acknowledges the incipient but strong bioengineering infection community. I also appreciate the inclusion of a limitation paragraph that better frames the current work and proposes future advancements.

      The addition of more methodological details has improved the manuscript. Although as mentioned earlier the wording needs to be accessible for the biology community. I also appreciated the addition of the quantification of binding under the WSS gradient in the different geometries and shown in Fig 3H. However, the description of the figure and the legend is not clear. What does "vessel" mean on the graph and "normalized histograms ...(blue)" in the figure legend. Could the authors rephrase it?

      In Figure 3F, we investigated whether Neisseria meningitidis exhibits preferential sites of infection. We hypothesized that, if bacteria preferentially adhered to specific regions, the local shear stress at these sites would differ from the overall distribution. To test this, we compared the shear stress at bacterial adhesion sites in the VoC (orange dots and curve) with the shear stress along the entire vascular edges (blue dots and curve). The high Spearman correlation indicates that there is no distinct shear stress value associated with bacterial adhesion. This suggests that bacteria can adhere across all regions, independently of local shear stress. To enhance clarity, the legend of Figure 3 and the related text have been rephrased in the revised manuscript (L289-314).

      Line 415. Should reference to Fig S5B, not Fig 5B. Also, the titles in Supplementary Figure 4 and 5 are duplicated, and the description of the legend inf Fig S5 seems a bit off. A and B seem to be swapped.

      Indeed, the reference to the right figure has been corrected. Also, the title of Figure S4 has been adapted to its contents, and the legend of Figure S5 has been corrected.

      Reviewer #2 (Recommendation to the authors):

      Minor comments to the authors:

      Line 163 "they formed" instead of "formed".

      Line 212 "two days" instead of "two day"

      Line 269 a space between two words is missing.

      These three comments have been addressed in the revised manuscript.

      In addition, I appreciate answering the comments, especially those requiring hypothesizing about including further cells. However, when discussing which other cells could be relevant for the model (lines 631 to 632) it would be beneficial to discuss not only the role of those cells but also how could they be included in the model. I think for the reader, inclusion of further cells could be seen as a challenge or limitation, and addressing these technical points in the discussion could be helpful.

      We thank Reviewer #2 for the insightful suggestion. Indeed, the method of introducing cells into the VoC depends on their type. Fibroblasts and dendritic cells, which are resident tissue cells, should be embedded in the collagen gel before polymerization and UV carving. This requires careful optimization to preserve chip integrity, as these cells exert pulling forces while migrating within the collagen matrix. In contrast, T cells and macrophages should be introduced through the vessel lumen to mimic their circulation in vivo. Pericytes can be co-seeded with endothelial cells, as they have been shown to self-organize within a few hours post-seeding. These important informations are now included in the manuscript (L577-587).

      Reviewer #3 (Recommendation to the authors):

      Suggestions and Recommendations

      Some suggestions related to the VOC itself:

      Figure 1, Fig S1, paragraph starting line 1071: More information would be helpful for the laser photoablation. For instance, is a non-standard UV laser needed? Which form of UV light is used? What is the frequency of laser pulsing? How many pulses/how long is needed to ablate the region of interest?

      The photoablation process requires a focused UV-laser, with high frequency (10 kHz) to lower the carving time while providing the required intensity to degrade collagen gel. To carve a reproducible number of 30 µm-large vessels, we used a 2 µm-large laser beam at an energy of 10 mW and moved the stage (i.e., sample) at a maximum speed of 1 mm/s. This information has been added to the related paragraph starting on line 1147 of the revised manuscript.

      It is difficult to understand the geometry of the VOC. In Figure 1C, is the light coloration representing open space through which medium can flow, and the dark section the collagen? On a single chip, how many vessels are cut through the collagen? It looks as if at least two are cut in Figure 1C in the righthand photo.

      In Figure 1C, the light coloration is the Factin staining. The horizontal upper and lower parts are the 2D lateral channels that also contain endothelial cells, and are connected to inlets and outlets, respectively. In the middle, two vertically carved 3D vessels are shown in the confocal image.

      Technically, we designed the PDMS structures to allow carving of 1 to 3 channels, maximizing the number of vessels that can be imaged while minimizing any loss of permeability at the PDMS/collagen/cells interface. This information has been added in the revised manuscript (L. 1147).

      If multiple vessels are cut in the center channel between the lateral channels, how do you ensure that medium flow is even between all vessels? A single chip with multiple different vessel architectures through the center channel would be expected to have different hydrostatic resistance with different architectures, thereby causing differences in flow rates in each vessel.

      To ensure a consistent flow rate regardless of the number of carved vessels, we opted to control the flow rate directly across the chip with a syringe pump. During experiments, one inlet and one outlet were closed, and a syringe pump was used. Because the carved vessels are arranged in parallel (derivation), the flow rate remains the same in each vessel. If a pressure controller had been used instead, the flow would have been distributed evenly across the different channels. This has been added to the revised manuscript in the paragraph starting on line 1210.

      The figures imply that the laser ablation can be performed at depth within the collagen gel, rather than just etching the surface. If this is the case, it should be stated explicitly. If not, this needs to be clarified.

      One of the main advantages of the photoablation technique is carving the collagen gel in volume, and not only etching the surface. Thanks to the 3D UV degradation, we can form the 3D architecture surrounded by the bulk collagen. This has been added to the revised manuscript, lines 154-155.

      Is the in-vivo-like vessel architecture connected to the lateral channel at an oblique angle, or is the image turned to fit the entire structure? (Figure 1F and 3E). Is that why there is high shear stress at its junction with the lateral channel depicted in Figure 3E?

      All structures require connection to the lateral channels to ensure media circulation and nutrient supply. The in vivo-like design must be rotated to allow the upper and lower branches of the complex structure to pass between the fixed PDMS pillars. To remain consistent with the image and the flow direction, we have kept the same orientation as in the COMSOL simulation. This leads to a locally higher shear stress at the top of the architecture. This has been added in the revised manuscript, in the paragraph starting on line 1474.

      Figure S1F,G: In the legend, shapes are circles, not squares. On the graphs, what do the numbers in parentheses mean?

      Indeed, the terms "squares" have been replaced by "circles" in Figure 1. (1) and (2) refer to the providers of the collagen, FujiFilm and Corning, respectively. We have added this mention in the legend in Figure S1.

      Figure 3B: how do the images on the left and right differ? Each of the 4 images needs to be explained.

      The four images represent the infected VoC from different viewing angles, illustrating the three-dimensional spread of infection throughout the vessel. A more detailed description has been added in the legend of Figure 3.

      Figure S3C is not referenced but should be, likely before sentence starting on line 299.

      Indeed, the reference to Figure S3C has been added line 301 of the revised manuscript.

      Results in Figure 3 with the pilD mutant are very interesting. It is worth commenting in the Discussion about how T4P functionality in addition to the presence of T4P contributes to Nm infection, and how in the future this could be probed with pilT mutants.

      We thank Reviewer #3 for this relevant insight. Following adhesion, a key functionality of Neisseria meningitidis for colony formation and enhanced infection is twitching motility. As suggested, we have added in the Discussion the idea of using a PilT mutant, which can adhere but cannot retract its pili, in the VoC model to investigate the role of motility in colonization in vitro under flow conditions (L611–623).

      Which vessel design was used for the data presented in Figures 4, 5, and 6 and associated supplemental figures?

      Straight channels have been mostly used in figures 4, 5, and 6. Rarely, we used the branched in vivo-like designs to observe potential similar infection patterns to in vivo, and related neutrophil activity. This has been added in the revised manuscript, lines 1435-1439.

      Figure 4B-D: the images presented in Figure 4C are not representative of the averages presented in Figures 4B,D. For instance, the aggregates appear much larger and more elongated in the animal model in Figure 4C, but the animal model and VOC have the colony doubling time (implying same size) in Figure 4B, and same average aggregate elongation in Figure 4D.

      The images in Figure 4C were selected to illustrate the elongation of colonies quantified in Figure 4D. The elongation angles are consistent between both images and align with the channel orientation. Representative images of colony expansion over time, corresponding to Figure 4A and 4B, are provided in Figure S4A.

      Figures 4E-F: dextran does not appear to diffuse in the VOC in response to histamine in these images, yet there is a significant increase in histamine-induced permeability in Figure 4F. Dotted lines should be used to indicate vessel walls for histamine, and/or a more representative image should be selected. A control set of images should also be included for comparison.

      We thank Reviewer #3 for the insightful comment. We confirm that we have carefully selected representative images for the histamine condition and adjusted them to display the same range of gray levels. The apparent increase in permeability with histamine is explained by a slight rise in background fluorescence, combined with the smaller channel size shown in Figure 4E.

      Figure S4 title is a duplicate of Figure S5 and is unrelated to the content of Figure S4. Suggest rewording to mention changes in permeability induced by Nm infection in the VOC and animal model.

      Indeed, the title of Figure S4 did not correspond to its content. We have, thus, changed it in the revised manuscript.

      Line 489 "...our Vessel-on-Chip model has the potential to fully capture the human neutrophil response during vascular infections, in a species-matched microenvironment", is an overstatement. As presented, the VOC model only contains endothelial cells and neutrophils. Many other cell types and structures can affect neutrophil activity. Thus, it is an overstatement to claim that the model can fully capture the human neutrophil response.

      We agree with the Reviewer #3, that neutrophil activity is fully recapitulated with other cell types, such as platelets, pericytes, macrophages, dendritic cells, and fibroblasts, that secrete important molecules such as cytokines, chemokines, TNF-α, and histamine. In our simplified model we were able to reconstitute the complex interaction of neutrophils with endothelial cells and with bacteria. The text was modified accordingly.

      Supplemental Figure 6 - Does CD62E staining overlap with sites of Nm attachment

      E-selectin staining does not systematically colocalize with Neisseria meningitidis colonies although bacterial adhesion is required. Its overall induced expression is heterogeneous across the tissue and shows heterogeneity from cell to cell as seen in vivo.

      Line 475, Figure 6E- Phagocytosis of Nm is described, but it is difficult to see. An arrow should be added to make this clear. Perhaps the reference should have been to Figure 6G? Consider changing the colors in Figure 6G away from red/green to be more color-blind friendly.

      Indeed, the reference to the right figure is Figure 6G, where the phagocytosis event is zoomed in. We have changed it in the text. Adapting the color of this figure 6G would imply to also change all the color codes of the manuscript, as red has been used for actin and green for Neisseria meningitidis.

      Lines 621-632 - This important discussion point should be reworked. Some suggested references to cite and discuss include PMID: 7913984, 15186399, 17991045, 18640287, 19880493.

      We have introduced in the discussion parts the following references as suggested (3–7), and discussed more the importance of introducting of immune cells to study immune cell-bacteria interaction and related immune response (L659-678).

      Minor corrections:

      •  Line 8 - suggest "photoablation-generated" instead of "photoablation-based"

      •  Line 57- remove the word "either", or modify the sentence

      •  Sentence on lines 162-165 needs rewording

      •  Lines 204-205- "loss of vascular permeability" should read "increase in vascular permeability"

      •  Line 293- "Measured" shear stress, should be "computed", since it was not directly measured (according to the Materials & Methods)

      •  Line 304- "consistently" should be "consistent"

      •  Fig. 3 legend, second line: replace "our" with "the VoC"

      •  Line 371, change "our" to "the"

      •  Line 415- Figure 5B doesn’t appear to show 2-D data. Is this in Figure S5B? Some clarification is needed. The quantification of Nm vessel association in both the VOC and the animal model should be shown in Figure 5, for direct comparison.

      •  Supplementary Figure 5C: correlation coefficient with statistical significance should be calculated.

      •  Figure 6 title, rephrase to "The infected VOC model"

      •  Line 450, replace "important" with "statistically significant"

      •  Line 459, suggest rephrasing to "bacterial pilus-mediated adhesion"

      •  Line 533- grammar needs correction

      •  Line 589- should be "sheds"

      •  Line 1106- should be "pellet"

      •  Lines 1223-1224 - is the antibody solution introduced into the inlet of the VOC for staining? Please clarify.

      •  Line 1295-unclear why Figure 2B is being referenced here

      All the suggested minor corrections have been taken into account in the revised manuscript.

      References

      (1) Gyohei Egawa, Satoshi Nakamizo, Yohei Natsuaki, Hiromi Doi, Yoshiki Miyachi, and Kenji Kabashima. Intravital analysis of vascular permeability in mice using two-photon microscopy. Scientific Reports, 3(1):1932, Jun 2013. ISSN 2045-2322. doi: 10.1038/srep01932.

      (2) Valeria Manriquez, Pierre Nivoit, Tomas Urbina, Hebert Echenique-Rivera, Keira Melican, Marie-Paule Fernandez-Gerlinger, Patricia Flamant, Taliah Schmitt, Patrick Bruneval, Dorian Obino, and Guillaume Duménil. Colonization of dermal arterioles by neisseria meningitidis provides a safe haven from neutrophils. Nature Communications, 12(1):4547, Jul 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-24797-z.

      (3) Katherine A. Rhodes, Man Cheong Ma, María A. Rendón, and Magdalene So. Neisseria genes required for persistence identified via in vivo screening of a transposon mutant library. PLOS Pathogens, 18(5):1–30, 05 2022. doi: 10.1371/journal.ppat.1010497.

      (4) Heli Uronen-Hansson, Liana Steeghs, Jennifer Allen, Garth L. J. Dixon, Mohamed Osman, Peter Van Der Ley, Simon Y. C. Wong, Robin Callard, and Nigel Klein. Human dendritic cell activation by neisseria meningitidis: phagocytosis depends on expression of lipooligosaccharide (los) by the bacteria and is required for optimal cytokine production. Cellular Microbiology, 6(7):625–637, 2004. doi: https://doi.org/10.1111/j.1462-5822.2004.00387.x.

      (5) M. C. Jacobsen, P. J. Dusart, K. Kotowicz, M. Bajaj-Elliott, S. L. Hart, N. J. Klein, and G. L. Dixon. A critical role for atf2 transcription factor in the regulation of e-selectin expression in response to non-endotoxin components of neisseria meningitidis. Cellular Microbiology, 18(1):66–79, 2016. doi: https://doi.org/10.1111/cmi.12483.

      (6) Andrea Villwock, Corinna Schmitt, Stephanie Schielke, Matthias Frosch, and Oliver Kurzai. Recognition via the class a scavenger receptor modulates cytokine secretion by human dendritic cells after contact with neisseria meningitidis. Microbes and Infection, 10(10):1158–1165, 2008. ISSN 1286-4579. doi: https://doi.org/10.1016/j.micinf.2008.06.009.

      (7) Audrey Varin, Subhankar Mukhopadhyay, Georges Herbein, and Siamon Gordon. Alternative activation of macrophages by il-4 impairs phagocytosis of pathogens but potentiates microbial-induced signalling and cytokine secretion. Blood, 115(2):353–362, Jan 2010. ISSN 0006-4971. doi: 10.1182/blood-2009-08-236711.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The manuscript by Choi and colleagues investigates the impact of variation in cortical geometry and growth on cortical surface morphology. Specifically, the study uses physical gel models and computational models to evaluate the impact of varying specific features/parameters of the cortical surface. The study makes use of this approach to address the topic of malformations of cortical development and finds that cortical thickness and cortical expansion rate are the drivers of differences in morphogenesis.

      The study is composed of two main sections. First, the authors validate numerical simulation and gel model approaches against real cortical postnatal development in the ferret. Next, the study turns to modelling malformations in cortical development using modified tangential growth rate and cortical thickness parameters in numerical simulations. The findings investigate three genetically linked cortical malformations observed in the human brain to demonstrate the impact of the two physical parameters on folding in the ferret brain.

      This is a tightly presented study that demonstrates a key insight into cortical morphogenesis and the impact of deviations from normal development. The dual physical and computational modeling approach offers the potential for unique insights into mechanisms driving malformations. This study establishes a strong foundation for further work directly probing the development of cortical folding in the ferret brain. One weakness of the current study is that the interpretation of the results in the context of human cortical development is at present indirect, as the modelling results are solely derived from the ferret. However, these modelling approaches demonstrate proof of concept for investigating related alterations more directly in future work through similar approaches to models of the human cerebral cortex.

      We thank the reviewer for the very positive comments. While the current gel and organismal experiments focus on the ferret only, we want to emphasize that our analysis does consider previous observations of human brains and morphologies therein (Tallinen et al., Proc. Natl. Acad. Sci. 2014; Tallinen et al., Nat. Phys. 2016), which we compare and explain. This allows us to analyze the implications of our study broadly to understand the explanations of cortical malformations in humans using the ferret to motivate our study. Further analysis of normal human brain growth using computational and physical gel models can be found in our companion paper (Yin et al., 2025), now also published to eLife: S. Yin, C. Liu, G. P. T. Choi, Y. Jung, K. Heuer, R. Toro, L. Mahadevan, Morphogenesis and morphometry of brain folding patterns across species. eLife, 14, RP107138, 2025. doi:10.7554/eLife.107138

      In future work, we plan to obtain malformed human cortical surface data, which would allow us to further investigate related alterations more directly. We have added a remark on this in the revised manuscript (please see page 8–9).

      Reviewer 2 (Public review):

      Summary:

      Based on MRI data of the ferret (a gyrencephalic non-primate animal, in whom folding happens postnatally), the authors create in vitro physical gel models and in silico numerical simulations of typical cortical gyrification. They then use genetic manipulations of animal models to demonstrate that cortical thickness and expansion rate are primary drivers of atypical morphogenesis. These observations are then used to explain cortical malformations in humans.

      Strengths:

      The paper is very interesting and original, and combines physical gel experiments, numerical simulations, as well as observations in MCD. The figures are informative, and the results appear to have good overall face validity.

      We thank the reviewer for the very positive comments.

      Weaknesses:

      On the other hand, I perceived some lack of quantitative analyses in the different experiments, and currently, there seems to be rather a visual/qualitative interpretation of the different processes and their similarities/differences. Ideally, the authors also quantify local/pointwise surface expansion in the physical and simulation experiments, to more directly compare these processes. Time courses of eg, cortical curvature changes, could also be plotted and compared for those experiments. I had a similar impression about the comparisons between simulation results and human MRI data. Again, face validity appears high, but the comparison appeared mainly qualitative.

      We thank the reviewer for the comments. Besides the visual and qualitative comparisons between the models, we would like to point out that we have included the quantification of the shape difference between the real and simulated ferret brain models via spherical parameterization and the curvature-based shape index as detailed in main text Fig. 4 and SI Section 3. We have also utilized spherical harmonics representations for the comparison between the real and simulated ferret brains at different maximum order N. In our revision, we have included more calculations for the comparison between the real and simulated ferret brains at more time points in the SI (please see SI page 6). As for the comparison between the malformation simulation results and human MRI data in the current work, since the human MRI data are two-dimensional while our computational models are threedimensional, we focus on the qualitative comparison between them. In future work, we plan to obtain malformed human cortical surface data, from which we can then perform the parameterization-based and curvature-based shape analysis for a more quantitative assessment.

      I felt that MCDs could have been better contextualized in the introduction.

      We thank the reviewer for the comment. In our revision, we have revised the description of MCDs in the introduction (please see page 2).

      Reviewer #1 (Recommendations for the authors):

      The study is beautifully presented and offers an excellent complement to the work presented by Yin et al. In its current form, the malformation portion of the study appears predominantly reliant on the numerical simulations rather than the gel model. It might be helpful, therefore, to further incorporate the results presented in Figure S5 into the main text, as this seems to be a clear application of the physical gel model to modelling malformations. Any additional use of the gel models in the malformation portion of the study would help to further justify the necessity and complementarity of the dual methodological approaches.

      We thank the reviewer for the suggestion. We have moved Fig. S5 and the associated description to the main text in the revised manuscript (please see the newly added Figure 5 on page 6 and the description on page 5–7). In particular, we have included a new section on the physical gel and computational models for ferret cortical malformations right before the section on the neurology of ferret and human cortical malformations.

      One additional consideration is that the analyses in the current study focus entirely on the ferret cortex. Given the emphasis in the title on the human brain, it may be worthwhile to either consider adding additional modelling of the human cortex or to consider modifying the title to more accurately align with the focus of the methods/results.

      We thank the reviewer for the suggestion. While the current gel and organismal experiments focus on the ferret only, we want to emphasize that our analysis does consider previous observations of human brains and morphologies therein (Tallinen et al., Proc. Natl. Acad. Sci. 2014; Tallinen et al., Nat. Phys. 2016), which we compare and explain. This allows us to analyze the implications of our study broadly to understand the explanations of cortical malformations in humans using the ferret to motivate our study. Therefore, we think that the title of the paper seems reasonable. To further highlight the connection between the ferret brain simulations and human brain growth, we have included an additional comparison between human brain surface reconstructions adapted from a prior study and the ferret simulation results in the SI (please see SI Section S4 and SI Fig. S5 on page 9–10).

      Two additional minor points:

      Table S1 seems sufficiently critical to the motivation for the study and organization of the results section to justify inclusion in the main text. Of course, I would leave any such minor changes to the discretion of the authors.

      We thank the reviewer for the suggestion. We have moved Table S1 and the associated description to the main text in the revised manuscript (please see Table 1 on page 7).

      Page 7, Column 1: “macacques” → “macaques”.

      We thank the reviewer for pointing out the typo. We have fixed it in the revised manuscript (please see page 8).

      Reviewer #2 (Recommendations for the authors):

      The methods lack details on the human MRI data and patients.

      We thank the reviewer for the comment. Note that the human MRI data and patients were from prior works (Smith et al., Neuron 2018; Johnson et al., Nature 2018; Akula et al., Proc. Natl. Acad. Sci. 2023) and were used for the discussion on cortical malformations in Fig. 6. In the revision, we have included a new subsection in the Methods section and provided more details and references of the MRI data and patients (please see page 9–10).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The statistically adequate way of testing the biases is a hierarchical regression model (LMM) with a distance of the physical location from the nipple as a predictor, and a distance of the reported location from the nipple as a dependent variable. Either variable can be unsigned or signed for greater power, for example, coding the lateral breast as negative and the medial breast as positive. The bias will show in regression coefficients smaller than 1.

      Thank you for this suggestion. We have subsequently replaced the relevant ANOVA analyses with LMM analyses. Specifically, we use an LMM for breast and back separately to show the different effects of distance, then use a combined LMM to compare the interaction. Finally, we use an LMM to assess the differences between precision and bias on the back and breast. The new analysis confirms earlier statements and do not change the results/interpretation of the data.

      Moreover, any bias towards the nipple could simply be another instance of regression to the mean of the stimulus distribution, given that the tested locations were centered on the nipple. This confound can only be experimentally solved by shifting the distribution of the tested locations. Finally, given that participants indicated the locations on a 3D model of the body part, further experimentation would be required to determine whether there is a perceptual bias towards the nipple or whether the authors merely find a response bias.

      A localization bias toward the nipple in this context does not show that the nipple is the anchor of the breast's tactile coordinate system. The result might simply be an instance of regression to the mean of the stimulus distribution (also known as experimental prior). To convincingly show localization biases towards the nipple, the tested locations should be centered at another location on the breast.

      Another problem is the visual salience of the nipple, even though Blender models were uniformly grey. With this type of direct localization, it is very difficult to distinguish perceptual from response biases even if the regression to the mean problem is solved. There are two solutions to this problem: 1) Varying the uncertainty of the tactile spatial information, for example, by using a pen that exerts lighter pressure. A perceptual bias should be stronger for more uncertain sensory information; a response bias should be the same across conditions. 2) Measure bias with a 2IFC procedure by taking advantage of the fact that sensory information is noisier if the test is presented before the standard.

      We believe that the fact that we explicitly tested two locations with equally distributed test locations, both of which had landmarks, makes this unlikely. Indeed, testing on the back is exactly what the reviewer suggests. It would also be impossible to test this “on another location on the breast” as we are sampling across the whole breast. Moreover, as markers persisted on the model within each block, the participants were generating additional landmarks on each trial. Thus, if there were any regression to the mean, this would be observed for both locations. Nevertheless, we recognize that this test cannot distinguish between a sensory bias towards the nipple and consistent response bias that is always in the direction of the nipple, though to what extent these are the same thing is difficult to disentangle. That said, if we had restricted testing to half of the breast such that the distribution of points was asymmetrical this would allow us to test the hypothesis put forward by the reviewer. We recognize that this is a limitation of the data and have downplayed statements and added caveats accordingly.

      We have changed the appropriate heading and text in the discussion to downplay the finding:

      “Reports are biased towards the nipple”

      “suggesting that the nipple plays a pivotal role in the mental representation of the breast.”

      it might be harder to learn the range of locations on the back given that stimulation is not restricted to an anatomically defined region as it is the case for the breast.

      We apologize for any confusion but the point distribution is identical between tasks, as described in the methods.

      The stability of the JND differences between body parts across subjects is already captured in the analysis of the JNDs; the ANOVA and the post-hoc testing would not be significant if the order were not relatively stable across participants. Thus, it is unclear why this is being evaluated again with reduced power due to improper statistics.

      We apologize for any confusion here. Only one ANOVA with post-hoc testing was performed on the data. The second parenthetical describing the test was perhaps redundant and confusing, so I have removed it.

      “(Error! Reference source not found.A, B, 1-way ANOVA with Tukey’s HSD post-hoc t-test: p = 0.0284)”

      The null hypothesis of an ANOVA is that at least one of the mean values is different from the others; adding participants as a factor does not provide evidence for similarity.

      We agree with this statement and have removed the appropriate text.

      The pairwise correlations between body parts seem to be exploratory in nature. Like all exploratory analyses, the question arises of how much potential extra insights outweigh the risk of false positives. It would be hard to generate data with significant differences between several conditions and not find any correlations between pairs of conditions. Thus, the a priori chance of finding a significant correlation is much higher than what a correction accounts for.

      We broadly agree with this statement. However, we believe that the analyses were important to determine if participants were systematically more or less acute across body parts. Moreover, both the fact that we actually did not observe any other significant relationships and that we performed post-hoc correction imply that no false positives were observed. Indeed, in the one relationship that was observed, we would need to have an assumed FDR over 10x higher than the existing post hoc correction required implying a true relationship.

      If the JND at mid breast (measured with locations centered at the nipple) is roughly the same size as the nipple, it is not surprising that participants have difficulty with the categorical localization task on the nipple but perform better than chance on the significantly larger areola.

      We agree that it is not surprising given the previously shown data, however, the initial finding is surprising to many and this experiment serves to reinforce the previous finding.

      Neither signed nor absolute localization error can be compared to the results of the previous experiments. The JND should be roughly proportional to the variance of the errors.

      We apologize for any confusion, however we are not comparing the values, merely observing that the results are consistent.

      Reviewer #2 (Public review):

      I had a hard time understanding some parts of the report. What is meant by "broadly no relationship" in line 137?

      We have removed the qualifier to simplify the text.

      It is suggested that spatial expansion (which is correlated with body part size) is related between medial breast and hand - is this to say that women with large hands have large medial breast size? Nipple size was measured, but hand size was not measured, is this correct?

      Correct. We have added text to state as such.

      It is furthermore unclear how the authors differentiate medial breast and NAC. The sentence in lines 140-141 seems to imply the two terms are considered the same, as a conclusion about NAC is drawn from a result about the medial breast. This requires clarification.

      Thank you for catching this, we have corrected it in the text.

      Finally, given that the authors suspect that overall localization ability (or attention) may be overshadowed by a size effect, would not an analysis be adequate that integrates both, e.g. a regression with multiple predictors?

      If the reviewer means that participants would be consistently “acute” then we believe that SF1 would have stronger correlations. Consequently, we see no reason to add “overall tactile acuity” as a predictor.

      In the paragraph about testing quadrants of the nipple, it is stated that only 3 of 10 participants barely outperformed chance with a p < 0.01. It is unclear how a significant ttest is an indication of "barely above chance".

      We have adjusted the text to clarify our meaning.

      “On the nipple, however, participants were consistently worse at locating stimuli on the nipple than the breast (paired t-test, t = 3.42, p < 0.01) where only 3 of the 10 participants outperformed chance, though the group as a whole outperformed chance (Error! Reference source not found.B, 36% ± 13%; Z = 5.5, p < 0.01).”

      The final part of the paragraph on nipple quadrants (starting line 176) explains that there was a trend (4 of 10 participants) for lower tactile acuity being related to the inability to differentiate quadrants. It seems to me that such a result would not be expected: The stated hypothesis is that all participants have the same number of tactile sensors in their nipple and areola, independent of NAC size. In this section, participants determine the quadrant of a single touch. Theoretically, all participants should be equally able to perform this task, because they all have the same number of receptors in each quadrant of nipple and areola. Thus, the result in Figure 2C is curious.

      We agree that this result seemingly contradicts observations from the previous experiment, however we believe that it relates to the distinction between the ability to perform relative distinctions and absolute localizations. In the first experiment, the presentation of two sequential points provides an implicit reference whereas in the quadrant task there is no reference. With the results of the third experiment in mind, biases towards the nipple would effectively reduce the ability of participants to identify the quadrant. What this result may imply is that the degree of bias is greater for women with greater expansion. We have added text to the discussion to lay this out.

      “This negative trend implicitly contradicts the previous result where one might expect equal performance regardless of size as the location of the stimuli was scaled to the size of the nipple and areola. However, given the absence of a reference point, systematic biases are more likely to occur and thus may reflect a relationship between localization bias and breast size.”

      This section reports an Anova (line 193/194) with a factor "participant". This doesn't appear sensible. Please clarify. The factor distance is also unclear; is this a categorical or a continuous variable? Line 400 implies a 6-level factor, but Anovas and their factors, respectively, are not described in methods (nor are any of the other statistical approaches).

      We believe this comment has been addressed above with our replacement of the ANOVA with an LMM. We have also added descriptions of the analysis throughout the methods.

      The analysis on imprecision using mean pairwise error (line 199) is unclear: does pairwise refer to x/y or to touch vs. center of the nipple?

      We have clarified this to now read:

      “To measure the imprecision, we computed the mean pairwise distance between each of the reported locations for a given stimulus location and the mean reported location.”

      p8, upper text, what is meant by "relative over-representation of the depth axis"? Does this refer to the breast having depth but the equivalent area on the back not having depth? What are the horizontal planes (probably meant to be singular?) - do you simply mean that depth was ignored for the calculation of errors? This seems to be implied in Figure 3AB.

      This is indeed what we meant. We have attempted to clarify in the text.

      “Importantly, given the relative over-representation of the depth axis for the breast, we only considered angles in the horizontal planes such that the shape of the breast did not influence the results.” Became:

      “Importantly, because the back is a relatively flat surface in comparison to the breast, errors were only computed in the horizontal plane and depth was excluded when computing the angular error.”

      Lines 232-241, I cannot follow the conclusions drawn here. First, it is not clear to a reader what the aim of the presented analyses is: what are you looking for when you analyze the vectors? Second, "vector strength" should be briefly explained in the main text. Third, it is not clear how the final conclusion is drawn. If there is a bias of all locations towards the nipple, then a point closer to the nipple cannot exhibit a large bias, because the nipple is close-by. Therefore, one would expect that points close to the nipple exhibit smaller errors, but this would not imply higher acuity - just less space for localizing anything. The higher acuity conclusion is at odds with the remaining results, isn't it: acuity is low on the outer breast, but even lower at the NAC, so why would it be high in between the two?

      Thank you for pointing out the circular logic. We have replaced this sentence with a more accurate statement.

      “Given these findings, we conclude that the breast has lower tactile acuity than the hand and is instead comparable to the back. Moreover, localization of tactile events to both the back and breast are inaccurate but localizations to the breast are consistently biased towards the nipple.”

      The discussion makes some concrete suggestions for sensors in implants (line 283). It is not clear how the stated numbers were computed. Also, why should 4 sensors nipple quadrants receive individual sensors if the result here was that participants cannot distinguish these quadrants?

      Thank you for catching this, it should have been 4 sensors for the NAC, not just the nipple. We have fixed this in the text.

      I would find it interesting to know whether participants with small breast measurement delta had breast acuity comparable to the back. Alternatively, it would be interesting to know whether breast and back acuity are comparable in men. Such a result would imply that the torso has uniform acuity overall, but any spatial extension of the breast is unaccounted for. The lowest single participant data points in Figure 1B appear similar, which might support this idea.

      We agree that this is an interesting question and as you point out, the data does indicate that in cases of minimal expansion acuity may be constant on the torso. However, in the comparison of the JNDs, post-hoc testing revealed no significant difference between the back and either breast region. Consequently, subsampling the group would result in the same result. We have added a sentence to the discussion stating this.

      “Consequently, the acuity of the breast is likely determined initially by torso acuity and then any expansion.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The authors only report the quality of the classification considering the number of videos used for training, but not considering the number of mice represented or the mouse strain. Therefore, it is unclear if the classification model works equally well in data from all the mouse strains tested, and how many mice are represented in the classifier dataset and validation.

      We agree that strain-level performance is critical for assessing generalizability. In the revision we now report per-strain accuracy and F1 for the grooming classifier, which was trained on videos spanning 60 genetically diverse strains (n = 1100 videos) and evaluated on the test set videos spanning 51 genetically diverse strains (n=153 videos). Performance is uniform across most strains (median F1 = 0.94, IQR = 0.899–0.956), with only modest declines in albino lines that lack contrast under infrared illumination; this limitation and potential remedies are discussed in the text. The new per-strain metrics are presented in the Supplementary figure (corresponding to Figure 4).

      (2) The GUI requires pose tracking for classification, but the software provided in JABS does not do pose tracking, so users must do pose tracking using a separate tool. Currently, there is no guidance on the pose tracking recommendations and requirements for usage in JABS. The pose tracking quality directly impacts the classification quality, given that it is used for the feature calculation; therefore, this aspect of the data processing should be more carefully considered and described.

      We have added a section to the methods describing how to use the pose estimation models used in JABS. The reviewer is correct that pose tracking quality will impact classification quality. We recommend that classifiers should only be re-used on pose files generated by the same pose models used in the behavior classifier training dataset. We hope that the combination of sharing classifier training data and making a more unified framework for developing and comparing classifiers will get us closer to having foundational behavior classification models that work in many environments. We also would like to emphasize that deviating from using our pose model will also likely hinder re-using our shared large datasets in JABS-AI (JABS1200, JABS600, JABS-BxD).

      (3) Many statistical and methodological details are not described in the manuscript, limiting the interpretability of the data presented in Figures 4,7-8. There is no clear methods section describing many of the methods used and equations for the metrics used. As an example, there are no details of the CNN used to benchmark the JABS classifier in Figure 4, and no details of the methods used for the metrics reported in Figure 8.

      We thank the reviewer for bringing this to our attention. We have added a methods section to the manuscript to address this concern. Specifically, we now provide: (1) improved citation visibility of the source of CNN experiments such that the reader can locate the architecture information, (2) mathematical formulations for all performance metrics (precision, recall, F1, …) with explicit equations;  (3) detailed statistical procedures including permutation testing methods, power analysis and multiple testing corrections used throughout Figures 7-8. These additions facilitate reproducibility and proper interpretation of all quantitative results presented in the manuscript.

      Reviewer #2 (Public review):

      (1) The manuscript as written lacks much-needed context in multiple areas: what are the commercially available solutions, and how do they compare to JABS (at least in terms of features offered, not necessarily performance)? What are other open-source options?

      JABS adds to a list of commercial and open source animal tracking platforms. There are several reviews and resources that cover these technologies. JABS covers hardware, behavior prediction, a shared resource for classifiers, and genetic association studies. We’re not aware of another system that encompasses all these components. Commercial packages such as EthoVision XT and HomeCage Scan give users a ready-made camera-plus-software solution that automatically tracks each mouse and reports simple measures such as distance travelled or time spent in preset zones, but they do not provide open hardware designs, editable behavior classifiers, or any genetics workflow. At the open-source end, the >100 projects catalogued on OpenBehavior and summarised in recent reviews (Luxem et al., 2023; Işık & Ünal 2023) usually cover only one link in the chain—DIY rigs, pose-tracking libraries (e.g., DeepLabCut, SLEAP) or supervised and unsupervised behaviour-classifier pipelines (e.g., SimBA, MARS, JAABA, B-SOiD, DeepEthogram). JABS provides an open source ecosystem that integrates all four: (i) top-down arena hardware with parts list and assembly guide; (ii) an active-learning GUI that produces shareable classifiers; (iii) a public web service that enables sharing of the trained classifier and applies any uploaded classifier to a large and diverse strain survey; and (iv) built-in heritability, genetic-correlation and GWAS reporting. We have added a concise paragraph in the Discussion that cites these resources and makes this end-to-end distinction explicit.

      (2) How does the supervised behavioral classification approach relate to the burgeoning field of unsupervised behavioral clustering (e.g., Keypoint-MoSeq, VAME, B-SOiD)? 

      The reviewer raises an important point about the rapidly evolving landscape of automated behavioral analysis, where both supervised and unsupervised approaches offer complementary strengths for different experimental contexts. Unsupervised methods like Keypoint-MoSeq , VAME , and B-SOiD , which prioritize motif discovery from unlabeled data but may yield less precise alignments with expert annotations, as evidenced by lower F1 scores in comparative evaluations. Supervised approaches (like ours), by contrast, employ fully supervised classifiers to deliver frame-accurate, behavior-specific scores that align directly with experimental hypotheses. Ultimately, a pragmatic hybrid strategy, starting with unsupervised pilots to identify motifs and transitioning to supervised fine-tuning with minimal labels, can minimize annotation burdens and enhance both discovery and precision in ethological studies. This has been added in the discussion section of the manuscript.

      (3) What kind of studies will this combination of open field + pose estimation + supervised classifier be suitable for? What kind of studies is it unsuited for? These are all relevant questions that potential users of this platform will be interested in.

      This approach is suitable for a wide array of neuroscience, genetics, pharmacology, preclinical, and ethology studies. We have published in the domains of action detection for complex behaviors such as grooming, gait and posture, frailty, nociception, and sleep. We feel these tools are indispensable for modern behavior analysis. 

      (4) Throughout the manuscript, I often find it unclear what is supported by the software/GUI and what is not. For example, does the GUI support uploading videos and running pose estimation, or does this need to be done separately? How many of the analyses in Figures 4-6 are accessible within the GUI?

      We have now clarified these. The JABS framework comprises two distinct GUI applications with complementary functionalities. The JABS-AL (active learning) desktop application handles video upload, behavioral annotation, classifier training, and inference -- it does not perform pose estimation, which must be completed separately using our pose tracking pipeline (https://github.com/KumarLabJax/mouse-tracking-runtime). If a user does not want to use our pose tracking pipeline, we have provided conversions through SLEAP to convert to our JABS pose format.  The web-based GUI enables classifier sharing and cloud-based inference on our curated datasets (JABS600, JABS1200) and downstream behavioral statistics and genetic analyses (Figures 4-6). The JABS-AL application also supports CLI (command line interface) operation for batch processing.  We have clarified these distinctions and provided a comprehensive workflow diagram in the revised Methods section.

      (5) While the manuscript does a good job of laying out best practices, there is an opportunity to further improve reproducibility for users of the platform. The software seems likely to perform well with perfect setups that adhere to the JABS criteria, but it is very likely that there will be users with suboptimal setups - poorly constructed rigs, insufficient camera quality, etc. It is important, in these cases, to give users feedback at each stage of the pipeline so they can understand if they have succeeded or not. Quality control (QC) metrics should be computed for raw video data (is the video too dark/bright? are there the expected number of frames? etc.), pose estimation outputs (do the tracked points maintain a reasonable skeleton structure; do they actually move around the arena?), and classifier outputs (what is the incidence rate of 1-3 frame behaviors? a high value could indicate issues). In cases where QC metrics are difficult to define (they are basically always difficult to define), diagnostic figures showing snippets of raw data or simple summary statistics (heatmaps of mouse location in the open field) could be utilized to allow users to catch glaring errors before proceeding to the next stage of the pipeline, or to remove data from their analyses if they observe critical issues.

      These are excellent suggestions that align with our vision for improving user experience and data quality assessment. We recognize the critical importance of providing users with comprehensive feedback at each stage of the pipeline to ensure optimal performance across diverse experimental setups. Currently, we provide end-users with tools and recommendations to inspect their own data quality. In our released datasets (Strain Survey OFA and BXD OFA), we provide video-level quality summaries for coverage of our pose estimation models. 

      For behavior classification quality control, we employ two primary strategies to ensure proper operation: (a) outlier manual validation and (b) leveraging known characteristics about behaviors. For each behavior that we predict on datasets, we manually inspect the highest and lowest expressions of this behavior to ensure that the new dataset we applied it to maintains sufficient similarity. For specific behavior classifiers, we utilize known behavioral characteristics to identify potentially compromised predictions. As the reviewer suggested, high incidence rates of 1-3 frame bouts for behaviors that typically last multiple seconds would indicate performance issues.

      We currently maintain in-house post-processing scripts that handle quality control according to our specific use cases. Future releases of JABS will incorporate generalized versions of these scripts, integrating comprehensive QC capabilities directly into the platform. This will provide users with automated feedback on video quality, pose estimation accuracy, and classifier performance, along with diagnostic visualizations such as movement heatmaps and behavioral summary statistics.

      Reviewer #1 (Recommendations for the authors):

      (1) A weakness of this tool is that it requires pose tracking, but the manuscript does not detail how pose tracking should be done and whether users should expect that the data deposited will help their pose tracking models. There is no specification on how to generate pose tracking that will be compatible with JABS. The classification quality is directly linked to the quality of the pose tracking. The authors should provide more details of the requirements of the pose tracking (skeleton used) and what pose tracking tools are compatible with JABS. In the user website link, I found no such information. Ideally, JABS would be integrated with the pose tracking tool into a single pipeline. If that is not possible, then the utility of this tool relies on more clarity on which pose tracking tools are compatible with JABS.

      The JABS ecosystem was deliberately designed with modularity in mind, separating the pose estimation pipeline from the active learning and classification app (JABS-AL) to offer greater flexibility and scalability for users working across diverse experimental setups. Our pose estimation pipeline is documented in detail within the new Methods subsection, outlining the steps to obtain JABS-compatible keypoints with our recommended runtime (https://github.com/KumarLabJax/mouse-tracking-runtime) and frozen inference models (https://github.com/KumarLabJax/deep-hrnet-mouse). This pipeline is an independent component within the broader JABS workflow, generating skeletonized keypoint data that are then fed into the JABS-AL application for behavior annotation and classifier training.

      By maintaining this separation, users have the option to use their preferred pose tracking tools— such as SLEAP —while ensuring compatibility through provided conversion utilities to the JABS skeleton format. These details, including usage instructions and compatibility guidance, are now thoroughly explained in the newly added pose estimation subsection of our Methods section. This modular design approach ensures that users benefit from best-in-class tracking while retaining the full power and reproducibility of our active learning pipeline.

      (2) The authors should justify why JAABA was chosen to benchmark their classifier. This tool was published in 2013, and there have been other classification tools (e.g., SIMBA) published since then.  

      We appreciate the reviewer’s suggestion regarding SIMBA. However, our comparisons to JAABA and a CNN are based on results from prior work (Geuther, Brian Q., et al. "Action detection using a neural network elucidates the genetics of mouse grooming behavior." Elife 10 (2021): e63207.), where both were used to benchmark performance on our publicly released dataset. In this study, we introduce JABS as a new approach and compare it against those established baselines. While SIMBA may indeed offer competitive performance, we believe the responsibility to demonstrate this lies with SIMBA’s authors, especially given the availability of our dataset for benchmarking.

      (3) I had a lot of trouble understanding the elements of the data calculated in JABS vs outside of JABS. This should be clarified in the manuscript.

      (a) For example, it was not intuitive that pose tracking was required and had to be done separately from the JABS pipeline. The diagrams and figures should more clearly indicate that.

      (b) In section 2.5, are any of those metrics calculated by JABS? Another software GEMMA, but no citation is provided for this tool. This created ambiguity regarding whether this is an analysis that is separate from JABS or integrated into the pipeline.  

      We acknowledge the confusion regarding the delineation between JABS components and external tools, and we have comprehensively addressed this throughout the manuscript. The JABS ecosystem consists of three integrated modules: JABS-DA (data acquisition), JABS-AL (active learning for behavior annotation and classifier training), and JABS-AI (analysis and integration via web application). Pose estimation, while developed by our laboratory, operates as a preprocessing pipeline that generates the keypoint coordinates required for subsequent JABS classifier training and annotation workflows. We have now added a dedicated Methods subsection that explicitly maps each analytical step to its corresponding software component, clearly distinguishing between core JABS modules and external tools (such as GEMMA for genetic analysis). Additionally, we have provided proper citations and code repositories for all external pipelines to ensure complete transparency regarding the computational workflow and enable full reproducibility of our analyses.

      (4) There needs to be clearer explanations of all metrics, methods, and transformations of the data reported.

      (a) There is very little information about the architecture of the classification model that JABS uses.

      (b) There are no details on the CNN used for comparing and benchmarking the classifier in JABS.

      (c) Unclear how the z-scoring of the behavioral data in Figure 7 was implemented.

      (d) There is currently no information on how the metrics in Figure 8 are calculated.

      We have added a comprehensive Methods section that not only addresses the specific concerns raised above but provides complete methodological transparency throughout our study. This expanded section includes detailed descriptions of all computational architectures (including the JABS classifier and grooming benchmark models and metrics), statistical procedures and data transformations (including the z-scoring methodology for Figure 7), downstream genetic analysis (including all measures presented in Figure 8), and preprocessing pipelines. 

      (5) The authors talk about their datasets having visual diversity, but without seeing examples, it is hard to know what they mean by this visual diversity. Ideally, the manuscript would have a supplementary figure with a representation of the variety of setups and visual diversity represented in the datasets used to train the model. This is important so that readers can quickly assess from reading the manuscript if the pre-trained classifier models could be used with the experimental data they have collected.

      The visual diversity of our training datasets has been comprehensively documented in our previous tracking work (https://www.nature.com/articles/s42003-019-0362-1), which systematically demonstrates tracking performance across mice with diverse coat colors (black, agouti, albino, gray, brown, nude, piebald), body sizes including obese mice, and challenging recording conditions with dynamic lighting and complex environments. Notably, Figure 3B in that publication specifically illustrates the robustness across coat colors and body shapes that characterize the visual diversity in our current classifier training data. To address the reviewer's concern and enable readers to quickly assess the applicability of our pre-trained models to their experimental data, we have now added this reference to the manuscript to ground our claims of visual diversity in published evidence.

      (6) All figures have a lot of acronyms used that are not defined in the figure legend. This makes the figures really hard to follow. The figure legends for Figures 1,2, 7, and 9 did not have sufficient information for me to comprehend the figure shown.

      We have fixed this in the manuscript. 

      (7) In the introduction, the authors talk about compression artifacts that can be introduced in camera software defaults. This is very vague without specific examples.

      This is a complex topic that balances the size and quality of video data and is beyond the scope of this paper. We have carefully optimized this parameter and given the user a balanced solution. A more detailed blog post on compression artifacts can be found at our lab’s webpage (https://www.kumarlab.org/2018/11/06/brians-video-compression-tests/). We have also added a comment about keyframes shifting temporal features in the main manuscript. 

      (8) More visuals of the inside of the apparatus should be included as supplementary figures. For example, to see the IR LEDs surrounding the camera.

      We have shared data from JABS as part of several papers including the tracking paper (Geuther et al 2019), grooming, gait and posture, mouse mass. We have also released entire datasets that as part of this paper (JABS1800, JABS-BXD). We also have step by step assembly guide that shows the location of the lights/cameras and other parts (see Methods, JABS workflow guide, and this PowerPoint file in the GitHub repository (https://github.com/KumarLabJax/JABS-datapipeline/blob/main/Multi-day%20setup%20PowerPoint%20V3.pptx).

      (9) Figure 2 suggests that you could have multiple data acquisition systems simultaneously. Do each require a separate computer? And then these are not synchronized data across all boxes?

      Each JABS-DA unit has its own edge device (Nvidia Jetson). Each system (which we define as multiple JABS-DA areas associated with one lab/group) can have multiple recording devices (arenas). The system requires only 1 control portal (RPi computer) and can handle as many recording devices as needed (Nvidia computer w/ camera associated with each JABS-DA arena). To collect data, 1 additional computer is needed to visit the web control portal and initiate a recording session. Since this is a web portal, users can use any computer or a tablet. The recording devices are not strictly synchronized but can be controlled in a unified manner.

      (10) The list of parts on GitHub seems incomplete; many part names are not there.

      We thank referee for bringing this to our attention. We have updated the GitHub repository (and its README) which now links out to the design files. 

      (11) The authors should consider adding guidance on how tethers and headstages are expected to impact the use of JABS, as many labs would be doing behavioral experiments combined with brain measurements.

      While our pose estimation model was not specifically trained on tethered animals, published research demonstrates that keypoint detection models maintain robust performance despite the presence of headstages and recording equipment. Once accurate pose coordinates are extracted, the downstream behavior classification pipeline operates independently of the pose estimation method and would remain fully functional. We recommend users validate pose estimation accuracy in their specific experimental setup, as the behavior classification component itself is agnostic to the source of pose coordinates.

      Reviewer #2 (Recommendations for the authors):

      (1) "Using software-defaults will introduce compression artifacts into the video and will affect algorithm performance." Can this be quantified? I imagine most of the performance hit comes from a decrease in pose estimation quality. How does a decrease in pose estimation quality translate to action segmentation? Providing guidelines to potential users (e.g., showing plots of video compression vs classifier performance) would provide valuable information for anyone looking to use this system (and could save many labs countless hours replicating this experiment themselves). A relevant reference for the effect of compression on pose estimation is Mathis, Warren 2018 (bioRxiv): On the inference speed and video-compression robustness of DeepLabCut.

      Since our behavior classification approach depends on features derived from keypoint, changes in keypoint accuracy will affect behavior segmentation accuracy. We agree that it is important to try and understand this further, particularly with the shared bioRxiv paper investigating the effect of compression on pose estimation accuracy. Measuring the effect of compression on keypoint and behavior classification is a complex task to evaluate concisely, given the number of potential variables to inspect. To list a few variables that should be investigated are: discrete cosine transform quality (Mathis, Warren experiment), Frame Size (Mathis, Warren experiment), Keyframe Interval (new, unique to video data), inter-frame settings (new, unique to video data), behavior of interest, Pose models with compression-augmentation used in training ( https://arxiv.org/pdf/1506.08316?) and type of CNN used (under active development). The simplest recommendation that we can make at this time is that we know compression will affect behavior predictions and that users should be cautious about using our shared classifiers on compressed video data. To show that we are dedicated in sharing these results as we run those experiments, in a related work ( CV4Animals conference accepted paper (https://www.cv4animals.com/) and can be downloaded here https://drive.google.com/file/d/1UNQIgCUOqXQh3vcJbM4QuQrq02HudBLD/view) we have already begun to inspect how changing some factors affect behavior segmentation performance. In this work, we investigate the robustness of behavior classification across multiple behaviors using different keypoint subsets. Our findings in this work is that classifiers are relatively stable across different keypoint subsets. We are actively working on follow-up effort to investigate the effect of keypoint noise, CNN model architecture, and other factors we've listed above on behavior segmentation tasks.

      (2) The analysis of inter-annotator variability is very interesting. I'm curious how these differences compare to two other types of variability:

      (a) intra-annotator variability; I think this is actually hard to quantify with the presented annotation workflow. If a given annotator re-annotated a set of videos, but using different sparse subsets of the data, it is not possible to disentangle annotator variability versus the effect of training models on different subsets of data. This can only be rigorously quantified if all frames are labeled in each video.

      We propose an alternative approach to behavior classifier development in the text associated with Figure 3C. We do not advocate for high inter-annotator agreement since individual behavior experts have differing labeling style (an intuitive understanding of the behavior). Rather, we allow multiple classifiers for the same behavior and allow the end user to prioritize classifiers based on heritability of the behavior from a classifier.  

      (b) In lieu of this, I'd be curious to see the variability in model outputs trained on data from a single annotator, but using different random seeds or train/val splits of the data. This analysis would provide useful null distributions for each annotator and allow for more rigorous statistical arguments about inter-annotator variability. 

      JABS allows the user to use multiple classifiers (random forest, XGBoost). We do not expect the user to carry out hyperparameter tuning or other forms of optimization. We find that the major increase in performance comes from optimizing the size of the window features and folds of cross validation. However, future versions of JABS-AL could enable a complete hyper-parameter scan across seeds and data splits to obtain a null distribution for each annotator. 

      (c) I appreciate the open-sourcing of the video/pose datasets. The authors might also consider publicly releasing their pose estimation and classifier training datasets (i.e., data plus annotations) for use by method developers.

      We thank the referee for acknowledging our commitment to open data sharing practices. Building upon our previously released strain survey dataset, we have now also made our complete classifier training resources publicly available, including the experimental videos, extracted pose coordinates, and behavioral annotations. The repository link has been added to the manuscript to ensure full reproducibility and facilitate community adoption of our methods.  

      (3) More thorough discussion on the limitations of the top-down vs bottom-up camera viewpoint; are there particular scientific questions that are much better suited to bottomup videos (e.g., questions about paw tremors, etc.).

      Top-down imaging, bottom-up, and multi-view imaging have a variety of pros and cons. Generally speaking, multi-view imaging will provide the most accurate pose models but requires increased resources on both hardware setup as well as processing of data. Top-down provides the advantage of flexibility for materials, since the floor doesn’t need to be transparent. Additionally lighting and potential reflection with the bottom-up perspective. Since the paws are not occluded from the bottom-up perspective, models should have improved paw keypoint precision allowing the model to observe more subtle behaviors. However, the appearance of the arena floor will change over time as the mice defecate and urinate. Care must be taken to clean the arena between recordings to ensure transparency is maintained. This doesn’t impact top-down imaging that much but will occlude or distort from the bottom-up perspective. Additionally, the inclusion of bedding for longer recordings, which is required by IACUC, will essentially render bottom-up imaging useless because the bedding will completely obscure the mouse. Overall, while bottomup may provide a precision benefit that will greatly enhance subtle motion, top-down imaging is overall more robust for obtaining consistent imaging across large experiments for longer periods of time.

      (4) More thorough discussion on what kind of experiments would warrant higher spatial or temporal resolution (e.g., investigating slight tremors in a mouse model of neurodegenerative disease might require this greater resolution).

      This is an important topic that deserves its own perspective guide. We try to capture some of this in the paper on specifications. However, we only scratch the surface. Overall, there are tradeoffs between frame rate, resolution, color/monochrome, and compression. Labs have collected data at hundreds of frames per second to capture the kinetics of reflexive behavior for pain (AbdoosSaboor lab) or whisking behavior. Labs have also collected data a low 2.5 frames per second for tracking activity or centroid tracking (see Kumar et al PNAS). The data collection specifications are largely dependent on the behaviors being captured. Our rule of thumb is the Nyquist Limit, which states that the data capture rate needs to be twice that of the frequency of the event. For example, certain syntaxes of grooming occur at 7Hz and we need 14FPS to capture this data. JABS collects data at 30FPS, which is a good compromise between data load and behavior rate. We use 800x800 pixel resolution which is a good compromise to capture animal body parts while limiting data size. Thank you for providing the feedback that the field needs guidance on this topic. We will work on creating such guidance documents for video data acquisition parameters to capture animal behavior data for the community as a separate publication.

      (5) References 

      (a) Should add the following ref when JAABA/MARS are referenced: Goodwin et al.2024, Nat Neuro (SimBA)

      (b) Could also add Bohnslav et al. 2021, eLife (DeepEthogram).

      (c) The SuperAnimal DLC paper (Ye et al. 2024, Nature Comms) is relevant to the introduction/discussion as well.

      We thank the referee for the suggestions. We have added these references.  

      (6) Section 2.2:

      While I appreciate the thoroughness with which the authors investigated environmental differences in the JABS arena vs standard wean cage, this section is quite long and eventually distracted me from the overall flow of the exposition; might be worth considering putting some of the more technical details in the methods/appendix.

      These are important data for adopters of JABS to gain IACUC approval in their home institution. These committees require evidence that any new animal housing environment has been shown to be safe for the animals. In the development of JABS, we spent a significant amount of time addressing the JAX veterinary and IACUC concerns. Therefore, we propose that these data deserve to be in the main text. 

      (7) Section 2.3.1:

      (a) Should again add the DeepEthogram reference here

      (b) Should reference some pose estimation papers: DeepLabCut, SLEAP, Lightning Pose. 

      We thank the referee for the suggestions. We have added these references.  

      (c) "Pose based approach offers the flexibility to use the identified poses for training classifiers for multiple behaviors" - I'm not sure I understand why this wouldn't be possible with the pixel-based approach. Is the concern about the speed of model training? If so, please make this clearer.

      The advantage lies not just in training speed, but in the transferability and generalization of the learned representations. Pose-based approaches create structured, low-dimensional latent embeddings that capture behaviorally relevant features which can be readily repurposed across different behavioral classification tasks, whereas pixel-based methods require retraining the entire feature extraction pipeline for each new behavior. Recent work demonstrates that pose-based models achieve greater data efficiency when fine-tuned for new tasks compared to pixel-based transfer learning approaches [1], and latent behavioral representations can be partitioned into interpretable subspaces that generalize across different experimental contexts [2]. While pixel-based approaches can achieve higher accuracy on specific tasks, they suffer from the "curse of dimensionality" (requiring thousands of pixels vs. 12 pose coordinates per frame) and lack the semantic structure that makes pose-based features inherently reusable for downstream behavioral analysis.

      (1) Ye, Shaokai, et al. "SuperAnimal pretrained pose estimation models for behavioral analysis." Nature communications 15.1 (2024): 5165.

      (2) Whiteway, Matthew R., et al. "Partitioning variability in animal behavioral videos using semi-supervised variational autoencoders." PLoS computational biology 17.9 (2021): e1009439.  

      (d) The pose estimation portion of the pipeline needs more detail. Do users use a pretrained network, or do they need to label their own frames and train their own pose estimator? If the former, does that pre-trained network ship with the software? Is it easy to run inference on new videos from a GUI or scripts? How accurate is it in compliant setups built outside of JAX? How long does it take to process videos?

      We have added the guidance on pose estimation in the manuscript (section “2.3.1 Behavior annotation and classifier training” and in the methods section titled “Pose tracking pipeline”)

      (e) The final paragraph describing how to arrive at an optimal classifier is a bit confusing - is this the process that is facilitated by the app, or is this merely a recommendation for best practices? If this is the process the app requires, is it indeed true that multiple annotators are required? While obviously good practice, I imagine there will be many labs that just want a single person to annotate, at least in the beginning prototyping stages. Will the app allow training a model with just a single annotator?

      We have clarified this in the text. 

      (8) Section 2.5:

      (a) This section contained a lot of technical details that I found confusing/opaque, and didn't add much to my overall understanding of the system; sec 2.6 did a good job of clarifying why 2.5 is important. It might be worth motivating 2.5 by including the content of 2.6 first, and moving some of the details of 2.5 to the method/appendix.

      We moved some of the technical details in section 2.5 to the methods section titled “Genetic analysis”. Furthermore, we have added few statements to motivate the need of genetic analysis and how the webapp can facilitate this (which is introduced in the section 2.6)    

      (9) Minor corrections:

      (a) Bottom of first page, "always been behavior quantification task" missing "a".

      (b) "Type" column in Table S2 is undocumented and unused (i.e., all values are the same); consider removing.

      (c) Figure 4B, x-axis: add units.

      (d) Page 8/9: all panel references to Figure S1 are off by one

      We have fixed them in the updated manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      This paper by Poverlein et al reports the substantial membrane deformation around the oxidative phosphorylation super complex, proposing that this deformation is a key part of super complex formation. I found the paper interesting and well-written.

      We thank the Reviewer for finding our work interesting. 

      Analysis of the bilayer curvature is challenging on the fine lengthscales they have used and produces unexpectedly large energies (Table 1). Additionally, the authors use the mean curvature (Eq. S5) as input to the (uncited, but it seems clear that this is Helfrich) Helfrich Hamiltonian (Eq. S7). If an errant factor of one half has been included with curvature, this would quarter the curvature energy compared to the real energy, due to the squared curvature.

      We thank the Reviewer for raising this important issue. We have now clarified in the SI and main manuscript that we employ the Helfrich model. In our initial implementation, we indeed used the mean curvature H, thereby missing a factor of 2. As the Reviewer correctly noted, this resulted in curvature deformation energies that were underestimated by a factor of ~4. We have now corrected for this effect in the revised analysis, and the updated Table 1. Importantly, however, this correction does not alter the general conclusions of our work that supercomplex formation relieves membrane strain and stabilizes the system. We have added an additional paragraph where we discuss the magnitude of the observed bending effects, and compared the previous estimates in literature:

      SI: 

      “The local mean curvature of the membrane midplane was computed using the Helfrich model (4,5) …”

      (4) W. Helfrich, Elastic properties of lipid bilayers theory and possible experiments. Zeitschrift für Naturforschung 28c, 693-703 (1973).

      (5) F. Campelo et al., Helfrich model of membrane bending: From Gibbs theory of liquid interfaces to membranes as thick anisotropic elastic layers. Advances in Colloid and Interface Science 208, 25-33 (2014).

      Main Text: 

      “which measures the energetic cost of deforming the membrane from a flat geometry (ΔG<sub>curv</sub>) based on the Helfrich model (45, 46). …

      Our analysis suggests that both contributions are substantially reduced upon formation of the SC, with the curvature penalty decreasing by 79.2 ± 5.2 kcal mol<sup>-1</sup> (for a membrane area of ca. 1000 nm<sup>2</sup>) and the thickness penalty by 2.8 ± 2.0 kcal mol<sup>-1</sup> (Table 1).”

      “We note that the magnitude of the estimated bending energies (~10² kcal mol<sup>-1</sup>) (Table 1), while seemingly high at first glance, falls within the range expected for large-scale membrane deformation processes induced by large multi-domain proteins. For example, the Piezo mechanosensitive channel performs roughly 150k<sub>B</sub>T (≈ 90 kcal mol⁻¹) of work to bend the bilayer into its dome-like shape (65). Comparable energies have also been estimated for the nucleation of small membrane pores (66), while vesicle formation typically requires bending energies on the order of 300 kcal mol<sup>-1</sup>, largely independent of vesicle size (67). When normalized by the affected membrane area (~1000 nm<sup>2</sup>), these values correspond to an energy density of approximately 0.1 kcal mol<sup>-1</sup> nm<sup>-2</sup>, which places our estimates within a biophysically reasonable regime. Notably, cryo-EM structures of several supercomplexes shows that such assemblies can impose significant curvature on the surrounding bilayer (36, 50, 68), supporting the notion that respiratory chain organization is closely coupled to local membrane deformation. Nevertheless, we expect that the absolute deformation energies may be overestimated, as the continuum Helfrich model neglects molecular-level effects such as lipid tilt and local rearrangements, which can partially relax curvature stresses and reduce the effective bending penalty near protein–membrane interfaces (69, 70).”

      The bending modulus used (ca. 5 kcal/mol) is small on the scale of typically observed biological bending moduli. This suggests the curvature energies are indeed much higher even than the high values reported. Some of this may be due to the spontaneous curvature of the lipids and perhaps the effect of the protein modifying the nearby lipids properties.

      The SI initially included an incorrect value for the bending modulus (20 kJ mol<sup>-1</sup> instead of 20k<sub>B</sub>T), which has now been corrected. The revised value is consistent with experimentally reported bending moduli from X-ray scattering measurements, although there remains substantial uncertainty in the precise values across different experimental and computational studies.

      “The bending deformation energy was computed from the mean curvature field H(x,y), assuming a constant bilayer bending modulus κ (taken as 20k<sub>b</sub>T  = 11.85 kcal mol<sup>-1</sup> (6)):”

      (6) S. Brown et al., Comparative analysis of bending moduli in one-component membranes via coarsegrained molecular dynamics simulations. Biophysical Journal 124, 1–13 (2025).

      It is unclear how CDL is supporting SC formation if its effect stabilizing the membrane deformation is strong or if it is acting as an electrostatic glue. While this is a weakenss for a definite quantification of the effect of CDL on SC formation, the study presents an interesting observation of CDL redistribution and could be an interesting topic for future work.

      We agree with the Reviewer that future studies would be important to investigate the relationship between CDL-induced stabilization of membrane and its electrostatic effects.  

      In summary, the qualitative data presented are interesting (especially the combination of molecular modeling with simpler Monte Carlo modeling aiding broader interpretation of the results). The energies of the membrane deformations are quite large. This might reflect the roles of specific lipids stabilizing those deformations, or the inherent difficulty in characterizing nanometer-scale curvature.

      We thank the Reviewer for appreciating our work and for the help in further improving our findings.

      Reviewer #3 (Public review):

      Summary:

      In this contribution, the authors report atomistic, coarse-grained and lattice simulations to analyze the mechanism of supercomplex (SC) formation in mitochondria. The results highlight the importance of membrane deformation as one of the major driving forces for the SC formation, which is not entirely surprising given prior work on membrane protein assembly, but certainly of major mechanistic significance for the specific systems of interest.

      We thank Reviewer 3 for appreciating the importance of our study. 

      Strengths:

      The combination of complementary approaches, including an interesting (re)analysis of cryo-EM data, is particularly powerful, and might be applicable to the analysis of related systems. The calculations also revealed that SC formation has interesting impacts on the structural and dynamical (motional correlation) properties of the individual protein components, suggesting further functional relevance of SC formation. In the revision, the authors further clarified and quantified their analysis of membrane responses, leading to further insights into membrane contributions. They have also toned down the decomposition of membrane contributions into enthalpic and entropic contributions, which is difficult to do. Overall, the study is rather thorough, highly creative and the impact on the field is expected to be significant.

      Weaknesses:

      Upon revision, I believe the weakness identified in previous work has been largely alleviated.

      We thank the Reviewer for their previous remarks, which allowed us to significantly improve our manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Circannual timing is a phylogenetically widespread phenomenon in long-lived organisms and is central to the seasonal regulation of reproduction, hibernation, migration, fur color changes, body weight, and fat deposition in response to photoperiodic changes. Photoperiodic control of thyroid hormone T3 levels in the hypothalamus dictates this timing. However, the mechanisms that regulate these changes are not fully understood. The study by Stewart et al. reports that hypothalamic iodothyronine deiodinase 3 (Dio3), the major inactivator of the biologically active thyroid hormone T3, plays a critical role in circannual timing in the Djungarian hamster. Overall, the study yields important results for the field and is well-conducted, with the exception of the CRISPR/Cas9 manipulation.

      We appreciate the positive and supportive comment from the Reviewer. We have clarified the oversight in the Crispr/Cas9 data representation below. Our correction should alleviate any concern raised.

      Figure 1 lays the foundation for examining circannual timing by establishing the timing of induction, maintenance, and recovery phases of the circannual timer upon exposure of hamsters to short photoperiod (SP) by monitoring morphological and physiological markers. Measures of pelage color, torpor, body mass, plasma glucose, etc, established that the initiation phase occurred by weeks 4-8 in SP, the maintenance by weeks 12-20, and the recovery after week 20, where all morphological and physiological changes started to reverse back to long photoperiod phenotypes.

      The statistical analyses look fine, and the results are unambiguous.

      We thank the Reviewer for recognizing our attempts to highlight the phenomenon of circannual interval timing.

      Their representation could, however, be improved. In Figures 1d and 1e, two different measures are plotted on each graph and differentiated by dots and upward or downward arrowheads. The plots are so small, though, that distinguishing between the direction of the arrows is difficult. Some color coding would make it more reader-friendly. The same comment applies to Figure S4. 

      We have increased the panel size for Figure 1d and 1e. We have also changed the colour of the graphs in Figure 1d and 1e to facilitate the differentiation of the two dependent variables. For the circos plots, we attempted different ways to represent the data. We have opted to keep the figures in their current stage. The overall aim is to provide a ‘gestalt’ view of the timing of changes in transcript expression and highlighted only a few key genes. The whole dataset is provided in the supplementary materials for Reviewer/Reader interrogation.

      The authors went on to profile the transcriptome of the mediobasal and dorsomedial hypothalamus, paraventricular nucleus, and pituitary gland (all known to be involved in seasonal timing) every 4 weeks over the different phases of the circannual interval timer. A number of transcripts displaying seasonal rhythms in expression levels in each of the investigated structures were identified, including transcripts whose expression peaks during each phase. This included two genes of particular interest due to their known modulation of expression in response to photoperiod, Dio3 and Sst, found among the transcripts upregulated during the induction and maintenance phases, respectively. The experiments are technically sound and properly analyzed, revealing interesting candidates. Again, my main issues lie with the representation in the figure. In particular, the authors should clarify what the heatmaps on the right of Figures 1f and 1g represent. I suspect they are simply heatmaps of averaged expression of all genes within a defined category, but a description is missing in the legend, as well as a scale for color coding near the figure.

      We have clarified the heatmap and density maps in the Figure legend. We apologise for the lack of information to describe the figure panels. (see lines 644-648)

      Figure 2 reveals that SP-programmed body mass loss is correlated to increased Dio3-dependent somatostatin (Sst) expression. First, to distinguish whether the body mass loss was controlled by rheostatic mechanisms and not just acute homeostatic changes in energy balance, experiments from hamsters fed ad lib or experiencing an acute food restriction in both LP and SP were tested. Unlike plasma insulin, food restriction had no additional effect on SP-driven epididymal fat mass loss (Figure S7). This clearly establishes a rheostatic control of body mass loss across weeks in SP conditions. Importantly, Sst expression in the mediobasal hypothalamus increased in both ad lib fed or restriction fed SP hamsters and this increase in expression could be reduced by a single subcutaneous injection of active T3, clearly suggesting that increase in Sst expression in SP is due to a decrease of active T3 likely via Dio3 increase in expression in the hypothalamus. The results are unambiguous

      We thank the Reviewer for the supportive and affirmative feedback.

      Figure 3 provides a functional test of Dio3's role in the circannual timer. Mediobasal hypothalamic injections of CRISPR-Cas9 lentiviral vectors expressing two guide RNAs targeting the hamster Dio3 led to a significant reduction in the interval between induction and recovery phases seen in SP as measured by body mass, and diminished the extent of pelage color change by weeks 15-20. In addition, hamsters that failed to respond to SP exposure by decreasing their body mass also had undetectable Dio3 expression in the mediobasal hypothalamus. Together, these data provide strong evidence that Dio3 functions in the circannual timer. I noted, however, a few problems in the way the CRISPR modification of Dio3 in the mediobasal hypothalamus was reported in Figure S8. One is in Figure S8b, where the PAM sites are reported to be 9bp and 11bp downstream of sgRNA1 and sgRNA2, respectively. Is this really the case? If so, I would have expected the experiment to fail to show any effect as PAM sites need to immediately follow the target genomic sequence recognized by the sgRNA for Cas9 to induce a DNA double-stranded break. It seems that each guide contains a 3' NGG sequence that is currently underlined as part of sgRNAs in both Fig S8b and in the method section. If this is not a mistake in reporting the experimental design, I believe that the design is less than optimal and the efficiencies of sgRNAs are rather low, if at all functional.

      We apologize for the oversight and indeed the reporting in Figure S8b was a mistake. The PAM site previously indicated was the ‘secondary PAM site’ (which as the Reviewer notes would likely have low efficiency). The PAM site is described within the gRNA in the figure. We use Adobe Illustrator to generate figures, and during the editing process, the layer for PAM text was accidentally moved ‘back’ to a lower level. The oversight was not rectified before submission. We apologise for this unreservedly. The PAM site text has been moved forward, to highlight the location of the primary site (ie immediately following gRNA) and labelled the gRNA and PAM site in the ‘Target region’. The secondary PAM site text was removed to eliminate any confusion.

      The authors report efficiencies around 60% (line 325), but how these were obtained is not specified. 

      The efficiency provided are based on bioinformatic analyses and not in vivo assays. To reduce any confusion, we have removed the text. The gRNA were clearly effective to induce mutations based on the sequencing analyses.

      Another unclear point is the degree to which the mediobasal hypothalamus was actually mutated. Only one mutated (truncated) sequence in Figure S8c is reported, but I would have expected a range of mutations in different cells of the tissue of interest.

      The tissue punch would include multiple different cells (e.g., neuronal, glial, etc). We agree with the Reviewer that genomic samples from different cells would be included in the sequencing analyses. Given the large mutation in the target region, the gRNA was effective. We have only shown one representative sequence. If the Reviewer would like to see all mutations, we can easily show the other samples.

      Although the authors clearly find a phenotypic effect with their CRISPR manipulation, I suspect that they may have uncovered greater effects with better sgRNA design. These points need some clarification. I would also argue that repeating this experiment with properly designed sgRNAs would provide much stronger support for causally linking Dio3 in circannual timing.

      The gRNA was designed using the Gold-standard approach – ChopChop [citation Labon et al., 2019]. If the Reviewer’s concern re design is due to the comment above re PAM site; this issue was clarified and there are no concerns for the gRNA design. The major challenge with the Dio3 gene (single exon) with a very short sequence length (approx.. 412bp). There is limited scope within this sequence length to generate gRNA.

      A proposed schematic model for mechanisms of circannual interval timing is presented in Figure S9. I think this represents a nice summary of the findings put in a broader context and should be presented as a main figure in the manuscript itself rather than being relayed in supplementary materials.

      We agree with the Reviewer position and moved the figure to the main manuscript. The figure is now Figure 4.

      Reviewer #2 (Public review):

      Several animals and plants adjust their physiology and behavior to seasons. These changes are timed to precede the seasonal transitions, maximizing chances of survival and reproduction. The molecular mechanisms used for this process are still unclear. Studies in mammals and birds have shown that the expression of deiodinase type-1, 2, and 3 (Dio1, 2, 3) in the hypothalamus spikes right before the transition to winter phenotypes. Yet, whether this change is required or an unrelated product of the seasonal changes has not been shown, particularly because of the genetic intractability of the animal models used to study seasonality. Here, the authors show for the first time a direct link between Dio3 expression and the modulation of circannual rhythms.

      We appreciate the clear synthesis and support for the manuscript.

      Strengths:

      The work is concise and presents the data in a clear manner. The data is, for the most part, solid and supports the author's main claims. The use of CRISPR is a clear advancement in the field. This is, to my knowledge, the first study showing a clear (i.e., causal) role of Dio3 in the circannual rhythms in mammals. Having established a clear component of the circannual timing and a clean approach to address causality, this study could serve as a blueprint to decipher other components of the timing mechanism. It could also help to enlighten the elusive nature of the upstream regulators, in particular, on how the integration of day length takes place, maybe within the components in the Pars tuberalis, and the regulation of tanycytes.

      We thank the Reviewer for this positive summary.

      Weaknesses:

      Due to the nature of the CRISPR manipulation, the low N number is a clear weakness. This is compensated by the fact that the phenotypes shown here are strong enough. Also, this is the only causal evidence of Dio3's role; thus, additional evidence would have significantly strengthened the author's claims. The use of the non-responsive population of hamsters also helps, but it falls within the realm of correlations.

      We would also like to remind the Reviewer that one Crispr-Cas9 Dio3<sup>cc</sup> treated hamster did not show any mutation in the genome. This hamster was observed to have a change in body mass and pelage colour like controls. This animal provides another positive control.

      We also conducted a statistical power analysis to examine whether n=3 is sufficient for the Dio3<sup>cc</sup> treatment group. Using the appropriate expected difference in means and standard deviations for an alpha of 0.05; we regularly observed beta >0.8 across the dependent variables. 

      Additionally, the consequences of the mutations generated by CRISPR are not detailed; it is not clear if the mutations affect the expression of Dio3 or generate a truncation or deletion, resulting in a shorter protein.

      We agree with the Reviewer that transcript and protein assays would strengthen the genome mutation data. Due to the small brain region under investigation, we are limited in the amount of biological material to extract. Dio3 is an intronless gene and very short – approximately 412 base pairs in length. We opted to maximize resources into sequencing the gene as the confirmation of genetic mutation is paramount. Given the large size of the mutation in the treated hamsters, there would be no amplification of transcript or protein translated.

      Reviewer #3 (Public review):

      The authors investigated SP-induced physiological and molecular changes in Djungarian hamsters and the endogenous recovery from it after circa half a year. The study aimed to elucidate the intrinsic mechanism and included nice experiments to distinguish between rheostatic effects on energy state and homeostatic cues driven by an interval timer. It also aimed to elucidate the role of Dio3 by introducing a targeted mutation in the MBH by ICV. The experiments and analyses are sound, and the amount of work is impressive. The impact of this study on the field of seasonal chronobiology is probably high.

      We thank the Reviewer for their positive comments and support for our work.

      Even though the general conclusions are well-founded, I have fundamental criticism concerning 3 points, which I recommend revising:

      (1) The authors talk about a circannual interval timer, but this is no circannual timer. This is a circasemiannual timer. It is important that the authors use precise wording throughout the manuscript.

      We agree with the Reviewer that the change in physiology and behaviour does not approximate a full year (e.g. annual) and only a half of the year. We opted to use circannual timer as this term is established in the field (see doi: 10.1177/0748730404266626; doi: 10.1098/rstb.2007.2143). We cannot identify any publication that has used the term ‘semiannual timer’. We do not feel this manuscript is the appropriate time to introduce a new term to the field; we will endeavour to push the field to consider the use of ‘semiannual timer’. A Review or Opinion paper is best place for this discussion. We hope the Reviewer will understand our position.

      (2) The authors put their results in the context of clocks. For example, line 180/181 seasonal clock. But they have described and investigated an interval timer. A clock must be able to complete a full cycle endogenously (and ideally repeatedly) and not only half of it. In contrast, a timer steers a duration. Thus, it is well possible that a circannual clock mechanism and this circa-semiannual timer of photoperiodic species are 2 completely different mechanisms. The argumentation should be changed accordingly.

      We agree with the Reviewers definitions of circannual ‘clock’ and ‘timer’. We were careful to distinguish between the two concepts early in the manuscript (lines 41-46). We have added italics to emphasis the different terms. The use of seasonal clock on line 180/191 was imprecise and we appreciate the Reviewer highlighting our oversight and the text was revised. We have also revised the Abstract accordingly.

      (3) The authors chose as animal model the Djungarian hamster, which is a predominantly photoperiodic species and not a circannual species. A photoperiodic species has no circannual clock. That is another reason why it is difficult to draw conclusions from the experiment for circannual clocks. However, the Djungarian hamster is kind of "indifferent" concerning its seasonal timing, since a small fraction of them are indeed able to cycle (Anchordoquy HC, Lynch GR (2000), Evidence of an annual rhythm in a small proportion of Siberian hamsters exposed to chronic short days. J Biol Rhythms 15:122-125.). Nevertheless, the proportion is too small to suggest that the findings in the current study might reflect part of the circannual timing. Therefore, the authors should make a clear distinction between timers and clocks, as well as between circa-annual and circa-semiannual durations/periods.

      This comment is not clear to us. The Reviewer states the hamsters are not a circannual species, but then highlight one study that shows circannual rhythmicity. We agree that circannual rhythmicity in Djungarian hamsters is dependent on the physiological process under investigation (e.g. body mass versus reproduction) and that photoperiodic response system either dampen or mask robust cycles. We have corrected the text oversight highlighted above and the manuscript is focused on interval timers. We have kept the term circannual over semicircannual due to the prior use in the scientific literature.

      Reviewing Editor Comments:

      The detailed suggestions of the reviewers are outlined below (or above in case of reviewer 1). In light of the criticism, we ask the authors to especially pay attention to the comments on the Cas9/Crisp experiment, raised by Reviewers 1 and 2. As currently described, there are serious questions on the design of the sgRNAs, and also missing critical methodological details. If the latter are diligently taken care of, they may resolve the questions on the sgRNA design. Please also reconsider the wording along the suggestions of Reviewer 3.

      We appreciate the Editors time and support for the manuscript. We have clarified and corrected our oversight for the PAM site. This correction confirms the strength of the Crispr-cas9 gRNA used in the study. The correction should remove all concerns. We have also considered using semicircannual in the text. As there is existing scientific literature using circannual interval timer, and there is no publication to our knowledge for using ‘semicircannual; we have opted to keep with the current approach and use circannual. We feel a subsequent Opinion paper is more suitable to introduce a new term.

      Reviewer #2 (Recommendations for the authors):

      First, I want to commend the authors for their work. It is a clear advancement for our field. Below are a couple of comments and suggestions I have:

      we thank the Review for the positive comment and support. We have endeavoured to incorporate their suggested improvements to the manuscript.

      (1) Looking at the results of Figure 1A and Figure S8, the control in S8 showed a lower pelage color score as compared to the hamsters in 1A. Is this a byproduct of the ICV injection?

      The difference between Figure 1 and 3 is likely due to the smaller sample sizes. The controls in Figure 1 had a higher proportion of hamsters show complete white fur (score =3) at 1618 weeks compared to controls in Figure 3. It is possible, although unlikely that the ICV injection would reduce the development of winter phenotype. There was no substance in the ICV injection that would impact the prolactin signalling pathway. Our perspective is that the difference between the two figures is due to the different sampling population. Overall, the timing of the change in pelage colour is the same between the figures and suggest that the mechanisms of interval timer were unaffected.

      (2) Is there a particular reason why the pelage color for the CRISPR mutants is relegated to the supplemental information? In my opinion, this is also important, even though the results might be difficult to explain. Additionally, did the authors check for food intake and adipose mass in these animals?

      We agree with the Reviewer the pelage change is very interesting. We decided to have Figure 3 focus on body mass. The rationale was due to the robust nature of the data collection from Crispr-cas9 study (Fig.3b), in addition to the non-responsive hamsters (Fig.3e). We disagree that the data patterns are hard to explain, as pelage changes was similar to the photoperiodic induced change in body mass. No differences were observed for food intake or adipose tissue. We have added this information in the text (see lines 162-163).

      (3) I might have missed it, but did the authors check for the expression of Dio3 on the CRISPR mutants? Does the deletion cause reduced expression or any other mRNA effect, such as those resulting in the truncation of a protein?

      Due to the limited biological material extracted from the anatomical punches, we decided to focus on genomic mutations. Dio3 has a very short sequence length and the size of the mutations identified indicate that no RNA could be transcribed.

      (4) Could the authors clarify which reference genome or partial CDS (i.e., accession numbers) they used to align the gRNA? Did they use the SSSS strain or the Psun_Stras_1 isolate?

      The gRNAs were designed using the online tool CHOPCHOP, using the Mus musculus

      Dio3 gene. The generated gRNAs were subsequently aligned via blast with the Phodopus sungorus Dio3 partial cds (GenBank: MF662622.1), to ensure alignment with the species. We are confident that the gRNA designed align 100% in hamsters. Furthermore, we conducted BLAST to ensure there were no off-targets. The only gene identified in the BLAST was the rodent (i.e. hamster, mouse) Dio3 sequence.

      (5) Figure 3b. I do agree with the authors in pointing out that the decrease in body mass is occurring earlier in Dio3wt hamsters; however, the shape of the body mass dynamic is also different. Do the authors have any comments on the possible role of Dio3 in the process of exist of overwintering?

      This is a very interesting question. We do not have the data to evaluate the role of Dio3 for overwintering. We argue that disruption in Dio3 reduced the circannual interval period. For this interpretation, yes, Dio3 is necessary for overwintering. However, we would need to show the sufficiency of Dio3 to induce the winter phenotype in hamsters housed in long photoperiod. At this time, we do not have the technical ability to conduct this experiment.

      (6) In Figure 3d, the Dio3wt group does not show any dispersion. Is this correct? If that's true, and no dispersion is observed, no normality can be assumed, and a t-test can't be performed (Line 692).The Mann-Whitney test might be better suited.

      We conducted a Welch’s t-test to compare the difference in body mass period. We used the Welch’s test as the variance were not equal; Mann-Whitney test is best for skewed distributions. To clarify the test used, we have added ‘Welch’s test’ to the Figure legend.

      (9) Figure 1 h. It might be convenient to add the words "Induction", "maintenance", and "recovery" over each respective line on the polar graph for easier reading.

      We have added the text as suggested by the Reviewer.

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 1: Please enlarge all partial graphics at least to the size of Figure 2. In the print version, labels are barely readable

      we have increased the panels in Figure 1 and 3 by 20% to accommodate the Reviewers suggestion.

      (2) Legend Figure 2: Add that the food restriction was 16h.

      We have added 16h to the text.

      (3) Figure 3b: enlarge font size. In the legend: Dio3cc hamsters delayed.... The delay might have been a week or so, but not more (and even that is unclear since the rise in body mass in that week seems to be rather a disturbance of the curve). Thus 'delay' might not be the most appropriate wording. Instead, the initial decline is slower, but both started at nearly the same week (=> no delay). Minimum body mass is reached at the identical week as in wt (=> no delay). Also, the increase started at the same week but was much faster in Dio3cc than in wt. Figure 3c: How can there be a period when there is no repeated cycle (rhythm)? This is rather a duration. Moreover, according to the displayed data, I am wondering which start point and which endpoint is used. The first and last values are the highest of the graph, but have they been the maximum? Especially for Dio3wt, it can be assumed that animals haven't reached the maximum at the end of the graph.

      We have increased the font size in Figure 3b. We have changed ‘delayed’ to ‘slower’ in the text. Period analyses, such as the Lomb-Scargle measure the duration of a cycle (and multiple cycles). The start point and end point used in the analyses were the initial data collection date (week 0) and the final data collection date (week 32). The Lomb-Scargle analyses determines the duration of the period that occurs within these phases of the cycle. We believe the period analyses conducted by the Lomb-Scargle is the most suitable for the scientific question.

      (4) Figure S9: This is a very nice graph and summarises your main results. It should appear in the main manuscript and not in the supplements.

      We appreciate the positive comment and suggestion. We agree with the Reviewer and have move the graph to the main figure. The revised manuscript indicates the graph as Figure 4.

    1. Author response:

      Reviewer #1

      We agree that further clarification how elevated exercise disrupts blastema formation would strengthen the manuscript. Our data suggests a major contribution of proliferation. Exercise reduced the fraction of proliferative cells at 3 dpa, consistent with disrupted HA production and downstream Yap signaling. This interpretation aligns with prior studies showing that proliferation contributes to blastema establishment and is not restricted to the outgrowth phase of fin regeneration (Poleo et al, 2001; Poss et al, 2002; Wang et al, 2019; Pfefferli et al, 2014; Hou et al, 2020). We will explore additional experiments to reinforce these insights into the cellular mechanisms underlying exercise-disrupted blastema formation.

      We acknowledge that our analysis of ray branching abnormalities is limited in the current manuscript. We focus our study on introducing the zebrafish swimming and regeneration model and then characterizing ECM and signaling changes accounting for disrupted blastema establishment. For completeness, we included the observation of skeletal patterning defects (branching delays and bone fusions) but without detailed analysis. We note that decreased expression of shha and Shh-pathway components following early exercise corresponds with the branching defects. However, we recognize exercise could have additional effects during the outgrowth  phase when branching morphogenesis actively occurs. Therefore, we will expand our discussion to outline future research directions related to exercise impacts on regenerative skeletal patterning.

      We will expand the Introduction and/or Discussion sections to provide more context on known HA roles across regeneration contexts, including in zebrafish fins. Finally, we will improve the text’s clarity and specificity throughout the manuscript, including to resolve or explain any apparent contradictions.

      Reviewer #2

      We appreciate the Reviewer's concern regarding the specificity of forced exercise as a model for mechanical loading. Forced exercise has been widely used in vivo to induce mechanical loading without the requirement for specialized implants or animal restraint, including in mouse (Wallace et al, 2015; Bomer et al, 2016), rat (Honda et al, 2003; Boerckel et al, 2011; Boerckel et al, 2012), and, most relevant to our study, zebrafish models (Fiaz et al, 2012; Fiaz et al, 2014; Suniaga et al, 2018). However, we will expand our discussion of this approach and ensure precise language distinguishing exercise from mechanical loading.

      We acknowledge the possibility that early shear stress disrupts the wound epidermis, which we will elaborate on in a revised Discussion. However, exercise-induced disruptions to the fin epidermis of early regenerates (1–2 dpa; Figure 2) typically resolve within one day, whereas fibroblast lineage cells still fail to establish a robust blastema. Therefore, sustained effects of mechanical loading and/or mechanosensation are likely major contributors to the observed regeneration phenotypes.

      We will explore whether HA acts as a general enhancer of fin regeneration by comparing blastemal HA supplementation vs. controls in non-exercised regenerating animals, if technically feasible. We will merge Figure S7 (HA supplementation) with Figure 5 (HA depletion) for clarity, as suggested.

      We will include a schematic and clear definitions for 'peripheral' and 'central' rays in a revised manuscript.

      Reviewer #3

      We included Hoechst and eosin fluorescent staining in the manuscript to show changes in tissue architecture following swimming exercise (Supplemental Figure 4). We will extend this histological analysis to include hematoxylin and eosin staining to provide additional tissue visualization.

      References

      Poleo G, Brown CW, Laforest L, Akimenko MA. Cell proliferation and movement during early fin regeneration in zebrafish. Dev Dyn. 2001 Aug;221(4):380-90.

      Poss KD, Nechiporuk A, Hillam AM, Johnson SL, Keating MT. Mps1 defines a proximal blastemal proliferative compartment essential for zebrafish fin regeneration. Development. 2002 Nov;129(22):5141-9.

      Wang YT, Tseng TL, Kuo YC, Yu JK, Su YH, Poss KD, Chen CH. Genetic Reprogramming of Positional Memory in a Regenerating Appendage. Curr Biol. 2019 Dec 16;29(24):4193-4207.e4.

      Pfefferli C, Müller F, Jaźwińska A, Wicky C. Specific NuRD components are required for fin regeneration in zebrafish. BMC Biol. 2014 Apr 29;12:30.

      Hou Y, Lee HJ, Chen Y, Ge J, Osman FOI, McAdow AR, Mokalled MH, Johnson SL, Zhao G, Wang T. Cellular diversity of the regenerating caudal fin. Sci Adv. 2020 Aug 12;6(33):eaba2084.

      Wallace IJ, Judex S, Demes B. Effects of load-bearing exercise on skeletal structure and mechanics differ between outbred populations of mice. Bone. 2015 Mar;72:1-8.

      Bomer N, Cornelis FM, Ramos YF, den Hollander W, Storms L, van der Breggen R, Lakenberg N, Slagboom PE, Meulenbelt I, Lories RJ. The effect of forced exercise on knee joints in Dio2(-/-) mice: type II iodothyronine deiodinase-deficient mice are less prone to develop OA-like cartilage damage upon excessive mechanical stress. Ann Rheum Dis. 2016 Mar;75(3):571-7.

      Honda A, Sogo N, Nagasawa S, Shimizu T, Umemura Y. High-impact exercise strengthens bone in osteopenic ovariectomized rats with the same outcome as Sham rats. J Appl Physiol (1985). 2003 Sep;95(3):1032-7.

      Boerckel JD, Kolambkar YM, Stevens HY, Lin AS, Dupont KM, Guldberg RE. Effects of in vivo mechanical loading on large bone defect regeneration. J Orthop Res. 2012 Jul;30(7):1067-75.

      Boerckel JD, Uhrig BA, Willett NJ, Huebsch N, Guldberg RE. Mechanical regulation of vascular growth and tissue regeneration in vivo. Proc Natl Acad Sci U S A. 2011 Sep 13;108(37):E674-80.

      Fiaz AW, Léon-Kloosterziel KM, Gort G, Schulte-Merker S, van Leeuwen JL, Kranenbarg S. Swim-training changes the spatio-temporal dynamics of skeletogenesis in zebrafish larvae (Danio rerio). PLoS One. 2012;7(4):e34072.

      Fiaz AW, Léon‐Kloosterziel KM, van Leeuwen JL, Kranenbarg S. Exploring the molecular link between swim‐training and caudal fin development in zebrafish (Danio rerio) larvae. Journal of Applied Ichthyology. 2014 Aug;30(4):753-61.

      Suniaga S, Rolvien T, Vom Scheidt A, Fiedler IAK, Bale HA, Huysseune A, Witten PE, Amling M, Busse B. Increased mechanical loading through controlled swimming exercise induces bone formation and mineralization in adult zebrafish. Sci Rep. 2018 Feb 26;8(1):3646.

    1. Author response:

      Reviewer #1 (Public review):

      In this manuscript, Qin and colleagues aim to delineate a neural mechanism by which the internal satiety levels modulate the intake of sugar solution. They identified a three-step neuropeptidergic system that downregulates the sensitivity of sweet-sensing gustatory sensory neurons in sated flies. First, neurons that release a neuropeptide Hugin (which is an insect homolog of vertebrate Neuromedin U (NMU)) are in an active state when the concentration of glucose is high. This activation does not require synaptic inputs, suggesting that Hugin-releasing neurons sense hemolymph glucose levels directly. Next, the Hugin neuropeptides activate Allatostatin A (AstA)-releasing neurons via one of Hugin's receptors, PK2-R1. Finally, the released AstA neuropeptide suppresses sugar response in sugar-sensing Gr5a-expressing gustatory sensory neurons through AstA-R1 receptor. Suppression of sugar response in Gr5a-expressing neurons reduces the fly's sugar intake motivation (measured by proboscis extension reflex). They also found that NMU-expressing neurons in the ventromedial hypothalamus (VMH) of mice (which project to the rostral nucleus of the solitary tract (rNST)) are also activated by high concentrations of glucose, independent of synaptic transmission, and that injection of NMU reduces the glucose-induced activity in the downstream of NMU-expressing neurons in rNST. These data suggest that the function of Hugin neuropeptide in the fly is analogous to the function of NMU in the mouse.

      Generally, their central conclusions are well-supported by multiple independent approaches. The parallel study in mice adds a unique comparative perspective that makes the paper interesting to a wide range of readers. It is easier said than done: the rigor of this study, which effectively combined pharmacological and genetic approaches to provide multiple lines of behavioral and physiological evidence, deserves recognition and praise.

      A perceived weakness is that the behavioral effects of the manipulations of Hugin and AstA systems are modest compared to a dramatic shift of sugar solution-induced PER (the behavioral proxy of sugar sensitivity) induced by hunger, as presented in Figure 1B and E. It is true that the mutation of tyrosine hydroxylase (TH), which synthesizes dopamine, does not completely abolish the hunger-induced PER change, but the remaining effect is small. Moreover, the behavioral effect of the silencing of the Hugin/AstA system (Figure Supplement 13B, C) is difficult to interpret, leaving a possibility that this system may not be necessary for shifting PER in starved flies. These suggest that the Hugin-AstA system accounts for only a minor part of the behavioral adaptation induced by the decreased sugar levels. Their aim to "dissect out a complete neural pathway that directly senses internal energy state and modulates food-related behavioral output in the fly brain" is likely only partially achieved. While this outcome is not a shortcoming of a study per se, the depth of discussion on the mechanism of interactions between the Hugin/AstA system and the other previously characterized molecular circuit mechanisms mediating hunger-induced behavioral modulation is insufficient for readers to appreciate the novelty of this study and future challenges in the field.

      We thank the reviewer for the thoughtful comment. We agree that the behavioral effects of manipulating the Hugin–AstA system alone were considerably weaker than the pronounced PER shifts induced by starvation. We will revise our Discussion to address it by positioning our findings within the broader context of energy regulation.

      More specifically, we will discuss that feeding behavior is controlled by two distinct, yet synergistic, types of mechanisms:

      (1) Hunger-driven 'accelerators': as the reviewer notes, pathways involving dopamine and NPF are powerful drivers of sweet sensitivity. These systems are strongly activated by hunger to promote food-seeking and consumption.

      (2) Satiety-driven 'brakes': our study identifies the counterpart to those systems above, aka. a satiety-driven 'brake'. The Hugin–AstA pathway acts as a direct sensor of high internal energy (glucose), which is specifically engaged during satiety to actively suppress sweet sensation and prevent overconsumption.

      This framework explains the seemingly discrepancy in effect size. The dramatic PER shift seen upon starvation is a combined result of engaging the 'accelerators' (hunger pathways like TH/NPF) while simultaneously releasing the 'brake' (our Hugin–AstA pathway being inactive).

      Our manipulations, which specifically target only the 'brake' system, are therefore expected to have a more modest effect than this combined physiological state. Thus, rather than being a "minor part," the Hugin–AstA pathway is a mechanistically defined, satiety-specific circuit that is essential for the precise "braking" required for energy homeostasis. We will update our Discussion to emphasize how these 'accelerator' and 'brake' circuits must work in concert to ensure precise energy regulation.

      In this context, authors are encouraged to confront a limitation of the study due to the lack of subtype-level circuit characterization, despite their intriguing finding that only a subtype of Hugin- and AstA-releasing neurons are responsive to the elevated level of bath-applied glucose.

      We thank the reviewer for highlighting the critical issue of subtype-level specialization within the Hugin and AstA populations.

      We fully agree that the Hugin system is known for its functional heterogeneity (pleiotropy), with different Hugin neuron subclusters implicated in regulating a variety of behaviors, including feeding, aversion, and locomotion (we will cite relevant literature here). Our finding that only a specific subcluster of Hugin neurons is responsive to glucose elevation provides a crucial first step in functionally dissecting this complexity. 

      We will add a dedicated paragraph to elaborate on this functional partitioning. We propose that this subtype-level specialization allows the Hugin system to precisely link specific physiological states (like high circulating glucose) to appropriate behavioral outputs (like the suppression of sweet taste), demonstrating an elegant solution to coordinating multiple survival behaviors. Future work using high-resolution tools such as split-GAL4 and single-cell sequencing will be invaluable in fully mapping the specific functional roles corresponding to each Hugin and AstA subcluster.

      Reviewer #2 (Public review):

      Summary:

      The question of how caloric and taste information interact and consolidate remains both active and highly relevant to human health and cognition. The authors of this work sought to understand how nutrient sensing of glucose modulates sweet sensation. They found that glucose intake activates hugin signaling to AstA neurons to suppress feeding, which contributes to our mechanistic understanding of nutrient sensation. They did this by leveraging the genetic tools of Drosophila to carry out nuanced experimental manipulations and confirmed the conservation of their main mechanism in a mammalian model. This work builds on previous studies examining sugar taste and caloric sensing, enhancing the resolution of our understanding.

      Strengths:

      Fully discovering neural circuits that connect body state with perception remains central to understanding homeostasis and behavior. This study expands our understanding of sugar sensing, providing mechanistic evidence for a hugin/AstA circuit that is responsive to sugar intake and suppresses feeding. In addition to effectively leveraging the genetic tools of Drosophila, this study further extends their findings into a mammalian model with the discovery that NMU neural signaling is also responsive to sugar intake.

      Weaknesses:

      The effect of Glut1 knockdown on PER in hugin neurons is modest, and does not show a clear difference between fed and starved flies as might be expected if this mechanism acts as a sensor of internal energy state. This could suggest that glucose intake through Glut1 may only be part of the mechanism.

      We thank the reviewer for this insightful comment and agree that the modest behavioral effect of Glut1 knockdown is a critical finding that warrants further clarification. This observation strongly supports the idea that internal energy state is monitored by a sophisticated and robust network, not a single, fragile component. We believe the effect size is modest for two main reasons, which we will further address in revised Discussion.

      Firstly, the effect size is likely attenuated by technical and molecular redundancy. Specifically, the RNAi-mediated knockdown of Glut1 may be incomplete, leaving residual transporter function. Furthermore, Glut1 is likely only one part of the Hugin neuron's intrinsic sensing mechanism; other components, such as alternative glucose transporters or downstream K<sub>ATP</sub> channel signaling, may provide molecular redundancy, meaning that the full energy-sensing function is not easily abolished by a single manipulation.

      Secondly, and more importantly, the final feeding decision is an integrated output of competing circuits. While hunger-sensing pathways like the dopamine and NPF circuits act as powerful "accelerators" to drive sweet consumption, the Hugin–AstA pathway serves as a satiety-specific "brake". The modest effect of partially inhibiting just one component of this 'brake' system is the hallmark of a precisely regulated, multi-layered homeostatic system. We will further clarify in the Discussion that the Hugin pathway represents one essential inhibitory circuit within this cooperative network that works together with the hunger-promoting systems to ensure precise control over energy intake.

      Reviewer #3 (Public review):

      Summary:

      This study identifies a novel energy-sensing circuit in Drosophila and mice that directly regulates sweet taste perception. In flies, hugin+ neurons function as a glucose sensor, activated through Glut1 transport and ATP-sensitive potassium channels. Once activated, hugin neurons release hugin peptide, which stimulates downstream Allatostatin A (AstA)+ neurons via PK2-R1 receptors. AstA+ neurons then inhibit sweet-sensing Gr5a+ gustatory neurons through AstA peptide and its receptor AstA-R1, reducing sweet sensitivity after feeding. Disrupting this pathway enhances sweet taste and increases food intake, while activating the pathway suppresses feeding.

      The mammalian homolog of neuromedin U (NMU) was shown to play an analogous role in mice. NMU knockout mice displayed heightened sweet preference, while NMU administration suppressed it. In addition, VMH NMU+ neurons directly sense glucose and project to rNST Calb2+ neurons, dampening sweet taste responses. The authors suggested a conserved hugin/NMU-AstA pathway that couples energy state to taste perception.

      Strengths

      Interesting findings that extend from insects to mammals. Very comprehensive.

      Weaknesses:

      Coupling energy status to taste sensitivity is not a new story. Many pathways appear to be involved, and therefore, it raises a question as to how this hugin-AstA pathway is unique.

      The reviewer is correct that several energy-sensing pathways are known. However, we now clarify that these previously established mechanisms, such as the dopaminergic and NPF pathways, primarily function as hunger-driven "accelerators." They are activated by low energy states to promote sweet sensitivity and drive consumption.

      The crucial, missing piece of the puzzle—which our study provides—is the satiety-specific "brake" mechanism. We identify the Hugin–AstA circuit as one of the “brakes”: a dedicated, central sensor that responds directly to high circulating glucose (satiety) to suppress sweet sensation and prevent overconsumption.

      Thus, our work is unique because it defines the essential counterpart to the hunger pathways. In the revised Discussion, we will further explain how these 'accelerator' (hunger) and 'brake' (satiety) systems work in concert to allow for the precise, bidirectional regulation of energy intake. Furthermore, by demonstrating that this Hugin/NMU 'brake' circuit is evolutionarily conserved in mice, our findings reveal a fundamental energy-sensing strategy and suggest that this pathway could represent a promising new therapeutic target for managing conditions of excessive food intake.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This study extends the previous interesting work of this group to address the potentially differential control of movement and posture. Their earlier work explored a broad range of data to make the case for a downstream neural integrator hypothesized to convert descending velocity movement commands into postural holding commands. Included in that data were observations from people with hemiparesis due to stroke. The current study uses similar data, but pushes into a different, but closely related direction, suggesting that these data may address the independence of these two fundamental components of motor control. I find the logic laid out in the second sentence of the abstract ("The paretic arm after stroke is notable for abnormalities both at rest and during movement, thus it provides an opportunity to address the relationships between control of reaching, stopping, and stabilizing") less then compelling, but the study does make some interesting observations. Foremost among them, is the relation between the resting force postural bias and the effect of force perturbations during the target hold periods, but not during movement. While this interesting observation is consistent with the central mechanism the authors suggest, it seems hard to me to rule out other mechanisms, including peripheral ones. These limitations should should be discussed.

      Thank you for summarizing our work. Note we have improved the logic in our abstract (…”providing an opportunity to ask whether control of these behaviors is independently affected in stroke”) based on your comments as outlined in our previous revision. We now extensively discuss limitations and potential alternative mechanisms in greater detail, in a dedicated section (lines 846-895; see response to reviewer 2 for further details).

      Reviewer #2 (Public review):

      Summary:

      Here the authors address the idea that postural and movement control are differentially impacted with stroke. Specifically, they examined whether resting postural forces influenced several metrics of sensorimotor control (e.g., initial reach angle, maximum lateral hand deviation following a perturbation, etc.) during movement or posture. The authors found that resting postural forces influenced control only following the posture perturbation for the paretic arm of stroke patients, but not during movement. They also found that resting postural forces were greater when the arm was unsupported, which correlated with abnormal synergies (as assessed by the Fugl-Meyer). The authors suggest that these findings can be explained by the idea that the neural circuitry associated with posture is relatively more impacted by stroke than the neural circuitry associated with movement. They also propose a conceptual model that differentially weights the reticulospinal tract (RST) and corticospinal tract (CST) to explain greater relative impairments with posture control relative to movement control, due to abnormal synergies, in those with stroke.

      Thank you for the brief but comprehensive summary. We would like to clarify one point: we do not suggest that our findings are necessarily due to the neural circuitry associated with posture being more impacted than the neural circuitry associated with movement. (rather, our conceptual model suggests that increased outflow through the (ipsilateral) RST, involved in posture, compensates for CST damage, at the expense of posture abnormalities spilling over into movement). Instead, we suggest that the neural circuitry for posture vs. movement control remains relatively separate in stroke, with impairments in posture control not substantially explaining impairments in movement control.

      Comments on revisions:

      The authors should be commended for being very responsive to comments and providing several further requested analyses, which have improved the paper. However, there is still some outstanding issues that make it difficult to fully support the provided interpretation.

      Thank you for appreciating our response to your earlier comments. We address the outstanding issues below.

      The authors say within the response, "We would also like to stress that these perturbations were not designed so that responses are directly compared to each other ***(though of course there is an *indirect* comparison in the sense that we show influence of biases in one type of perturbation but not the other)***." They then state in the first paragraph of the discussion that "Remarkably, these resting postural force biases did not seem to have a detectable effect upon any component of active reaching but only emerged during the control of holding still after the movement ended. The results suggest a dissociation between the control of movement and posture." The main issue here is relying on indirect comparisons (i.e., significant in one situation but not the other), instead of relying on direct comparisons. Using well-known example, just because one group / condition might display a significant linear relationship (i.e., slope_1 > 0) and another group / condition does not (slope_2 = 0), does not necessarily mean that the two groups / conditions are statistically different from one another [see Figure 1 in Makin, T. R., & Orban de Xivry, J. J. (2019). Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife, 8, e48175.].

      We agree and are well aware of the limitation posed by an indirect comparison – hence the language we used to comment on the data (“did not seem”, “suggest”, etc.). To address this limitation, we performed a more direct comparison of how the two types of perturbations (moving vs. holding) interact with resting biases. For this comparison, we calculated a Response Asymmetry Index (RAI):

      Above, 𝑟<sub>𝐴</sub> is the response on direction where resting bias is most-aligned with the perturbation, and 𝑟<sub>𝑂</sub> is the response on direction where resting bias is most-opposed to the perturbation.

      We calculated RAIs for two response metrics used for both moving and holding perturbations: maximum deviation and time to stabilization/settling time. For these two response metrics, positive RAIs indicate an asymmetry in line with an effect of resting bias.

      The idea behind the RAI is that, while the magnitude of responses may well differ between the two types of perturbations, this will be accounted for by the ratio used to calculate the asymmetry. The same approach has been used to assess symmetry/laterality across a variety of different modalities, such as gait asymmetry (Robinson et al., 1987), the relative fMRI activity in the contralateral vs. ipsilateral sensorimotor cortex while performing a motor task (Cramer et al., 1997), or the relative strength of ipsilateral vs. contralateral responses to transcranial magnetic stimulation (McPherson et al., 2018). Notably, the normalization also addresses potential differences in overall stiffness between holding vs. moving perturbations, which would similarly affect aligned and opposing cases (see our response to your following point).

      Figure 8 shows RAIs we obtained for holding (red) vs. moving/pulse (blue) perturbations. For the maximum deviation (left), there is more asymmetry for the holding case though the pvalue is marginal (p=0.088) likely due to the large variability in the pulse case (individual values shown in black dots). For time to stabilization/settling time (right) the difference is significant (p=0.0048). Together, these analyses indicate that resting biases interact substantially more with holding compared to movement control, in line with a relative independence between these two control modalities. We now include this panel as Figure 8, and describe it in Results (lines 587-611).

      Note that even a direct comparison does not prove that resting biases and active movement control are perfectly independent. We now discuss these issues in more depth, in the new Limitations section suggested by the Reviewer (lines 836-849).

      The authors have provided reasonable rationale of why they chose certain perturbation waveforms for different. Yet it still holds that these different waveforms would likely yield very different muscular responses making it difficult to interpret the results and this remains a limitation. From the paper it is unknown how these different perturbations would differentially influence a variety of classic neuromuscular responses, including short-range stiffness and stretch reflexes, which would be at play here.

      Much of the results can be interpreted when one considers classic neuromuscular physiology. In Experiment 1, differences in resting postural bias in supported versus unsupported conditions can readily be explained since there is greater muscle activity in the unsupported condition that leads to greater muscle stiffness to resist mechanical perturbations (Rack, P. M., & Westbury, D. R. (1974). The short-range stiffness of active mammalian muscle and its effect on mechanical properties. The Journal of physiology, 240(2), 331-350.). Likewise muscle stiffness would scale with changes in muscle contraction with synergies. Importantly for experiment 2, muscle stiffness is reduced during movement (Rack and Westbury, 1974) which may explain why resting postural biases do not seem to be impacting movement. Likewise, muscle spindle activity is shown to scale with extrafusal muscle fiber activity and forces acting through the tendon (Blum, K. P., Campbell, K. S., Horslen, B. C., Nardelli, P., Housley, S. N., Cope, T. C., & Ting, L. H. (2020). Diverse and complex muscle spindle afferent firing properties emerge from multiscale muscle mechanics. eLife, 9, e55177.). The concern here is that the authors have not sufficiently considered muscle neurophysiology, how that might relate to their findings, and how that might impact their interpretation. Given the differences in perturbations and muscle states at different phases, the concern is that it is not possible to disentangle whether the results are due to classic neurophysiology, the hypothesis they propose, or both. Can the authors please comment.

      It is possible that neuromuscular physiology may explain part of our results. However, this would not contradict our conceptual model.

      Regarding Experiment 1, it is possible that stiffness would scale with changes in background muscle contraction as the reviewer suggests. Indeed, Bennett and al.(Bennett et al., 1992) used brief perturbations on the wrist to assess elbow stiffness, finding that, during movement, stiffness was increased in positions with a higher gravity load (and, in general, in positions where the net muscle torque was higher). However, during posture maintenance (like in our Experiment 1), they found that stiffness did not vary with (elbow) position or gravity load (two characteristics of our findings in Experiment 1):

      “The observed stiffness variation was not simply due to passive tissue or other joint angle dependent properties, as stiffnesses measured during posture were position invariant. Note that the minimum stiffness found in posture was higher than the peak stiffness measured during movement, and did not change much with the gravity load.” (illustrated in Fig. 5 of that paper)

      We thus find it very unlikely that stiffness explains the difference between the supported vs. unsupported conditions in Experiment 1.

      Even if stiffness modulation between the supported vs. unsupported conditions could explain our finding of stronger posture biases in the latter case, it would not be incompatible with our interpretation of increased RST drive: increased stiffness would potentially magnify the effects of the RST drive we propose to drive these resting biases. It is possible that the increase in resting biases under conditions of increased muscle contraction (lack of arm support) is mediated through an increase in muscle stiffness. In other words, the increase in resting biases may not directly reflect additional RST outflow per se, but the scaling, through stiffness, of the same magnitude of RST outflow. Understanding this interaction was beyond the scope of our experiment design; in line with this, we briefly comment about it in our Limitations section.

      Regarding Experiment 2, stiffness has indeed been shown to be lower during movement, and we now comment the potential effect of this on our results in the “Limitations” section (lines 815-830, replicated below). Importantly, for the case of holding perturbations, the increased stiffness associated with holding would increase resistance to both extension and flexion-inducing perturbations. Thus, higher stiffness would be unlikely to explain our finding whereby resting biases resist or aggravate the effects of holding perturbations depending on perturbation direction. In addition, the framework in Blum et al., that describes how interactions between alpha and gramma drive can explain muscle activity patterns, does not rule out central neural control of stiffness: “muscle spindles have a unique muscle-within-muscle design such that their firing depends critically on both peripheral and central factors” (emphasis ours). It may be, for example, that gamma motoneurons controlling muscle spindles and stiffness are modulated from input from the reticular formation, making this a mechanism in line with our conceptual model.

      “Moreover, it has been shown that joint stiffness is reduced during movement compared to holding control (Rack and Westbury, 1974; Bennett et al., 1992). Along similar lines, muscle spindle activity – which may modulate stiffness – scales with extrafusal muscle fiber activity (such as muscle exertion involved in holding) and forces acting through the tendon (Blum et al., 2020). Such observations could, in principle, explain why we were unable to detect a relationship between resting biases and active movement control but we readily found a relationship between resting biases and active holding control: reduced joint stiffness during movement could scale down the influence of resting abnormalities. There are two issues with this explanation, however. First, it is debatable whether this should be considered an alternative explanation per se: stiffness modulation could be, in total or in part, the manifestation of a central movement/posture CST/RST mechanism similar to the one we propose in our conceptual model. For example, (Blum et al., 2020) argue that muscle spindle firing depends on both peripheral and central factors. Second, increased stiffness would not necessarily help detect differences in how active postural control responds to within-resting-posture vs. out-of-resting-posture perturbations. This is because an overall increase in stiffness would likely increase resistance to perturbations in any direction.”

      The authors should provide a limitations paragraph. They should address 1) how they used different perturbation force profiles, 2) the muscles were in different states which would change neuromuscular responses between trial phase / condition, 3) discuss a lack of direct statistical comparisons that support their hypothesis, and 4) provide a couple of paragraphs on classic neurophysiology, such as muscle stiffness and stretch reflexes, and how these various factors could influence the findings (i.e., whether they can disentangle whether the reported results are due to classic neurophysiology, the hypothesis they propose, or both).

      Thank you for your suggestion. We now discuss these points in a separate paragraph (lines 846895), bringing together our previous discussion on stretch reflexes, our description of different perturbation types, and the additional issues raised by the reviewer above.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have responded well to all my concerns, save two minor points.

      Figure 2 appears to be unchanged, although they describe appropriate changes in the response letter.

      Thank you for catching this error – we now include the updated figure (further updated to use the terms near/distant in place of proximal/distal).

      I still take issue with the use of proximal and distal to describe the locations of targets. Taking definitions somewhat randomly from the internet, "The terms proximal and distal are used in structures that are considered to have a beginning and an end," and "Proximal and distal are anatomical terms used to describe the position of a body part in relation to another part or its origin." In any case, the hand does not become proximal just because you bring it to your chest. Why not simply stick to the common and clearly defined terms "near" and "distant"?

      Point taken. We have updated the paper to use the terms near/distant.

      Additional changes/corrections not outlined above

      We now include a link to the data and code supporting our findings (https://osf.io/hufy8/). In addition, we made several minor edits throughout the text to improve readability, and corrected occasional mislabeling of CCW and CW pulse data. Note that this correction did not alter the (lack of) relationship between resting biases and responses to perturbations during active movement.

      Response letter references

      Bennett D, Hollerbach J, Xu Y, Hunter I (1992) Time-varying stiffness of human elbow joint during cyclic voluntary movement. Exp Brain Res 88:433–442.

      Blum KP, Campbell KS, Horslen BC, Nardelli P, Housley SN, Cope TC, Ting LH (2020) Diverse and complex muscle spindle afferent firing properties emerge from multiscale muscle mechanics. Elife 9:e55177.

      Cramer SC, Nelles G, Benson RR, Kaplan JD, Parker RA, Kwong KK, Kennedy DN, Finklestein SP, Rosen BR (1997) A functional MRI study of subjects recovered from hemiparetic stroke. Stroke 28:2518–2527.

      McPherson JG, Chen A, Ellis MD, Yao J, Heckman C, Dewald JP (2018) Progressive recruitment of contralesional cortico-reticulospinal pathways drives motor impairment post stroke. J Physiol 596:1211–1225 Available at: https://doi.org/10.1113/JP274968.

      Rack PM, Westbury D (1974) The short range stiffness of active mammalian muscle and its effect on mechanical properties. J Physiol 240:331–350.

      Robinson R, Herzog W, Nigg BM (1987) Use of force platform variables to quantify the effects of chiropractic manipulation on gait symmetry. J Manipulative Physiol Ther 10:172–176.

      Williams PE, Goldspink G (1973) The effect of immobilization on the longitudinal growth of striated muscle fibres. J Anat 116:45.

    1. Author response:

      Reviewer #1:

      In line with the reviewer’s suggestions, we will be adjusting the text with more conservative language regarding the claims of maturation within the co-culture system, and emphasize that the conclusion is based on limited transcriptomic evidence. We acknowledge that the results from bulk RNA sequencing might contain contaminants across the gates, but would like to point out that the CD45+ CD14+ population is clear, and any resulting contamination would likely be small. We will be addressing this caveat clearly in a new limitations section, as suggested by reviewer 3 as well. We will also be taking the reviewer’s suggestion to look further into the stress response genes to further characterize the system. We apologise if we might have missed out any statistical annotations and will take care to include them in the updated version.

      Reviewer #3:

      We acknowledge the reviewer’s concerns that the study was primarily focused on bulk RNA sequencing data and might not fully represent the complex metabolic and functional shifts, especially in a cell type like the hepatocyte , and will be addressing these concerns in a new limitations section in the revised manuscript. We also apologise if it was unclear in the manuscript that the iHeps and iMacs were characterised prior to coculturing, for example the iMacs are routinely assessed for CD45, CD14 and CD163 prior to the start of any experiment, and likewise the iHeps are tested by qPCR, which also served as the baseline of the fold expression changes in Fig 3. The primary aim of the IL-6 assays is to demonstrate that the hepatocyte co-culture systems behave differently based on the source of the macrophages, and that the use of primary macrophages might not be suitable in studying drug responses in-vitro. We will clarify in the revised manuscript that the overall effect might not be directly related to specific Kupffer cell identity.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      Summary: 

      The presented study by Centore and colleagues investigates the inhibition of BAF chromatin remodeling complexes. The study is well written and includes comprehensive datasets, including compound screens, gene expression analysis, epigenetics, as well as animal studies. This is an important piece of work for the uveal melanoma research field, and sheds light on a new inhibitor class, as well as a mechanism that might be exploited to target this deadly cancer for which no good treatment options exist. 

      Strengths: 

      This is a comprehensive and well-written study. 

      Weaknesses: 

      There are minimal weaknesses. 

      Reviewer #2 (Public review): 

      Summary: 

      The authors generate an optimized small molecule inhibitor of SMARCA2/4 and test it in a panel of cell lines. All uveal melanoma (UM) cell lines in the panel are growth inhibited by the inhibitor making the focus of the paper. This inhibition is correlated with loss of promoter occupancy of key melanocyte transcription factors e.g. SOX10. SOX10 overexpression and a point mutation in SMARCA4 can rescue growth inhibition exerted by the SMARCA2/4 inhibitor. Treatment of a UM xenograft model results in growth inhibition and regression which correlates with reduced expression of SOX10 but not discernible toxicity in the mice. Collectively, the data suggest a novel treatment of uveal melanoma. 

      Strengths: 

      There are many strengths of the study, including the strong challenge of the on-target effect, the assays used and the mechanistic data. The results are compelling as are the effects of the inhibitor. The in vivo data is dose-dependent and doses are low enough to be meaningful and associated with evidence of target engagement. 

      Weaknesses: 

      The authors have addressed weaknesses in the revised version. 

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript reports the discovery of new compounds that selectively inhibit SMARCA4/SMARCA2 ATPase activity and have pronounced effects on uveal melanoma cell proliferation. They induce apoptosis and suppress tumor growth, with no toxicity in vivo. The report provides biological significance by demonstrating that the drugs alter chromatin accessibility at lineage specific gene enhancer regions and decrease expression of lineage specific genes, including SOX10 and SOX10 target genes. 

      Strengths: 

      The study provides compelling evidence for the therapeutic use of these compounds and does a thorough job at elucidating the mechanisms by which the drugs work. The study will likely have a high impact on the chromatin remodeling and cancer fields. The datasets will be highly useful to these communities. 

      Weaknesses: 

      The authors have addressed all my concerns. 

      Recommendations for the authors: 

      We would, however, like to draw the authors attention to 2 comments by the referees. 

      Referee 1 comments: While BAP1 mutant UM cell lines were included for some of the experiments, it seems the in-vivo data mentioned in the response to the reviewers comment is missing? The authors stated that "MP46 (Supplementary Fig. 3a) is BAP1null uveal melanoma cell line with no detectable protein expression (AmiroucheneAngelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor." But the CDX model data shown in Figure 4 is from 92.1 cells. If this data is available, then the manuscript would benefit from its addition. 

      We thank the reviewer for bringing this to our attention. As the reviewer mentioned, we show 92-1 CDX model in our manuscript. Additionally, strong tumor growth inhibition was observed in MP-46  CDX model treated with our BAF ATPase inhibitor and can be found in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      Referee 3 comments: 

      Supplementary Figure 2C 

      Is the T910M mutation in the parental MP41 cells heterozygous? If so, the authors should indicate this in the figure legend. If this is a homozygous mutation, the authors should explain how the inhibitors suppress SMARCA4 activity in cells that have a LOF mutation. 

      Could the authors please comment on these issues before a final version is posted online? 

      We thank the reviewer for bringing this to our attention. T910M mutation is heterozygous and the variant allele frequency for that mutation is 0.5. We updated the figure legend accordingly to reflect the genotype of the mutations highlighted in the table.

      Reviewer #1 (Recommendations for the authors): 

      The authors have addressed most of the questions in their review. 

      While BAP1 mutant UM cell lines were included for some of the experiments, it seems the in-vivo data mentioned in the response to the reviewers comment is missing? The authors stated that "MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor." But the CDX model data shown in Figure 4 is from 92.1 cells. If this data is available, then the manuscript would benefit from its addition. 

      Reviewer #3 (Recommendations for the authors): 

      Supplementary Figure 2C 

      Is the T910M mutation in the parental MP41 cells heterozygous? If so, the authors should indicate this in the figure legend. If this is a homozygous mutation, the authors should explain how the inhibitors suppress SMARCA4 activity in cells that have a LOF mutation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript, the authors performed an integration of 48 scRNA-seq public datasets and created a single-cell transcriptomic atlas for AML (222 samples comprising 748,679 cells). This is important since most AML scRNA-seq studies suffer from small sample size coupled with high heterogeneity. They used this atlas to further dissect AML with t(8;21) (AML-ETO/RUNX1-RUNX1T1), which is one of the most frequent AML subtypes in young people. In particular, they were able to predict Gene Regulatory Networks in this AML subtype using pySCENIC, which identified the paediatric regulon defined by a distinct group of hematopoietic transcription factors (TFs) and the adult regulon for t(8;21). They further validated this in bulk RNA-seq with AUCell algorithm and inferred prenatal signature to 5 key TFs (KDM5A, REST, BCLAF1, YY1, and RAD21), and the postnatal signature to 9 TFs (ENO1, TFDP1, MYBL2, KLF1, TAGLN2, KLF2, IRF7, SPI1, and YXB1). They also used SCENIC+ to identify enhancer-driven regulons (eRegulons), forming an eGRN, and found that prenatal origin shows a specific HSC eRegulon profile, while a postnatal origin shows a GMP profile. They also did an in silico perturbation and found AP-1 complex (JUN, ATF4, FOSL2), P300, and BCLAF1 as important TFs to induce differentiation. Overall, I found this study very important in creating a comprehensive resource for AML research. 

      Strengths: 

      (1) The generation of an AML atlas integrating multiple datasets with almost 750K cells will further support the community working on AML. 

      (2) Characterisation of t(8;21) AML proposes new interesting leads. 

      We thank the reviewer for a succinct summary of our work and highlighting its strengths.

      Weaknesses: 

      Were these t(8;21) TFs/regulons identified from any of the single datasets? For example, if the authors apply pySCENIC to any dataset, would they find the same TFs, or is it the increase in the number of cells that allows identification of these? 

      We implemented pySCENIC on individual datasets and compared the TFs (defining the regulons) identified to those from the combined AML scAtlas analysis. There were some common TFs identified, but these vary between individual studies. The union of all TFs identified makes a very large set - comprising around a third of all known TFs. AML scAtlas provides a more refined repertoire of TFs, perhaps as the underlying network inference approach is more robust with a higher number of cells. The findings of these investigations are included in Supplementary Figure 4DE, we hope this is useful for other users of pySCENIC.

      Reviewer #2 (Public review): 

      Summary: 

      The authors assemble 222 publicly available bone marrow single-cell RNA sequencing samples from healthy donors and primary AML, including pediatric, adolescent, and adult patients at diagnosis. Focusing on one specific subtype, t(8;21), which, despite affecting all age classes, is associated with better prognosis and drug response for younger patients, the authors investigate if this difference is reflected also in the transcriptomic signal. Specifically, they hypothesize that the pediatric and part of the young population acquires leukemic mutations in utero, which leads to a different leukemogenic transformation and ultimately to differently regulated leukemic stem cells with respect to the adult counterpart. The analysis in this work heavily relies on regulatory network inference and clustering (via SCENIC tools), which identifies regulatory modules believed to distinguish the pre-, respectively, post-natal leukemic transformation. Bulk RNA-seq and scATAC-seq datasets displaying the same signatures are subsequently used for extending the pool of putative signature-specific TFs and enhancer elements. Through gene set enrichment, ontology, and perturbation simulation, the authors aim to interpret the regulatory signatures and translate them into potential onset-specific therapeutic targets. The putative pre-natal signature is associated with increased chemosensitivity, RNA splicing, histone modification, stemness marker SMARCA2, and potentially maintained by EP300 and BCLAF1. 

      Strengths: 

      The main strength of this work is the compilation of a pediatric AML atlas using the efficient Cellxgene interface. Also, the idea of identifying markers for different disease onsets, interpreting them from a developmental angle, and connecting this to the different therapy and relapse observations, is interesting. The results obtained, the set of putative up-regulated TFs, are biologically coherent with the mechanisms and the conclusions drawn. I also appreciate that the analysis code was made available and is well documented. 

      We thank the reviewer for evaluating our work, and highlighting its key features, including creation of AML atlas, downstream analysis and interpretation for t(8;21) subtype.

      Weaknesses:

      There were fundamental flaws in how methods and samples were applied, a general lack of critical examination of both the results and the appropriateness of the methods for the data at hand, and in how results were presented. In particular: 

      (1) Cell type annotation: 

      (a) The 2-phase cell type annotation process employed for the scRNA-seq sample collection raised concerns. Initially annotated cells are re-labeled after a second round with the same cell types from the initial label pool (Figure 1E). The automatic annotation tools were used without specifying the database and tissue atlases used as a reference, and no information was shown regarding the consensus across these tools. 

      Cell type annotations are heavily influenced by the reference profiles used and vary significantly between tools. To address this, we used multiple cell type annotation tools which predominantly encompassed healthy peripheral blood cell types and/or healthy bone marrow populations. This determined the primary cluster cell types assigned. 

      Existing tools and resources are not leukemia specific, thus, to identify AMLassociated HSPC subpopulations we created a custom SingleR reference, using a CD34 enriched AML single-cell dataset. This was not suitable for the annotation of the full AML scAtlas, as it is derived from CD34 sorted cell types so is biased towards these populations. 

      We have made this much clearer in the revised manuscript, by splitting Figure 1 into two separate figures (now Figure 1 and Figure 2) reflecting both different analyses performed. The methods have also been updated with more detail on the cell type annotations, and we have included the automated annotation outputs as a supplementary table, as this may be useful for others in the single-cell community. 

      (b) Expression of the CD34 marker is only reported as a selection method for HSPCs, which is not in line with common practice. The use of only is admitted as a surface marker, while robust annotation of HSPCs should be done on the basis of expression of gene sets. 

      Most of the cells used in the HSPC analysis were in fact annotated as HSPCs with some exceptions. In line with this feedback, we have re-worked this analysis and simply taken HSPC annotated clusters forward for the subsequent analysis, yielding the same findings. 

      (c) During several analyses, the cell types used were either not well defined or contradictory, such as in Figure 2D, where it is not clear if pySCENIC and AUC scores were computed on HSPCs alone or merged with CMPs. In other cases, different cell type populations are compared and used interchangeably: comparing the HSPCderived regulons with bulk (probably not enriched for CD34+ cells) RNA samples could be an issue if there are no valid assumptions on the cell composition of the bulk sample. 

      We apologize for the lack of clarity regarding which cell types were used, the text has been updated to clarify that in the pySCENIC analysis all myeloid progenitor cells were included. 

      The bulk RNA-seq samples were used only to test the enrichment of our AML scAtlas derived regulons in an unbiased and large-scale way. While CD34 enriched samples could be preferable, this was not available to us. 

      We agree that more effort could be made to ensure the single-cell/myeloid progenitor derived regulons are comparable to the bulk-RNA sequencing data. In the original bulk RNA-seq validation analysis, we used all bulk-RNA sequencing timepoints (diagnostic, on-treatment, relapse) and included both bone marrow and peripheral blood. Upon reflection, and to better harmonize the bulk RNA-seq selection strategy with that of AML scAtlas, we revised our approach to include only diagnostic bone marrow samples. We expect that, since the leukemia blast count for pediatric AML is typically high at diagnosis, these samples will predominantly contain leukemic blasts. 

      (2) Method selection: 

      (a) The authors should explain why they use pySCENIC and not any other approach.They should briefly explain how pySCENIC works and what they get out in the main text. In addition they should explain the AUCell algorithm and motivate its usage. 

      pySCENIC is state-of-the-art method for network inference from scRNA data and is widely used within the single-cell community (over 5000 citations for both versions of the SCENIC pipeline). The pipeline has been benchmarked as one of the top performers for GRN analysis (Nguyen et al, 2021. Briefings in Bioinformatics). AUCELL is a module within the pySCENIC pipeline to summarize the activity of a set of genes (a regulon) into a single number which helps compare and visualize different regulons.  We have modified the manuscript (Results section 2 paragraph 2) to better explain this method and provided some rationale and accompanying citations to justify its use for this analysis. We thank the reviewer for highlighting this and hope our updates add some clarity.

      (b) The obtained GRN signatures were not critically challenged on an external dataset. Therefore, the evidence that supports these signatures to be reliable and significant to the investigated setting is weak. 

      These signatures were inferred using the most suitable AML single-cell RNA datasets currently available. To validate our findings, we used two independent datasets (the TARGET AML bulk RNA sequencing cohort, and the Lambo et al. scRNA-seq dataset). To clarify this workflow in the manuscript, we have added a panel to Figure 3 outlining the analytical process. To our knowledge, there are no other better-suited datasets for validation. Experimental validations on patient samples, while valuable, are beyond the scope of this study.

      (3) There are some issues with the analysis & visualization of the data. 

      Based on this feedback, we have improved several aspects of the analysis, changed some visualizations, and improved figure resolution throughout the manuscript. 

      (4) Discussion: 

      (a) What exactly is the 'regulon signature' that the authors infer? How can it be useful for insights into disease mechanisms? 

      The ’regulon signature’ here refers to a gene regulatory program (multiple gene modules, each defined by a transcription factor and its targets) which are specific to different age groups. Further investigation into this can be useful for understanding why patients of different ages confer a different clinical course. We have amended the text to explain this.  

      (b) The authors write 'Together this indicates that EP300 inhibition may be particularly effective in t(8;21) AML, and that BCLAF1 may present a new therapeutic target for t(8;21) AML, particularly in children with inferred pre-natal origin of the driver translocation.' I am missing a critical discussion of what is needed to further test the two targets. Put differently: Would the authors take the risk of a clinical study given the evidence from their analysis? 

      Indeed, many extensive studies would be required before these findings are clinically translatable. We have included a discussion paragraph (discussion paragraph 7) detailing what further work is required in terms of experimental validation and potential subsequent clinical study.

      Reviewer #1 (Recommendations for the authors): 

      In addition to the point raised above, Cytoscape files for the GRNs and eGRNs inferred would be useful to have. 

      We have now provided Cytoscape/eGRN tables in supplementary materials.

      Reviewer #2 (Recommendations for the authors): 

      (1) Figures 1F and 1G: You show the summed-up frequencies for all patients, right? It would be very interesting to see this per patient, or add error bars, since the shown frequencies might be driven by single patients with many cells. 

      While this type of plot could be informative, the large number of samples in the AML scAtlas rendered the output difficult to interpret. As a result, we decided not to include it in the manuscript.

      (2) An issue of selection bias has to be raised when only the two samples expressing the expected signatures are selected from the external scRNA dataset. Similarly, in the DepMap analysis, the age and nature of the other cell lines sensitive to EP300 and BCLAF1 should be reported. 

      Since the purpose of this analysis was to build on previously defined signatures, we selected the two samples which we had preliminary hypotheses for. It would indeed be interesting to explore those not matching these signatures; however, samples numbers are very small, so without preliminary findings robust interpretation and validation would be difficult. An expanded validation would be more appropriate once more data becomes available in the future. 

      We agree that investigating the age and nature of other BCLAF1/EP300 sensitive cell lines is a very valuable direction. Our analysis suggests that our BCLAF1 findings may also be applicable to other in-utero origin cancers, and we have now summarized these observations in Supplementary Figure 7H. 

      (3) Is there statistical evidence for your claim that "This shows that higher-risk subtypes have a higher proportion of LSCs compared to favorable risk disease."? At least intermediate and adverse look similar to me. How does this look if you show single patients?  

      We are grateful to the reviewer for noticing this oversight and have now included an appropriate statistical test in the revised manuscript. As before, while showing single patients may be useful, the large number of patients makes such plot difficult to interpret. For this reason, we have chosen not to include them.

      (4) Specify the statistical test you used to 'identify significantly differentially expressed TFs' (line 192). 

      The methods used for differential expression analysis are now clearly stated in the text as well as in the methods section. We hope this addition improves clarity for the reader.

      (5) Figure 2B: You show the summed up frequencies for all patients, right? It would be intriguing to see this figure per patient, since the shown frequencies might be driven by single patients with many cells. 

      Yes, the plot includes all patients. Showing individual patients on a single plot is not easily interpretable. 

      (6) Y axis in 2D is not samples, but single cells? Please specify. 

      We thank the reviewer for bringing this to our attention and have now updated Figure 3D accordingly. 

      (7) Figure 3A: I don't get why the chosen clusters are designated as post- and prenatal, given the occurrence of samples in them. 

      This figure serves to validate the previously defined regulon signatures, so the cluster designations are based on this. We have amended the text to elaborate on this point, which will hopefully provide greater clarity.

      (8) Figure 3E: What is shown on the y axis? Did you correct your p-values for multiple testing? 

      We apologize for this oversight and have now added a y axis label. P values were not corrected for multiple testing, as there are only few pairwise T tests performed.

      (9) Robustness: You find some gene sets up- and down-regulated. How would that change if you used an eg bootstrapped number of samples, or a different analysis approach? 

      To address this, we implemented both edgeR and DESeq2 for DE testing. Our findings (Supplementary Figure 5B) show that 98% of edgeR genes are also detected by DESeq2. We opted to use the smaller edgeR gene list for our analysis, due to the significant overlap showing robust findings. We thank the reviewer for this helpful suggestion, which has strengthened our analysis

      (10) Multiomics analysis:

      (a) Why only work on 'representative samples'? The idea of an integrated atlas is to identify robust patterns across patients, no? I'd love to see what regulons are robust, ie,  shared between patients.

      As discussed in point 2, there are very few samples available for the multiomics analysis. Therefore, we chose to focus on those samples which we had a working hypothesis for, as a validation for our other analyses. 

      (b) I don't agree that finding 'the key molecular processes, such as RNA splicing, histone modification, and TF binding' expressed 'further supports the stemness signature in presumed prenatal origin t(8;21) AML'.

      Following the improvements made on the bulk RNA-Seq analysis in response to the previous reviewer comments, we ended up with a smaller gene set. Consequently, the ontology results have changed. The updated results are now more specific and indicate that developmental processes are upregulated in presumed prenatal origin t(8;21) AML. 

      (c) Please clarify if the multiome data is part of the atlas.

      The multiome data is not a part of AML scAtlas, as it was published at a later date. We used this dataset solely for validation purposes and have updated the figures and text to clearly indicate that it is used as a validation dataset.  

      (d) Please describe the used data with respect to the number of patients, cells, age, etc.

      We clarified this point in the text and have also included supplementary tables detailing all samples used in the atlas and validation datasets. 

      (e) The four figures in Figure 4E look identical to me. What is the take-home message here? Do all perturbations have the same impact on driving differentiation? Please elaborate.

      The perturbation figure is intended to illustrate that other genes can behave similarly to members of the AP-1 complex (JUN and ATF4 here) following perturbation. Since the AP-1 complex is well known to be important in t(8;21) AML, we hypothesize that these other genes are also important. We apologize for the previous lack of interpretation here and have amended the text to clarify this point. 

      (11) Abstract: Please detail: how many of the 159 AML patients are t(8;21)? 

      We have now amended the abstract to include this. 

      (12) Figures: Increase font size where possible, eg age in 1B or risk group in 1G is super small and hard to read. 

      Extra attention has been given to improving the figure readability and resolution throughout the whole manuscript.  

      (13) Color codes in Figures 2B and 2C are all over the place and misleading: Sort 2C along age, indicate what is adult and adolescent, sort the x axis in 2B along age. 

      We have changed this figure accordingly.  

      (14) I suggest not coloring dendrograms, in my opinion this is highly irritating. 

      The dendrogram colors correspond to clusters which are referenced in the text, this coloring provides informative context and aids interpretation, making it a useful addition to the figure.

      (15) The resolution in Figure 4B is bad, I can't read the labels. 

      This visualization has been revised, to make presentation of this data clearer.  

      (16) In addition to selecting bulk RNA samples matching the two regulon signatures, some effort should have been put into investigating the samples not aligned with those, or assessing how unique these GRN signatures are to the specific cell type and disease of interest, excluding the influence of cell type composition and random noise. The lateonset signatures should also be excluded from being present in an external pre-natal cohort in a more statistically rigorous manner. 

      Our use of the bulk RNA-Seq data is solely intended for the validation of predefined regulon signatures, for which we already have a working hypothesis.  While we agree that further investigation of the samples that do not align with these signatures could yield interesting insights, we believe that such an analysis would extend beyond the scope of the current manuscript.

      (17) The specific bulk RNA samples used should be specified, along with the tissue of origin. The same goes for the Lambo dataset. 

      We have clarified this point in the text and provided a supplementary table detailing all samples used for validation, alongside the sample list from AML scAtlas.

      (18) In Supplementary Figure 5 B, the axes should be define. 

      We have updated this figure to include axis legends.

      (19) Supplementary Figure 4A. There is a mistake in the sex assignment for sample AML14D. Since chrY-genes are expressed, this sample is likely male, while the Xist expression is mostly zero. 

      We thank the reviewer for pointing out this error, which has now been corrected.  

      (20) Wording suggestions: 

      (a) Line 54: not compelling phrasing. 

      (b) Line 83: "allows to decipher". 

      (c) Line 88: repetition from line 85. 

      (d) Line 90: the expression "clean GRN" is not clear. 

      These wording suggestions have all been incorporated in the revised manuscript.

      (21) Supplementary Figure 3D is not interpretable, I suggest a different visualization. 

      We agree that the original figure was not the most informative and have replaced it with UMAPs displaying LSC6 and LSC17 scores.

    1. Author response:

      Reviewer 1 (Public review):

      (1) Figure 1B shows the PREDICTED force-extension curve for DNA based on a worm-like chain model. Where is the experimental evidence for this curve? This issue is crucial because the F-E curve will decide how and when a catch-bond is induced (if at all it is) as the motor moves against the tensiometer. Unless this is actually measured by some other means, I find it hard to accept all the results based on Figure 1B.

      The Worm-Like-Chain model for the elasticity of DNA was established by early work from the Bustamante lab (Smith et al., 1992)  and Marko and Siggia (Marko and Siggia, 1995), and was further validated and refined by the Block lab (Bouchiat et al., 1999; Wang et al., 1997). The 50 nm persistence length is the consensus value, and was shown to be independent of force and extension in Figure 3 of Bouchiat et al (Bouchiat et al., 1999). However, we would like to stress that for our conclusions, the precise details of the Force-Extension relationship of our dsDNA are immaterial. The key point is that the motor stretches the DNA and stalls when it reaches its stall force. Our claim of the catch-bond character of kinesin is based on the longer duration at stall compared to the run duration in the absence of load. Provided that the motor is indeed stalling because it has stretched out the DNA (which is strongly supported by the repeated stalling around the predicted extension corresponding to ~6 pN of force), then the stall duration depends on neither the precise value for the extension nor the precise value of the force at stall.

      (2) The authors can correct me on this, but I believe that all the catch-bond studies using optical traps have exerted a load force that exceeds the actual force generated by the motor. For example, see Figure 2 in reference 42 (Kunwar et al). It is in this regime (load force > force from motor) that the dissociation rate is reduced (catch-bond is activated). Such a regime is never reached in the DNA tensiometer study because of the very construction of the experiment. I am very surprised that this point is overlooked in this manuscript. I am therefore not even sure that the present experiments even induce a catch-bond (in the sense reported for earlier papers).

      It is true that Kunwar et al measured binding durations at super-stall loads and used that to conclude that dynein does act as a catch-bond (but kinesin does not) (Kunwar et al., 2011). However, we would like to correct the reviewer on this one. This approach of exerting super-stall forces and measuring binding durations is in fact less common than the approach of allowing the motor to walk up to stall and measuring the binding duration. This ‘fixed trap’ approach has been used to show catch-bond behavior of dynein (Leidel et al., 2012; Rai et al., 2013) and kinesin (Kuo et al., 2022; Pyrpassopoulos et al., 2020). For the non-processive motor Myosin I, a dynamic force clamp was used to keep the actin filament in place while the myosin generated a single step (Laakso et al., 2008). Because the motor generates the force, these are not superstall forces either.

      (3) I appreciate the concerns about the Vertical force from the optical trap. But that leads to the following questions that have not at all been addressed in this paper:

      (i) Why is the Vertical force only a problem for Kinesins, and not a problem for the dynein studies?

      Actually, we do not claim that vertical force is not a problem for dynein; our data do not speak to this question. There is debate in the literature as to whether dynein has catch bond behavior in the traditional single-bead optical trap geometry - while some studies have measured dynein catch bond behavior (Kunwar et al., 2011; Leidel et al., 2012; Rai et al., 2013), others have found that dynein has slip-bond or ideal-bond behavior (Ezber et al., 2020; Nicholas et al., 2015; Rao et al., 2019). This discrepancy may relate to vertical forces, but not in an obvious way.

      (ii) The authors state that "With this geometry, a kinesin motor pulls against the elastic force of a stretched DNA solely in a direction parallel to the microtubule". Is this really true? What matters is not just how the kinesin pulls the DNA, but also how the DNA pulls on the kinesin. In Figure 1A, what is the guarantee that the DNA is oriented only in the plane of the paper? In fact, the DNA could even be bending transiently in a manner that it pulls the kinesin motor UPWARDS (Vertical force). How are the authors sure that the reaction force between DNA and kinesin is oriented SOLELY along the microtubule?

      We acknowledge that “solely” is an absolute term that is too strong to describe our geometry. We will soften this term in our revision to “nearly parallel to the microtubule”. In the Geometry Calculations section of Supplementary Methods, we calculate that if the motor and streptavidin are on the same protofilament, the vertical force will be <1% of the horizontal force. We also note that if the motor is on a different protofilament, there will be lateral forces and forces perpendicular to the microtubule surface, except they are oriented toward rather than away from the microtubule. The DNA can surely bend due to thermal forces, but because inertia plays a negligible role at the nanoscale (Howard, 2001; Purcell, 1977), any resulting upward forces will only be thermal forces, which the motor is already subjected to at all times.

      (4) For this study to be really impactful and for some of the above concerns to be addressed, the data should also have included DNA tensiometer experiments with Dynein. I wonder why this was not done?

      As much as we would love to fully characterize dynein here, this paper is about kinesin and it took a substantial effort. The dynein work merits a stand-alone paper.

      While I do like several aspects of the paper, I do not believe that the conclusions are supported by the data presented in this paper for the reasons stated above.

      The three key points the reviewer makes are the validity of the worm-like-chain model, the question of superstall loads, and the role of DNA bending in generating vertical forces. We hope that we have fully addressed these concerns in our responses above.

      Reviewer #2 (Public review):

      Major comments:

      (1) The use of the term "catch bond" is misleading, as the authors do not really mean consistently a catch bond in the classical sense (i.e., a protein-protein interaction having a dissociation rate that decreases with load). Instead, what they mean is that after motor detachment (i.e., after a motor protein dissociating from a tubulin protein), there is a slip state during which the reattachment rate is higher as compared to a motor diffusing in solution. While this may indeed influence the dynamics of bidirectional cargo transport (e.g., during tug-of-war events), the used terms (detachment (with or without slip?), dissociation, rescue, ...) need to be better defined and the results discussed in the context of these definitions. It is very unsatisfactory at the moment, for example, that kinesin-3 is at first not classified as a catch bond, but later on (after tweaking the definitions) it is. In essence, the typical slip/catch bond nomenclature used for protein-protein interaction is not readily applicable for motors with slippage.

      We appreciate the reviewer’s point and we will work to streamline and define terms in our revision.

      (2) The authors define the stall duration as the time at full load, terminated by >60 nm slips/detachments. Isn't that a problem? Smaller slips are not detected/considered... but are also indicative of a motor dissociation event, i.e., the end of a stall. What is the distribution of the slip distances? If the slip distances follow an exponential decay, a large number of short slips are expected, and the presented data (neglecting those short slips) would be highly distorted.

      The reviewer brings up a good point that there may be undetected slips. To address this question, we plotted the distribution of slip distances for kinesin-3, which by far had the most slip events. As the reviewer suggested, it is indeed an exponential distribution. Our preliminary analysis suggests that roughly 20% of events are missed due to this 60 nm cutoff. This will change our unloaded duration numbers slightly, but this will not alter our conclusions.\

      (3) Along the same line: Why do the authors compare the stall duration (without including the time it took the motor to reach stall) to the unloaded single motor run durations? Shouldn't the times of the runs be included?

      The elastic force of the DNA spring is variable as the motor steps up to stall, and so if we included the entire run duration then it would be difficult to specify what force we were comparing to unloaded. More importantly, if we assume that any stepping and detachment behavior is history independent, then it is mathematically proper to take any arbitrary starting point (such as when the motor reaches stall), start the clock there, and measure the distribution of detachments durations relative to that starting point.

      More importantly, what we do in Fig. 3 is to separate out the ramps from the stalls and, using a statistical model, we compute a separate duration parameter (which is the inverse of the off-rate) for the ramp and the stall. What we find is that the relationship between ramp, stall, and unloaded durations is different for the three motors, which is interesting in itself.

      (4) At many places, it appears too simple that for the biologically relevant processes, mainly/only the load-dependent off-rates of the motors matter. The stall forces and the kind of motor-cargo linkage (e.g., rigid vs. diffusive) do likely also matter. For example: "In the context of pulling a large cargo through the viscous cytoplasm or competing against dynein in a tug-of-war, these slip events enable the motor to maintain force generation and, hence, are distinct from true detachment events." I disagree. The kinesin force at reattachment (after slippage) is much smaller than at stall. What helps, however, is that due to the geometry of being held close to the microtubule (either by the DNA in the present case or by the cargo in vivo) the attachment rate is much higher. Note also that upon DNA relaxation, the motor is likely kept close to the microtubule surface, while, for example, when bound to a vesicle, the motor may diffuse away from the microtubule quickly (e.g., reference 20).

      We appreciate the reviewer’s detailed thinking here, and we offer our perspective. As to the first point, we agree that the stall force is relevant and that the rigidity of the motor-cargo linkage will play a role. The goal of the sentence on pulling cargo that the reviewer highlights is to set up our analysis of slips, which we define as rearward displacements that don’t return to the baseline before force generation resumes. We agree that force after slippage is much smaller than at stall, and we plan to clarify that section of text. However, as shown in the model diagram in Fig. 5, we differentiate between the slip state (and recovery from this slip state) and the detached state (and reattachment from this detached state). This delineation is important because, as the reviewer points out, if we are measuring detachment and reattachment with our DNA tensiometer, then the geometry of a vesicle in a cell will be different and diffusion away from the microtubule or elastic recoil perpendicular to the microtubule will suppress this reattachment.

      Our evidence for a slip state in which the motor maintains association with the microtubule comes from optical trapping work by Tokelis et al (Toleikis et al., 2020) and Sudhakar et al (Sudhakar et al., 2021). In particular, Sudhakar used small, high index Germanium microspheres that had a low drag coefficient. They showed that during ‘slip’ events, the relaxation time constant of the bead back to the center of the trap was nearly 10-fold slower than the trap response time, consistent with the motor exerting drag on the microtubule. (With larger beads, the drag of the bead swamps the motor-microtubule friction.) Another piece of support for the motor maintaining association during a slip is work by Ramaiya et al. who used birefringent microspheres to exert and measure rotational torque during kinesin stepping (Ramaiya et al., 2017). In most traces, when the motor returned to baseline following a stall, the torque was dissipated as well, consistent with a ‘detached’ state. However, a slip event is shown in S18a where the motor slips backward while maintaining torque. This is best explained by the motor slipping backward in a state where the heads are associated with the microtubule (at least sufficiently to resist rotational forces). Thus, we term the resumption after slip to be a rescue from the slip state rather than a reattachment from the detached state.

      To finish the point, with the complex geometry of a vesicle, during slip events the motor remains associated with the microtubule and hence primed for recovery. This recovery rate is expected to be the same as for the DNA tensiometer. Following a detachment, however, we agree that there will likely be a higher probability of reattachment in the DNA tensiometer due to proximity effects, whereas with a vesicle any elastic recoil or ‘rolling’ will pull the detached motor away from the microtubule, suppressing reattachment. We plan to clarify these points in the text of the revision.

      (5) Why were all motors linked to the neck-coil domain of kinesin-1? Couldn't it be that for normal function, the different coils matter? Autoinhibition can also be circumvented by consistently shortening the constructs.

      We chose this dimerization approach to focus on how the mechoanochemical properties of kinesins vary between the three dominant transport families. We agree that in cells, autoinhibition of both kinesins and dynein likely play roles in regulating bidirectional transport, as will the activity of other regulatory proteins. The native coiled-coils may act as as ‘shock absorbers’ due to their compliance, or they might slow the motor reattachment rate due to the relatively large search volumes created by their long lengths (10s of nm). These are topics for future work. By using the neck-coil domain of kinesin-1 for all three motors, we eliminate any differences in autoinhibition or other regulation between the three kinesin families and focus solely on differences in the mechanochemistry of their motor domains.

      (6) I am worried about the neutravidin on the microtubules, which may act as roadblocks (e.g. DOI: 10.1039/b803585g), slip termination sites (maybe without the neutravidin, the rescue rate would be much lower?), and potentially also DNA-interaction sites? At 8 nM neutravidin and the given level of biotinylation, what density of neutravidin do the authors expect on their microtubules? Can the authors rule out that the observed stall events are predominantly the result of a kinesin motor being stopped after a short slippage event at a neutravidin molecule?

      We will address these points in our revision.

      (7) Also, the unloaded runs should be performed on the same microtubules as in the DNA experiments, i.e., with neutravidin. Otherwise, I do not see how the values can be compared.

      We will address this point in our revision.

      (8) If, as stated, "a portion of kinesin-3 unloaded run durations were limited by the length of the microtubules, meaning the unloaded duration is a lower limit." corrections (such as Kaplan-Meier) should be applied, DOI: 10.1016/j.bpj.2017.09.024.

      (9) Shouldn't Kaplan-Meier also be applied to the ramp durations ... as a ramp may also artificially end upon stall? Also, doesn't the comparison between ramp and stall duration have a problem, as each stall is preceded by a ramp ...and the (maximum) ramp times will depend on the speed of the motor? Kinesin-3 is the fastest motor and will reach stall much faster than kinesin-1. Isn't it obvious that the stall durations are longer than the ramp duration (as seen for all three motors in Figure 3)?

      The reviewer rightly notes the many challenges in estimating the motor off-rates during ramps. To estimate ramp off-rates and as an independent approach to calculating the unloaded and stall durations, we developed a Markov model coupled with Bayesian inference methods to estimate a duration parameter (equivalent to the inverse of the off-rate) for the unloaded, ramp, and stall duration distributions. With the ramps, we have left censoring due to the difficulty in detecting the start of the ramps in the fluctuating baseline, and we have right censoring due to reaching stall (with different censoring of the ramp duration for the three motors due to their different speeds). The Markov model assumes a constant detachment probability and history independence, and thus is robust even in the face of left and right censoring (details in the Supplementary section). This approach is preferred over Kaplan-Meier because, although these non-parametric methods make no assumptions for the distribution, they require the user to know exactly where the start time is.

      Regarding the potential underestimate of the kinesin-3 unloaded run duration due to finite microtubule lengths. The first point is that the unloaded duration data in Fig. 2C are quite linear up to 6 s and are well fit by the single-exponential fit (the points above 6s don’t affect the fit very much). The second point is that when we used our Markov model (which is robust against right censoring) to estimate the unloaded and stall durations, the results agreed with the single-exponential fits very well (Table S2). For instance, the single-exponential fit for the kinesin-3 unloaded duration was 2.74 s (2.33 – 3.17 s 95% CI) and the estimate from the Markov model was 2.76 (2.28 – 3.34 s 95% CI). Thus, we chose not to make any corrections due to finite microtubule lengths.

      (10) It is not clear what is seen in Figure S6A: It looks like only single motors (green, w/o a DNA molecule) are walking ... Note: the influence of the attached DNA onto the stepping duration of a motor may depend on the DNA conformation (stretched and near to the microtubule (with neutravidin!) in the tethered case and spherically coiled in the untethered case).

      In Figure S6A kymograph, the green traces are GFP-labeled kinesin-1 without DNA attached (which are in excess) and the red diagonal trace is a motor with DNA attached. There are also two faint horizontal red traces, which are labeled DNA diffusing by (smearing over a large area during a single frame). Panel S6B shows run durations of motors with DNA attached. We agree that the DNA conformation will differ if it is attached and stretched (more linear) versus simply being transported (random coil), but by its nature this control experiment is only addressing random coil DNA.

      (11) Along this line: While the run time of kinesin-1 with DNA (1.4 s) is significantly shorter than the stall time (3.0 s), it is still larger than the unloaded run time (1.0 s). What do the authors think is the origin of this increase?

      Our interpretation of the unloaded kinesin-DNA result is that the much slower diffusion constant of the DNA relative to the motor alone enables motors to transiently detach and rebind before the DNA cargo has diffused away, thus extending the run duration. In contrast, such detachment events for motors alone normally result in the motor diffusing away from the microtubule, terminating the run. This argument has been used to reconcile the longer single-motor run lengths in the gliding assay versus the bead assay (Block et al., 1990). Notably, this slower diffusion constant should not play a role in the DNA tensiometer geometry because if the motor transiently detaches, then it will be pulled backward by the elastic forces of the DNA and detected as a slip or detachment event. We will address this point in the revision.

      (12) "The simplest prediction is that against the low loads experienced during ramps, the detachment rate should match the unloaded detachment rate." I disagree. I would already expect a slight increase.

      Agreed. We will change this text to: “The prediction for a slip bond is that against the low loads experienced during ramps, the detachment rate should be equal to or faster than the unloaded detachment rate.”

      (13) Isn't the model over-defined by fitting the values for the load-dependence of the strong-to-weak transition and fitting the load dependence into the transition to the slip state?

      Essentially, yes, it is overdefined, but that is essentially by design and it is still very useful. Our goal here was to make as simple a model as possible that could account for the data and use it to compare model parameters for the different motor families. Ignoring the complexity of the slip and detached states, a model with a strong and weak state in the stepping cycle and a single transition out of the stepping cycle is the simplest formulation possible. And having rate constants (k<sub>S-W</sub> and k<sub>slip</sub> in our case) that vary exponentially with load makes thermodynamic sense for modeling mechanochemistry (Howard, 2001). Thus, we were pleasantly surprised that this bare-bones model could recapitulate the unloaded and stall durations for all three motors (Fig. 5C-E).

      (14) "When kinesin-1 was tethered to a glass coverslip via a DNA linker and hydrodynamic forces were imposed on an associated microtubule, kinesin-1 dissociation rates were relatively insensitive to loads up to ~3 pN, inconsistent with slip-bond characteristics (37)." This statement appears not to be true. In reference 37, very similar to the geometry reported here, the microtubules were fixed on the surface, and the stepping of single kinesin motors attached to large beads (to which defined forces were applied by hydrodynamics) via long DNA linkers was studied. In fact, quite a number of statements made in the present manuscript have been made already in ref. 37 (see in particular sections 2.6 and 2.7), and the authors may consider putting their results better into this context in the Introduction and Discussion. It is also noteworthy to discuss that the (admittedly limited) data in ref. 37 does not indicate a "catch-bond" behavior but rather an insensitivity to force over a defined range of forces.

      The reviewer misquoted our sentence. The actual wording of the sentence was: “When kinesin-1 was connected to micron-scale beads through a DNA linker and hydrodynamic forces parallel to the microtubule imposed, dissociation rates were relatively insensitive to loads up to ~3 pN, inconsistent with slip-bond characteristics (Urbanska et al., 2021).” The sentence the reviewer quoted was in a previous version that is available on BioRxiv and perhaps they were reading that version. Nonetheless, in the revision we will note in the Discussion that this behavior was indicative of an ideal bond (not a catch-bond), and we will also add a sentence in the Introduction highlighting this work.

      Reviewer #3 (Public review):

      The authors attribute the differences in the behaviour of kinesins when pulling against a DNA tether compared to an optical trap to the differences in the perpendicular forces. However, the compliance is also much different in these two experiments. The optical trap acts like a ~ linear spring with stiffness ~ 0.05 pN/nm. The dsDNA tether is an entropic spring, with negligible stiffness at low extensions and very high compliance once the tether is extended to its contour length (Fig. 1B). The effect of the compliance on the results should be addressed in the manuscript.

      This is an interesting point. To address it, we calculated the predicted stiffness of the dsDNA by taking the slope of theoretical force-extension curve in Fig. 1B. Below 650 nm extension, the stiffness is <0.001 pN/nM; it reaches 0.01 pN/nM at 855 nm, and at 960 nm where the force is 6 pN the stiffness is roughly 0.2 pN/nm. That value is higher than the quoted 0.05 pN/nm trap stiffness, but for reference, at this stiffness, an 8 nm step leads to a 1.6 pN jump in force, which is reasonable. Importantly, the stiffness of kinesin motors has been estimated to be in the range of 0.3 pN (Coppin et al., 1996; Coppin et al., 1997). Granted, this stiffness is also nonlinear, but what this means is that even at stall, our dsDNA tether has a similar predicted compliance to the motor that is pulling on it. We will address this point in our revision.  

      Compared to an optical trapping assay, the motors are also tethered closer to the microtubule in this geometry. In an optical trap assay, the bead could rotate when the kinesin is not bound. The authors should discuss how this tethering is expected to affect the kinesin reattachment and slipping. While likely outside the scope of this study, it would be interesting to compare the static tether used here with a dynamic tether like MAP7 or the CAP-GLY domain of p150glued.

      Please see our response to Reviewer #2 Major Comment #4 above, which asks this same question in the context of intracellular cargo. We plan to address this in our revision. Regarding a dynamic tether, we agree that’s interesting – there are kinesins that have a second, non-canonical binding site that achieves this tethering (ncd and Cin8); p150glued likely does this naturally for dynein-dynactin-activator complexes; and we speculated in a review some years ago (Hancock, 2014) that during bidirectional transport kinesin and dynein may act as dynamic tethers for one another when not engaged, enhancing the activity of the opposing motor.

      In the single-molecule extension traces (Figure 1F-H; S3), the kinesin-2 traces often show jumps in position at the beginning of runs (e.g., the four runs from ~4-13 s in Fig. 1G). These jumps are not apparent in the kinesin-1 and -3 traces. What is the explanation? Is kinesin-2 binding accelerated by resisting loads more strongly than kinesin-1 and -3?

      Due to the compliance of the dsDNA, the 95% limits for the initial attachment position are +/- 290 nm (Fig. S2). Thus, some apparent ‘jumps’ from the detached state are expected. We will take a closer look at why there are jumps for kinesin-2 that aren’t apparent for kinesin-1 or -3.

      When comparing the durations of unloaded and stall events (Fig. 2), there is a potential for bias in the measurement, where very long unloaded runs cannot be observed due to the limited length of the microtubule (Thompson, Hoeprich, and Berger, 2013), while the duration of tethered runs is only limited by photobleaching. Was the possible censoring of the results addressed in the analysis?

      Yes. Please see response to Reviewer #2 points (8) and (9) above.

      The mathematical model is helpful in interpreting the data. To assess how the "slip" state contributes to the association kinetics, it would be helpful to compare the proposed model with a similar model with no slip state. Could the slips be explained by fast reattachments from the detached state?

      In the model, the slip state and the detached states are conceptually similar; they only differ in the sequence (slip to detached) and the transition rates into and out of them. The simple answer is: yes, the slips could be explained by fast reattachments from the detached state. In that case, the slip state and recovery could be called a “detached state with fast reattachment kinetics”. However, the key data for defining the kinetics of the slip and detached states is the distribution of Recovery times shown in Fig. 4D-F, which required a triple exponential to account for all of the data. If we simplified the model by eliminating the slip state and incorporating fast reattachment from a single detached state, then the distribution of Recovery times would be a single-exponential with a time constant equivalent to t<sub>1</sub>, which would be a poor fit to the experimental distributions in Fig. 4D-F.

      We appreciate the efforts and helpful suggestions of all three reviewers and the Editor.

      References:

      Block, S.M., L.S. Goldstein, and B.J. Schnapp. 1990. Bead movement by single kinesin molecules studied with optical tweezers. Nature. 348:348-352.

      Bouchiat, C., M.D. Wang, J. Allemand, T. Strick, S.M. Block, and V. Croquette. 1999. Estimating the persistence length of a worm-like chain molecule from force-extension measurements. Biophys J. 76:409-413.

      Coppin, C.M., J.T. Finer, J.A. Spudich, and R.D. Vale. 1996. Detection of sub-8-nm movements of kinesin by high-resolution optical-trap microscopy. Proc Natl Acad Sci U S A. 93:1913-1917.

      Coppin, C.M., D.W. Pierce, L. Hsu, and R.D. Vale. 1997. The load dependence of kinesin's mechanical cycle. Proc Natl Acad Sci U S A. 94:8539-8544.

      Ezber, Y., V. Belyy, S. Can, and A. Yildiz. 2020. Dynein Harnesses Active Fluctuations of Microtubules for Faster Movement. Nat Phys. 16:312-316.

      Hancock, W.O. 2014. Bidirectional cargo transport: moving beyond tug of war. Nat Rev Mol Cell Biol. 15:615-628.

      Howard, J. 2001. Mechanics of Motor Proteins and the Cytoskeleton. Sinauer Associates, Inc., Sunderland, MA. 367 pp.

      Kunwar, A., S.K. Tripathy, J. Xu, M.K. Mattson, P. Anand, R. Sigua, M. Vershinin, R.J. McKenney, C.C. Yu, A. Mogilner, and S.P. Gross. 2011. Mechanical stochastic tug-of-war models cannot explain bidirectional lipid-droplet transport. Proc Natl Acad Sci U S A. 108:18960-18965.

      Kuo, Y.W., M. Mahamdeh, Y. Tuna, and J. Howard. 2022. The force required to remove tubulin from the microtubule lattice by pulling on its alpha-tubulin C-terminal tail. Nature communications. 13:3651.

      Laakso, J.M., J.H. Lewis, H. Shuman, and E.M. Ostap. 2008. Myosin I can act as a molecular force sensor. Science. 321:133-136.

      Leidel, C., R.A. Longoria, F.M. Gutierrez, and G.T. Shubeita. 2012. Measuring molecular motor forces in vivo: implications for tug-of-war models of bidirectional transport. Biophys J. 103:492-500.

      Marko, J.F., and E.D. Siggia. 1995. Stretching DNA. Macromolecules. 28:8759-8770.

      Nicholas, M.P., F. Berger, L. Rao, S. Brenner, C. Cho, and A. Gennerich. 2015. Cytoplasmic dynein regulates its attachment to microtubules via nucleotide state-switched mechanosensing at multiple AAA domains. Proc Natl Acad Sci U S A. 112:6371-6376.

      Purcell, E.M. 1977. Life at low Reynolds Number. Amer J. Phys. 45:3-11.

      Pyrpassopoulos, S., H. Shuman, and E.M. Ostap. 2020. Modulation of Kinesin's Load-Bearing Capacity by Force Geometry and the Microtubule Track. Biophys J. 118:243-253.

      Rai, A.K., A. Rai, A.J. Ramaiya, R. Jha, and R. Mallik. 2013. Molecular adaptations allow dynein to generate large collective forces inside cells. Cell. 152:172-182.

      Ramaiya, A., B. Roy, M. Bugiel, and E. Schaffer. 2017. Kinesin rotates unidirectionally and generates torque while walking on microtubules. Proc Natl Acad Sci U S A. 114:10894-10899.

      Rao, L., F. Berger, M.P. Nicholas, and A. Gennerich. 2019. Molecular mechanism of cytoplasmic dynein tension sensing. Nature communications. 10:3332.

      Smith, S.B., L. Finzi, and C. Bustamante. 1992. Direct mechanical measurements of the elasticity of single DNA molecules by using magnetic beads. Science. 258:1122-1126.

      Sudhakar, S., M.K. Abdosamadi, T.J. Jachowski, M. Bugiel, A. Jannasch, and E. Schaffer. 2021. Germanium nanospheres for ultraresolution picotensiometry of kinesin motors. Science. 371.

      Toleikis, A., N.J. Carter, and R.A. Cross. 2020. Backstepping Mechanism of Kinesin-1. Biophys J. 119:1984-1994.

      Urbanska, M., A. Ludecke, W.J. Walter, A.M. van Oijen, K.E. Duderstadt, and S. Diez. 2021. Highly-Parallel Microfluidics-Based Force Spectroscopy on Single Cytoskeletal Motors. Small. 17:e2007388.

      Wang, M.D., H. Yin, R. Landick, J. Gelles, and S.M. Block. 1997. Stretching DNA with optical tweezers. Biophys J. 72:1335-1346.

    1. Author response:

      Reviewer #1:

      The issue on validation of injection sites and viral spread is an important one, and we are fully aware of the risks associated with an incomplete assessment. Note that in the supplementary material, section on ‘Brain area identification’ we write the following: ‘In all neuroanatomical tracing experiments, correct placement of tracer injections into the four different areas (MEC, PER, PIR and LEC) was carefully evaluated based on known cytoarchitectonic features (see below). Electrophysiological experiments were initiated after our neuroanatomical experiments had verified the correct surgery coordinates for interrogating pathways to LEC from MEC, PIR, PER and cLEC. In patch-clamp experiments, viral injections were considered to hit the intended target area whenever the axonal innervation patterns in LEC were consistent with the patterns obtained in our neuroanatomical tracing experiments. To ensure that our injections were placed in MEC, without unintended spread to LEC, we examined the innervation patterns in DG.

      In agreement with the current understanding of entorhinal innervation of DG in rodents (Steward, 1976; van Groen et al., 2003), injections targeting MEC or LEC resulted in axonal labelling in the middle one-third or outer one-third of the molecular layer of DG, respectively. Cases where the injection had clearly spread to LEC, evident from the laminar distribution of labelling in DG and labelled cell bodies in LEC, were excluded from analysis.’

      In our view this provides sufficient security that we did not by mistake included intrinsic LEC projections into our dataset. In the result section, we addressed this issue as well by stating that: ‘We carefully checked all sections at and close to the levels we used for our experiments and did not observe any virally labelled neurons in LEC.’ In case of electrophysiological experiments, one normally does not secure whole brain material to exclude viral spread, but since for each animal we did record from multiple adjacent thick slices and in none did we find indications of including LEC. Finally, we included an analysis of SST projections originating from LEC (suppl Figure 1). As can be seen from panel C the local SST axonal pattern in LEC is markedly different form that seen following an injection in MEC. We aim to provide additional supplementary detail of this and include that in the text of the revised version.

      Reviewer #2:

      The remark that the in vivo relevance of these connections remains to be determined is absolutely correct and in the discussion we only speculated on this, since we currently do not have functional data of sufficient quality to address this. However, in an earlier version of the paper, still accessible on bioRxiv (https://biorxiv.org/cgi/content/short/2022.11.29.518323v1), we did include data on changes in expression of the immediate early gene cFos in LEC layer IIa cells upon manipulation of the SST projections from MEC within the context of conspecific memory. These data resulted in a non-significant trend, but we do not have the time, nor the financial means to extent that dataset. Therefore we cannot revise the paper in this respect.

    1. Author response:

      Reviewer #1 (Public review):

      Fombellida-Lopez and colleagues describe the results of an ART intensification trial in people with HIV infection (PWH) on suppressive ART to determine the effect of increasing the dose of one ART drug, dolutegravir, on viral reservoirs, immune activation, exhaustion, and circulating inflammatory markers. The authors hypothesize that ART intensification will provide clues about the degree to which low-level viral replication is occurring in circulation and in tissues despite ongoing ART, which could be identified if reservoirs decrease and/or if immune biomarkers change. The trial design is straightforward and well-described, and the intervention appears to have been well tolerated. The investigators observed an increase in dolutegravir concentrations in circulation, and to a lesser degree in tissues, in the intervention group, indicating that the intervention has functioned as expected (ART has been intensified in vivo). Several outcome measures changed during the trial period in the intervention group, leading the investigators to conclude that their results provide strong evidence of ongoing replication on standard ART. The results of this small trial are intriguing, and a few observations in particular are hypothesis-generating and potentially justify further clinical trials to explore them in depth. However, I am concerned about over-interpretation of results that do not fully justify the authors' conclusions.

      We thank Reviewer #1 for their thoughtful and constructive comments, which helped us clarify and improve the manuscript. Below, we address each of the reviewer’s points and describe the changes that we implemented in the revised version. We acknowledge the reviewer’s concern regarding potential overinterpretation of certain findings, and in the revised version we took particular care to ensure that all conclusions are supported by the data and framed within the exploratory nature of the study.

      (1) Trial objectives: What was the primary objective of the trial? This is not clearly stated. The authors describe changes in some reservoir parameters and no changes in others. Which of these was the primary outcome? No a priori hypothesis / primary objective is stated, nor is there explicit justification (power calculations, prior in vivo evidence) for the small n, unblinded design, and lack of placebo control. In the abstract (line 36, "significant decreases in total HIV DNA") and conclusion (lines 244-246), the authors state that total proviral DNA decreased as a result of ART intensification. However, in Figures 2A and 2E (and in line 251), the authors indicate that total proviral DNA did not change. These statements are confusing and appear to be contradictory. Regarding the decrease in total proviral DNA, I believe the authors may mean that they observed transient decrease in total proviral DNA during the intensification period (day 28 in particular, Figure 2A), however this level increases at Day 56 and then returns to baseline at Day 84, which is the source of the negative observation. Stating that total proviral DNA decreased as a result of the intervention when it ultimately did not is misleading, unless the investigators intended the day 28 timepoint as a primary endpoint for reservoir reduction - if so, this is never stated, and it is unclear why the intervention would then be continued until day 84? If, instead, reservoir reduction at the end of the intervention was the primary endpoint (again, unstated by the authors), then it is not appropriate to state that the total proviral reservoir decreased significantly when it did not.

      We agree with the reviewer that the primary objective of the study was not explicitly stated in the submitted manuscript. We clarified this in the revised manuscript (lines 361-364). As registered on ClinicalTrials.gov (NCT05351684), the primary outcome was defined as “To evaluate the impact of treatment intensification at the level of total and replication-competent reservoir (RCR) in blood and in tissues”, with a time frame of 3 months. Accordingly, our aim was to explore whether any measurable reduction in the HIV reservoir (total or replication-competent) occurred during the intensification period, including at day 28, 56, or 84. The protocol did not prespecify a single time point for this effect to occur, and the exploratory design allowed for detection of transient or sustained changes within the intensification window.

      We recognize that this scope was not clearly articulated in the original text and may have led to confusion in interpreting the transient drop in total HIV DNA observed at day 28. While total DNA ultimately returned to baseline by the end of intensification, the presence of a transient reduction during this 3-month window still fits within the framework of the study’s registered objective. Moreover, although the change in total HIV DNA was transient, it aligns with the consistent direction of changes observed across the multiple independent measures, including CA HIV RNA, RNA/DNA ratio and intact HIV DNA, collectively supporting a biological effect of intensification.

      We would also like to stress that this is the first clinical trial ever, in which an ART intensification is performed not by adding an extra drug but by increasing the dosage of an existing drug. Therefore, we were more interested in the overall, cumulative, effect of intensification throughout the entire trial period, than in differences between groups at individual time points. We clarified in the revised manuscript that this was a proof-of-concept phase 2 study, designed to reveal biological effects of ART intensification rather than confirm efficacy in a powered comparison. The absence of a prespecified statistical endpoint or sample size calculation reflects the exploratory nature of the trial.

      (2) Intervention safety and tolerability: The results section lacks a specific heading for participant safety and tolerability of the intervention. I was wondering about clinically detectable viremia in the study. Were there any viral blips? Was the increased DTG well tolerated? This drug is known to cause myositis, headache, CPK elevation, hepatotoxicity, and headache. Were any of these observed? What is the authors' interpretation of the CD4:8 ratio change (line 198)? Is this a significant safety concern for a longer duration of intensification? Was there also a change in CD4% or only in absolute counts? Was there relative CD4 depletion observed in the rectal biopsy samples between days 0 and 84? Interestingly, T cells dropped at the same timepoints that reservoirs declined... how do the authors rule out that reservoir decline reflects transient T cell decline that is non-specific (not due to additional blockade of replication)?

      We improved the Methods section to clarify how safety and tolerability were assessed during the study (lines 389-396). Safety evaluations were conducted on day 28 and day 84 and included a clinical examination and routine laboratory testing (liver function tests, kidney function, and complete blood count). Medication adherence was also monitored through pill counts performed by the study nurses.

      No virological blips above 50 copies/mL were observed and no adverse events were reported by participants during the 3-month intensification period. Although CPK levels were not included in the routine biological monitoring, no participant reported muscle pain or other symptoms suggestive of muscle toxicity.

      The CD4:CD8 ratio decrease noted during intensification was not associated with significant changes in absolute CD4 or CD8 counts, as shown in Figure 5. We interpret this ratio change as a transient redistribution rather than an immunological risk, therefore we do not consider it to represent a safety concern.

      We would like to clarify that CD4⁺ T-cell counts did not significantly decrease in any of the treatment groups, as shown in Figure 5. The apparent decline observed concerns the CD4/CD8 ratio, which transiently dropped, but not the absolute number of CD4⁺ T cells. Moreover, although the dynamics of total HIV DNA is indeed similar to that of CD4/CD8 ratio (both declined transiently and then returned to baseline by day 84), the dynamics of unspliced RNA and unspliced RNA/total DNA ratio are clearly different, as these markers demonstrated a sustained decrease that was maintained throughout the trial period, even when the CD4/CD8 ratio already returned to baseline. Also, we observed a significant decrease in intact HIV DNA at day 84 compared to day 0. These effects cannot be easily explained by a transient decline in CD4+ cells.

      (3) The investigators describe a decrease in intact proviral DNA after 84 days of ART intensification in circulating cells (Figure 2D), but no changes to total proviral DNA in blood or tissue (Figures 2A and 2E; IPDA does not appear to have been done on tissue samples). It is not clear why ART intensification would result in a selective decrease in intact proviruses and not in total proviruses if the source of these reservoir cells is due to ongoing replication. These reservoir results have multiple interpretations, including (but not limited to) the investigators' contention that this provides strong evidence of ongoing replication. However, ongoing replication results in the production of both intact and mutated/defective proviruses that both contribute to reservoir size (with defective proviruses vastly outnumbering intact proviruses). The small sample size and well-described heterogeneity of the HIV reservoir (with regard to overall size and composition) raise the possibility that the study was underpowered to detect differences over the 84-day intervention period. No power calculations or prior studies were described to justify the trial size or the duration of the intervention. Readers would benefit from a more nuanced discussion of reservoir changes observed here.

      We sincerely thank the reviewer for this insightful comment. We fully agree that the reservoir dynamics observed in our study might raise several possible interpretations, and that its complexity, resulting from continuous cycles of expansion and contraction, reflects the heterogeneity of the latent reservoir. 

      Total HIV DNA in PBMCs showed a transient decline during intensification (notably at day 28), ultimately returning to baseline by day 84. This biphasic pattern likely reflects the combined effects of suppression of ongoing low-level replication by an increased DTG dosage, followed by the expansion of infected cell clones (mostly harbouring defective proviruses). In other words, the transient decrease in total (intact + defective) DNA at day 28 may be due to an initial decrease in newly infected cells upon ART intensification, however at the subsequent time points this effect was masked by proliferation (clonal expansion) of infected cells with defective proviruses. Recent studies suggest that intact and defective proviruses are subjected to different selection pressures by the immune system on ART (PMID: 38337034) and their decay on therapy is different (intact proviruses are cleared much more rapidly than defectives). In addition, defective proviruses can be preferentially expanded as they can reprogram the host cell proliferation machinery (https://doi.org/10.1101/2025.09.22.676989). This explains why in our study the intact proviruses decreased, but the total proviruses did not change, between days 0 and 84, in the intensification group. Interestingly, in the control group, we observed a significant increase in total DNA at day 84 compared to day 0, with no difference for the intact DNA, which is also in line with the clonal expansion of defective proviruses.

      Importantly, we observed a significant decrease in intact proviral DNA between day 0 and day 84 in the intensification group (Figure 2D). This result directly addresses the study’s primary objective: assessing the impact of intensification on the replication-competent reservoir. In comparison, as the reviewer rightly points out, total HIV DNA includes over 90% defective genomes, which limits its interpretability as a biomarker of biologically relevant reservoir changes. In addition, other reservoir markers, such as cell-associated unspliced RNA and RNA/DNA ratios, also showed consistent trends supporting a biologically relevant effect of intensification. Even in the absence of sustained changes in total HIV DNA, the coherence across the different independent measures of the reservoir (intact DNA, unspliced RNA), suggests an effect indicative of ongoing replication pre-intensification.

      Regarding tissue reservoirs, the lack of substantial change in total HIV DNA between days 0 and 84 is also in line with the predominance of defective sequences in these compartments. Moreover, the limited increase in rectal tissue dolutegravir levels during intensification (from 16.7% to 20% of plasma concentrations) may have limited the efficacy of the intervention in this site.

      As for the IPDA on rectal biopsies, we attempted the assay using two independent DNA extraction methods (Promega Reliaprep and Qiagen Puregene), but both yielded high DNA shearing index values, and intact proviral detection was successful in only 3 of 40 samples. Given the poor DNA integrity, these results were not interpretable.

      That said, we fully acknowledge the limitations of our study, especially the small sample size, and we agree with the reviewer that caution is needed when interpreting these findings. In the revised manuscript, we adopted a more measured tone in the discussion (lines 340-346), stating that these observations are exploratory and hypothesis-generating, and require confirmation in larger, more powered studies. Nonetheless, we believe that the convergence of multiple reservoir markers pointing in the same direction constitutes a meaningful biological effect that deserves further investigation.

      (4) While a few statistically significant changes occurred in immune activation markers, it is not clear that these are biologically significant. Lines 175-186 and Figure 3: The change in CD4 cells + for TIGIT looks as though it declined by only 1-2%, and at day 84, the confidence interval appears to widen significantly at this timepoint, spanning an interquartile range of 4%. The only other immune activation/exhaustion marker change that reached statistical significance appears to be CD8 cells + for CD38 and HLA-DR, however, the decline appears to be a fraction of a percent, with the control group trending in the same direction. Despite marginal statistical significance, it is not clear there is any biological significance to these findings; Figure S6 supports the contention that there is no significant change in these parameters over time or between groups. With most markers showing no change and these two showing very small changes (and the latter moving in the same direction as the control group), these results do not justify the statement that intensifying DTG decreases immune activation and exhaustion (lines 38-40 in the abstract and elsewhere).

      We agree with the reviewer that the observed changes in immune activation and exhaustion markers were modest. We revised the abstract and the manuscript text (including a section header) to reflect this more accurately (lines 39, 175, 185, 253). We noted that these differences, while statistically significant (e.g., in TIGIT+ CD4+ T cells and CD38+HLA-DR+ CD8+ T cells), were limited in magnitude. We explicitly acknowledged these limitations and interpreted the findings with appropriate caution.

      (5) There are several limitations of the study design that deserve consideration beyond those discussed at line 327. The study was open-label and not placebo-controlled, which may have led to some medication adherence changes that confound results (authors describe one observation that may be evidence of this; lines 146-148). Randomized/blinded / cross-over design would be more robust and help determine signal from noise, given relatively small changes observed in the intervention arm.There does not seem to be a measurement of key outcome variables after treatment intensification ceased - evidence of an effect on replication through ART intensification would be enhanced by observing changes once intensification was stopped. Why was intensification maintained for 84 days? More information about the study duration would be helpful. Table 1 indicates that participants were 95% male. Sex is known to be a biological variable, particularly with regard to HIV reservoir size and chronic immune activation in PWH. Worldwide, 50% of PWH are women. Research into improving management/understanding of disease should reflect this, and equal participation should be sought in trials. Table 1 shows differing baseline reservoir sizes between the control and intervention groups. This may have important implications, particularly for outcomes where reservoir size is used as the denominator.

      We expanded the limitations section to address several key aspects raised by the reviewer: the absence of blinding and placebo control, the predominantly male study population, and the lack of postintervention follow-up. While we acknowledge that open-label designs can introduce behavioural biases, including potential changes in adherence, we now explicitly state that placebo-controlled, blinded trials would provide a more robust assessment and are warranted in future research (lines 340346). 

      The 84-day duration of intensification was chosen based on previous studies and provided sufficient time for observing potential changes in viral transcription and reservoir dynamics. However, we agree that including post-intervention follow-up would have strengthened the conclusions, and we highlighted this limitation and future direction in the revised manuscript (lines 340-346). 

      The sex imbalance is now clearly acknowledged as a limitation in the revised manuscript, and we fully support ongoing efforts to promote equitable recruitment in HIV research. We would like to add that, in our study, rectal biopsies were coupled with anal cancer screening through HPV testing. This screening is specifically recommended for younger men who have sex with men (MSM), as outlined in the current EACS guidelines (see: https://eacs.sanfordguide.com/eacs part2/cancer/cancerscreening-methods). As a result, MSM participants had both a clinical incentive and medical interest to undergo this procedure, which likely contributed to the higher proportion of male participants in the study.

      Lastly, although baseline total HIV DNA was higher in the intensified group, our statistical approach is based on a within-subject (repeated-measures) design, in which the longitudinal change of a parameter within the same participant during the study was the main outcome. In other words, we are not comparing absolute values of any marker between the groups, we are looking at changes of parameters from baseline within participants, and these are not expected to be affected by baseline imbalances.

      (6) Figure 1: the increase in DTG levels is interesting - it is not uniform across participants. Several participants had lower levels of DTG at the end of the intervention. Though unlikely to be statistically significant, it would be interesting to evaluate if there is a correlation between change in DTG concentrations and virologic / reservoir / inflammatory parameters. A positive relationship between increasing DTG concentration and decreased cell-associated RNA, for example, would help support the hypothesis that ongoing replication is occurring.

      We agree with the reviewer that assessing correlations between DTG concentrations and virological, immunological, or inflammatory markers would be highly informative. In fact, we initially explored this question in a preliminary way by examining whether individuals who showed a marked increase in DTG levels after intensification also demonstrated stronger changes in the viral reservoir. While this exploratory analysis did not reveal any clear associations, we would like to emphasize that correlating biological effects with DTG concentrations measured at a single timepoint may have limited interpretability. A more comprehensive understanding of the relationship between drug exposure and reservoir dynamics would ideally require multiple pharmacokinetic measurements over time, including pre-intensification baselines. This is particularly important given that DTG concentrations vary across individuals and over time, depending on adherence, metabolism, and other individual factors.

      (7) Figure 2: IPDA in tissue- was this done? scRNA in blood (single copy assay) - would this be expected to correlate with usCaRNA? The most unambiguous result is the decrease in cell-associated RNA - accompanying results using single-copy assay in plasma would be helpful to bolster this result.

      As mentioned in our response to point 3, we attempted IPDA on tissue samples, but technical limitations prevented reliable detection of intact proviruses. Regarding residual viremia, we did perform ultra-sensitive plasma HIV RNA quantification but due to a technical issue (an inadvertent PBMC contamination during plasma separation) that affected the reliability of the results we felt uncomfortable including these data in the manuscript.

      The use of the US RNA / Total DNA ratio is not helpful/difficult to interpret since the control and intervention arms were unmatched for total DNA reservoir size at study entry.

      We respectfully disagree with this comment. The US RNA/total DNA ratio is commonly used to assess the relative transcriptional activity of the viral reservoir, rather than its absolute size. While we acknowledge that the total HIV-1 DNA levels differed at baseline between the two groups, the US RNA/total DNA ratio specifically reflects the relationship between transcriptional activity and reservoir size within each individual, and is therefore not directly confounded by baseline differences in total DNA alone.

      Moreover, our analyses focus on within-subject longitudinal changes from baseline, not on direct between-group comparisons of absolute marker values. As such, the observed changes in the US RNA/total DNA ratio over time are interpreted relative to each participant's baseline, mitigating concerns related to baseline imbalances between groups.

      Reviewer #2 (Public review):

      Summary:

      An intensification study with a double dose of 2nd generation integrase inhibitor with a background of nucleoside analog inhibitors of the HIV retrotranscriptase in 2, and inflammation is associated with the development of co-morbidities in 20 individuals randomized with controls, with an impact on the levels of viral reservoirs and inflammation markers. Viral reservoirs in HIV are the main impediment to an HIV cure, and inflammation is associated with co-morbidities.

      Strengths:

      The intervention that leads to a decrease of viral reservoirs and inflammation is quite straightforward forward as a doubling of the INSTI is used in some individuals with INSTI resistance, with good tolerability.

      This is a very well documented study, both in blood and tissues, which is a great achievement due to the difficulty of body sampling in well-controlled individuals on antiretroviral therapy. The laboratory assays are performed by specialists in the field with state-of-the art quantification assays. Both the introduction and the discussion are remarkably well presented and documented.

      The findings also have a potential impact on the management of chronic HIV infection.

      Weaknesses:

      I do not think that the size of the study can be considered a weakness, nor the fact that it is open-label either.

      We thank Reviewer #2 for their constructive and supportive comments. We appreciate their positive assessment of the study design, the translational relevance of the intervention, and the technical quality of the assays. We also take note of their perspective regarding sample size and study design, which supports our positioning of this trial as an exploratory, hypothesis-generating phase 2 study.

      Reviewer #3 (Public review):

      The introduction does a very good job of discussing the issue around whether there is ongoing replication in people with HIV on antiretroviral therapy. Sporadic, non-sustained replication likely occurs in many PWH on ART related to adherence, drug-drug interactions and possibly penetration of antivirals into sanctuary areas of replication and as the authors point out proving it does not occur is likely not possible and proving it does occur is likely very dependent on the population studied and the design of the intervention. Whether the consequences of this replication in the absence of evolution toward resistance have clinical significance challenging question to address.

      It is important to note that INSTI-based therapy may have a different impact on HIV replication events that results in differences in virus release for specific cell type (those responsible for "second phase" decay) by blocking integration in cells that have completed reverse transcription prior to ART initiation but have yet to be fully activated. In a PI or NNRTI-based regimen, those cells will release virus, whereas with an INSTI-based regimen, they will not.

      Given the very small sample size, there is a substantial risk of imbalance between the groups in important baseline measures. Unfortunately, with the small sample size, a non-significant P value is not helpful when comparing baseline measures between groups. One suggestion would be to provide the full range as opposed to the inter-quartile range (essentially only 5 or 6 values). The authors could also report the proportion of participants with baseline HIV RNA target not detected in the two groups.

      We thank Reviewer #3 for their thoughtful and balanced review. We are grateful for the recognition of the strength of the Introduction, the complexity of evaluating residual replication, and the technical execution of the assays. We also appreciate the insightful suggestions for improving the clarity and transparency of our results and discussion.

      We revised the manuscript to address several of the reviewer’s key concerns. We agree that the small sample size increases the risk of baseline imbalances. We acknowledged these limitations in the manuscript (lines 327-330). For transparency, we now provide both the full range and the IQR for all parameters in Table 1. However, we would like to stress that our statistical approach is based on a within-subject (repeated-measures) design, in which the longitudinal change of a parameter within the same participant during the study was the main outcome. In other words, we are not comparing absolute values of any marker between the groups, we are looking at changes of parameters from baseline within participants, and these are not expected to be affected by baseline imbalances.

      A suggestion that there is a critical imbalance between groups is that the control group has significantly lower total HIV DNA in PBMC, despite the small sample size. The control group also has numerically longer time of continuous suppression, lower unspliced RNA, and lower intact proviral DNA. These differences may have biased the ability to see changes in DNA and US RNA in the control group.

      We acknowledge the significant baseline difference in total HIV DNA between groups, which we have clearly reported. However, the other variables mentioned, such as duration of continuous viral suppression, unspliced RNA levels, and intact proviral DNA, did not differ significantly between groups at baseline, despite differences in the median values (that are always present). These numerical differences do not necessarily indicate a critical imbalance.

      Notably, there was no significant difference in the change in US RNA/DNA between groups (Figure 2C).

      The nonsignificant difference in the change in US RNA/total DNA between groups is not unexpected, given the significant between-group differences for both US RNA and total DNA changes. Since the ratio combines both markers, it is likely to show attenuated between-group differences compared to the individual components. However, while the difference did not reach statistical significance (p = 0.09), we still observed a trend towards a greater reduction in the US RNA/total DNA ratio in the intervention group.

      The fact that the median relative change appears very similar in Figure 2C, yet there is a substantial difference in P values, is also a comment on the limits of the current sample size. 

      Although we surely agree that in general, the limited sample size impacts statistical power, we would like to point out that in Figure 2C, while the medians may appear similar, the ranges do differ between groups. At days 56 and 84, the median fold changes from baseline are indeed close but the full interquartile range in the DTG group stays below 1, while in the control group, the interquartile range is wider and covers approximately equal distance above and below 1. This explains the difference in p values between the groups.

      The text should report the median change in US RNA and US RNA/DNA when describing Figures 2A-2C.

      These data are already reported in the Results section (lines 164–166): "By day 84, US RNA and US RNA/total DNA ratio had decreased from day 0 by medians (IQRs) of 5.1 (3.3–6.4) and 4.6 (3.1–5.3) fold, respectively (p = 0.016 for both markers)."

      This statistical comparison of changes in IPDA results between groups should be reported. The presentation of the absolute values of all the comparisons in the supplemental figures is a strength of the manuscript.

      In the assessment of ART intensification on immune activation and exhaustion, the fact that none of the comparisons between randomized groups were significant should be noted and discussed.

      We would like to point out that a statistically significant difference between the randomized groups was observed for the frequency of CD4⁺ T cells expressing TIGIT, as shown in Figure 3A and reported in the Results section (p = 0.048).

      The changes in CD4:CD8 ratio and sCD14 levels appear counterintuitive to the hypothesis and are commented on in the discussion.

      Overall, the discussion highlights the significant changes in the intensified group, which are suggestive. There is limited discussion of the comparisons between groups where the results are less convincing.

      We observed statistically significant differences between the randomized groups for total DNA (p<0.001) and US RNA (p=0.01), as well as for the frequency of CD4⁺ T cells expressing TIGIT (p=0.048). We would like to stress that US RNA is a key marker of residual replication as it is very sensitive to de novo infection events. As discussed in the manuscript (lines 291-294), a newly infected CD4+ T lymphocyte can contain hundreds to thousands of US HIV RNA copies at the peak of infection. Therefore, a change in the US RNA level upon ART intensification is a very sensitive indicator of new infections. The fact that for US RNA we observed both a significant reduction in the intensified group and a significant difference between the groups is a strong indicator that some new infections had been occurring prior to intensification.

      The limitations of the study should be more clearly discussed. The small sample size raises the possibility of imbalance at baseline. The supplemental figures (S3-S5) are helpful in showing the differences between groups at baseline, and the variability of measurements is more apparent. The lack of blinding is also a weakness, though the PK assessments do help (note 3TC levels rise substantially in both groups for most of the time on study (Figure S2).

      The many assays and comparisons are listed as a strength. The many comparisons raise the possibility of finding significance by chance. In addition, if there is an imbalance at baseline outcomes, measuring related parameters will move in the same direction.

      We agree that the multiple comparisons raise the possibility of chance findings but would like to stress that in an exploratory study like this it is very important to avoid a type II error. In addition, the consistent directionality of the most relevant outcomes (US RNA and intact DNA) lends biological plausibility to the observed effects.

      The limited impact on activation and inflammation should be addressed in the discussion, as they are highlighted as a potentially important consequence of intermittent, not sustained replication in the introduction.

      The study is provocative and well executed, with the limitations listed above. Pharmacokinetic analyses help mitigate the lack of blinding. The major impact of this work is if it leads to a much larger randomized, controlled, blinded study of a longer duration, as the authors point out.

      Finally, we fully endorse the reviewer’s suggestion that the primary contribution of this study lies in its value as a proof-of-concept and foundation for future randomized, blinded trials of greater scale and duration. We highlighted this more clearly in the revised Discussion (lines 340-346).

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 84-87: How would chronic immune activation/inflammation be expected to differ if viral antigen is being released from stable reservoirs rather than low-level replication?

      This is a very insightful question. Although release of viral antigens from stable reservoirs could certainly also trigger immune activation/inflammation, the reservoir cells in PWH on long-term ART are constantly being negatively selected by the immune system (PMID: 38337034; PMID: 36596305) so that after a number of years on therapy, most proviruses are either transcriptionally silent or express only a low amount of viral RNA/antigen. Recent evidence suggests that these selected cells possess specific biological properties that include mechanisms that limit proviral gene expression (PMID: 36599977; PMID: 36599978). In comparison, low-level replication would result in de novo infection of unselected, activated CD4+ cells that are expected to produce much more viral antigen than preselected reservoir cells.

      (2) Lines 249-253: There are multiple ways to explain this observation - alternatively, the total proviral DNA declined due to transient CD4 depletion.

      As discussed above, CD4⁺ T-cell counts did not significantly decrease in any of the treatment groups, as shown in Figure 5. The apparent decline observed concerns the CD4/CD8 ratio, which transiently dropped, but not the absolute number of CD4⁺ T cells. Moreover, although the dynamics of total HIV DNA is indeed similar to that of CD4/CD8 ratio (both declined transiently and then returned to baseline by day 84), the dynamics of unspliced RNA and unspliced RNA/total DNA ratio is clearly different, as these markers demonstrated a sustained decrease that was maintained throughout the trial period. Also, we observed a significant decrease in intact HIV DNA at day 84 compared to day 0. These effects cannot be easily explained by a transient decline in CD4+ cells.

      (3) Lines 301-305: This is a confusing explanation for not seeing an effect in tissue. Overall, there was no change in total proviral DNA in blood between days 0 and 84 either - yet the explanation for this observation is different (249-253). Was IPDA not performed on the tissue? Wouldn't this be the preferred test for reservoir depletion?

      We thank the reviewer for bringing this point to our attention. We modified the Discussion to prevent the confusion (lines 303-305). As for the IPDA on tissue, we attempted this assay on the tissue samples using two independent DNA extraction methods (Promega Reliaprep and Qiagen Puregene), but both yielded high DNA shearing index values, and intact proviral detection was successful in only 3 of 40 samples. Given the poor DNA integrity, these results were not interpretable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weaknesses:

      Only 1 gene (katG) gave a strong and 1 (Mab_1456c) exhibited a minor defect. Two of the clones did not show any persistence phenotype (blaR and recR) and one (pafA) showed a minor phenotype,

      We have now carried out more detailed validation studies on the Tn-Seq, with analysis of timedependent killing over 14 d. This more comprehensive analysis shows that 4 of 5 genes analyzed do indeed have antibiotic tolerance defects under the conditions that Tn-Seq predicted a survival defect (Revised Figure 3). In addition, we found that even before actual cell death, several mutants had delayed resumption of growth after antibiotic removal (Figure 3 Supplemental).

      Fig 3 - Why is there such a huge difference in the extent of killing of the control strain in media, when exposed to TIG/LZD, when compared to Fig. 1C and Fig. 4. In Fig. 1C, M. abs grown in media decreases by >1 log by Day 3 and >4 log by Day 6, whereas in Fig. 3, the bacterial load decreases by <1 log by Day 3 and <2 log by Day 6. This needs to be clarified, if the experimental conditions were different, because if comparing to Fig. 1C data then the katG mutant strain phenotype is not very different.

      We agree with the reviewer that there is variability in the timing and extent of cell death from experiment to experiment. As noted by the reviewer, in Figure 1C the largest decrement in survival is between day 1 - day 3 (also seen in Figure 6A). As they noted in Figure 4 the largest decrement is between day 3 – day 6 (also seen in Figure 3A, Figure 5F). In each experiment with katG mutants we carefully compare the mutant vs. the control strain within that experiment, which is more accurate than comparing the behavior of mutant in one experiment to a control in another experiment.

      Reviewer #2 (Public review):

      Weaknesses:

      .First, word-choice decisions could better conform to the published literature. Alternatively, novel definitions could be included. In particular, the data support the concept of phenotypic tolerance, not persistence. 

      We appreciate the reviewers comments, text modified.

      Second, two of the novel observations could be explored more extensively to provide mechanistic explanations for the phenomena. 

      We have added several additional experiments, these are detailed below in response to specific comments.

      Reviewer #3 (Public review):

      Weaknesses:

      The findings could not be validated in clinical strains.

      We understand the reviewer’s concern that the katG phenotype was only observed in one of the two clinical strains we studied. We feel that our findings are relevant beyond the ATCC 19977 strain for two reasons

      (1) We have performed additional analyses of the two clinical isolates and indeed find significant accumulation of ROS following antibiotic exposure in both of these strains (revised Figure 6A).

      (2) We do in fact see a role for katG in starvation-induced antibiotic tolerance in Mabs clinical strain-2. It is not surprising that different strains from a particular species may have some different responses to stresses – for example, there is wide strain-specific variability in susceptibility to different phages within a species based on which particular phage defense modules a given strain carries (for example PMID: 37160116). We speculate that different Mabs strains may express varying levels of other antioxidant factors and note that the genes encoding several such factors were identified by our Tn-Seq screen including the peroxidases ahpC, ahpD, and ahpE. Our analysis of the genetic interactions between katG and these other factors is ongoing. 

      Comments/Suggestions

      (1) In Fig1E, the authors show no difference in killing Mtb with or without adaptation in PBS. These data are contrary to the data presented in Figure 1B. These also do not align with the data of M. smegmatis and M. abscesses. Please discuss these observations in light of the Duncan model of persistence (Mol Microbiol. 2002 Feb;43(3):717-31.).’

      The above referenced Duncan laboratory study found tolerance after prolonged starvation but did not actually examine tolerance at early time points. While some of the transcriptional and metabolic changes seen by Duncan and others are slow, other groups have described starvation responses in Mtb that are quite rapid. For example, the stringent response mediator ppGpp accumulates within a few hours after onset of starvation in Mtb (PMID: 30906866). We suspect that a rapid signaling response such as this underlies the phenotype we observe. Regarding the difference between Mtb and other mycobacterial species we also find it surprising that Mtb had a much more rapid starvation response. This is a clear species-specific difference that may reflect an adaptation of Mtb to the nutrient-limited physiologic niche within host macrophages.

      (2) Line 151, the authors state that they have used an M. abscesses Tn mutant library of ~ 55,000 mutant strains. The manuscript will benefit from the description of the coverage of total TA sites covered by the mutants.

      Text modified to add this detail. There are 91,559 TA sites in the abscessus genome. Thus, our Tn density is ~60%.

      (3) Line 155: Please explain how long the cells were kept in an Antibiotic medium.

      This technical detail was noted above on line 153 in the original text: “…and then exposed them to TIG/LZD for 6 days”. To clarify the overall conditions, we have also revised the text of the manuscript and added the detail of how long cells were passaged after removal of antibiotics.

      (4) Line 201: data not shown. Delayed resumption of growth after removal of antibiotic would be helpful in indicating drug resilience. This data could enhance the manuscript.

      Data now provided in Figure 3 Supplemental

      (5) Figures 4C and 4F represent the kill curve. It will be good to show the date with CFU against the drug concentration in place of OD600. CFU rather than OD600 best reflects growth inhibition.

      Figures 4C and 4F are measuring the minimum inhibitory concentration (MIC) to stop the overall growth of the bacterial population. While we agree that CFU could be analyzed, this would be measuring a different outcome – cell death and the minimum bactericidal concentration (MBC). In these experiments we sought to specifically examine the MIC so as to separate growth inhibition from cell death. For this we used the standard method employed by clinical microbiology laboratories for MIC, which is optical density of the culture (PMID: 10325306).

      (6) Figure 5C. The authors shall show the effect of TIG/LZD on M. abscesses ROS production without the PBS adaptation. It is important to conclude that TIG/LZD induces ROS in cells. Authors should utilize ROS scavengers such as Thiourea, DFO, etc., to conclude ROS's contribution to bacterial killing following inhibition of transcription and translation.

      New data added (revised Figure 5 and Figure 5 Supplemental)  

      (7) Line 303. Remove "note".

      Text revised. We thank the reviewer for identifying this typographical error.  

      (8) The introduction and Discussion are very similar, and several lines are repeated.

      Text revised with overlapping content removed.

      Reviewer #1 (Recommendations for the authors):

      It appears that the same datasets for PBS adapted cultures were plotted in A-C and D-F. Either this should be specifically mentioned in the legend or it might be better to integrate the non-adapted plots into A-C which would also allow easier comparison.

      Appreciate the reviewer’s suggestion; text modified with added clarification to figure legend.

      This manuscript is focused on M. abs and the antibiotics TIG/LZD, so the Mtb data or data using the antibiotics INH/RIF/EMB and serves more as a distraction and can be removed

      We appreciate the reviewer’s perspective. However, we wish to include these data to show the similarities (and differences) in starvation-induced tolerance between the three organisms.

      Fig 3 -As mentioned for Fig. 1, it appears that the same dataset was used for the control in all the figures A-E. This should be explicitly stated in the Figure legend.

      Appreciate the reviewer’s suggestion; text modified with added clarification to figure legend.

      The divergent results from the clinical strains are extremely interesting. It would be helpful to determine the oxidative stress levels (similar to the cellROX data shown in 5E), to tease out if the difference in katG role is because of lack of ROS induction in these strains or due to expression of alternate anti-oxidative stress defense mechanisms.

      We have performed additional cellROX analysis as suggested by the reviewer and found that the ROS induction is indeed present across all three Mabs strains, but that katG is only required in one of the two strains (Strain #2). These data are now included in the revised Figure 6.

      Reviewer #2 (Recommendations for the authors):

      GENERAL COMMENTS

      This is a nice piece of work that uses the pathogen Mabs as a test subject.

      The work has findings that likely apply generally to antibiotics and mycobacteria: 1) phenotypic tolerance is associated with suppression of ROS, 2) lethal protein synthesis inhibitors act via accumulation of ROS, and 3) levofloxacin behaves in an unexpected way. Each is a new observation. However, I believe that each topic requires more work to be firmly established to be suitable for eLife.

      Phenotypic tolerance: Association with suppression of ROS is important but expected. I would solidify the conclusion by performing several additional experiments. For example, confirm the lethal effect of ROS by reducing it with an iron chelator and a radical scavenger. There is a large literature on effects of iron uptake, levels, etc. on antibiotic lethality that could be applied to this question. In 2013 Imlay argued against the validity of fluorescent probes. Perhaps getting the same results with another probe would strengthen the conclusion.

      We have carried out additional experiments with both an iron chelator and small molecule ROS scavengers to further test this idea but note that these experiments have several inherent limitations: 1) These compounds have highly pleiotropic effects. For example while N-acetyl cysteine (NAC) is an antioxidant it also increases mycobacterial respiration and was shown to paradoxically decrease antibiotic tolerance in M. tuberculosis (PMID: 28396391). 2) It has been shown by the Imlay group that small-molecule antioxidants are often ineffective in quenching ROS in bacteria (PMID: 388893820), making negative results difficult to interpret. Nonetheless, we present new experimental data showing that iron chelation does indeed improve the survival of antibiotic-treated Mabs (revised Figure 5).  However,  small molecule antioxidants such as thiourea do not restore antibiotic tolerance and actually increased bacterial cell death, suggesting that they may be affecting respiration in Mabs in a manner similar to that seen for NAC in Mtb. We also note that our genetic analysis, which identified numerous other genes encoding proteins with antioxidant function (Figure 2) is a strong additional argument in support of the importance of ROS in antibiotic-mediated lethality. 

      Regarding the concern raised by Imlay about the validity of oxidation-sensitive dyes - this relates to concern bacterial autofluorescence induced by antibiotics that can confound analyses in some species. We have ruled this out in our analyses by using bacteria unstained by cellROX as controls to confirm that there is negligible autofluorescence in Mabs (<0.1%, Figure 5E, Figure 6A).

      Protein synthesis inhibitors: At present, this is simply an observation. More work is needed to suggest a mechanism. For example, with E. coli the aminoglycosides are protein synthesis inhibitors that also cause membrane damage. Membrane damage is known to stimulate ROS-mediated killing. Your observation needs to be extended because chloramphenicol, another protein synthesis inhibitor, blocks ROS production. The lethality may be a property of mycobacteria: does it occur with E. coli (note that rifampicin is bacteriostatic with E. coli but lethal to Mtb)?

      We agree with the reviewer that the mechanism underlying ROS accumulation following transcription or translational inhibition in Mabs is of significant interest. It is likely to be a mechanism different from E. coli, because in E. coli tetracyclines and rifamycins are both bacteriostatic, whereas in Mabs they are both bactericidal. Determining the mechanism by which translation inhibitors cause ROS accumulation in Mabs is an ongoing effort in our laboratory using proteomics and metabolomics, but is outside the scope of this manuscript.

      Levofloxacin: This is also at the observational stage but is unexpected. In other studies, ROS is involved in quinolone-mediated killing of bacteria. Why is this not the case with Mabs? The observation should be solidified by showing the contrast with moxifloxacin, since this compound has been studied with mycobacteria (Shee 2022 AAC). With E. coli, quinolone structure can affect the relative contribution of ROS to killing (Malik 2007 AAC), as is also seen with Mtb (Malik 2006 AAC). What is happening in the present work with levofloxacin, an important anti-tuberculosis drug? Is there a structure explanation (compare with ofloxacin)?

      While these are interesting questions, a detailed exploration of the structure-function relationships between different fluoroquinolone antibiotics and their varying activities on Mtb and Mabs is outside the scope of this manuscript.  

      The writing is generally easy to follow. However, the concept of persistence should be changed to phenotypic tolerance with text changes throughout. I base this suggestion on the definitions of tolerance and persistence as stated in the consensus review (Balaban 2019 Nat Micro Rev). Experimentally, tolerance is seen as a gradual decline in survival following antibiotic addition; the decline is slower than seen with wild-type cells. The data presented in this paper fit that definition. In contrast, persistence refers to a rapid drop in survival followed by a distinct plateau (Balaban 2019 Nat Micro Rev; for example, see Wu Lewis AAC 2012 ). Moreover, to claim persistence, it would be necessary to demonstrate subpopulation status, which is not done. The Balaban review is an attempt to bring order to the field with respect to persistence and tolerance, since the two are commonly used without regard for a consistent definition.

      We appreciate the reviewer’s suggestion; text modified in multiple places to clarify.

      Another issue requiring clarification is the relationship between resistance and tolerance. Killing by antibiotics is a two-step process, as most clearly seen with quinolones. First a reversible bacteriostatic event occurs. Resistance blocks that bacteriostatic damage. Then a lethal metabolic response to that damage occurs. Tolerance selectively blocks the second, killing event, a distinct process that often involves the accumulation of ROS. Direct antibiotic-mediated damage is an additional mode of killing that also stems from the reversible, bacteriostatic damage created by antibiotics. The authors recognize the distinction but could make it clearer. Take a look at Zheng (JJ Collins) 2020, 2022.

      Text modified to clarify this point

      Many readers would also like to see a bit more background on Mabs. For example, does it grow rapidly? Are there features that make it a good model for studying mycobacteria or bacteria in general? The more general, the better.

      Text modified, background added

      Below I have listed specific comments that I hope are useful in bringing the work to publication and making it highly cited.

      SPECIFIC COMMENTS

      Line 30 unexpectedly. I would delete this word because the result is expected from the ROS work of Shee et al 2022 with mycobacteria. Moreover, Zeng et al 2022 PNAS showed that ROS participates in antimicrobial tolerance, and persistence is a form of tolerance (Balalban et al, 2019, Nat Micro Rev).

      Text modified as per review suggestion

      Line 39 key goal: this is probably untrue in the general sense stated, since bacteriostatic antibiotics are sufficient to clear infection (Wald-Dickler 2019 Clin Infect Dis). However, it is likely to be the goal for Mtb infections.

      We agree with the reviewer that bacteriostatic antibiotics are effective in treating most types of infections and do not claim otherwise in the manuscript. However, from a clinical standpoint, eradication of the pathogen causing the infection is indeed the goal of antibiotic therapy in virtually all circumstances (with the exception of specific scenarios such as cystic fibrosis where it is recognized that the infecting organism cannot be fully eliminated). In most cases, the combination of bacteriostatic antibiotics and the host immune response is sufficient to achieve eradication. We have modified the manuscript text to reflect this nuance noted by the reviewer.

      Line 62 several: you list three, but hipAB works via ppGpp, so the sentence needs fixing

      Text modified  

      Line 70 uncertain: this uncertainty is unreferenced. Since everything is uncertain, this vague phrase does not add to the story.

      The reviewer makes an interesting philosophical argument. However, we would submit that some aspects of biology, for example the regulation of glycolysis, are understood in great detail. However, other mechanisms, such as the precise mechanisms of lethality for diverse antibiotics in different bacterial species, are far more uncertain and remain a subject of debate (for example PMID: 39910302). Text not modified.

      Line 72 somewhat controversial: I would delete this, because the points in the Science papers by Lewis and Imlay have been clarified and in some cases refuted by prior and subsequent work.

      Text modified

      Line 72 presumed: this suggests that it is wrong and perhaps a different idea has replaced it. Another, and more likely view is that there is an additional mode of killing. I suggest rephrasing to be more in line with the literature.

      Text modified for clarity. In this sentence “presume” refers to the historical concept that direct target inhibition was solely responsible for antibiotic lethality. As the reviewer notes, there is now significant literature that ROS (and perhaps other secondary effects) also contribute to bacterial killing.  

      Line 73 However and the following might also: this phrasing, plus the presumed, misleads the reader from your intent. I suggest rephrasing.

      See above re: line 72

      Line 75 citations: these are inappropriate and should be changed to fit the statement. I suggest the initial paper by Collins (Kohanski 2007 Cell) a recent paper by Zhao (Zeng PNAS 2022), and a review Drlica Expert Rev Anti-infect Therapy 2021). The present citations are fine if you want to narrow the statement to mycobacteria, but the history is that the E. coli work came first and was then generalized to mycobacteria. A mycobacterial paper for ROS is Shee 2022 AAC.

      We thank the reviewer for noticing that we inadvertently omitted several important E. coli-related references. These have been added.

      Line 75 and 76: Conversely ... unresolved. Compelling arguments have been made that show major flaws in the two papers cited, and a large body of evidence has now accumulated showing the validity of the idea promoted by the Collins lab, beginning with Kohanski 2007. In addition to many papers by Collins, see Hong 2019 PNAS and Zeng 2022 PNAS). It is fine if you want to counter the arguments against the Lewis and Imlay papers (summarized in Drlica & Zhao 2021 Expert Rev Anti-infect Therapy), but making a blanket statement suggests that the authors are unfamiliar with the literature.

      We agree with the reviewer that the weight of the evidence supports a role for antibiotic-induced ROS as an important mechanism for antibiotic lethality under many (though not all) conditions. We have revised the text to better reflect this nuance.

      Line 78. Advantages over what?

      Text modified

      Line 80 exposure: to finish the logic you need to show that E. coli and S. aureus persisters fail to do this.

      We thank the reviewer for their suggestion but studying these other organisms is outside the scope of this study. 

      Line 82 whereas: this misdirects the reader. It would seem that a simple "and" is better

      Text modified

      Line 89 I think this paragraph is about the need to study Mabs, the subject of the present report. This paragraph could use a more appropriate topic sentence to guide the reader so that no guessing is involved. I suggest rephrasing this paragraph to make the case for studying more compelling.

      Text modified

      Line 96. I suggest citing several references after subinhibitory concentration of antibiotic.

      The references are in the following sentence alongside the key observations.

      Line 99. Genetic analysis: how does this phrase fit with the idea of persister cells arising stochastically?

      There are two issues: 1) We would argue that persister formation is not completely stochastic, but rather a probability that can be modified both genetically and by environment (for example hipA PMID: 6348026). 2) Even if persister formation were totally stochastic, the survival of these cells may depend on specific genes – as we indeed find in our Tn-Seq analysis of Mabs.  

      Line 106. In this paragraph you need to define persister. The consensus definition (Balaban 2019 Nat Micro Rev) is a subpopulation of tolerant cells. Tolerance is defined as the slowing or absence of killing while an antibiotic retains its ability to block growth. See Zeng 2022 PNAS for example with rapidly growing cells. Phenotypic tolerance is the absence of killing due to environmental perturbations, most notably nutrient starvation, dormancy, and growth to stationary phase. By extension, phenotypic persistence would be subpopulation status of a phenotypically tolerant cells. If you have a different definition, it is important to state it and emphasize that you disagree with the consensus statement.

      Text modified  

      Line 109 unexpectedly. I would delete this word, because the literature leads the reader to expect this result unless you make a clear case for Mabs being fundamentally different from other bacteria with respect to how antibiotics kill bacteria (this is unlikely, see Shee 2022 AAC). Indeed, lines 111-113 state extensions of E. coli work, although suppression of ROS in phenotypic tolerance and genetic persistence have not been demonstrated.

      Text modified

      Line 124 you might add, in parentheses and with references, that a property of persisters is crosspersistence to multiple antibiotic classes. This is also true for tolerance, both genetic and phenotypic. An addition will support your approach.

      Text modified

      Line 128 minimal

      Text not modified. We appreciate the reviewer’s preference but both “minimal” and “minimum” are both widely accepted terms. Indeed, the Balaban et al 2019 consensus statement on definitions cited by the author above also uses “minimum” (PMID: 30980069), as do IDSA clinical guidelines (PMID: 39108079).

      Line 130 is MIC somehow connected to killing or did you also measure killing? Note that blocking growth and killing cells are mechanistically distinct phenomena, although they are related. By being upstream from killing, blockage of growth will also interfere with killing.

      Text modified

      Line 133 PBS is undefined

      Text modified

      Line 134 increase in persisters ... you need to establish that these are not phenotypically tolerant cells. Do they constitute the entire population (tolerance)? Your data would be more indicative of persisters if you saw a distinct plateau with the PBS samples, as such data are often used to document persistence (retardation of killing is a property of tolerance, Balaban 2019). Fig. 1B is clearly phenotypic tolerance, as the entire population grows. Your data suggest that you are not measuring persistence as defined in the literature (Balaban 2019). Line 139 persister should be tolerance •

      Text modified

      Lines 142, 143, 144. 159, 163, 171, 181, 211, 226, 238, 246, 277, 279,289 persistent should be tolerant

      Text modified

      Line 146 fig 1E Mtb does not show the adaptation phenomenon and it is clearly tolerant, not persistent. This should be pointed out. As stated, you may be misleading the reader.

      Text modified  

      *Line 169. Please make it clear whether these genes are affecting antibiotic susceptibility (MIC will affect killing because blocking growth is upstream) or if you are dealing with tolerance (no change in MIC). These measurements are essential and should included as a table. By antibiotic response, do you mean that antibiotics change expression levels?

      Regarding MICs, the data for MICs in control and katG mutant are presented in Figure 4C and 4F. Regarding ‘response’ we have clarified the text of this sentence.

      Line 174 Interestingly should be as expected

      Text not modified; tetracyclines do not induce ROS in E. coli and oxazolidinones have not been studied in this regard.

      Line 183 you need to include citations. You can cite the ability of chloramphenicol to block ROS-mediated killing of E. coli. That allows you to use the word unexpected

      Text modified

      Line 199. All of the data in Fig. 3 shows tolerance, not persistence, requiring word changes in this paragraph.

      Text modified

      Line 226. The MIC experiment is important. You can add that this result solidifies the idea that blocking growth and killing cells are distinct phenomena. You can cite Shee 2022 AAC for a mycobacterial paper

      Text modified

      Line 241. The result with levofloxacin is unexpected, because the fluoroquinolones are widely reported to induce ROS, even with mycobacteria (see Shee 2022 AAC). You need to point this out and perhaps redo the experiment to make sure it is correct.

      We appreciate the reviewer’s interest in this question. All experiments in this paper were repeated multiple times. This particular experiment was repeated 3 times and in all replicates the katG mutant was sensitized to translation inhibitors but not levofloxacin. Shee et al examined Mtb treated with moxifloxacin and found ROS generation, but did not assess whether a Mtb katG mutant had impaired survival. Thus, in addition to differences in: i) the species studied and ii) the particular fluoroquinolone used, the two sets of experiments were designed to address different questions (ROS accumulation vs protection by katG) . A cell might accumulate ROS without a katG mutant having impaired survival if genetic redundancy exists – a result we indeed see in our clinical Mabs strains under some conditions (new data included in revised Figure 6A).  

      Line 269 Additional controls would bolster the conclusion: use of an antioxidant such as thiourea and an iron chelator (dipyridyl) both should reduce ROS effects.

      New experiments performed, revised Figure 5.

      Line 276 the word no is singular

      Text modified

      Line 284 this suggested ... in fact previous work suggested. This summary paragraph might go better as the first paragraph of the Discussion

      Text modified to specify that this is in reference to the work in this manuscript

      Lines 294-299 Most of this is redundant and should be deleted.

      Text modified

      Line 299 this species is vague

      Text modified

      Line 310 Do you want to discuss spoT?

      Text not modified

      Line 313 paragraph is largely redundant

      Text modified

      Line 314 controversial. As above, I would delete this, especially since it is not referenced and is unlikely to be true. If you believe it, you have the obligation to show why the ROS-lethality idea is untrue. If you are referring to Lewis and Imlay, there were almost a dozen supporting papers before 2013 and many after. This statement does not make the present work more important, so deletion costs you nothing.

      Text modified

      Line 314 direct disruption of targets. This is clearly not a general principle, because the quinolones rapidly kill while inhibition of gyrase by temperature-sensitive mutations does not (Kreuzer 1979 J.Bact; Steck 1985). Indeed, formation of drug-gyrase-DNA complexes is reversible: death is not.

      Text modified

      Line 318 as pointed out above, you have not brought this story up to date. The two papers mainly focused on Kohanski 2007, ignoring other available evidence.’’

      Text modified

      Line 326 you need to cite Shee 2022 AAC

      Text modified

      Line 342 the idea of mutants being protective is not novel, as several have been reported with E. coli studies. Thus, there is a general principle involved.

      We agree that this suggests a potential general principle

      Line 344. It depends on the inhibitor. For example, aminoglycosides are translation inhibitors and they also cause the accumulation of ROS.

      We agree that ROS generation depends on the inhibitor, and indeed upon other variables including drug concentration, growth conditions, and bacterial species as well.  

      Line 347. You need to point out the considerable data showing that the absence of catalase increases killing

      Text modified

      Line 363 look at Shee 2022 AAC and Jacobs 2021 AAC

      Text modified, reference added.

      Line 585 I suggest having a colleague provide critical comments on the manuscript and acknowledge that person.

      Text not modified

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Pavel et al. analyzed a cohort of atrial fibrillation (AF) patients from the University of Illinois at Chicago, identifying TTN truncating variants (TTNtvs) and TTN missense variants (TTNmvs). They reported a rare TTN missense variant (T32756I) associated with adverse clinical outcomes in AF patients. To investigate its functional significance, the authors modeled the TTN-T32756I variant using human induced pluripotent stem cell-derived atrial cardiomyocytes (iPSC-aCMs). They demonstrated that mutant cells exhibit aberrant contractility, increased activity of the cardiac potassium channel KCNQ1 (Kv7.1), and dysregulated calcium homeostasis. Interestingly, these effects occurred without compromising sarcomeric integrity. The study further identified increased binding of the titin-binding protein Four-and-a-Half Lim domains 2 (FHL2) with KCNQ1 and its modulatory subunit KCNE1 in the TTN-T32756I iPSCaCMs.

      Strengths:

      This work has translational potential, suggesting that targeting KCNQ1 or FHL2 could represent a novel therapeutic strategy for improving cardiac function. The findings may also have broader implications for treating patients with rare, disease-causing variants in sarcomeric proteins and underscore the importance of integrating genomic analysis with experimental evidence to advance AF research and precision medicine.

      Weaknesses

      (1) Variant Identification: It is unclear how the TTN missense variant (T32756I) was identified using REVEL, as none of the patients' parents reportedly carried the mutation or exhibited AF symptoms. Are there other TTN variants identified in the three patients carrying TTN-T32756I? Clarification on this point is necessary.  

      We thank the reviewer for their insightful comment. We have now clarified these in the method section.

      Line 484-491: “The TTN-T32756I variant (REVEL Score: 0.58758, Supplementary Table 1) was prioritized due to its occurrence in multiple unrelated individuals within our clinical AF cohort, despite no reported family history of AF in affected individuals. While no parental inheritance was observed, the possibility of de novo origin cannot be excluded. Furthermore, this variant is located within a region overlapping a deletion mutation recently shown to cause AF in a zebrafish model, supporting its potential pathogenicity [37]. Notably, the affected individuals did not carry additional loss-of-function TTN variants.”

      (2) Patient-Specific iPSC Lines: Since the TTN-T32756I variant was modeled using only one healthy iPSC line, it is unclear whether patient-specific iPSC-derived atrial cardiomyocytes would exhibit similar AF-related phenotypes. This limitation should be addressed.

      We have now acknowledged this limitation in the revised manuscript.

      Line 505-509: “Due to the patients' unavailability of peripheral blood mononuclear cells (PBMCs), we utilized a healthy iPSC line and introduced the TTN-T32756I variant using CRISPR/Cas9 genome editing. This approach ensures an isogenic background, thereby minimizing genetic variability and providing a controlled system to study the direct effects of the mutation.”

      (3) Hypertension as a Confounding Factor: The three patients carrying TTN-T32756I also have hypertension. Could the hypertension associated with this variant contribute secondarily to AF? The authors should discuss or rule out this possibility.

      We have now explicitly discussed this in the revised manuscript.

      Line 362-367: “Hypertension is a common comorbidity in patients with AF and could contribute to disease progression. However, all three individuals carrying TTN-T32756I exhibited earlyonset AF (onset before 66 years), with one case occurring as early as 36 years. This suggests a potential two-hit mechanism, where genetic predisposition and comorbidities influence disease risk. Importantly, our iPSC model isolates the genetic effects of TTN-T32756I from other factors, supporting a direct pathogenic role.”

      (4) FHL2 and KCNQ1-KCNE1 Interaction: Immunostaining data demonstrating the colocalization of FHL2 with the KCNQ1-KCNE1 (MinK) complex in TTN-T32756I iPSC-aCMs are needed to strengthen the mechanistic findings.

      We thank the reviewer for this insightful suggestion. We agree that additional immunostaining data would further strengthen the evidence for FHL2 colocalization with the KCNQ1-KCNE1 complex in TTN-T32756I iPSC-aCMs. In line with this, we have expanded our analysis to include both co-immunoprecipitation and confocal microscopy.  As described in the revised manuscript (Lines 282–287), the colocalization between KCNE1 and FHL2 was increased by approximately threefold in TTN-T32756I iPSC-aCMs compared with WT, supporting an enhanced interaction between these proteins (Figure 5A, Supplementary Figure 6). We are generating additional immunostaining data to validate and extend these findings, and we will incorporate them into the revised submission to further substantiate the mechanistic link proposed.

      Line 282-287: “…..if TTN-T32756I increases I<sub>ks</sub> by modulating the interaction between KCNQ1KCNE1 and FHL2, we performed co-immunoprecipitation studies and confocal microscopy in both WT and TTN-T32756I-iPSC-aCMs. The co-localization between KCNE1 and FHL2 increased ~3 fold in TTN-T32756I-iPSC-aCMs, suggesting an increased interaction between them (Figure 5A, Supplementary Figure 7).”

      (5) Functional Characterization of FHL2-KCNQ1-KCNE1 Interaction: To further validate the proposed mechanism, additional functional assays are necessary to characterize the interaction between FHL2 and the KCNQ1-KCNE1 complex in TTN-T32756I iPSC-aCMs.

      We thank the reviewer for this valuable suggestion. We agree that additional functional assays would provide further validation of the proposed mechanism. However, we believe such in-depth characterization warrants a dedicated follow-up study and is beyond the scope of the current revision. In this work, our primary objective is to establish that the TTN missense variant can exert a detrimental effect and serve as a substrate for AF. 

      Line 418-419: “Further study is needed to validate the proposed mechanism and determine if TTNmvs in other regions are associated with AF by a similar process.”

      Reviewer #2 (Public review):

      Summary:

      The authors present data from a single-center cohort of African-American and Hispanic/Latinx individuals with atrial fibrillation (AF). This study provides insight into the incidences and clinical impact of missense variants in this population in the Titin (TTN) gene. In addition, the authors identified a single amino acid TTN missense variant (TTN-T32756I) that was further studied using human induced pluripotent stem cell-derived atrial cardiomyocytes (iPSC-aCMs). These studies demonstrated that the Four-and-a-Half Lim domains 2 (FHL2) has increased binding with KCNQ1 and its modulatory subunit KCNE1 in the TTN-T32756I-iPSCaCMs, enhancing the slow delayed rectifier potassium current (Iks) and is a potential mechanism for atrial fibrillation. Finally, the authors demonstrate that suppression of FHL2 could normalize the Iks current.

      Strengths:

      The strengths of this manuscript/study are listed below:

      (1) This study includes a previously underrepresented population in the study of the genetic and mechanistic basis of AF.

      (2) The authors utilize current state-of-the-art methods to investigate the pathogenicity of a specific TTN missense variant identified in this underrepresented patient population.

      (3) The findings of this study identify a potential therapeutic for treating atrial fibrillation.

      Weaknesses:

      (1) The authors do not include a non-AF group when evaluating the incidence and clinical significance of TTN missense variants in AF patients.

      We appreciate the reviewer’s comment and acknowledge the limitation of not including a non-AF control group in our clinical analysis. As noted in the revised manuscript (Lines 347–353), our cohort was derived from a single-center registry of individuals with AF and therefore lacks a matched non-AF control population for direct comparison of TTN missense variant incidence. We agree that future studies incorporating larger, multiethnic validation cohorts with both AF and non-AF individuals, as well as evaluating AF-specific measures such as arrhythmia burden and treatment response, will be essential to fully elucidate the clinical significance of TTN missense variants in AF.

      Line 347-353: “Our cohort is derived from a single-center multi-ethnic registry of individuals with AF and lacks a matched cohort of non-AF controls to compare the incidence of TTN missense variants.  Further study exploring these associations in mult-ethnic, larger validation cohorts that include both AF and non-AF individuals and examining AF-specific measures such as arrhythmia burden or treatment response will be necessary to fully understand the clinical importance of TTNmvs in AF.”

      (2) The authors do not provide evidence that TTN-T32756I-iPSC-aCMs are arrhythmogenic, only that there is an increase in the Iks current and associated action potential changes. More specifically, the authors report that "compared to the WT, TTN-T32756I-iPSC-aCMs exhibited increased arrhythmic frequency," yet it is unclear what they are referring to by "arrhythmic frequency."

      We thank the reviewer for this important point and for highlighting the need for clarification. In our study, the term “arrhythmic frequency” was intended to describe the increased spontaneous beating rate, irregular action potential patterns, and abnormal calcium handling observed in TTN-T32756I iPSC-aCMs compared with WT. These findings support the concept that the AF-associated TTN-T32756I variant promotes ion channel remodeling and perturbs excitation–contraction coupling, thereby creating a potential arrhythmogenic substrate for AF. To avoid ambiguity, we have removed the term “arrhythmic frequency” and revised the text for clarity and precision (Lines 222–223).

      Lines 222-223: “Compared to the WT, TTN-T32756I-iPSC-aCMs exhibited increased frequency along with a significant reduction of the time to 50% and 90% decline of calcium transients (Figure 3G-I, Supplementary Figure 4F).”

      (3) There seem to be discrepancies regarding the impact of the TTN-T32756I variant on mechanical function. Specifically, the authors report "both reduced contraction and abnormal relaxation in TTN-T32756I-iPSC-aCMs" yet, separately report "the contraction amplitude of the mutant was also increased . . . suggesting an increased contractile force by the TTN-T32756IiPSC-aCMs and TTN-T32756I-iPSC-CMs exhibited similar calcium transient amplitudes as the WT."

      We thank the reviewer for highlighting this critical point and apologize for the lack of clarity. We intended to distinguish between changes in contractile force and contractile dynamics. Specifically, the increased contraction amplitude observed in TTN-T32756I iPSCaCMs reflects enhanced contractile force, whereas the reduced contraction duration and impaired relaxation reflect abnormalities in contractile kinetics. Together, these findings indicate that the TTN-T32756I variant alters both the strength and the temporal dynamics of contraction, consistent with dysfunctional mechanical performance. We have revised the text accordingly to more accurately convey these results (Lines 187–192).

      Lines 187-192: “Compared to WT, the beating frequency of the TTN-T32756I-iPSC-aCMs was significantly increased (52 ± 7.8 vs. 98 ± 7.5 beats per min, P=0.001; Figure 2C) coupled with the reduction of the contraction duration (456.5 ± 61.45 vs 262.9 ± 48.16 msec, P=0.032; Figure 2D), the peak-to-peak time (1529 ± 195.5 vs 636.6 ± 135.8 msec, P=0.004; Supplementary Figure 3B),  and the relaxation (281.5 ± 42.95 vs 79.40 ± 21.14 msec, P=0.003; Supplementary Figure 3A).”

      Reviewer #3 (Public review):

      Summary:

      The authors describe the abnormal contractile function and cellular electrophysiology in an iPSC model of atrial myocytes with a titin missense variant. They provide contractility data by sarcomere length imaging, calcium imaging, and voltage clamp of the repolarizing current iKs. While each of the findings is interesting, the paper comes across as too descriptive because there is no data merging to support a cohesive mechanistic story/statement, especially from the electrophysiological standpoint. There is not enough support for the title "A Titin Missense Variant Causes Atrial Fibrillation", since there is no strong causative evidence. There is some interesting clinical data regarding the variant of interest and its association with HF hospitalization, which may lead to future important discoveries regarding atrial fibrillation.

      Strengths:

      The manuscript is well written, and a wide range of experimental techniques are used to probe this atrial fibrillation model.

      Weaknesses

      (1) While the clinical data is interesting, it is essential to rule out heart failure with preserved EF as a confounder. HFpEF leads to AF due to increased atrial remodeling, so the fact that patients with this missense variant have increased HF hospitalizations does not necessarily directly support the variant as causative of AF. It could be that the variant is associated directly with HFpEF instead, and this needs to be addressed and corrected in the analyses.

      We appreciate the reviewer’s insightful comment and agree that HFpEF-related atrial remodeling could represent a potential confounder in the association between TTN missense variants and AF. The primary aim of our clinical analysis was to assess the potential significance of TTNmv in AF, recognizing the inherent limitations of retrospective observational data in establishing causality. To complement this, our in vitro studies were specifically designed to demonstrate that TTNmv can alter the electrophysiological substrate, thereby predisposing to AF independent of clinical comorbidities.

      While HFpEF is an important consideration, to our knowledge, no existing literature directly implicates TTNmv in HFpEF pathogenesis. In contrast, loss-of-function TTN variants are more commonly associated with HFrEF and dilated cardiomyopathy, and even these associations remain an area of active debate. To address potential confounding in our cohort, we adjusted for reduced ejection fraction in multivariable analyses of clinical outcomes. Additionally, we performed a sensitivity analysis excluding patients with nonischemic dilated cardiomyopathy (Supplementary Table 6). Together, these approaches mitigate the potential impact of heart failure subtypes on our findings, while our mechanistic studies strengthen the argument that TTNmv may contribute directly to AF susceptibility.

      (2) All contractility and electrophysiologic data should be done with pacing at the same rate in both control and missense variant groups, to control for the effect of cycle length on APD and calcium loading. A shorter APD cannot be claimed when the firing rate of one set of cells is much faster than the other, since shorter APD is to be expected with a quicker rate. Similarly, contractility is affected by diastolic interval because of the influence of SR calcium content on the myocyte power stroke. So the cells need to be paced at the same rate in the IonOptix for any direct comparison of contractility. The authors should familiarize themselves with the concept of electrical restitution.

      We thank the reviewer for this crucial technical comment. iPSC-derived cardiomyocytes (iPSC-CMs) are known to exhibit spontaneous automaticity due to the presence of pacemaker-like currents and reduced I<sub>K1</sub>, which enables interrogation of their intrinsic electrophysiological properties and disease-relevant remodeling. In our study, we leveraged this feature to test the hypothesis that TTN missense variants alter electrophysiological properties through ion channel remodeling. That said, we fully agree with the reviewer that pacing iPSCCMs at a controlled cycle length is essential for minimizing rate-dependent effects on APD, calcium handling, and contractility, and would improve the interpretability of group comparisons. While iPSC-CMs with matched genetic backgrounds are expected to display broadly comparable electrophysiological profiles, biological and technical variability can influence spontaneous beating rates, thereby confounding direct comparisons. To address this, we have incorporated pacing protocols into our revised experimental design to ensure that APD and contractility measurements are obtained under identical cycle lengths, consistent with the concept of electrical restitution.

      (3) It is interesting that the firing rate of the myocytes is faster with the missense variant. This should lead to a hypothesis and investigation of abnormal automaticity or triggered activity, which may also explain the increased contractility since all these mechanisms are related to the SR's calcium clock and calcium loading. See #2 above for suggestions on how to probe calcium handling adequately. Such an investigation into impulse initiation mechanisms would be compelling in supporting the primary statement of the paper since these are actual mechanisms thought to cause AF.

      We thank the reviewer for this insightful suggestion. We agree that the faster firing rate observed in TTN-T32756I iPSC-aCMs raises the possibility of abnormal automaticity or triggered activity, both of which are highly relevant to AF pathophysiology. As these mechanisms are tightly coupled to calcium handling and the SR calcium clock, further probing of calcium cycling abnormalities would provide valuable mechanistic insights. While this level of investigation is beyond the scope of the current study, we view it as a compelling future direction that could directly link TTN missense variants to impulse initiation abnormalities contributing to AF. 

      (4) The claim of shortened APD without correcting for cycle length is problematic. However, linking shortened APD in isolated cells alone to AF causation is more complicated. To have a setup for reentry, there must be a gradient of APD from short to long, and this can only be demonstrated at the tissue level, not at the cellular level, so reentry should not be invoked here. If shortened APD is demonstrated with correction of the cycle length problem, restitution curves can be made showing APD shortening at different cycle lengths. If restitution is abnormal (i.e. the APD does not shorten normally in relation to the diastolic interval), this may lead to triggered activity which is an arrhythmogenic mechanism. This would also tie in well with the finding of abnormally elevated iKs current since iKs is a repolarizing current directly responsible for restitution.

      We thank the reviewer for this necessary clarification. We agree that isolated cell studies cannot directly demonstrate reentrant circuits and that reentry should not be inferred solely from cellular APD data. Our observation of shortened APD and abnormal beating patterns in TTN-T32756I iPSC-aCMs suggests ion channel remodeling that may predispose to arrhythmogenic conditions. Still, we recognize that tissue-level gradients of APD are required to establish reentry as a mechanism. Accordingly, we have removed mention of “the reentrant mechanism” from the revised manuscript and limited our interpretation to the cellular findings. Future studies incorporating pacing protocols and restitution curve analyses will be valuable in determining whether abnormal APD restitution and elevated I<sub>Ks</sub> contribute to triggered activity, thereby providing a more direct mechanistic link to AF (Lines 101–105).

      Lines 101-105: “Our study showed that the TTN-T32756I iPSC-aCMs exhibited a striking AF-like EP phenotype in vitro, and transcriptomic analyses revealed that the TTNmv increases the activity of the FHL2, which then modulates the slow delayed rectifier potassium current (I<sub>Ks</sub>) to cause AF.” 

      Reviewer #1 (Recommendations for the authors):

      Electrophysiological Phenotype in Ventricular CMs: Has the iPSC line carrying TTN-T32756I been differentiated into ventricular cardiomyocytes (iPSC-vCMs)? The reported cellular phenotype in iPSC-aCMs does not seem to specifically reflect an AF phenotype. Does the variant produce similar electrophysiological alterations in iPSC-vCMs?

      We thank the reviewer for this thoughtful comment. To date, we have not differentiated the TTN-T32756I iPSC line into ventricular cardiomyocytes (iPSC-vCMs). Our current work focuses on iPSC-aCMs, where we demonstrate that the AF-associated TTNT32756I variant induces ion channel remodeling and abnormal beating patterns, thereby creating a potential arrhythmogenic substrate relevant to AF. We agree that investigating whether this variant produces similar or distinct electrophysiological alterations in iPSC-vCMs would provide essential insights into chamber-specific effects and broaden our mechanistic understanding. We have acknowledged this as a future direction in the revised manuscript (Lines 422–425).

      Lines 422-425: “While we have not yet explored the effect of TTN-T32756I in iPSC-derived ventricular cardiomyocytes, it would be interesting to investigate whether this variant produces similar or distinct electrophysiological alterations in the ventricular cardiomyocytes.”

    1. Author response:

      (1) General Statements

      Our manuscript studies mechanisms of planar polarity establishment in vivo in the Drosophila pupal wing. Specifically we seek to understand mechanisms of ‘cell-scale signalling’ that is responsible for segregating core pathway planar polarity proteins to opposite cell edges. This is an understudied question, in part because it is difficult to address experimentally.

      We use conditional and restrictive expression tools to spatiotemporally manipulate core protein activity, combined with quantitative measurement of core protein distribution, polarity and stability. Our results provide evidence for a robust cell-scale signal, while arguing against mechanisms that depend on depletion of a limited pool of a core protein or polarised transport of core proteins on microtubules. Furthermore, we show that polarity propagation across a tissue is hard, highlighting the strong intrinsic capacity of individual cells to establish and maintain planar polarity.

      The original manuscript received three fair and thorough peer-reviews, which raised many important points. In response, we decided to embark on a full revision that attempts to answer all of the points. We have included new data to support our conclusions in Supplemental Figures 1, 2 and 5.

      Additionally in response to the reviewers we have revised the manuscript title, which is now ‘Characterisation of cell-scale signalling by the core planar polarity pathway during Drosophila wing development’.

      (2) Point-by-point description of the revisions

      We thank all of the reviewers for their thorough and thoughtful review of our manuscript. They raise many helpful points which have been extremely useful in assisting us to revise the manuscript.

      In response we have carried out a major revision of the manuscript, making numerous changes and additions to the text and also adding new experimental data. Specific changes are listed after our detailed response to each comment.

      Reviewer #1:

      […] Major points:

      The exact meaning of cell-scale signaling is not defined, but I infer that the authors use this term to describe how what happens on one side of a cell affects another side. The remainder of my critique depends on this understanding of the intended meaning.

      As the reviewer points out, it is important that the meaning of the term ‘cell-scale signalling’ is clear to the reader and in response to their comment we have had another go at defining it explicitly in the Introduction to the manuscript.

      Specifically, we use the term ‘cell-scale signalling’ to describe possible intracellular mechanisms acting on core protein segregation to opposite cell membranes during core pathway dependent planar polarisation. For example, this could be a signal from distal complexes at one side of the cell leading to segregation of proximal complexes to the opposite cell edge, or vice versa. See also our response to Reviewer #2 regarding the distinction between ‘molecular-scale’ and ‘cell-scale’ signalling. 

      Changes to manuscript: Revised definition of ‘cell-scale signalling’ in Introduction.

      The authors state that any tissue wide directional information comes from pre-existing polarity and its modification by cell flow, such that the de novo signaling paradigm "bypasses" these events and should therefore not be responsive to any further global cues. It is my understanding that this is not a universally accepted model, and indeed, the authors' data seem to suggest otherwise. For example, the image in Fig 5B shows that de novo induction restores polarity orientation to a predominantly proximal to distal orientation. If no global cue is active, how is this orientation explained?

      We assume that the reviewer’s point is that it is not universally accepted that de novo induction after hinge contraction leads to uncoupling from global cues (rather than that it is not accepted that hinge contraction remodels radial polarity to a proximodistal pattern). We are (we believe) the only lab that has used de novo induction as a tool, and we’re not aware of any debate in the literature about whether this bypasses global cues. Nevertheless, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al., 2020; Yu et al., 2020).

      We think that the reviewer is proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation for this is that induction at these stages fails to establish the early radial pattern of core pathway polarity and hence hinge contraction cannot reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest (Etournay et al., 2015)).

      In this manuscript, our earliest de novo experiments begin with Fz induction at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B, referred to by the reviewer, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing.

      We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig. 5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region. We apologise for this oversight and the confusion caused and appreciate the feedback.

      The 6 hr condition, that has only partial polarity magnitude, is quite disordered. Do the patterns at 8 and 10 hrs become more proximally-distally oriented? It is stated that they all show swirls, but please provide adult wing images, and the corresponding orientation outputs from QuantifyPolarity to help validate the notion that the global cues are indeed bypassed by this paradigm.

      In all three ‘normal’ de novo conditions (6h, 8h and 10h), regardless of the time of induction, the polarity orientation patterns of Fz-mKate2 in pupal and adult wings are very similar in the experimentally analysed region (Fig. S5B-E). The strong local hair swirling agrees with the previous published data (Strutt and Strutt, 2002; Strutt and Strutt 2007). Overall, we don’t see any evidence that the 10h de novo induction results in more proximodistally coordinated polarity than the 8h or 6h conditions. This is consistent with our contention that there is no global cue present at these stages, which presumably would have a stronger effect when core pathway activity was induced at earlier stages.

      Changes to manuscript: Added additional explanation of the ‘de novo induction’ paradigm and why we believe the resulting polarity patterns are unlikely to be influenced by any global signals in Introduction and Results section ‘Induced core protein relocalisation…’. Added quantification of polarity in the experiment region proximal to the anterior cross-vein in pupal wings (Fig.S5E-E’’’) and zoomed-out images of the surrounding region in adult wings showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing (Fig.S5B-D), arguing against an unknown proximodistal polarity cue at these stages of development.

      In the de novo paradigm, polarization is initiated immediately or shortly after heat shock induction. However, the results should be differently interpreted if the level of available Fz protein does not rise rapidly and then stabilize before the 6 hr time point, and instead continues to rise throughout the experiment. Western blots of the Fz::mKate2-sfGFP at time points after induction should be performed to demonstrate steady state prior to measurements. Otherwise, polarity magnitude could simply reflect the total available pool of Fz at different times after induction. Interpreting stability is complex, and could depend on the same issue, as well as the amount of recycling that may occur. Prior work from this lab using FRAP suggested that turnover occurs, and could result from recycling as well as replenishment from newly synthesized protein. 

      The reviewer raises an important point, which we agree could confound our experimental interpretations. As suggested we have now carried out western blotting and quantitation for Fz::mKate2-sfGFP levels and added these data to Fig.S1 (Fig. S1C,D). Quantified Fz is not significantly different between the three de novo polarity induction timings and not significantly different compared to constitutive Fz::mKate2-sfGFP expression (although there is a trend towards increasing Fz::mKate2-sfGFP protein levels with increasing induction times). These data are consistent with Fz::mKate2-sfGFP being at steady state in our experiments and that levels are sufficient to achieve normal polarity (as constitutive Fz::mKate2-sfGFP does so). Therefore it is unlikely that differing protein levels explain the differing polarity magnitudes at the different induction times. Interestingly, Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, possibly due to lower expression or increased turnover/reduced recycling.

      Changes to manuscript: Added western blot analysis of Fz::mKate2-sfGFP expression under 10h, 8h and 6h induction conditions vs endogenous Fz expression and constitutive Fz::mKate2sfGFP expression (Fig.S1C-D) and discussed in Results section ‘Planar polarity establishment is…’.

      From the Fig 3 results, the authors claim that limiting pools of core proteins do not explain cellscale signaling, a result expected based on the lack of phenotypes in heterozygotes, but of course they do not test the possibility that Fz is limiting. They do note that some other contributing protein could be. 

      Previously published results from our lab (Strutt et al., 2016 Cell Reports; Supplemental Fig. S6E) show that in a heterozygous fz mutant background, Fz protein levels are not affected by halving the gene dosage when compared to wt, suggesting that Fz is most likely produced in excess and is not normally limiting, but that protein that cannot form complexes may be rapidly degraded. We have now added this information to the text.

      Changes to manuscript: Added explanation in text that Fz levels had previously been shown to not be dosage sensitive in Results section ‘Planar polarity establishment is…’ and also added a caveat to the Discussion about not directly testing Fz.

      In Fig 3, it is unclear why the authors chose to test dsh1/+ rather than dsh[null]/+. In any case, the statistically significant effect of Dsh dose reduction is puzzling, and might indicate that the other interpretation is correct. Ideally, a range including larger and smaller reductions would be tested. As is, I don't think limiting Dsh is ruled out. 

      Concerning the choice of dsh allele, we appreciate the query of the reviewer regarding use of dsh[1] instead of a null, as there might be a concern that dsh[1] would give a less strong phenotype. The answer is that over more than two decades we and others have never found any evidence that dsh[1] does not act as a ‘null’ for planar polarity in the pupal wing, and furthermore use of dsh[1] preserves function in Wg signalling – and we would prefer to rule out any phenotypic effects due to any potential cross-talk between the two pathways that might be seen using a complete null. To expand on this point, dsh[1] mutant protein is never seen at cell junctions (Axelrod 2001; Shimada et al., 2001; our own work), and by every criteria we have used, planar polarity is completely disrupted in hemizygous or homozygous mutants e.g. see quantifications of polarity in (Warrington et al., 2017 Curr Biol).

      In terms of the broader point, whether we can rule out Dsh being limiting, we were very careful to be clear that we did not see evidence for Dsh (or other core proteins) being limiting in terms of ‘rates of core pathway de novo polarisation’. When the reviewer says ‘the statistically significant effect of Dsh dose reduction is puzzling’ we believe they are referring to the data in Fig. 3J, showing a small but significantly different reduction in stable Fz in de novo 6h conditions (also seen in 8h de novo conditions, Fig. S3I). As Dsh is known to stabilise Fz in complexes (Strutt et al., 2011 Dev Cell; Warrington et al., 2017 Curr Biol), in itself this result is not wholly surprising. Nevertheless, while this shows that halving Dsh levels does modestly reduce Fz stability, it does not alter our conclusion that halving Dsh levels does not affect Fz polarisation rate under either 6h or 8h de novo conditions.

      Unfortunately, we do not have available to us a practical way of achieving consistent intermediate reductions in Dsh levels (e.g. a series of verified transgenes expressing at different levels). Levels of all the core proteins could be dialled down using transgenes, to see when the system breaks, and indeed we have previously published that lower levels of polarity are seen if Fmi levels are <<50% or if animals are transheterozygous for pk, stbm, dgo or dsh, pk, stbm, dgo simultaneously (Strutt et al., 2016 Cell Reports). However, it seems to be a trivial result that eventually the ability to polarise is lost if insufficient core proteins are present at the junctions. For this reason we have focused on a simple set of experiments reducing gene dosage singly by 50% under two de novo induction conditions, and have been careful to state our results cautiously. The assays we carried out were a great deal of work even for just the 5 heterozygous conditions tested.

      We believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels (see above), the system might be expected to be sensitised to further dosage reductions, and despite this we failed to see an effect on rate of polarisation.

      We note that Reviewer #3 made a similar point about whether we can rule out dosage sensitivity on the basis of 50% reductions in protein level. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      In a similar vein, Reviewer #2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added more explanation of our choice of dsh[1] as an appropriate mutant allele to use in Results section ‘Planar polarity establishment is…’. Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components’.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      The data in Fig 5 are somewhat internally inconsistent, and inconsistent with the authors' interpretation. In both repolarization conditions, the authors claim that repolarization extends only to row 1, and row 1 is statistically different from non-repolarized row 1, but so too is row 3. Row 2 is not. This makes no sense, and suggests either that the statistical tests are inappropriate and/or the data is too sparse to be meaningful. 

      As we’re sure the reviewer appreciates, this was an extremely complex experiment to perform and analyse. We spent a lot of time trying to find the best way to illustrate the results (finally settling on a 2D vector representation of polarity) and how to show the paired statistical comparisons between different groups. Moreover, in the end we were only able to detect generally quite modest (statistically significant) changes in cell polarity under the experimental conditions.

      However, we note that failure to see large and consistent changes in polarity is exactly the expected result if it is hard to repolarise from a boundary – and this is of course the conclusion that we draw. Conversely, if repolarisation were easy, which was our expectation at least under de novo conditions without existing polarity, then we would have expected large and highly statistically significant changes in polarity across multiple cell rows. Hence we stand by our conclusion that ‘it is hard to repolarise from a boundary of Fz overexpression in both control and de novo polarity conditions’.

      Overall, we were trying to establish three points:

      (1) to demonstrate that repolarisation occurs from a boundary of overexpression i.e. from boundary 0 to row 0

      (2) to establish whether a wave of repolarisation occurs across rows 1, 2 and 3

      (3) to determine if in repolarisation in de novo condition it is easier to repolarise than in repolarisation in the control (already polarised) condition Taking each in turn:

      (1) To detect repolarisation from a boundary relative to the control condition, we have to compare row 0 in repolarisation condition (Fig.5G,K) vs control condition (Fig.5F,J). This comparison shows a significative repolarisation (p=0.0014). From now, row 0 in repolarisation condition is our reference for repolarisation occurring.

      (2) To determine if there is a wave of repolarisation in the repolarisation condition we have to compare row 0 vs row 1 to 3 in the repolarisation condition (Fig.5K). Row 1 is not significantly different to row 0, but rows 2 and 3 are different and the vectors show obviously lower polarity than row 0. Hence no wave of repolarisation is detected over rows 1 to 3.

      (3) To determine if it is easier to repolarise in the de novo condition, our reference for establishment of a repolarisation pattern is the polarisation condition in rows 0 to 3. So, we compare repolarisation condition vs repolarisation in de novo condition, row 0 vs row 0, row 1 vs row 1, row 2 vs row 2 and row 3 vs row 3 – in each case no significative difference in polarity is detected, supporting our conclusion that it is not easier to repolarise in the de novo condition.

      We agree that the variations in row 3 are puzzling, but there is no evidence that this is due to propagation of polarity from row 0, and so in terms of our three questions, it does not alter our conclusions.

      Changes to manuscript: We have extensively revised the text describing the results in Fig.5 to hopefully make the reasons for our conclusions clearer and also be more cautious in our conclusions in Results section ‘Induced core protein relocalisation…’. 

      For the related boundary intensity data in Fig 6, the authors need to describe exactly how boundaries were chosen or excluded from the analysis. Ideally, all boundaries would be classified as either meido-lateral (meaning anterior-posterior) or proximal-distal depending on angle. 

      We thank the reviewer for pointing out that this was not clear.

      All boundaries were classified following their orientation compared to the Fz over-expression boundary using hh-GAL4 expressed in the wing posterior compartment. Horizontal junctions were defined as parallel to the Fz over-expression boundary (between 0 and 45 degrees) and mediolateral junctions as junctions linking two horizontal boundaries (between 45 and 90 degrees).

      Changes to manuscript: The boundary classification detailed above has been added in the Materials and Methods.

      If the authors believe their Fig 5 and 6 analyses, how do they explain that hairs are reoriented well beyond where the core proteins are not? This would be a dramatic finding, because as far as I know, when core proteins are polarized, prehair orientation always follows the core protein distribution. Surprisingly, the authors do not so much as comment about this. The authors should age their wings just a bit more to see whether the prehair pattern looks more like the adult hair pattern or like that predicted by their protein orientation results.

      Again the reviewer makes an interesting point, and we agree that this is something that we should have more directly addressed in the manuscript.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al., 2000 in the wing and Lawrence et al., 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      Minor points:

      As the authors know, there is a model in the literature that suggests microtubule trafficking provides a global cue to orient PCP. The authors' repolarization data in Fig 4 make a reasonably convincing case against a role for no role for microtubules in cell-scale signaling, but do not rule out a role as a global cue. The authors should be careful of language such as "...MTs and core proteins being oriented independently of each other" that would appear to possibly also refer to a role as a global cue. 

      Thank you for pointing out that this was not clear. We have now modified the text to hopefully address this.

      Changes to manuscript: Text updated in Results section ‘Microtubules do not provide…’.

      Significance:

      There are two negative conclusions and one positive conclusion made by the authors. Provided the above points are addressed, the negative conclusions, that core proteins are not limiting and that microtubules are not involved in cell-scale signaling are solid. The positive conclusion is more nebulous - the authors say that cell-scale signaling is strong relative to cell-cell signaling - but how strong is strong? Strong relative to their prior expectations? I'm not sure how to interpret such a conclusion. Overall, we learn something from these results, though it fails to reveal anything about mechanism. These results will be of some interest to those studying PCP.

      The reviewer raises an interesting point, which is how do you compare the strength of two different processes, even if both processes affect the same outcome (in this case cell polarity). Repolarisation from a boundary has not been carefully studied at the level of core protein localisation in any previous study to our knowledge – this is one of the important novel aspects of this study. Hence there is not a baseline for defining strong repolarisation. Similarly, there has been no investigation of the nature of ‘cell-scale signalling’. This was a considerable challenge for us in writing the manuscript, and we have done our best to find appropriate language that hopefully conveys our message adequately. Minimally our work may provide a baseline for helping to define the ‘strengths’ of these processes in future studies.

      One of our main points is that we can generate an artificial boundary of Fz expression, where Fz levels are at least several fold higher than in the neighbouring cell (e.g. compare Fig.4N’ and O’) and only two rows of cells show a significant change in polarity relative to controls. Even when the tissue next to the overexpression domain is still in the process of generating polarity (de novo condition) then the boundary has little effect on polarity in neighbouring cell rows. This was a result that surprised us, and we tried to convey that by using language to suggest cell-scale signalling was stronger than cell-cell signalling i.e. stronger in terms of the ability to define the final direction of polarity.

      Changes to manuscript: In the revised manuscript we have reviewed our use of language and now avoid saying ‘strong’ but instead use terms such as ‘effective’ and ‘robust’ in e.g. Results section ‘Induced core protein relocalisation…’, the Discussion and we have also changed the title of the manuscript to avoid claiming a ‘strong’ signal.

      Reviewer #2:

      […] Critique

      The experiments described in this paper are of high quality with a sophisticated level of design and analysis. However, there needs to be some recalibration of the extent of the conclusions that can be drawn (see below). Moreover, a limitation of this paper is that, despite the quality of their data, they cannot give a molecular hint about the nature of their proposed cell-scale signal. Below are a two key points that the authors may want to clarify.

      (1) The first set of repolarisation experiment is performed after the global cell rearrangements that have been shown to act as global signal. However, this approach does not exclude the possible contribution of an unknown diffusible global signal.

      A similar point was raised by Reviewer 1. For the convenience of this reviewer, we’ll summarise the arguments against such an unknown cue again below. More broadly, both reviewers asking a similar question indicates that we have failed to lay out the evidence in sufficient detail. In our defence, we have used the same ‘de novo’ paradigm in three previous publications (Strutt and Strutt 2002, 2007; Brittle et al 2022) without attracting (overt) controversy. We have now added text to the Introduction and Results that goes into more detail, as well as more experimental evidence (Fig.S5).

      Firstly, it is worth noting that the global cues acting in the wing are poorly understood, with mostly negative evidence against particular cues accruing in recent years. This makes it a hard subject to succinctly discuss. Secondly, we accept that it is hard to prove there is no influence of global cues, when the nature of those cues and the time at which they act remain unclear. Below we summarise the reasons why we believe there are not significance effects of global cues in our experiments that would influence the interpretation of our results.

      First, our reading of the literature supports a broad consensus that an early radial core planar polarity pattern is realigned by cell flow produced by hinge contraction beginning at around 16h APF (e.g. Aigouy et al., 2010; Strutt and Strutt, 2015; Aw and Devenport, 2017; Butler and Wallingford, 2017; Tan and Strutt, 2025). Taken at face value, this suggests that there are ‘radial’ cues present prior to hinge contraction, maybe coming from the wing margin – arguably these radial cues could be Ft-Ds or Wnts or both, given they are expressed in patterns consistent with such a role (notwithstanding the published evidence arguing against roles for either of these cues). It then appears that hinge contraction supercedes these cues to convert a radial pattern to a proximodistal pattern – whether the radial cues that affect the core pathway earlier remain active after hinge contraction is unclear, although both Ft-Ds and Wnts appear to maintain their ‘radial’ patterns beyond the beginning of hinge contraction (e.g. Merkel et al., 2014; Ewen-Campen et al.,2020; Yu et al., 2020).

      We think that the reviewers are proposing the presence of a proximodistal cue that is active in the proximal region of the wing that we use for our experiments shown e.g. in Fig.5, and that this cue orients core polarity here (but not elsewhere in the wing) in a time window after 18h APF. Ft-Ds and Wnts do not seem to be plausible candidates as they are still in ‘radial’ patterns. This leaves either an unknown proximodistal cue (a gradient of some unknown signalling molecule?), or possibly some ability of hinge contraction to align proximodistal polarity specifically in this wing region but not elsewhere. We cannot definitively rule out either of these possibilities, but neither do we think there is sufficient evidence to justify invoking their existence to explain our observations.

      In particular, the reason that we don’t think there is a proximodistal cue in the proximal part of the wing after 18h APF, is that work from our lab shows that induction of Fz or Stbm expression at times around or after the start of hinge contraction (i.e. >16 h APF) results in increasing levels of trichome swirling with polarity not being coordinated with the tissue axis either proximally or distally (Strutt and Strutt, 2002; Strutt and Strutt 2007). Our simplest interpretation of this is that induction at these stages fails to result in the early radial pattern of core pathway polarity being established and hence a failure of hinge contraction to reorient radial to proximodistal. If hinge contraction alone could specify proximodistal polarity in the absence of the earlier radial polarity, then we would not expect to see swirling over much of the proximal wing (where the forces from hinge contraction are strongest, Etournay et al., 2015).

      In this manuscript, our earliest de novo experiments begin at 18h APF (de novo 10h), then at 20h APF (de novo 8h) and at 22h APF (de novo 6h). The image in Fig. 5B referred to by Reviewer 1, is of a wing where Fz is induced de novo at 22 h APF. In these wings, as expected, the core proteins localise asymmetrically in stereotypical swirling patterns throughout the wing surface (see Fig. 2M and also Strutt and Strutt, 2002; Strutt and Strutt 2007), but – usefully for our experiments – they broadly localise along the proximal-distal axis in the region analysed in Fig. 5B. Given the strong swirling in surrounding regions when inducing at >20h APF, we feel reasonably confident in assuming that the pattern is not due to a proximodistal cue present in the proximal wing. We appreciate that the original manuscript did not show images including the trichome pattern in adjacent regions, so this point would not have been clear, but we now include these in Supplementary Fig.S5. We have also added a note in the legend to Fig. 5B to clarify that the proximodistal pattern seen is local to this wing region.

      Changes to manuscript: Text extended in Introduction and Results to better explain why we believe the de novo conditions that we use most likely result in a polarity pattern that is not significantly influenced by ‘global cues’. Now show zoomed-out images of the surrounding region around the experiment region proximal to the anterior cross-vein region in adult wings, showing that the polarity pattern does not become more proximodistal when induction time is longer, and also that there is not overall proximodistal polarity in proximal regions of the wing, arguing against an unknown proximodistal polarity cue at these stages of development (Fig.S5B-E’’’).

      (2) The putative non-local cell scale signal must be more precisely defined (maybe also given a better name). It is not clear to me that one can separate cell-scale from molecular-scale signal.

      Local signals can redistribute within a cell (or membrane) so local signals are also cell-scale. Without a clear definition, it is difficult to interpret the results of the gene dosage experiments. The link between gene dosage and cell-scale signal is not rigorously stated. Related to this, the concluding statement of the introduction is too cryptic.

      We thank the reviewer for raising this, as again a similar comment was made by Reviewer 1, so we are clearly falling short in defining the term. We have now had another attempt in the Introduction.

      To more specifically answer the point made by the reviewer regarding molecular vs cellular, we are essentially being guided here by the prior computational modelling work, as at the biological level the details are still being worked out. A specific class of previous models only allowed ‘signals’ between core proteins to act ‘locally’, meaning within a cell junction, and within the models there was no explicit mechanism by which proteins on other junctions could ‘detect’ the polarity of a neighbouring junction (e.g. Amonlirdviman et al., 2005; Le Garrec et al., 2006; Fischer et al., 2013). Other models implicitly or explicitly encode a mechanism by which cell junctions can be influenced by the polarity of other junctions (e.g. Meinhardt, 2007; Burak and Shraiman, 2009; Abley et al., 2013; Shadkhoo and Mani, 2019), for instance by diffusion of a factor produced by localisation of particular planar polarity proteins.

      We agree with the reviewer that a cell-scale signal will depend on ‘molecules’ and thus could be called ‘molecular-scale’, but here by ‘molecular-scale’ we mean signals that at the range of the sizes of molecules i.e. nanometers, rather than cell-scale signals that act at the size of cells i.e. micrometers. A caveat to our definition is that we implicitly include interactions that occur locally on cell junctions (<1 µm range) within ‘molecular-scale’, but this is a shorter range than ‘cellular-scale’ which requires signals acting over the diameter of a cell (3-5 µm). Nevertheless, we think the concept of ‘molecular-scale’ vs ‘cell-scale’ is a helpful one in this context, and have attempted to address the issue through a more careful definition of the terms.

      Changes to manuscript: Text revised in Introduction and legend to Fig.1 to more carefully define ‘cell-scale signalling’ and to distinguish it from ‘molecular-scale signalling’. Final sentence of Introduction also altered so we no longer cryptically speculate on the nature of the cell-scale signal but leave this to the Discussion.

      Minor comments. 

      Some of the (clever) genetic manipulation may need more details in the text. For example:

      - Need to specify if the hs-flp approach induces expression throughout the tissue.

      We apologise for the lack of clarity. In all the experiments, the hs-FLP transgene is present in all cells, and heat-shock results in ubiquitous expression. 

      Changes to manuscript: We have clarified this in the Results and Materials and Methods.

      - Need to specify in the text that in the unpolarised condition the tissue is both dsh and fz mutant.

      The reviewer is of course correct and we have updated this point in the text. The full genotype for the unpolarised condition is: w dsh<sup>1</sup> hsFLP22/y;; Act>>fz-mKate2sfGFP, fz<sup>P21</sup>/fz<sup>P21</sup> (see Table S1). So this line is mutant for dsh and fz with induced expression of Fz-mKate2sfGFP. 

      Changes to manuscript: We have clarified this in the relevant part of the Results.

      - Need to specify in the text that the experiment illustrated in Fig 5 is with hh-gal4. 

      As noted by the reviewer, we continued to use the same hh-GAL4 repolarisation paradigm as in Fig.4 and this info was in the legend to Fig.5 legend. However, we agree it is helpful to be explicit about this in the main text.

      Changes to manuscript: We have added this to this section of the Results.

      - Need to address a possible shortcoming of the hh experiment, that the AP boundary is a region of high tension.

      It is true that the AP boundary is under high tension in the wing disc (e.g. Landsberg et al., 2009). But we are not aware of any evidence that this higher tension persists into the pupal wing. In separate studies we have labelled for Myosin II in pupal wings (Trinidad et al 2025 Curr Biol; Tan & Strutt 2025 Nature Comms), and as far as we have noticed have not seen preferentially higher levels on the AP boundary. We think if tension were higher, the cell boundaries would appear straighter than in surrounding cells (as seen in the wing disc) and this is not evident in our images.

      - Need to dispel the possibility that there is no residual polarisation (e.g. of other components) in fz1 mutant (I assume this is the case).

      We use the null allele fz[P21] through this work, and we and others have consistently reported a complete loss of polarisation of other core proteins or downstream components in this background. The caveat to this is that core proteins that persist at cell junctions always appear at least slightly punctate in mutant backgrounds for other core proteins, and so any automated detection algorithm will always find evidence of individual cell polarity above a baseline level of uniform distribution. Hence we tend to use lack of local coordination of polarity (variance of cell polarity angle) as an additional measure of loss of polarisation, in addition to direct measures of average cell polarity. (We discuss this in the QuantifyPolarity manuscript Tan et al 2021 e.g. Fig.S6).

      Changes to manuscript: We now include in the Materials and Methods section ‘Fly genetics…’ a much more extensive explanation of the evidence for specific mutant alleles being ‘null’ for planar polarity function (including dsh1 as raised by Reviewer 1), specifically that they result in no detectable planar polarisation of either other core proteins or downstream effectors, and added appropriate references.

      - Need to provide evidence that 50% gene dosage commensurately affect protein level. 

      This is a good suggestion. In the case of Stbm, we have already published a western blot showing that a reduction in gene dosage results in reduced protein levels (Strutt et al 2016, Fig.S6). We have now performed western blots to quantify protein levels upon reduction of fmi, pk and dgo levels (we actually used EGFP-dgo for the latter, as we don’t have antibodies that can detect endogenous Dgo on western blots).

      Changes to manuscript: When presenting the dosage reduction experiments, we now refer back to Strutt et al., 2016 explicitly for Stbm, and have added western blot data for Fmi, Pk and EGFPDgo in new Fig.S2.

      - I am surprised that the relationship with microtubule polarity was never investigated. Is this true? 

      We agree this is a point that needed further clarification, as Reviewer 1 made a related point regarding the two possible roles for microtubules, one being as a mediator of a global cue upstream of the core pathway, and the second (which we investigate in this manuscript) as a mediator of a cell-scale signal downstream of the core pathway.

      Both the Uemura and Axelrod groups have published on potential upstream function as a global cue mediator in the Drosophila wing (e.g. Shimada et al., 2006; Harumoto et al., 2010; Matis et al., 2014).

      Both groups have also looked out whether core pathway components could affect orientation of microtubules (Harumoto et al., 2010; Olofsson at al., 2014; Sharp and Axelrod 2016). Notably Harumoto et al., 2010 observed that in 24h APF wings, loss of Fz or Stbm did not alter microtubule polarity from a proximodistal orientation consistent with the microtubules aligning along the long cell axis in the absence of other cues. However, this did not rule out an instructive effect of Fz or Stbm on microtubule polarity during core pathway cell-scale signalling. The Axelrod lab manuscripts saw interesting effects of Pk protein isoforms on microtubule polarity, albeit not throughout the entire wing, which hinted at a potential role in cell-scale signalling. Taken together this prior work was the motivation for our directed experiments to specifically test whether the core pathway might generate cell-scale polarity by instructing microtubule polarity.

      Changes to manuscript: We have revised the Results section ‘Microtubules do not…’ to make a clearer distinction regarding possible ‘upstream’ and ‘downstream’ roles of microtubules in Drosophila core pathway planar polarity and the motivation for our experiments investigating the latter.

      - The authors suggest that polarity does not propagate as a wave. And yet the range measured in adult is longer than in the pupal wing. Explain. 

      Again an excellent point, also made by Reviewer 1, which we have now addressed explicitly in the manuscript. For the convenience of this reviewer, we lay out the reasons why we think the propagation of polarity seen in the adult is further than seen for core protein localisation.

      There are three reasons why we might expect adult trichomes to show a different effect from the measured core protein polarity pattern seen in our experiments:

      (i) we are assaying core protein polarity at 28h APF, but trichomes emerge at >32h APF, so there is still time for polarity to propagate a bit further from the boundary. We now have added data showing that by the point of trichome initiation, the wave of polarisation extends 3-4 cell rows (Fig.S5A).  

      (ii) it has long been known that a strong localisation of core proteins at a cell edge is not required for polarisation of trichome polarity from a boundary. For instance, in Strutt & Strutt 2007 we show clones of cells overexpressing Fz causing propagation through pk[pk-sple] mutant tissue where there is no detectable core protein polarity. We were following up prior observations of Adler et al 2000 in the wing and Lawrence et al 2004 in the abdomen.

      (iii) there is evidence to suggest that the polarity of adult trichomes is locally coupled, possibly mechanically. This point is hard to prove without live imaging taking in both initial core protein localisation, the site of actin-rich trichome initiation and then the final orientation of the much larger microtubule filled trichome, and we’re not aware that such data exist. However, Wong & Adler 1993 (JCB) showed that over a number of hours trichomes become much larger and move towards the centre of the cell, presumably becoming decoupled from any core protein cue. The images in Guild … & Tilney, 2005 (MBoC)  are also interesting to look at in this regard. Finally, septate junction proteins have been implicated in local alignment of trichomes, independently of the core pathway (Venema … & Auld, 2004 Dev Biol).

      Changes to manuscript: Added new data in Fig.S5A showing where trichomes initiate under 6h de novo induction conditions, for comparison to core protein localisation and adult trichome data in Fig.5. Added some text explaining why adult trichome repolarisation might be stronger than the observed effects on core protein localisation in Discussion. 

      - The discussion states that the cell-intrinsic system remains to be fully characterised, implying that it has been partially characterised. What do we know about it? 

      As the reviewer probably realises, we were attempting to side-step a long speculative discussion about the various hints and ideas in the literature by grouping them under the umbrella of ‘remaining to be fully characterised’. We would argue that this current manuscript is the first to attempt to systematically investigate the nature of ‘cell-scale signalling’. The lack of prior work is probably due to two factors (i) pioneering theoretical work showed that a sufficiently strong global signal coupled with ‘local’ (i.e. confined to one cell junction) protein interactions was sufficient to polarise cells without the need to invoke the existence of a cell-scale signal; (ii) there is no easy way to identify cell-scale signals as their loss results in loss of polarity which will also occur if other (i.e. more locally acting) core pathway functions are compromised.

      The main investigation of the potential for cell-scale signalling has been another set of theory studies (Burak and Shraiman 2009; Abley et al., 2013; Shadkhoo and Mani 2019) which have considered the possibility of diffusible signals. In our present work we have further considered the possibility of a ‘depletion’ model, based on the pioneering theory work of Hans Meinhardt, and as discussed above the possibility that microtubules could mediate a cell-scale signal.

      Changes to manuscript: We have revised the Discussion to hopefully be clearer about the current state of knowledge.

      Reviewer #3:

      […] Major comments

      The data are clearly presented and the manuscript is well written. The conclusions are well supported by the data. 

      (1) The authors use a system to de novo establish PCP, which has the advantage of excluding global cues orienting PCP and thus to focus on the cell-intrinsic mechanisms. At the same time, the system has the limitation that it is unclear to what extent de novo PCP establishment reflects 'normal' cell scale PCP establishment, in particular because the Gal4/UAS expression system that is used to induce Fz expression will likely result in much higher Fz levels compared with the endogenous levels. The authors should briefly discuss this limitation. 

      We apologise if this wasn’t clear. We only used GAL4/UAS overexpression when we were generating an artificial boundary of Fz expression with hh-GAL4 to induce repolarisation. The de novo induction system involves Fz::mKate2-sfGFP being expressed directly under an Act5C promoter without use of GAL4/UAS. In response to a comment from Reviewer 1 we have now carried out western blot analysis which shows that Fz::mKate2-sfGFP levels under Act5C are actually lower than endogenous Fz levels. As we achieve normal levels of polarity, similar to what we measure in wild-type conditions when measured using QuantifyPolarity, we assume that therefore Fz levels are not limiting under these conditions. However, we note that lower than normal levels of Fz might sensitise the system to perturbation, which in fact would be advantageous in our study, as it might for instance have been expected to more readily reveal dosage sensitivity of other components.

      Changes to manuscript: We now describe the levels of expression achieved using the de novo induction system (Fig.S1C-D) and discuss possible consequences in the relevant Results sections and Discussion.

      (2) Fig. 3. The authors use heterozygous mutant backgrounds to test the robustness of de novo PCP establishment towards (partial) depletion in core PCP proteins. The authors conclude that de novo polarization is 'extremely robust to variation in protein level'. Since the authors (presumably) lowered protein levels by 50%, this conclusion appears to be somewhat overstated. The authors should tune down their conclusion. 

      Reviewer 1 makes a similar point about whether we can argue that the lack of sensitivity to a 50% reduction in protein levels actually rules out the depletion model. To address the comments of both reviewers we had now added some further narrative and caveats in the text.

      We nevertheless believe that the experiments shown effectively make the point that there is no strong dosage sensitivity – and it remains our contention that if protein levels were the key to setting up cell-scale polarity, then a 50% reduction would be expected to show an effect on the rate of polarisation. We further note that as Fz::mKate2-sfGFP levels are lower than endogenous Fz levels, the system might be expected to be sensitised to further dosage reductions, and despite this we fail to see an effect on rate of polarisation.

      In a similar vein, Reviewer 2 requested data on whether dosage reduction altered protein levels by the expected amount. We have now added further explanation/references and western blot data to address this.

      Changes to manuscript: Added some narrative and caveats regarding whether lowering levels more than 50% would add to our findings in the Discussion. Revised conclusions to be more cautious including altering section title to read ‘Planar polarity establishment is not highly sensitive to variation in protein levels of core complex components.

      Also added westerns and text/references showing that for the tested proteins there is a reduction in protein levels upon removal of one gene dosage in Results section ‘Planar polarity establishment is…’ and Fig.S2.

      Minor comments :

      (1) Page 3. The authors mention and reference that they used the PCA method to quantify cell polarity magnification and magnitude. It would help the unfamiliar reader, if the authors would briefly describe the principle of this method. 

      Changes to manuscript: More details have been added in Materials & Methods.

      Significance:

      The manuscript contributes to our understanding of how planar cell polarity is established. It extends previous work by the authors (Strutt and Strutt, 2002,2007) that already showed that induction of core PCP pathway activity by itself is sufficient to induce de novo PCP. This manuscript further explores the underlying mechanisms. The authors test whether de novo PCP establishment depends on an 'inhibitory signal', as previously postulated (Meinhardt, 2007), but do not find evidence. They also test whether core PCP proteins help to orient microtubules (which could enhance cell intrinsic polarization of core PCP proteins), but, again, do not find evidence, corroborating previous work (Harumoto et al, 2010). The most significant finding of this manuscript, perhaps, is the observation that local de novo PCP establishment does not propagate far through the tissue. A limitation of the study is that the mechanisms establishing intrinsic cell scale polarity remain unknown. The work will likely be of interest to specialists in the field of PCP.